nlp-sociology-methodology

To install the spaCy transformer model, run:
python -m spacy download uk_core_news_trf

Key Data Files

scraping-bs4.py: Scrapes news headlines from suspilne.media; collects ID, headline, URL, and time; saves the output to data/bs4_data.json.
scraping-pytesseract.py: Extracts text from an image using pytesseract; requires a .env file with tesseract_path and img_path; saves the output to data/pytesseract_data.txt.
stopwords-and-punctuations.py: Removes stopwords and punctuation; normalizes whitespace; adds headline_clean; saves the output to data/bs4_data_clean.json.
lemmatisation.py: Lemmatizes headlines using spaCy (uk_core_news_trf) and the tone dictionary; adds headline_lemma; saves the output to bs4_data_lemma.json and tone-dict-uk-lemma.json.
stemming.py: Applies stemming using SnowballStemmer and stems tone dictionary entries; adds headline_stem_of_lemma; saves the output to bs4_data_stem.json and tone-dict-uk-stem.json.
sentiment.py: Uses VADER with two analyzers (lemma-based and stem-based); computes neg, neu, pos, and compound scores; saves the final dataset to data/dataset.json.
check-tone-diff.py: Compares lemma_compound_tone and stem_compound_tone; prints identical IDs; saves the differing entries to data/wdiff_dataset.json.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
app		app
data		data
docs		docs
screenshots		screenshots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt