To install the spaCy transformer model, run:
python -m spacy download uk_core_news_trf
tone-dict-uk.tsv— original tone dictionarydataset.json— final sentiment analysis resultswdiff_dataset.json— entries where the methods differ
-
scraping-bs4.py: Scrapes news headlines from suspilne.media; collects ID, headline, URL, and time; saves the output to
data/bs4_data.json. -
scraping-pytesseract.py: Extracts text from an image using pytesseract; requires a
.envfile withtesseract_pathandimg_path; saves the output todata/pytesseract_data.txt. -
stopwords-and-punctuations.py: Removes stopwords and punctuation; normalizes whitespace; adds
headline_clean; saves the output todata/bs4_data_clean.json. -
lemmatisation.py: Lemmatizes headlines using spaCy (
uk_core_news_trf) and the tone dictionary; addsheadline_lemma; saves the output tobs4_data_lemma.jsonandtone-dict-uk-lemma.json. -
stemming.py: Applies stemming using SnowballStemmer and stems tone dictionary entries; adds
headline_stem_of_lemma; saves the output tobs4_data_stem.jsonandtone-dict-uk-stem.json. -
sentiment.py: Uses VADER with two analyzers (lemma-based and stem-based); computes
neg,neu,pos, andcompoundscores; saves the final dataset todata/dataset.json. -
check-tone-diff.py: Compares
lemma_compound_toneandstem_compound_tone; prints identical IDs; saves the differing entries todata/wdiff_dataset.json.