Skip to content

BigToothDev/nlp-sociology-methodology

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nlp-sociology-methodology

To install the spaCy transformer model, run:
python -m spacy download uk_core_news_trf

Key Data Files

  • tone-dict-uk.tsv — original tone dictionary
  • dataset.json — final sentiment analysis results
  • wdiff_dataset.json — entries where the methods differ

Script File Descriptions

  • scraping-bs4.py: Scrapes news headlines from suspilne.media; collects ID, headline, URL, and time; saves the output to data/bs4_data.json.

  • scraping-pytesseract.py: Extracts text from an image using pytesseract; requires a .env file with tesseract_path and img_path; saves the output to data/pytesseract_data.txt.

  • stopwords-and-punctuations.py: Removes stopwords and punctuation; normalizes whitespace; adds headline_clean; saves the output to data/bs4_data_clean.json.

  • lemmatisation.py: Lemmatizes headlines using spaCy (uk_core_news_trf) and the tone dictionary; adds headline_lemma; saves the output to bs4_data_lemma.json and tone-dict-uk-lemma.json.

  • stemming.py: Applies stemming using SnowballStemmer and stems tone dictionary entries; adds headline_stem_of_lemma; saves the output to bs4_data_stem.json and tone-dict-uk-stem.json.

  • sentiment.py: Uses VADER with two analyzers (lemma-based and stem-based); computes neg, neu, pos, and compound scores; saves the final dataset to data/dataset.json.

  • check-tone-diff.py: Compares lemma_compound_tone and stem_compound_tone; prints identical IDs; saves the differing entries to data/wdiff_dataset.json.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors