Skip to content

Latest commit

 

History

History
23 lines (18 loc) · 1.88 KB

File metadata and controls

23 lines (18 loc) · 1.88 KB

Project Summary: This project forcuses on applying Topic Modeling to a BBC News Dataset. Topic Modeling is a statistical modeling technique used to uncover the main themes and topics present in a structure of documents or textual data. The project was done in several parts as follows:

  1. Analysis and Preprocessing

    • Performed an in-depth analysis of the dataset and the textual content it contained by using various descriptive statistics.
    • Used visualizations as well as word clouds to figure out most commonly used words and patterns in the occurrence of words.
    • Combined the titles and descriptions of the news articles to extract more information from the text.
  2. Text Preprocessing:

    • Performed text preprocessing to clean the text and prepare it for analysis, andremove words and instanceswhich do not add any semantic meaning to the text.
    • Removed stopwords, punctuation, extra spaces, and unnecessary characters which do not add any semantic meaning to the text and could interfere with the accuracy of the analysis.
  3. Text Vectorization using FastText and Embedding Visualizations using UMAP:

    • Converted the preprocessed text into vector representations using FastText embeddings.
    • Utilized FastText embeddings to capture the semantic meaning of words and generate numerical representations of the text.
    • Created Embedding Visualizations using UMAP.
  4. Topic Modelling with LDA

    • Applied the Latent Dirichlet Allocation (LDA) algorithm, a popular topic modeling technique, to identify underlying topics within the dataset.
    • Analyzed patterns of word co-occurrence in the documents to uncover latent themes or topics.
  5. Analysis of LDA results

    • Interpreted and understood the results obtained from the LDA model.
    • Identified the most significant terms within each topic to gain insights into the main themes present in the dataset.