Skip to content

abdellatif-laghjaj/hespress-comments-analysis

Repository files navigation

Hespress Comment Sentiment Analysis

This project performs sentiment analysis on comments scraped from Hespress articles. It uses a big data pipeline consisting of Apache Kafka, Apache Spark, HDFS, and MongoDB to process and store the data.

Architecture

The project follows a hybrid batch and real-time processing architecture:

  1. Data Source (Hesspress): Comments are scraped from Hespress articles using a custom scraper.

  2. Data Ingestion (Kafka): Scraped comments are streamed into a Kafka topic.

  3. Batch Processing (Spark):

    • Spark reads comments from the Kafka topic in batches.
    • Preprocessing steps (cleaning, normalization) are applied.
    • Sentiment is predicted using a pre-trained deep learning model.
    • Processed comments, including sentiment, are stored in MongoDB.
  4. Storage (MongoDB): MongoDB stores both batch and real-time processed comments.

  5. Persistent Storage (HDFS): The raw comments ingested from Kafka are stored on HDFS for data durability and potential replay/reprocessing.

Screenshots

Dashboard Recent Comments

Project Structure

hespress-comments-analysis/
├── config/                   # Configuration files
│   ├── kafka_config.py
│   └── mongodb_config.py
├── models/                   # Data models
│   ├── comment.py
│   ├── sentiment_model.h5
│   ├── tokenizer.json
│   └── label_encoder.pkl
├── processors/               # Data processing logic
│   ├── batch_processor.py
│   └── spark_processor.py
├── storage/                  # Data storage handlers
│   ├── hdfs_handler.py
│   ├── kafka_handler.py
│   └── mongodb_handler.py
├── utils/                    # Utility functions
│   └── scrapper.py
│   └── sentiments_processor.py
├── dashboard/                # Flask dashboard
└── main.py                   # Main application entry point
└── requirements.txt          # Project dependencies
└── README.md                 # This file

Getting Started

Prerequisites

Installation (Docker based)

  1. Clone the repository:

    git clone https://github.com/abdellatif-laghjaj/hespress-comments-analysis
    cd hespress-comments-analysis
  2. Generate Model Files:

  • Before building the Docker image, you need to generate the sentiment analysis model files.

  • Run the Notebook: Execute the notebook model_training_notebook.ipynb that trains and saves the sentiment analysis model, tokenizer, and label encoder. and also don't forget to use the attached CSV file as the dataset.

  • Place Model Files: Ensure that the generated files (sentiment_model.h5, tokenizer.json, label_encoder.pkl) are placed in the model/ directory.

  1. Build and Run with Docker Compose:
  • Navigate to the project root directory in your terminal (where docker-compose.yml is located).
  • Build the Docker image:
docker-compose build
  1. Run the entire application using Docker Compose:
docker-compose up
  1. Accessing the Dashboard
http://localhost:5001/

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

About

This project performs sentiment analysis on comments scraped from Hespress articles. It uses a big data pipeline consisting of Apache Kafka, Apache Spark, HDFS, and MongoDB to process and store the data.

Topics

Resources

Stars

Watchers

Forks

Contributors