This project performs sentiment analysis on comments scraped from Hespress articles. It uses a big data pipeline consisting of Apache Kafka, Apache Spark, HDFS, and MongoDB to process and store the data.
The project follows a hybrid batch and real-time processing architecture:
-
Data Source (Hesspress): Comments are scraped from Hespress articles using a custom scraper.
-
Data Ingestion (Kafka): Scraped comments are streamed into a Kafka topic.
-
Batch Processing (Spark):
- Spark reads comments from the Kafka topic in batches.
- Preprocessing steps (cleaning, normalization) are applied.
- Sentiment is predicted using a pre-trained deep learning model.
- Processed comments, including sentiment, are stored in MongoDB.
-
Storage (MongoDB): MongoDB stores both batch and real-time processed comments.
-
Persistent Storage (HDFS): The raw comments ingested from Kafka are stored on HDFS for data durability and potential replay/reprocessing.
hespress-comments-analysis/
├── config/ # Configuration files
│ ├── kafka_config.py
│ └── mongodb_config.py
├── models/ # Data models
│ ├── comment.py
│ ├── sentiment_model.h5
│ ├── tokenizer.json
│ └── label_encoder.pkl
├── processors/ # Data processing logic
│ ├── batch_processor.py
│ └── spark_processor.py
├── storage/ # Data storage handlers
│ ├── hdfs_handler.py
│ ├── kafka_handler.py
│ └── mongodb_handler.py
├── utils/ # Utility functions
│ └── scrapper.py
│ └── sentiments_processor.py
├── dashboard/ # Flask dashboard
└── main.py # Main application entry point
└── requirements.txt # Project dependencies
└── README.md # This file
- Docker: Ensure you have Docker installed on your system. Install Docker
- Docker Compose: Ensure you have Docker Compose installed. Install Docker Compose
-
Clone the repository:
git clone https://github.com/abdellatif-laghjaj/hespress-comments-analysis cd hespress-comments-analysis -
Generate Model Files:
-
Before building the Docker image, you need to generate the sentiment analysis model files.
-
Run the Notebook: Execute the notebook
model_training_notebook.ipynbthat trains and saves the sentiment analysis model, tokenizer, and label encoder. and also don't forget to use the attachedCSV fileas the dataset. -
Place Model Files: Ensure that the generated files (
sentiment_model.h5,tokenizer.json,label_encoder.pkl) are placed in themodel/directory.
- Build and Run with Docker Compose:
- Navigate to the project root directory in your terminal (where docker-compose.yml is located).
- Build the Docker image:
docker-compose build- Run the entire application using Docker Compose:
docker-compose up- Accessing the Dashboard
http://localhost:5001/Contributions are welcome! Please open an issue or submit a pull request.