Natural Language Processing Project – Naive Bayes & Logistic Regression
This project was developed for academic purposes (NLP midterm) and can be used as a learning resource.
Twitter is often a primary source of real-time information during disasters. However, not all tweets containing words such as “ablaze” or “fire” actually refer to real disaster events.
Examples:
- "The concert was ablaze!" → Not a disaster
- "The entire forest is ablaze." → Disaster
This project aims to build Naive Bayes and Logistic Regression models to classify whether a tweet truly contains disaster-related information.
-
Source: Kaggle – Real or Not? Disaster Tweets
-
Total data: 10,876 English tweets
- 7,613 training data (used in the analysis)
- 3,263 test data (not used because labels are unavailable)
Labels:
- 1 → Tweet references a real disaster
- 0 → Tweet does not reference a disaster
- Exploratory Data Analysis (EDA) – tweet distribution, text length, frequent words
- Preprocessing – lowercase, removing URLs, numbers, punctuation; stopword removal; lemmatization; tokenization
- Vectorization – TF-IDF with
ngram_range=(1,2)andmax_features=10,000 - Feature Selection – Chi-Square (SelectKBest, top 3,000 features)
- Modeling – Naive Bayes and Logistic Regression with GridSearchCV (5-fold cross-validation)
- Evaluation – confusion matrix, precision, recall, F1-score, accuracy
- Interpretability – error analysis, POS tagging, Named Entity Recognition (NER)
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| Logistic Regression | 0.84 | 0.83 | 0.83 | 83.7% |
| Naive Bayes | 0.86 | 0.85 | 0.86 | 85.9% |
- Naive Bayes outperforms Logistic Regression on all evaluation metrics.
- Both models perform well for short-text classification tasks such as tweets.
- Preprocessing is crucial to reduce noise commonly found in social media text.
- Naive Bayes performs strongly on short, sparse text due to its independence assumptions.
- Logistic Regression is more stable on balanced feature distributions.
- POS tagging and NER help identify recurring locations, times, and organizations mentioned in disaster-related tweets.
-
Clone the repository
git clone https://github.com/username/tweet-disaster-classification.git cd tweet-disaster-classification