Tweet Classification for Disaster Information

Natural Language Processing Project – Naive Bayes & Logistic Regression

This project was developed for academic purposes (NLP midterm) and can be used as a learning resource.

Project Description

Twitter is often a primary source of real-time information during disasters. However, not all tweets containing words such as “ablaze” or “fire” actually refer to real disaster events.

Examples:

"The concert was ablaze!" → Not a disaster
"The entire forest is ablaze." → Disaster

This project aims to build Naive Bayes and Logistic Regression models to classify whether a tweet truly contains disaster-related information.

Dataset

Source: Kaggle – Real or Not? Disaster Tweets
Total data: 10,876 English tweets
- 7,613 training data (used in the analysis)
- 3,263 test data (not used because labels are unavailable)

Labels:

1 → Tweet references a real disaster
0 → Tweet does not reference a disaster

Methodology

Exploratory Data Analysis (EDA) – tweet distribution, text length, frequent words
Preprocessing – lowercase, removing URLs, numbers, punctuation; stopword removal; lemmatization; tokenization
Vectorization – TF-IDF with ngram_range=(1,2) and max_features=10,000
Feature Selection – Chi-Square (SelectKBest, top 3,000 features)
Modeling – Naive Bayes and Logistic Regression with GridSearchCV (5-fold cross-validation)
Evaluation – confusion matrix, precision, recall, F1-score, accuracy
Interpretability – error analysis, POS tagging, Named Entity Recognition (NER)

Results

Model	Precision	Recall	F1-Score	Accuracy
Logistic Regression	0.84	0.83	0.83	83.7%
Naive Bayes	0.86	0.85	0.86	85.9%

Naive Bayes outperforms Logistic Regression on all evaluation metrics.
Both models perform well for short-text classification tasks such as tweets.

Insights

Preprocessing is crucial to reduce noise commonly found in social media text.
Naive Bayes performs strongly on short, sparse text due to its independence assumptions.
Logistic Regression is more stable on balanced feature distributions.
POS tagging and NER help identify recurring locations, times, and organizations mentioned in disaster-related tweets.

How to Run

Clone the repository

git clone https://github.com/username/tweet-disaster-classification.git
cd tweet-disaster-classification

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
PPT Tweet Classification for Disaster Information (Naive Bayes & Log. Reg.).pdf		PPT Tweet Classification for Disaster Information (Naive Bayes & Log. Reg.).pdf
README.md		README.md
Tweet_Classification_for_Disaster_Information_(Naive_Bayes_&_Log_Reg_).ipynb		Tweet_Classification_for_Disaster_Information_(Naive_Bayes_&_Log_Reg_).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet Classification for Disaster Information

Project Description

Dataset

Methodology

Results

Insights

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tweet Classification for Disaster Information

Project Description

Dataset

Methodology

Results

Insights

How to Run

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages