📧 Spam Message Classifier (Machine Learning Project)

An end-to-end Machine Learning project that classifies messages as Spam 🚨 or Not Spam ✅ using Natural Language Processing (NLP) techniques.

This project demonstrates the complete ML pipeline: data cleaning, feature engineering, model training, evaluation, and real-time prediction.

🚀 Project Overview

Spam messages are a common problem in emails and SMS. This project builds a machine learning model that can automatically detect whether a message is spam or not.

The model is trained on a large dataset of labeled messages and uses TF-IDF vectorization + classification algorithms to make predictions.

📂 Dataset

Source: Kaggle Spam Dataset
Total samples: 83,448 messages
Features:
- label → 0 (Not Spam), 1 (Spam)
- text → Message content

🧠 Machine Learning Pipeline

1️⃣ Data Cleaning

Removed missing values (none found)
Filtered valid labels (0 and 1)
Text preprocessing:
- Lowercasing
- Removing special characters
- Removing extra spaces
- Handling line-break noise (\n)

2️⃣ Feature Engineering

Converted text into numerical features using:
- TF-IDF Vectorization
Improvements applied:
- Stopword removal (stop_words='english')
- N-grams (1 to 3 words) for better context understanding
- Increased feature size for richer representation

3️⃣ Model Training

Models explored:
- Logistic Regression (with class balancing)
- Multinomial Naive Bayes (optimized for text classification)
Final model trained on processed data

4️⃣ Model Evaluation

📊 Performance Metrics:

Accuracy: 98.24%
Precision: 97.99%
Recall: 98.67%
F1 Score: 98.33%

📉 Confusion Matrix:

[[7761  177]
 [ 116 8636]]

🔍 Interpretation:

High precision → Few false positives
High recall → Most spam messages correctly detected
Balanced performance across all metrics

🧪 Real-Time Prediction (User Input)

The model allows live testing through user input:

Example:

Enter a message: get this free offer today
Prediction: 🚨 Spam

Enter a message: hey are you coming to class?
Prediction: ✅ Not Spam

⚠️ Observations & Learnings

The model sometimes misclassifies similar phrases:
- "free offer" → Spam ✅
- "free deal" → Not Spam ❌

Reason:

Machine learning models learn from data patterns, not actual language meaning
Some words (like "deal") may appear in both spam and normal messages

🔧 Improvements Applied

To enhance performance, the following optimizations were implemented:

✅ Stopword removal
✅ N-gram feature expansion (1–3 words)
✅ Class imbalance handling (class_weight='balanced')
✅ Tried Naive Bayes (better for text classification)
✅ Cleaned noisy text (line breaks, formatting issues)

🏁 Limitations

Model does not fully understand semantic meaning
Some edge cases still misclassified
Performance depends heavily on dataset quality

💡 Future Improvements

Use advanced NLP models (like word embeddings or transformers)
Hyperparameter tuning
Try Deep Learning models (LSTM, BERT)
Deploy as a web app

🛠️ Tech Stack

Python 🐍
Pandas
NumPy
Scikit-learn
NLP (TF-IDF)

📁 Project Structure

spam_classifier/
│── spam_classifier.py
│── combined_data.csv
│── README.md

▶️ How to Run

Clone the repository:

git clone <your-repo-link>

Install dependencies:

pip install pandas numpy scikit-learn

Run the script:

python spam_classifier.py

🎯 Key Takeaway

This project is not just about building a model, but understanding:

How ML models behave
Why predictions can go wrong
How preprocessing impacts performance

⭐ Final Note

Even high-performing models are not perfect. The goal is to continuously improve, analyze mistakes, and learn from them.

🚀 This project marks a strong foundation in Machine Learning and NLP.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
spam_classifier_model		spam_classifier_model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📧 Spam Message Classifier (Machine Learning Project)

🚀 Project Overview

📂 Dataset

🧠 Machine Learning Pipeline

1️⃣ Data Cleaning

2️⃣ Feature Engineering

3️⃣ Model Training

4️⃣ Model Evaluation

📊 Performance Metrics:

📉 Confusion Matrix:

🔍 Interpretation:

🧪 Real-Time Prediction (User Input)

Example:

⚠️ Observations & Learnings

Reason:

🔧 Improvements Applied

🏁 Limitations

💡 Future Improvements

🛠️ Tech Stack

📁 Project Structure

▶️ How to Run

🎯 Key Takeaway

⭐ Final Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

📧 Spam Message Classifier (Machine Learning Project)

🚀 Project Overview

📂 Dataset

🧠 Machine Learning Pipeline

1️⃣ Data Cleaning

2️⃣ Feature Engineering

3️⃣ Model Training

4️⃣ Model Evaluation

📊 Performance Metrics:

📉 Confusion Matrix:

🔍 Interpretation:

🧪 Real-Time Prediction (User Input)

Example:

⚠️ Observations & Learnings

Reason:

🔧 Improvements Applied

🏁 Limitations

💡 Future Improvements

🛠️ Tech Stack

📁 Project Structure

▶️ How to Run

🎯 Key Takeaway

⭐ Final Note

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages