An end-to-end Machine Learning project that classifies messages as Spam 🚨 or Not Spam ✅ using Natural Language Processing (NLP) techniques.
This project demonstrates the complete ML pipeline: data cleaning, feature engineering, model training, evaluation, and real-time prediction.
Spam messages are a common problem in emails and SMS. This project builds a machine learning model that can automatically detect whether a message is spam or not.
The model is trained on a large dataset of labeled messages and uses TF-IDF vectorization + classification algorithms to make predictions.
-
Source: Kaggle Spam Dataset
-
Total samples: 83,448 messages
-
Features:
label→ 0 (Not Spam), 1 (Spam)text→ Message content
-
Removed missing values (none found)
-
Filtered valid labels (0 and 1)
-
Text preprocessing:
- Lowercasing
- Removing special characters
- Removing extra spaces
- Handling line-break noise (
\n)
-
Converted text into numerical features using:
- TF-IDF Vectorization
-
Improvements applied:
- Stopword removal (
stop_words='english') - N-grams (1 to 3 words) for better context understanding
- Increased feature size for richer representation
- Stopword removal (
-
Models explored:
- Logistic Regression (with class balancing)
- Multinomial Naive Bayes (optimized for text classification)
-
Final model trained on processed data
- Accuracy: 98.24%
- Precision: 97.99%
- Recall: 98.67%
- F1 Score: 98.33%
[[7761 177]
[ 116 8636]]
- High precision → Few false positives
- High recall → Most spam messages correctly detected
- Balanced performance across all metrics
The model allows live testing through user input:
Enter a message: get this free offer today
Prediction: 🚨 Spam
Enter a message: hey are you coming to class?
Prediction: ✅ Not Spam
-
The model sometimes misclassifies similar phrases:
- "free offer" → Spam ✅
- "free deal" → Not Spam ❌
- Machine learning models learn from data patterns, not actual language meaning
- Some words (like "deal") may appear in both spam and normal messages
To enhance performance, the following optimizations were implemented:
- ✅ Stopword removal
- ✅ N-gram feature expansion (1–3 words)
- ✅ Class imbalance handling (
class_weight='balanced') - ✅ Tried Naive Bayes (better for text classification)
- ✅ Cleaned noisy text (line breaks, formatting issues)
- Model does not fully understand semantic meaning
- Some edge cases still misclassified
- Performance depends heavily on dataset quality
- Use advanced NLP models (like word embeddings or transformers)
- Hyperparameter tuning
- Try Deep Learning models (LSTM, BERT)
- Deploy as a web app
- Python 🐍
- Pandas
- NumPy
- Scikit-learn
- NLP (TF-IDF)
spam_classifier/
│── spam_classifier.py
│── combined_data.csv
│── README.md
- Clone the repository:
git clone <your-repo-link>
- Install dependencies:
pip install pandas numpy scikit-learn
- Run the script:
python spam_classifier.py
This project is not just about building a model, but understanding:
- How ML models behave
- Why predictions can go wrong
- How preprocessing impacts performance
Even high-performing models are not perfect. The goal is to continuously improve, analyze mistakes, and learn from them.
🚀 This project marks a strong foundation in Machine Learning and NLP.