Skip to content

aeindri-tech/Spam_Claasifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📧 Spam Message Classifier (Machine Learning Project)

An end-to-end Machine Learning project that classifies messages as Spam 🚨 or Not Spam ✅ using Natural Language Processing (NLP) techniques.

This project demonstrates the complete ML pipeline: data cleaning, feature engineering, model training, evaluation, and real-time prediction.


🚀 Project Overview

Spam messages are a common problem in emails and SMS. This project builds a machine learning model that can automatically detect whether a message is spam or not.

The model is trained on a large dataset of labeled messages and uses TF-IDF vectorization + classification algorithms to make predictions.


📂 Dataset

  • Source: Kaggle Spam Dataset

  • Total samples: 83,448 messages

  • Features:

    • label → 0 (Not Spam), 1 (Spam)
    • text → Message content

🧠 Machine Learning Pipeline

1️⃣ Data Cleaning

  • Removed missing values (none found)

  • Filtered valid labels (0 and 1)

  • Text preprocessing:

    • Lowercasing
    • Removing special characters
    • Removing extra spaces
    • Handling line-break noise (\n)

2️⃣ Feature Engineering

  • Converted text into numerical features using:

    • TF-IDF Vectorization
  • Improvements applied:

    • Stopword removal (stop_words='english')
    • N-grams (1 to 3 words) for better context understanding
    • Increased feature size for richer representation

3️⃣ Model Training

  • Models explored:

    • Logistic Regression (with class balancing)
    • Multinomial Naive Bayes (optimized for text classification)
  • Final model trained on processed data


4️⃣ Model Evaluation

📊 Performance Metrics:

  • Accuracy: 98.24%
  • Precision: 97.99%
  • Recall: 98.67%
  • F1 Score: 98.33%

📉 Confusion Matrix:

[[7761  177]
 [ 116 8636]]

🔍 Interpretation:

  • High precision → Few false positives
  • High recall → Most spam messages correctly detected
  • Balanced performance across all metrics

🧪 Real-Time Prediction (User Input)

The model allows live testing through user input:

Example:

Enter a message: get this free offer today
Prediction: 🚨 Spam

Enter a message: hey are you coming to class?
Prediction: ✅ Not Spam

⚠️ Observations & Learnings

  • The model sometimes misclassifies similar phrases:

    • "free offer" → Spam ✅
    • "free deal" → Not Spam ❌

Reason:

  • Machine learning models learn from data patterns, not actual language meaning
  • Some words (like "deal") may appear in both spam and normal messages

🔧 Improvements Applied

To enhance performance, the following optimizations were implemented:

  • ✅ Stopword removal
  • ✅ N-gram feature expansion (1–3 words)
  • ✅ Class imbalance handling (class_weight='balanced')
  • ✅ Tried Naive Bayes (better for text classification)
  • ✅ Cleaned noisy text (line breaks, formatting issues)

🏁 Limitations

  • Model does not fully understand semantic meaning
  • Some edge cases still misclassified
  • Performance depends heavily on dataset quality

💡 Future Improvements

  • Use advanced NLP models (like word embeddings or transformers)
  • Hyperparameter tuning
  • Try Deep Learning models (LSTM, BERT)
  • Deploy as a web app

🛠️ Tech Stack

  • Python 🐍
  • Pandas
  • NumPy
  • Scikit-learn
  • NLP (TF-IDF)

📁 Project Structure

spam_classifier/
│── spam_classifier.py
│── combined_data.csv
│── README.md

▶️ How to Run

  1. Clone the repository:
git clone <your-repo-link>
  1. Install dependencies:
pip install pandas numpy scikit-learn
  1. Run the script:
python spam_classifier.py

🎯 Key Takeaway

This project is not just about building a model, but understanding:

  • How ML models behave
  • Why predictions can go wrong
  • How preprocessing impacts performance

⭐ Final Note

Even high-performing models are not perfect. The goal is to continuously improve, analyze mistakes, and learn from them.


🚀 This project marks a strong foundation in Machine Learning and NLP.

About

An end-to-end Machine Learning project that classifies messages as Spam in Not Spam using Natural Language Processing (NLP) techniques. The project includes text preprocessing, feature extraction (TF-IDF), model training, and real-time user input prediction. It demonstrates practical implementation of NLP and classification algorithms on real-world

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors