This project uses machine learning to detect phishing (scam) emails, helping to prevent email-based frauds.
It demonstrates how machine learning can be trained on email text data to classify whether an email is legitimate or phishing.
This project uses a CSV dataset named phishing_dataset.csv.
Columns:
email_text: The body of the emaillabel:1for phishing,0for legitimate
Note: Ensure your dataset is placed inside the
/datafolder.
- Machine Learning:
- Logistic Regression, Naive Bayes, Random Forest, etc.
- Libraries & Frameworks:
- Scikit-learn for model training and evaluation
- Pandas for data manipulation
- NumPy for numerical computations
- Matplotlib & Seaborn for data visualization
- Text Processing:
- TF-IDF for text vectorization
- NLTK or spaCy for natural language processing (optional)
- Text vectorization: TF-IDF
- Classifier: Logistic Regression
- Evaluation Metrics: Accuracy, confusion matrix, precision, recall
You can easily extend this by trying different ML models like:
- Naive Bayes
- Random Forest
- SVM
- Experiment with deep learning models (LSTM, BERT)
- Real-time email scanning via API or web app (Flask/Streamlit)
- Integration with Gmail API for real-time inbox monitoring
- Deployment with Docker or as a web service