A complete end-to-end machine learning project predicting customer churn using demographic, transactional, and behavioural data.
| Item | Detail |
|---|---|
| Problem | Binary classification — predict if a customer will churn |
| Dataset | 1,000 customers across 5 data sources |
| Best Model | Logistic Regression (Tuned) |
| Best Recall | 46.3% |
| Best F1 Score | 27.9% |
| Best AUC-ROC | 52.9% (Random Forest) |
churn_prediction_project/
├── data/
│ └── raw/ ← original data (not uploaded)
├── notebooks/
│ ├── 01_EDA_Churn.ipynb ← exploratory data analysis
│ └── 02_Preprocessing_Modelling.ipynb ← ML pipeline
├── reports/
│ ├── 01_churn_distribution.png
│ ├── 02_age_analysis.png
│ ├── 03_categorical_churn.png
│ ├── 04_transaction_analysis.png
│ ├── 05_online_activity_churn.png
│ ├── 06_service_churn.png
│ ├── 07_correlation_heatmap.png
│ ├── 08_model_comparison.png
│ └── 09_feature_importance.png
├── .gitignore
├── README.md
└── requirements.txt
- Dataset is 80/20 imbalanced — 796 stayed, 204 churned
- LoginFrequency is the strongest predictor (correlation: -0.08)
- TotalSpent and NumTransactions have 0.90 correlation — multicollinearity detected
- 332 customers never contacted customer service (structural zeros)
- Categorical features (Gender, MaritalStatus) show weak churn signal
| Model | Accuracy | Precision | Recall | F1 Score | AUC-ROC |
|---|---|---|---|---|---|
| Logistic Regression | 51.0% | 20.0% | 46.3% | 27.9% | 47.4% |
| Random Forest | 63.0% | 20.0% | 26.8% | 22.9% | 52.9% |
| XGBoost | 55.5% | 14.7% | 24.4% | 18.3% | 47.6% |
- Highest Recall (46.3%) — catches most actual churners
- No overfitting — Train-Test gap < 10%
- Interpretable — required by FCA banking regulations
- Random Forest and XGBoost showed severe overfitting before tuning
| Rank | Feature | Direction | Business Meaning |
|---|---|---|---|
| 1 | LoginFrequency | ↓ decreases churn | More logins = loyal customer |
| 2 | AvgSpent | ↑ increases churn | Higher spenders still churn |
| 3 | DaysSinceLogin | ↑ increases churn | Inactivity = churn signal |
| 4 | IncomeLevel_Low | ↑ increases churn | Price-sensitive segment |
| 5 | MainInteractionType_Complaint | ↑ increases churn | Complaints predict leaving |
git clone https://github.com/YOUR_USERNAME/churn_prediction_project.git
cd churn_prediction_projectpython -m venv venv
venv\Scripts\activate # Windows
source venv/bin/activate # Mac/Linuxpip install -r requirements.txtPlace Customer_Churn_Data_Large.xlsx in data/raw/
01_EDA_Churn.ipynb → EDA and findings
02_Preprocessing_Modelling.ipynb → ML pipeline
| Tool | Purpose |
|---|---|
| Python 3.13 | Core language |
| Pandas 3.0.1 | Data manipulation |
| NumPy 2.4.3 | Numerical computing |
| Matplotlib 3.10 | Visualisation |
| Seaborn 0.13.2 | Statistical plots |
| Scikit-learn 1.8.0 | ML models and preprocessing |
| XGBoost 3.2.0 | Gradient boosting |
| Jupyter Notebook | Development environment |
- Deploy Logistic Regression for monthly customer scoring
- Three-tier risk segmentation — High/Medium/Low risk
- Estimated impact — £14M+ annual revenue protected at scale
- Key action — Re-engage customers with <15 logins/month
- Data improvement — Add account tenure and product holdings
Samitha Sandaruwan
Aspiring Data Scientist
GitHub
Project developed as part of Lloyds Banking Group Data Science Job Simulation.