Skip to content

Samitha2001/churn_prediction_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Churn Prediction — Lloyds Banking Group

A complete end-to-end machine learning project predicting customer churn using demographic, transactional, and behavioural data.


Project Overview

Item Detail
Problem Binary classification — predict if a customer will churn
Dataset 1,000 customers across 5 data sources
Best Model Logistic Regression (Tuned)
Best Recall 46.3%
Best F1 Score 27.9%
Best AUC-ROC 52.9% (Random Forest)

Project Structure

churn_prediction_project/
├── data/
│   └── raw/                    ← original data (not uploaded)
├── notebooks/
│   ├── 01_EDA_Churn.ipynb      ← exploratory data analysis
│   └── 02_Preprocessing_Modelling.ipynb ← ML pipeline
├── reports/
│   ├── 01_churn_distribution.png
│   ├── 02_age_analysis.png
│   ├── 03_categorical_churn.png
│   ├── 04_transaction_analysis.png
│   ├── 05_online_activity_churn.png
│   ├── 06_service_churn.png
│   ├── 07_correlation_heatmap.png
│   ├── 08_model_comparison.png
│   └── 09_feature_importance.png
├── .gitignore
├── README.md
└── requirements.txt

Key Findings

EDA Findings

  • Dataset is 80/20 imbalanced — 796 stayed, 204 churned
  • LoginFrequency is the strongest predictor (correlation: -0.08)
  • TotalSpent and NumTransactions have 0.90 correlation — multicollinearity detected
  • 332 customers never contacted customer service (structural zeros)
  • Categorical features (Gender, MaritalStatus) show weak churn signal

Model Results

Model Accuracy Precision Recall F1 Score AUC-ROC
Logistic Regression 51.0% 20.0% 46.3% 27.9% 47.4%
Random Forest 63.0% 20.0% 26.8% 22.9% 52.9%
XGBoost 55.5% 14.7% 24.4% 18.3% 47.6%

Why Logistic Regression was selected

  • Highest Recall (46.3%) — catches most actual churners
  • No overfitting — Train-Test gap < 10%
  • Interpretable — required by FCA banking regulations
  • Random Forest and XGBoost showed severe overfitting before tuning

Top Churn Risk Factors

Rank Feature Direction Business Meaning
1 LoginFrequency ↓ decreases churn More logins = loyal customer
2 AvgSpent ↑ increases churn Higher spenders still churn
3 DaysSinceLogin ↑ increases churn Inactivity = churn signal
4 IncomeLevel_Low ↑ increases churn Price-sensitive segment
5 MainInteractionType_Complaint ↑ increases churn Complaints predict leaving

How to Run This Project

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/churn_prediction_project.git
cd churn_prediction_project

2. Create virtual environment

python -m venv venv
venv\Scripts\activate        # Windows
source venv/bin/activate     # Mac/Linux

3. Install dependencies

pip install -r requirements.txt

4. Add your data

Place Customer_Churn_Data_Large.xlsx in data/raw/

5. Run notebooks in order

01_EDA_Churn.ipynb              → EDA and findings
02_Preprocessing_Modelling.ipynb → ML pipeline

Technologies Used

Tool Purpose
Python 3.13 Core language
Pandas 3.0.1 Data manipulation
NumPy 2.4.3 Numerical computing
Matplotlib 3.10 Visualisation
Seaborn 0.13.2 Statistical plots
Scikit-learn 1.8.0 ML models and preprocessing
XGBoost 3.2.0 Gradient boosting
Jupyter Notebook Development environment

Business Recommendations

  1. Deploy Logistic Regression for monthly customer scoring
  2. Three-tier risk segmentation — High/Medium/Low risk
  3. Estimated impact — £14M+ annual revenue protected at scale
  4. Key action — Re-engage customers with <15 logins/month
  5. Data improvement — Add account tenure and product holdings

Author

Samitha Sandaruwan
Aspiring Data Scientist
GitHub


Acknowledgements

Project developed as part of Lloyds Banking Group Data Science Job Simulation.


About

Customer churn prediction ML project for Lloyds Banking Group job simulation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors