This project simulates a customer analytics use case and demonstrates how machine learning can be used to identify customers with a high churn risk.
It is designed as an end-to-end pipeline that reflects real-world data workflows in a simplified and structured way.
The pipeline generates synthetic customer data, stores it in a database, trains a machine learning model, predicts churn risk, and provides basic analysis and visualization of the results.
The goal is to demonstrate how business-relevant insights can be derived from data using a clean and modular approach.
Typical applications of this type of pipeline include:
- identifying customers with high churn risk
- supporting retention strategies
- prioritizing customer follow-ups
- analyzing behavioral patterns
- Python
- pandas
- numpy
- scikit-learn
- SQLite
- matplotlib
- joblib
- Generate synthetic customer data
- Store structured data in SQLite
- Train a Random Forest classification model
- Predict churn risk
- Evaluate model performance (accuracy, classification report)
- Export predictions to CSV
- Save trained model
- Analyze customer segments
- Visualize churn distribution
- Visualize feature importance
- Predict churn risk for a new customer
The pipeline generates the following outputs:
data/customer_data.db(database)data/churn_predictions.csv(predictions)data/churn_distribution.png(visualization)data/feature_importance.png(model insights)models/risk_model.pkl(trained model)
The model predicts whether a customer is likely to churn (1) or not (0) based on behavioral and demographic features.
Additional outputs include:
- churn rate
- average values per risk group
- feature importance ranking
pip install -r requirements.txt
python main.py