This project performs customer segmentation using unsupervised learning techniques on the UK-based Online Retail dataset. The goal is to identify customer clusters based on RFM (Recency, Frequency, Monetary) features to support marketing strategy, personalization, and business insights.
- K-Means Clustering
- Gaussian Mixture Models (GMM)
- DBSCAN (Density-Based Spatial Clustering)
Each model is tuned using multiple strategies (e.g., Elbow Method, Silhouette Score, AIC/BIC, K-Distance Knee) and compared using clustering performance metrics.
-
Data Loading and Cleaning
- Load raw data from
data/raw/Online Retail.xlsx - Clean and preprocess the data (remove nulls, invalid entries)
- Load raw data from
-
Feature Engineering
- Calculate RFM features per customer
- Standardize RFM features for clustering
-
Model Training / Loading
- Ask whether to load saved models or retrain
- K-Means: best k from elbow and silhouette
- GMM: best n from AIC/BIC and silhouette
- DBSCAN: best ε from silhouette and knee distance graph
-
Model Evaluation
- Evaluate clusters using silhouette score and other internal metrics
- Save models to
models/and metrics toresults/scores.txt
-
Visualization
- Visualize clusters in 2D and 3D using PCA
- Save plots under
results/visuals/
.
├── data/
│ ├── raw/ # Raw Excel file goes here
│ └── processed/ # Scaled RFM CSV will be saved here
├── models/ # Trained clustering models (.joblib)
├── results/
│ ├── visuals/ # All plots (elbow, silhouette, clusters)
│ └── scores.txt # Model evaluation scores
├── notebooks/ # (Optional) Exploratory notebooks
├── src/ # Core codebase
│ ├── data_loader.py
│ ├── feature_engineering.py
│ ├── kmeans_clustering.py
│ ├── gmm_clustering.py
│ ├── dbscan_clustering.py
│ ├── evaluator.py
│ ├── utils.py
│ └── visualization.py
├── main.py # Pipeline entrypoint
└── README.md
pip install -r requirements.txtDownload the Online Retail Dataset from the UCI Repository and place it in:
data/raw/Online Retail.xlsx
python main.py- 📈 Model performance in
results/scores.txt - 📊 Cluster plots in
results/visuals/ - 💾 Trained models in
models/
- If saved models already exist, you'll be prompted to reuse them or retrain from scratch.
- DBSCAN’s best ε is the average of silhouette-based and k-distance-based values for more stable clustering.
- All visualizations are saved automatically; no GUI or manual plotting required.
- Integrate hyperparameter tuning automation
- Add t-SNE visualizations
- Export cluster labels with customer IDs for business use
- Add comparison with supervised labeling (if any labels are added)
Project by [Your Name]
For questions, feel free to open an Issue or reach out via GitHub.