Author: Hitesh Taneja Project for: MSc Computer Science, University of Greenwich Supervisor: Dr. Chris Walshaw Date: September 2025
This project provides a universal, end-to-end Python pipeline for performing interpretable cluster analysis on high-dimensional data.
Standard clustering algorithms are effective at identifying groups in data but often produce a "black box" result. They assign labels like "Cluster 0" or "Cluster 1" without explaining why a data point belongs to a certain group. This makes it difficult for analysts and stakeholders to trust the results and derive actionable insights.
This pipeline solves the "black box" problem by not only discovering clusters but also automatically generating a rich, multi-faceted, and human-readable explanation for them. It is designed to be robust, reusable, and applicable to any tabular dataset.
-
Universal & Configurable: Easily adaptable to any dataset by modifying a simple
CONFIGdictionary. No code changes are required. -
Automated Preprocessing: Automatically detects numerical and categorical features and applies appropriate imputation, scaling, and encoding.
-
Data-Driven Methodology:
-
Uses a data-driven approach to select the best DR algorithm (UMAP/t-SNE) based on dataset size.
-
Automatically determines the optimal number of clusters (
k) using the Elbow Method.
-
-
Multi-Faceted Interpretation Module: Provides four complementary explanations for the discovered clusters:
-
Centroid Profile: The statistical "average" of a cluster.
-
Medoid Prototype: The single best real-world example from the dataset.
-
Decision Tree Surrogate: A set of simple, human-readable rules that define the clusters.
-
SHAP Analysis: A clear ranking of the features that were most important in separating the clusters.
-
The system is designed as a sequential, 4-stage pipeline:
-
Configuration & Data Ingestion: The user defines the dataset path and parameters.
-
Automated Preprocessing: The data is cleaned, scaled, and encoded.
-
Core Analysis Engine: Dimensionality reduction and clustering are performed.
-
Interpretation & Reporting: Quantitative metrics and qualitative explanations are generated.
-
Language: Python
-
Core Libraries: Scikit-learn, Pandas, NumPy
-
Dimensionality Reduction: UMAP, t-SNE
-
Clustering & Interpretation: k-means, SHAP, Decision Trees
-
Visualization: Matplotlib, Seaborn
Ensure you have the required Python libraries installed:
pip install pandas scikit-learn umap-learn matplotlib seaborn shap
-
Clone the Repository:
git clone "https://github.com/hiteshtanejaa/MSc-Project.git"cd MSc-Project -
Configure the Pipeline: Open the
pipeline_integration.pyscript and modify theCONFIGdictionary at the top to point to your dataset.
Example for the Forest Cover Type dataset:
CONFIG = {
"file_path": "file.csv",
"target_column": "Cover_Type",
"id_column": None,
"features_to_drop": []
}
-
Run the Script: Execute the pipeline from your terminal
python Final_pipeline.py -
View the Results: The script will print all quantitative metrics and qualitative interpretations to the console and display the Elbow Method plot, the final UMAP scatter plot, and the SHAP summary plot.
The pipeline's robustness was validated on five diverse, high-dimensional datasets:
-
Titanic (Demographics)
-
PISA 2018 (Social Science)
-
Forest Cover Type (Ecology, >580k samples)
-
Gene Expression Cancer RNA-Seq (Bioinformatics, >20k features)
-
MNIST (Computer Vision, >70k samples)
Across all datasets, the pipeline successfully identified and explained meaningful clusters, demonstrating its effectiveness and universality.
The next logical step for this project is to formally package this tool for distribution on the Python Package Index (PyPI), making it easily installable and accessible to the wider data science community.