End-to-End Pipeline for Interpretable Cluster Analysis

Author: Hitesh Taneja Project for: MSc Computer Science, University of Greenwich Supervisor: Dr. Chris Walshaw Date: September 2025

1. Overview

This project provides a universal, end-to-end Python pipeline for performing interpretable cluster analysis on high-dimensional data.

The Problem: The "Black Box" Cluster

Standard clustering algorithms are effective at identifying groups in data but often produce a "black box" result. They assign labels like "Cluster 0" or "Cluster 1" without explaining why a data point belongs to a certain group. This makes it difficult for analysts and stakeholders to trust the results and derive actionable insights.

The Solution: An Automated Interpretation Pipeline

This pipeline solves the "black box" problem by not only discovering clusters but also automatically generating a rich, multi-faceted, and human-readable explanation for them. It is designed to be robust, reusable, and applicable to any tabular dataset.

2. Key Features

Universal & Configurable: Easily adaptable to any dataset by modifying a simple CONFIG dictionary. No code changes are required.
Automated Preprocessing: Automatically detects numerical and categorical features and applies appropriate imputation, scaling, and encoding.
Data-Driven Methodology:
- Uses a data-driven approach to select the best DR algorithm (UMAP/t-SNE) based on dataset size.
- Automatically determines the optimal number of clusters (k) using the Elbow Method.
Multi-Faceted Interpretation Module: Provides four complementary explanations for the discovered clusters:
1. Centroid Profile: The statistical "average" of a cluster.
2. Medoid Prototype: The single best real-world example from the dataset.
3. Decision Tree Surrogate: A set of simple, human-readable rules that define the clusters.
4. SHAP Analysis: A clear ranking of the features that were most important in separating the clusters.

3. Pipeline Architecture

The system is designed as a sequential, 4-stage pipeline:

Configuration & Data Ingestion: The user defines the dataset path and parameters.
Automated Preprocessing: The data is cleaned, scaled, and encoded.
Core Analysis Engine: Dimensionality reduction and clustering are performed.
Interpretation & Reporting: Quantitative metrics and qualitative explanations are generated.

4. Tech Stack

Language: Python
Core Libraries: Scikit-learn, Pandas, NumPy
Dimensionality Reduction: UMAP, t-SNE
Clustering & Interpretation: k-means, SHAP, Decision Trees
Visualization: Matplotlib, Seaborn

5. How to Use

Prerequisites

Ensure you have the required Python libraries installed:

pip install pandas scikit-learn umap-learn matplotlib seaborn shap

Steps

Clone the Repository: git clone "https://github.com/hiteshtanejaa/MSc-Project.git" cd MSc-Project
Configure the Pipeline: Open the pipeline_integration.py script and modify the CONFIG dictionary at the top to point to your dataset.

Example for the Forest Cover Type dataset:

CONFIG = {
"file_path": "file.csv",
"target_column": "Cover_Type",
"id_column": None,
"features_to_drop": []
}

Run the Script: Execute the pipeline from your terminal python Final_pipeline.py
View the Results: The script will print all quantitative metrics and qualitative interpretations to the console and display the Elbow Method plot, the final UMAP scatter plot, and the SHAP summary plot.

6. Validation

The pipeline's robustness was validated on five diverse, high-dimensional datasets:

Titanic (Demographics)
PISA 2018 (Social Science)
Forest Cover Type (Ecology, >580k samples)
Gene Expression Cancer RNA-Seq (Bioinformatics, >20k features)
MNIST (Computer Vision, >70k samples)

Across all datasets, the pipeline successfully identified and explained meaningful clusters, demonstrating its effectiveness and universality.

7. Future Work

The next logical step for this project is to formally package this tool for distribution on the Python Package Index (PyPI), making it easily installable and accessible to the wider data science community.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Datasets		Datasets
Final_Pipeline.ipynb		Final_Pipeline.ipynb
FocusWare Project 03 Cluster Interpretation via Dimensionality Reduction.pdf		FocusWare Project 03 Cluster Interpretation via Dimensionality Reduction.pdf
Proposal.pdf		Proposal.pdf
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Pipeline for Interpretable Cluster Analysis

1. Overview

The Problem: The "Black Box" Cluster

The Solution: An Automated Interpretation Pipeline

2. Key Features

3. Pipeline Architecture

4. Tech Stack

5. How to Use

Prerequisites

Steps

6. Validation

7. Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

End-to-End Pipeline for Interpretable Cluster Analysis

1. Overview

The Problem: The "Black Box" Cluster

The Solution: An Automated Interpretation Pipeline

2. Key Features

3. Pipeline Architecture

4. Tech Stack

5. How to Use

Prerequisites

Steps

6. Validation

7. Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages