This repository implements an unsupervised learning pipeline for clustering peptide and protein sequence data using UMAP for dimensionality reduction and HDBSCAN for density-based clustering.
The goal is to discover latent structure in high-dimensional biological sequence data without requiring predefined labels or cluster counts.
Biological sequence data is inherently high-dimensional and difficult to cluster directly.
This project addresses that challenge by:
- Converting peptide sequences into numerical feature representations
- Reducing dimensionality with UMAP to preserve local and global structure
- Applying HDBSCAN to identify dense clusters while labeling noise robustly
This approach is well-suited for exploratory analysis in bioinformatics and computational biology.
-
Data Loading
- Import peptide/protein sequences from FASTA files
-
Feature Engineering
- Convert sequences into numerical representations suitable for ML
-
Dimensionality Reduction
- Apply UMAP to project high-dimensional features into 2D/3D space
-
Clustering
- Use HDBSCAN to identify clusters of varying density
- Automatically detect noise and outliers
-
Visualization & Analysis
- Visualize embeddings and cluster assignments
- Interpret biological structure and separation
- Python
- UMAP
- HDBSCAN
- NumPy, Pandas
- Matplotlib
- Jupyter Notebook