Building structure-informed Profile HMMs for Kunitz/BPTI-type protease inhibitor domain detection, comparing sequence-based (MUSCLE) vs. structure-based (PDBeFold) approaches using 2-fold cross-validation on Swiss-Prot data.
This repository contains the code, datasets, and results for a project focused on developing structure-informed Profile Hidden Markov Models (HMMs) for the detection and annotation of Kunitz/BPTI-type protease inhibitor domains. Two HMMs were built:
- Sequence-based HMM — from MUSCLE multiple sequence alignment
- Structure-based HMM — from PDBeFold structural alignment
The models were evaluated using curated datasets from Swiss-Prot, with performance assessed via 2-fold cross-validation. The structure-based approach demonstrated superiority in capturing remote homologs (peak MCC: 0.997 vs. 0.991).
HMM_KunitzDomain/
│
├── Data/ # Input datasets
│ ├── Raw/ # Original downloaded sequences (Swiss-Prot, PDB structures)
│ └── Processed/ # Curated datasets, alignments, HMM models, evaluation files
│
├── Docs/ # Project documentation
│ └── report/ # Final project report (PDF)
│
├── Script/ # Custom scripts for analysis
│ └── performance.py # HMM performance evaluation (MCC, TPR, PPV)
│
├── environment.yml # Conda environment specification
├── requirements.txt # Python dependencies
├── .gitignore # Files to ignore in Git
├── LICENSE # MIT License
└── README.md # This file
- Construct profile HMMs from sequence-based and structure-based alignments of Kunitz domains.
- Evaluate model performance on positive and negative datasets using classification metrics, addressing annotation incompleteness in databases like Pfam.
- Demonstrate the advantage of incorporating structural information for improved sensitivity to remote homologs.
git clone https://github.com/MahanBalooei/HMM_KunitzDomain.git
cd HMM_KunitzDomain
conda env create -f environment.yml
conda activate hmm-kunitzgit clone https://github.com/MahanBalooei/HMM_KunitzDomain.git
cd HMM_KunitzDomain
pip install -r requirements.txtExternal tools required: HMMER ≥ 3.3.2, MUSCLE v5, MMseqs2, CD-HIT ≥ 4.8.1, UCSF ChimeraX (for structural visualization)
Positive set — Retrieved from UniProtKB/Swiss-Prot (release 2025_03):
- Annotated with Pfam PF00014, InterPro IPR036880, or Prosite (PS00280, PS50279)
- 368 high-confidence Kunitz proteins after filtering, split into two folds (
pos_1,pos_2)
Negative set — All Swiss-Prot proteins lacking Kunitz annotations
PDB structures — Pfam PF00014, resolution ≤ 3.5 Å, length 45–80 aa, clustered at 90% identity with CD-HIT
Contamination removal — BLASTp to filter sequences ≥95% identical to structural seeds
Structures aligned using PDBeFold (SSM), exported in FASTA format. All structures showed the canonical Kunitz fold: two β-strands, central α-helix, inhibitory loop, three disulfide bridges.
Generated using MUSCLE v5 on the same representative sequences for fair comparison.
hmmbuild sequence_based.hmm muscle_alignment.ali
hmmbuild structure_based.hmm pdbe_alignment.alihmmsearch -Z 1000 --max --tblout pos_1.out structure_based.hmm pos_1.fasta
hmmsearch -Z 1000 --max --tblout neg_1.out structure_based.hmm neg_1.fastafor i in `seq 1 12`; do python Script/performance.py set_1.class 1e-$i; doneComputes confusion matrix, Q2 (accuracy), MCC, TPR (sensitivity), PPV (precision) across E-value thresholds (10⁻¹ to 10⁻¹²).
| Model | Peak MCC | Optimal E-value |
|---|---|---|
| Structure-based HMM | 0.997 | 10⁻⁵ – 10⁻⁶ |
| Sequence-based HMM | 0.991 | 10⁻⁵ – 10⁻⁶ |
Key findings:
- E-value separation: Positives scored 10⁻³⁶ to 10⁻⁶⁰; negatives scored 1–10
- Cross-validation: Near-perfect — 0 false positives, 0–2 false negatives per fold
- Misclassification: Apparent false positives were genuine Kunitz domains missed by Pfam
- Domain-level scoring outperformed full-sequence scoring for multidomain proteins
See the full report in Docs/report/Mahan_Balooei_LAB1_Report.pdf for figures (MCC curves, confusion matrices, E-value distributions).
Mahan Balooei
- 🏛️ Department of Pharmacy and Biotechnology, Alma Mater Studiorum – Università di Bologna
- 📧 mahan.balooei@studio.unibo.it
- 🔬 ORCID: 0009-0006-5358-0784
This project is licensed under the MIT License — see the LICENSE file for details.