This README is the developer-facing reference for the full PepSeqPred training and evaluation pipeline.
For lightweight inference usage and API quickstart, use README.pypi.md.
PepSeqPred supports two usage profiles:
- PyPI quickstart profile (
pip install pepseqpred): user-facing inference API with bundled pretrained artifacts and artifact-path inference helpers. - Repository developer profile (
pip install -e .[dev]): full source tree for preprocessing, embeddings, label generation, FFNN training, Optuna tuning, prediction, evaluation, and HPC orchestration.
The repository profile is the source of truth for reproducing experiments end-to-end.
| Path | Purpose |
|---|---|
src/pepseqpred/apps/ |
CLI entrypoints for each pipeline stage |
src/pepseqpred/core/preprocess/ |
Metadata and z-score preprocessing |
src/pepseqpred/core/embeddings/ |
ESM-2 sequence embedding generation |
src/pepseqpred/core/labels/ |
Residue-level label construction |
src/pepseqpred/core/data/ |
Iterable dataset and windowing/padding logic |
src/pepseqpred/core/models/ |
FFNN model definitions |
src/pepseqpred/core/train/ |
DDP, splitting, metrics, thresholds, trainer, seeds, weights |
src/pepseqpred/core/predict/ |
Checkpoint/manifest resolution and inference logic |
src/pepseqpred/core/io/ |
FASTA/TSV readers, key parsing, logging, CSV appends |
src/pepseqpred/api/ |
Stable Python inference API and pretrained registry |
scripts/hpc/ |
SLURM wrappers for each production stage |
scripts/tools/ |
Zipapp build tools and Cocci eval prep/compare tooling |
tests/ |
Unit, integration, and e2e coverage |
envs/ |
Conda environment specs for local and HPC |
Stage 1 preprocess metadata + zscores
Stage 2 generate ESM-2 per-residue embeddings
Stage 3 build residue-level label shards
Stage 4 train FFNN (seeded or ensemble-kfold, DDP-aware)
Stage 5 optional Optuna tuning (DDP-aware)
Stage 6 predict residue masks from checkpoint/manifest
Stage 7 evaluate residue metrics (+ optional Cocci peptide compare)
CLI: pepseqpred-preprocess (src/pepseqpred/apps/preprocess_cli.py)
Inputs
- metadata TSV (PV1-style)
- z-score TSV
Core modules
core/preprocess/pv1.pycore/preprocess/zscores.pycore/io/read.py
Command
pepseqpred-preprocess data/meta.tsv data/zscores.tsv --saveOutput
- training-ready metadata TSV with
Def epitope,Uncertain,Not epitope - default filename pattern:
input_data_<is_epi_z>_<is_epi_min_subs>_<not_epi_z>_<not_epi_max_subs|all>.tsv
CLI: pepseqpred-esm (src/pepseqpred/apps/esm_cli.py)
Inputs
- FASTA file
- optional metadata TSV for
id-familynaming mode
Core modules
core/embeddings/esm2.pycore/io/read.pycore/io/keys.py
Command
pepseqpred-esm \
--fasta-file data/targets.fasta \
--out-dir localdata/esm2 \
--embedding-key-mode id-family \
--key-delimiter - \
--model-name esm2_t33_650M_UR50D \
--max-tokens 1022 \
--batch-size 8Output
- per-protein embedding files under
<out-dir>/artifacts/pts/*.pt - embedding index CSV under
<out-dir>/artifacts/*.csv - optional shard-specific outputs when
--num-shards > 1
CLI: pepseqpred-labels (src/pepseqpred/apps/labels_cli.py)
Inputs
- preprocessed metadata TSV
- one or more embedding directories
Core module
core/labels/builder.py
Command
pepseqpred-labels \
data/input_data_20_4_10_all.tsv \
localdata/labels/labels_shard_000.pt \
--emb-dir localdata/esm2/artifacts/pts/shard_000 \
--restrict-to-embeddings \
--calc-pos-weight \
--embedding-key-delim -Output
- label shard
.ptwith protein label tensors and peptide metadata - optional
class_statspayload when--calc-pos-weightis enabled
CLI: pepseqpred-train-ffnn (src/pepseqpred/apps/train_ffnn_cli.py)
Modes
seeded: split/train seed pairs define runsensemble-kfold: K-fold members per set, optional multiple set seeds
Core modules
core/data/proteindataset.pycore/models/ffnn.pycore/train/{trainer,split,ddp,metrics,threshold,weights,seed,embedding}.py
Command (smoke)
pepseqpred-train-ffnn \
--embedding-dirs localdata/esm2/artifacts/pts/shard_000 \
--label-shards localdata/labels/labels_shard_000.pt \
--epochs 1 \
--subset 100 \
--save-path localdata/models/ffnn_smoke \
--results-csv localdata/models/ffnn_smoke/runs.csvOutputs
- run checkpoint(s), usually
fully_connected.pt - per-run CSV (
runs.csvormulti_run_results.csv) - aggregate
multi_run_summary.json - ensemble manifest JSON in
ensemble-kfoldmode
CLI: pepseqpred-train-ffnn-optuna (src/pepseqpred/apps/train_ffnn_optuna_cli.py)
Core modules
- same data/model/train stack as Stage 4
- Optuna trial orchestration in app layer
Command (smoke)
pepseqpred-train-ffnn-optuna \
--embedding-dirs localdata/esm2/artifacts/pts/shard_000 \
--label-shards localdata/labels/labels_shard_000.pt \
--n-trials 2 \
--epochs 1 \
--save-path localdata/models/optuna_smoke \
--csv-path localdata/models/optuna_smoke/trials.csvOutputs
- trial rows CSV
- study storage (if configured)
- per-trial checkpoints under
trials/trial_* best_trial.jsonand copied best checkpoint
CLI: pepseqpred-predict (src/pepseqpred/apps/prediction_cli.py)
Accepted model artifact types
- single checkpoint
.pt - ensemble manifest
.json(schema v1 or v2)
Core modules
core/predict/artifacts.pycore/predict/inference.py
Command
pepseqpred-predict \
localdata/models/run_001/fully_connected.pt \
data/inference_targets.fasta \
--output-fasta localdata/predictions/predictions.fastaOutput
- FASTA containing binary residue masks
CLI: pepseqpred-eval-ffnn (src/pepseqpred/apps/evaluate_ffnn_cli.py)
Capabilities
- evaluate single checkpoint or ensemble manifest
- optional set auto-selection from
runs.csv - optional fold-level metrics and ROC/PR curves
- optional plot generation
Core modules
core/predict/artifacts.pycore/predict/inference.pycore/data/proteindataset.pycore/train/metrics.py
Command
pepseqpred-eval-ffnn \
localdata/models/run_001/fully_connected.pt \
--embedding-dirs localdata/esm2/artifacts/pts/shard_000 \
--label-shards localdata/labels/labels_shard_000.pt \
--output-json localdata/eval/ffnn_eval_summary.jsonOutput
- residue-level evaluation JSON
- optional fold payloads, curves, and plot files
The stable Python API is implemented in:
src/pepseqpred/api/predictor.pysrc/pepseqpred/api/pretrainedregistry.pysrc/pepseqpred/api/types.py
Top-level exports (import pepseqpred):
load_pretrained_predictorlist_pretrained_modelsload_predictorpredict_sequencepredict_fastaPepSeqPredictorPredictionResult
Bundled pretrained registry currently includes:
flagship1-v1(alias: flagship1)flagship2-v1(aliases: flagship2, default)
- tensor shape:
(L, D+1) L: residue count- final feature column stores sequence length
{
"labels": {"<protein_id>": Tensor[(L,3)] or Tensor[(L,)]},
"proteins": {"<protein_id>": {"tax_info": {...}, "peptides": [...]}}
# optional:
"class_stats": {"pos_count": int, "neg_count": int, "pos_weight": float}
}{
"model_state_dict": ...,
"optim_state_dict": ...,
"epoch": int,
"config": {...},
"best_loss": float,
"metrics": {...}
}- schema v1: single set,
memberslist - schema v2: root
setslist withset_index, each withmembers - members are filtered by
status == "OK"in predict/eval resolution
| CLI | File | Purpose |
|---|---|---|
pepseqpred-preprocess |
apps/preprocess_cli.py |
metadata + z-score preprocessing |
pepseqpred-esm |
apps/esm_cli.py |
ESM-2 embedding generation |
pepseqpred-labels |
apps/labels_cli.py |
residue label shard generation |
pepseqpred-train-ffnn |
apps/train_ffnn_cli.py |
seeded or ensemble-kfold training |
pepseqpred-train-ffnn-optuna |
apps/train_ffnn_optuna_cli.py |
Optuna tuning |
pepseqpred-predict |
apps/prediction_cli.py |
FASTA inference from checkpoint/manifest |
pepseqpred-eval-ffnn |
apps/evaluate_ffnn_cli.py |
residue-level evaluation |
These wrappers are production-facing interfaces and should be treated as first-class entrypoints.
| Script | Stage | Default resources |
|---|---|---|
generateembeddings.sh |
Embeddings | GPU, array 0-3, a100, 2 CPU/GPU, 8G/GPU, 01:00:00 |
generatelabels.sh |
Labels | CPU, 1 CPU, 16G, 01:00:00 |
trainffnn.sh |
Train FFNN | GPU, 4xa100, 20 CPU, 256G, 12:00:00 |
trainffnnoptuna.sh |
Optuna | GPU, 4xa100, 20 CPU, 448G, 48:00:00 |
predictepitope.sh |
Predict | GPU, a100, 4 CPU, 32G, 00:30:00 |
evaluateffnn.sh |
End-to-end eval pipeline | GPU, a100, 8 CPU, 128G, 04:00:00 |
evalffnnsweep.sh |
Seeded eval batch submitter | wrapper script (calls evaluateffnn.sh) |
preprocessdata.sh |
Preprocess helper | local helper, not a SLURM script |
evaluateffnn.shorchestrates prepare, embed, labels, predict, eval, and peptide compare stages with stage toggles (RUN_PREP,RUN_EMBED,RUN_LABELS,RUN_PREDICT,RUN_EVAL,RUN_COMPARE).evaluateffnn.shandevalffnnsweep.shdepend onscripts/tools/cocci_eval_pipeline.py.- HPC wrappers expect
.pyzruntime artifacts in the working directory (for exampleesm.pyz,train_ffnn.pyz,predict.pyz).
| Tool | Purpose |
|---|---|
buildpyz.py |
build .pyz runtime apps from src/pepseqpred |
pyzapps.py |
registry of app target names to module entrypoints |
cocci_eval_pipeline.py |
Cocci-specific eval subset prep and peptide compare |
rename_embeddings_id_family.py |
rename ID.pt to ID-family.pt embeddings from metadata |
Build examples:
python scripts/tools/buildpyz.py --list
python scripts/tools/buildpyz.py esm
python scripts/tools/buildpyz.py allTest suites are organized as:
tests/unit/: module-level behavior and edge casestests/integration/: CLI-level smoke and interactionstests/e2e/: train-to-predict boundary validation
Representative coverage areas include:
- API registry and predictor behavior (
tests/unit/api/*) - dataset, embeddings, labels, predict, train internals (
tests/unit/core/*) - CLI parsers and eval selection/curve logic (
tests/unit/apps/*) - checkpoint/manifest prediction/evaluation smoke tests (
tests/integration/*)
Run sequence:
ruff check .
pytest tests/unit
pytest tests/integration
pytest tests/e2econda env create -f envs/environment.local.yml
conda activate pepseqpred
pip install -e .[dev]For GPU/HPC development:
conda env create -f envs/environment.hpc.yml
conda activate pepseqpred
pip install -e .[dev]python -m venv .venv
. .venv/bin/activate # Linux/macOS
pip install --upgrade pip
pip install -e .[dev]Windows PowerShell:
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]When changing training or evaluation logic:
- preserve split semantics (
idvsid-family) - preserve seed handling and deterministic run planning
- avoid rank-dependent side effects in DDP code paths
- only write shared artifacts from intended rank
- avoid output path collisions between experiments
- prefer smoke tests over expensive full retraining during development
id-familyembedding key mode requires metadata family mapping.- Label generation must align embedding naming with
--embedding-key-delim(""forID.pt,-forID-family.pt). - Prediction/evaluation threshold overrides must remain in
(0.0, 1.0). - Ensemble manifests are resolved by valid
status=OKmembers and optionalk_foldstruncation. - For HPC pipelines, keep
.pyzartifacts and shell scripts in the same execution directory unless intentionally using module imports.
- GitHub issues: bug reports, feature requests, and development questions
- Maintainers: Jeffrey Hoelzel, Jason Ladner