PepSeqPred Developer README

PepSeqPred logo

PepSeqPred Developer README

This README is the developer-facing reference for the full PepSeqPred training and evaluation pipeline.

For lightweight inference usage and API quickstart, use README.pypi.md.

Scope

PepSeqPred supports two usage profiles:

PyPI quickstart profile (pip install pepseqpred): user-facing inference API with bundled pretrained artifacts and artifact-path inference helpers.
Repository developer profile (pip install -e .[dev]): full source tree for preprocessing, embeddings, label generation, FFNN training, Optuna tuning, prediction, evaluation, and HPC orchestration.

The repository profile is the source of truth for reproducing experiments end-to-end.

Repository Map

Path	Purpose
`src/pepseqpred/apps/`	CLI entrypoints for each pipeline stage
`src/pepseqpred/core/preprocess/`	Metadata and z-score preprocessing
`src/pepseqpred/core/embeddings/`	ESM-2 sequence embedding generation
`src/pepseqpred/core/labels/`	Residue-level label construction
`src/pepseqpred/core/data/`	Iterable dataset and windowing/padding logic
`src/pepseqpred/core/models/`	FFNN model definitions
`src/pepseqpred/core/train/`	DDP, splitting, metrics, thresholds, trainer, seeds, weights
`src/pepseqpred/core/predict/`	Checkpoint/manifest resolution and inference logic
`src/pepseqpred/core/io/`	FASTA/TSV readers, key parsing, logging, CSV appends
`src/pepseqpred/api/`	Stable Python inference API and pretrained registry
`scripts/hpc/`	SLURM wrappers for each production stage
`scripts/tools/`	Zipapp build tools and Cocci eval prep/compare tooling
`tests/`	Unit, integration, and e2e coverage
`envs/`	Conda environment specs for local and HPC

End-to-End Pipeline

Stage 1  preprocess metadata + zscores
Stage 2  generate ESM-2 per-residue embeddings
Stage 3  build residue-level label shards
Stage 4  train FFNN (seeded or ensemble-kfold, DDP-aware)
Stage 5  optional Optuna tuning (DDP-aware)
Stage 6  predict residue masks from checkpoint/manifest
Stage 7  evaluate residue metrics (+ optional Cocci peptide compare)

Stage Reference

Stage 1: Preprocess

CLI: pepseqpred-preprocess (src/pepseqpred/apps/preprocess_cli.py)

Inputs

metadata TSV (PV1-style)
z-score TSV

Core modules

core/preprocess/pv1.py
core/preprocess/zscores.py
core/io/read.py

Command

pepseqpred-preprocess data/meta.tsv data/zscores.tsv --save

Output

training-ready metadata TSV with Def epitope, Uncertain, Not epitope
default filename pattern: input_data_<is_epi_z>_<is_epi_min_subs>_<not_epi_z>_<not_epi_max_subs|all>.tsv

Stage 2: Generate ESM-2 Embeddings

CLI: pepseqpred-esm (src/pepseqpred/apps/esm_cli.py)

Inputs

FASTA file
optional metadata TSV for id-family naming mode

Core modules

core/embeddings/esm2.py
core/io/read.py
core/io/keys.py

Command

pepseqpred-esm \
  --fasta-file data/targets.fasta \
  --out-dir localdata/esm2 \
  --embedding-key-mode id-family \
  --key-delimiter - \
  --model-name esm2_t33_650M_UR50D \
  --max-tokens 1022 \
  --batch-size 8

Output

per-protein embedding files under <out-dir>/artifacts/pts/*.pt
embedding index CSV under <out-dir>/artifacts/*.csv
optional shard-specific outputs when --num-shards > 1

Stage 3: Build Residue Labels

CLI: pepseqpred-labels (src/pepseqpred/apps/labels_cli.py)

Inputs

preprocessed metadata TSV
one or more embedding directories

Core module

core/labels/builder.py

Command

pepseqpred-labels \
  data/input_data_20_4_10_all.tsv \
  localdata/labels/labels_shard_000.pt \
  --emb-dir localdata/esm2/artifacts/pts/shard_000 \
  --restrict-to-embeddings \
  --calc-pos-weight \
  --embedding-key-delim -

Output

label shard .pt with protein label tensors and peptide metadata
optional class_stats payload when --calc-pos-weight is enabled

Stage 4: Train FFNN

CLI: pepseqpred-train-ffnn (src/pepseqpred/apps/train_ffnn_cli.py)

Modes

seeded: split/train seed pairs define runs
ensemble-kfold: K-fold members per set, optional multiple set seeds

Core modules

core/data/proteindataset.py
core/models/ffnn.py
core/train/{trainer,split,ddp,metrics,threshold,weights,seed,embedding}.py

Command (smoke)

pepseqpred-train-ffnn \
  --embedding-dirs localdata/esm2/artifacts/pts/shard_000 \
  --label-shards localdata/labels/labels_shard_000.pt \
  --epochs 1 \
  --subset 100 \
  --save-path localdata/models/ffnn_smoke \
  --results-csv localdata/models/ffnn_smoke/runs.csv

Outputs

run checkpoint(s), usually fully_connected.pt
per-run CSV (runs.csv or multi_run_results.csv)
aggregate multi_run_summary.json
ensemble manifest JSON in ensemble-kfold mode

Stage 5: Optuna Tuning (Optional)

CLI: pepseqpred-train-ffnn-optuna (src/pepseqpred/apps/train_ffnn_optuna_cli.py)

Core modules

same data/model/train stack as Stage 4
Optuna trial orchestration in app layer

Command (smoke)

pepseqpred-train-ffnn-optuna \
  --embedding-dirs localdata/esm2/artifacts/pts/shard_000 \
  --label-shards localdata/labels/labels_shard_000.pt \
  --n-trials 2 \
  --epochs 1 \
  --save-path localdata/models/optuna_smoke \
  --csv-path localdata/models/optuna_smoke/trials.csv

Outputs

trial rows CSV
study storage (if configured)
per-trial checkpoints under trials/trial_*
best_trial.json and copied best checkpoint

Stage 6: Predict

CLI: pepseqpred-predict (src/pepseqpred/apps/prediction_cli.py)

Accepted model artifact types

single checkpoint .pt
ensemble manifest .json (schema v1 or v2)

Core modules

core/predict/artifacts.py
core/predict/inference.py

Command

pepseqpred-predict \
  localdata/models/run_001/fully_connected.pt \
  data/inference_targets.fasta \
  --output-fasta localdata/predictions/predictions.fasta

Output

FASTA containing binary residue masks

Stage 7: Evaluate

CLI: pepseqpred-eval-ffnn (src/pepseqpred/apps/evaluate_ffnn_cli.py)

Capabilities

evaluate single checkpoint or ensemble manifest
optional set auto-selection from runs.csv
optional fold-level metrics and ROC/PR curves
optional plot generation

Core modules

core/predict/artifacts.py
core/predict/inference.py
core/data/proteindataset.py
core/train/metrics.py

Command

pepseqpred-eval-ffnn \
  localdata/models/run_001/fully_connected.pt \
  --embedding-dirs localdata/esm2/artifacts/pts/shard_000 \
  --label-shards localdata/labels/labels_shard_000.pt \
  --output-json localdata/eval/ffnn_eval_summary.json

Output

residue-level evaluation JSON
optional fold payloads, curves, and plot files

Inference API (`pepseqpred.api`)

The stable Python API is implemented in:

src/pepseqpred/api/predictor.py
src/pepseqpred/api/pretrainedregistry.py
src/pepseqpred/api/types.py

Top-level exports (import pepseqpred):

load_pretrained_predictor
list_pretrained_models
load_predictor
predict_sequence
predict_fasta
PepSeqPredictor
PredictionResult

Bundled pretrained registry currently includes:

flagship1-v1 (alias: flagship1)
flagship2-v1 (aliases: flagship2, default)

Artifact Contracts

Embedding `.pt`

tensor shape: (L, D+1)
L: residue count
final feature column stores sequence length

Label shard `.pt`

{
  "labels": {"<protein_id>": Tensor[(L,3)] or Tensor[(L,)]},
  "proteins": {"<protein_id>": {"tax_info": {...}, "peptides": [...]}}
  # optional:
  "class_stats": {"pos_count": int, "neg_count": int, "pos_weight": float}
}

Training checkpoint `.pt`

{
  "model_state_dict": ..., 
  "optim_state_dict": ..., 
  "epoch": int,
  "config": {...},
  "best_loss": float,
  "metrics": {...}
}

Ensemble manifest JSON

schema v1: single set, members list
schema v2: root sets list with set_index, each with members
members are filtered by status == "OK" in predict/eval resolution

CLI Reference

CLI	File	Purpose
`pepseqpred-preprocess`	`apps/preprocess_cli.py`	metadata + z-score preprocessing
`pepseqpred-esm`	`apps/esm_cli.py`	ESM-2 embedding generation
`pepseqpred-labels`	`apps/labels_cli.py`	residue label shard generation
`pepseqpred-train-ffnn`	`apps/train_ffnn_cli.py`	seeded or ensemble-kfold training
`pepseqpred-train-ffnn-optuna`	`apps/train_ffnn_optuna_cli.py`	Optuna tuning
`pepseqpred-predict`	`apps/prediction_cli.py`	FASTA inference from checkpoint/manifest
`pepseqpred-eval-ffnn`	`apps/evaluate_ffnn_cli.py`	residue-level evaluation

HPC Script Reference (`scripts/hpc`)

These wrappers are production-facing interfaces and should be treated as first-class entrypoints.

Script	Stage	Default resources
`generateembeddings.sh`	Embeddings	GPU, array `0-3`, `a100`, `2` CPU/GPU, `8G`/GPU, `01:00:00`
`generatelabels.sh`	Labels	CPU, `1` CPU, `16G`, `01:00:00`
`trainffnn.sh`	Train FFNN	GPU, `4xa100`, `20` CPU, `256G`, `12:00:00`
`trainffnnoptuna.sh`	Optuna	GPU, `4xa100`, `20` CPU, `448G`, `48:00:00`
`predictepitope.sh`	Predict	GPU, `a100`, `4` CPU, `32G`, `00:30:00`
`evaluateffnn.sh`	End-to-end eval pipeline	GPU, `a100`, `8` CPU, `128G`, `04:00:00`
`evalffnnsweep.sh`	Seeded eval batch submitter	wrapper script (calls `evaluateffnn.sh`)
`preprocessdata.sh`	Preprocess helper	local helper, not a SLURM script

Important HPC notes

evaluateffnn.sh orchestrates prepare, embed, labels, predict, eval, and peptide compare stages with stage toggles (RUN_PREP, RUN_EMBED, RUN_LABELS, RUN_PREDICT, RUN_EVAL, RUN_COMPARE).
evaluateffnn.sh and evalffnnsweep.sh depend on scripts/tools/cocci_eval_pipeline.py.
HPC wrappers expect .pyz runtime artifacts in the working directory (for example esm.pyz, train_ffnn.pyz, predict.pyz).

Zipapp and Tooling (`scripts/tools`)

Tool	Purpose
`buildpyz.py`	build `.pyz` runtime apps from `src/pepseqpred`
`pyzapps.py`	registry of app target names to module entrypoints
`cocci_eval_pipeline.py`	Cocci-specific eval subset prep and peptide compare
`rename_embeddings_id_family.py`	rename `ID.pt` to `ID-family.pt` embeddings from metadata

Build examples:

python scripts/tools/buildpyz.py --list
python scripts/tools/buildpyz.py esm
python scripts/tools/buildpyz.py all

Testing Map

Test suites are organized as:

tests/unit/: module-level behavior and edge cases
tests/integration/: CLI-level smoke and interactions
tests/e2e/: train-to-predict boundary validation

Representative coverage areas include:

API registry and predictor behavior (tests/unit/api/*)
dataset, embeddings, labels, predict, train internals (tests/unit/core/*)
CLI parsers and eval selection/curve logic (tests/unit/apps/*)
checkpoint/manifest prediction/evaluation smoke tests (tests/integration/*)

Run sequence:

ruff check .
pytest tests/unit
pytest tests/integration
pytest tests/e2e

Environment and Setup

Conda

conda env create -f envs/environment.local.yml
conda activate pepseqpred
pip install -e .[dev]

For GPU/HPC development:

conda env create -f envs/environment.hpc.yml
conda activate pepseqpred
pip install -e .[dev]

Pip + venv

python -m venv .venv
. .venv/bin/activate  # Linux/macOS
pip install --upgrade pip
pip install -e .[dev]

Windows PowerShell:

python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]

Reproducibility and Safety Guardrails

When changing training or evaluation logic:

preserve split semantics (id vs id-family)
preserve seed handling and deterministic run planning
avoid rank-dependent side effects in DDP code paths
only write shared artifacts from intended rank
avoid output path collisions between experiments
prefer smoke tests over expensive full retraining during development

Known Operational Notes

id-family embedding key mode requires metadata family mapping.
Label generation must align embedding naming with --embedding-key-delim ("" for ID.pt, - for ID-family.pt).
Prediction/evaluation threshold overrides must remain in (0.0, 1.0).
Ensemble manifests are resolved by valid status=OK members and optional k_folds truncation.
For HPC pipelines, keep .pyz artifacts and shell scripts in the same execution directory unless intentionally using module imports.

Contact

GitHub issues: bug reports, feature requests, and development questions
Maintainers: Jeffrey Hoelzel, Jason Ladner

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.github/workflows		.github/workflows
dist		dist
docs		docs
envs		envs
localdata		localdata
notebooks		notebooks
scripts		scripts
src/pepseqpred		src/pepseqpred
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PepSeqPred_logo_black.png		PepSeqPred_logo_black.png
PepSeqPred_logo_white.png		PepSeqPred_logo_white.png
README.md		README.md
README.pypi.md		README.pypi.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PepSeqPred Developer README

Scope

Repository Map

End-to-End Pipeline

Stage Reference

Stage 1: Preprocess

Stage 2: Generate ESM-2 Embeddings

Stage 3: Build Residue Labels

Stage 4: Train FFNN

Stage 5: Optuna Tuning (Optional)

Stage 6: Predict

Stage 7: Evaluate

Inference API (pepseqpred.api)

Artifact Contracts

Embedding .pt

Label shard .pt

Training checkpoint .pt

Ensemble manifest JSON

CLI Reference

HPC Script Reference (scripts/hpc)

Important HPC notes

Zipapp and Tooling (scripts/tools)

Testing Map

Environment and Setup

Conda

Pip + venv

Reproducibility and Safety Guardrails

Known Operational Notes

Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Inference API (`pepseqpred.api`)

Embedding `.pt`

Label shard `.pt`

Training checkpoint `.pt`

HPC Script Reference (`scripts/hpc`)

Zipapp and Tooling (`scripts/tools`)

Packages