Train a Random Forest classifier from labeled genome sequences.
- Read labeled FASTA (first header token = class label)
- Vectorize sequences via feature hashing (2²⁰ buckets, hashing trick)
- Evaluate accuracy via train/test split or stratified k-fold CV
- Train the final model on all data (100 trees, bootstrap sampling)
- Compute out-of-bag (OOB) accuracy from the bootstrap samples
- Save compressed model bundle (bincode + zstd)
- Export feature importance with genomic coordinates
pathotypr train \
--input labeled_genomes.fasta \
--output model.pathotypr.zst \
--kmer-size 21 \
--test-split 0.2 \
--threads 8 \
--excel| Flag | Default | Description |
|---|---|---|
-i, --input |
required | Labeled FASTA file |
-o, --output |
required | Output model path (.pathotypr.zst) |
-k, --kmer-size |
21 |
K-mer size for feature hashing (1–31) |
-s, --test-split |
0.2 |
Fraction held out for accuracy estimation |
-t, --threads |
all cores | Number of CPU threads |
--cv-folds |
— | Stratified k-fold CV instead of single split (e.g., 5 or 10) |
--max-depth |
20 |
Maximum tree depth (regularization) |
--min-samples-leaf |
5 |
Minimum samples per leaf node (regularization) |
| File | Content |
|---|---|
model.pathotypr.zst |
Compressed model bundle (vectorizer + encoder + 100 trees) |
model.importance.tsv |
Top 500 features: rank, bucket, split count, k-mers |
model.importance.coords.tsv |
Genomic coordinates for discriminant k-mers |
- Single split (default): trains on 80%, tests on 20%
- k-fold CV (
--cv-folds): stratified by class, reports mean ± std - OOB accuracy: always computed from bootstrap — free, nearly unbiased
The final model is always trained on 100% of the data regardless of evaluation method.
- Feature hashing: 2²⁰ = 1,048,576 buckets via bitmask (no vocabulary)
- Trees: 100 decision trees, sqrt(n_features) candidates per split, Gini impurity
- Bootstrap: each tree trained on ~63% of samples; ~37% are out-of-bag
- Serialization: bincode → zstd streaming compression
- Memory: model size depends on tree complexity, typically 5–50 MB compressed
For in-depth documentation of the underlying algorithms:
- Feature Hashing — The hashing trick, 2-bit encoding, bucket collisions, and reverse mapping
- Random Forest — Sparse CART trees, binary search on sparse rows, bootstrap aggregation, and OOB accuracy
- Training Pipeline — End-to-end flow: vectorize → evaluate (CV or split) → train final model → serialize with bincode+zstd