Standard FASTA format. The first whitespace-delimited token in each header is used as the class label.
>L4 sample_0001 extra_info
ACTGATCGATCG...
>L2 sample_0002
ACTGATCGATCG...
Labels: L4, L2.
Requirements:
- At least 2 distinct classes
- At least a few sequences per class (10+ recommended for reliable training)
Tab-separated, no header required (lines starting with # are skipped).
| Column | Content | Required |
|---|---|---|
| 1 | Genomic position (1-based) | ✅ |
| 2 | REF allele | ✅ |
| 3 | ALT allele | ✅ |
| 4+ | Lineage hierarchy (one level per column) | ✅ (at least one) |
| after first empty | Annotation columns (gene, mutation) | Optional |
#pos ref alt level1 level2 gene mutation
761155 C T L4 L4.9 gyrA Ser95Thr
2155168 G A L2 L2.2 katG Ser315Thr
4247431 CC C L1 L1.1 ethA frameshift
- Lineage columns are read left-to-right until the first empty cell
- Hierarchical nesting:
L4→L4;L4.9internally (semicolon-joined path) classifyworks best with SNP markers (single-base REF and ALT)split-fastqhandles SNPs, MNVs, and small indels. However, indels in FASTQ mode are intentionally skipped because short reads produce unreliable k-mer matches across repetitive regions
sample_name /absolute/path/to/sample.fasta /optional/path/to/sample.gff3
- Column 1: sample name (must be unique)
- Column 2: path to FASTA file
- Column 3: optional path to GFF3 annotation
sample_name /path/to/reads_R1.fastq.gz /path/to/reads_R2.fastq.gz
- Column 1: sample name (must be unique)
- Column 2+: paths to FASTQ files (one for single-end, two for paired-end)
For classify and split-fastq: a single-record FASTA file containing the reference genome used to define marker positions.
For match: a multi-record FASTA file containing all candidate reference genomes.
Standard GFF3 format. Only CDS features are used. Gene names are extracted from attributes in this priority order:
gene=...locus_tag=...Name=...ID=...
Coordinates are 1-based (GFF3 standard) and converted internally to 0-based for interval tree queries.