Divergence-based truncation of forwards trajectories#20
Conversation
Full root-to-tip paths over-represent the root and deep internal nodes in training data (they appear in nearly every trajectory) and those deep nodes are also the least-reliably reconstructed. Add an optional --max-divergence flag to trajectory.py that truncates each trajectory's basal end, keeping only the portion near the tip. When --max-divergence is set, walk from the tip toward the root accumulating per-branch substitution counts (the same gap-ignoring Hamming values written in the |n_subs| header field) and drop ancestors once cumulative subs / L exceeds the threshold, where L is the marker's alignment.fasta column count. The --min-nodes guard (default 3) extends the path basally if the cutoff yields too few nodes, since training requires >= 3 nodes per path. The most-basal kept node becomes the new origin (header reset to |0|0). When --max-divergence is unset behaviour is unchanged: the full path is emitted and the summary JSON's tree-edge cumulative distance is identical to before. Verified on 6 markers across all 5 phyla (cyanobacteriota, bacteroidota incl. a 20.8k-tip marker, actinomycetota, bacillota, pseudomonadota incl. the 247k-tip TIGR00064): every truncated trajectory is a contiguous suffix of the full path, has >= min(min_nodes, full_depth) nodes, and is a tight cut (kept node within the threshold, next ancestor over it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bug fixes to the --max-divergence basal truncation feature:
Bug 1 — wrong divergence metric. Truncation previously computed
divergence as cumulative observed-Hamming substitutions / alignment
column count. bac120 alignments are ~70% gap columns, so that number
maxes at ~0.13-0.58 root-to-tip and --max-divergence 1.0 truncated
nothing. Now divergence(ancestor -> tip) = div[tip] - div[ancestor]
where div is the auspice tree divergence (node_attrs.div, cumulative
subs/site augur computed). Adds a --auspice argument (required when
--max-divergence is set); trajectory.py loads the JSON, builds a
{node: div} map, then frees the JSON. The Snakefile trajectory rule
gains auspice_raw.json as an input and passes --auspice.
Bug 2 — min-nodes wrongly extended past the cutoff. min_nodes was a
minimum-length guard that re-added the exact over-divergence deep
nodes the truncation exists to remove. min_nodes is now a drop
filter: the path is never extended basally past the cutoff; a
truncated trajectory with fewer than min_nodes frames is dropped
entirely.
Drop accounting: build_trajectory_content returns path_depth == -1
to mark a divergence-attributable drop -- one where the untruncated
trajectory would have been emitted (>= 2 frames) -- distinct from
path_depth 0 for paths too short to emit regardless of truncation.
The per-marker "Dropped N tips" report and the summary JSON
dropped_tips field count only -1 drops, so N is exactly the number
of trajectories truncation removed from the output.
Summary stats use an empty-list-tolerant _stats() helper so a marker
with every tip dropped writes a summary instead of crashing on
min()/max() of an empty list.
Backward compatible: with --max-divergence unset, output is
byte-identical to origin/main (verified).
Verified on 5 markers across all 5 phyla (incl. a 181k-tip
pseudomonadota marker): cutoff at 1.0 fires (truncates/drops a
meaningful fraction), kept trajectories are contiguous suffixes with
identical shared sequences and >= min_nodes, truncation is tight, the
drop set matches the iff predicate exactly, and the reported drop
count matches the verified iff count on every marker.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…jectory.py branches.py already parses the auspice JSON to extract branch structure, so it's the natural place to also extract divergence. This adds a "div" column to branches.tsv containing per-branch divergence (child_div - parent_div from node_attrs.div), consistent with how the hamming column stores per-branch Hamming distance. trajectory.py reads div from branches.tsv and accumulates per-branch values along each root-to-tip path to get cumulative divergence for the truncation cutoff comparison. This removes the need to separately parse the auspice JSON: the --auspice flag and load_div_map() function are dropped entirely. Changes: - branches.py: emit per-branch div (child_div - parent_div) as 4th column - train_test_split.py: pass through div column between hamming and train_test - trajectory.py: parse_branches() returns div_of[(parent, child)] dict; build_trajectory_content() accumulates along path for truncation - Snakefile: remove auspice_raw.json input and --auspice from trajectories rule - tests: update fixtures to use per-branch div_of format All 51 tests pass. Integration-tested on flu-h3-xs with --max-divergence 0.02: 9219 train trajectories written, 333 tips dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Cool! I like where this is headed. I cleaned this up a bit to drop I like including Yeah, this is currently specified only per-analysis in the config, like https://github.com/blab/trajectories/blob/main/Snakefile#L225. We should probably change this to a global setting in |
Summary
Adds optional basal truncation to forwards trajectories in
trajectory.py, so the root and deep internal nodes aren't over-represented in training data.Currently each forwards trajectory is the full root→tip path, which means the root and the deepest internal nodes appear in essentially every trajectory — and those deep nodes are also the least-reliably reconstructed (saturation). Per team discussion (truncate so deep nodes aren't trained on repeatedly), this truncates each trajectory's basal end by cumulative divergence, not hop count, since reconstruction accuracy tracks substitutions/site rather than node count.
Changes
trajectory.pyflags:--max-divergence(float, subs/site) — truncate each trajectory to the tip-ward subpath within this divergence of the leaf.--min-nodes(int, default 3) — drop filter: if the truncated trajectory has fewer than this many nodes, the tip's trajectory is dropped entirely (never extended past the cutoff).--auspice(path) — required with--max-divergence; supplies the tree.div[tip] − div[ancestor]fromnode_attrs.div. The most-basal kept node becomes the trajectory origin (|0|0); per-branch substitution counts and node sequences are unchanged.Snakefiletrajectory rule gainsauspice_raw.jsonas an input and optionalmax_divergence/min_nodesconfig keys.Backward compatibility
With
--max-divergenceunset, output is byte-identical to the previous behavior (full root→tip paths). Existing tests unchanged; 51/51 pass (includes new truncation/drop-filter tests).Verification
A/B tested on 5 markers across all 5 phyla (incl. a 181k-tip pseudomonadota marker),
--max-divergence 1.0 --min-nodes 3:🤖 Generated with Claude Code