Divergence-based truncation of forwards trajectories by rchikhi · Pull Request #20 · blab/trajectories

rchikhi · 2026-05-16T14:38:37Z

Summary

Adds optional basal truncation to forwards trajectories in trajectory.py, so the root and deep internal nodes aren't over-represented in training data.

Currently each forwards trajectory is the full root→tip path, which means the root and the deepest internal nodes appear in essentially every trajectory — and those deep nodes are also the least-reliably reconstructed (saturation). Per team discussion (truncate so deep nodes aren't trained on repeatedly), this truncates each trajectory's basal end by cumulative divergence, not hop count, since reconstruction accuracy tracks substitutions/site rather than node count.

Changes

New trajectory.py flags:
- --max-divergence (float, subs/site) — truncate each trajectory to the tip-ward subpath within this divergence of the leaf.
- --min-nodes (int, default 3) — drop filter: if the truncated trajectory has fewer than this many nodes, the tip's trajectory is dropped entirely (never extended past the cutoff).
- --auspice (path) — required with --max-divergence; supplies the tree.
Divergence metric is the auspice tree divergence: div[tip] − div[ancestor] from node_attrs.div. The most-basal kept node becomes the trajectory origin (|0|0); per-branch substitution counts and node sequences are unchanged.
Snakefile trajectory rule gains auspice_raw.json as an input and optional max_divergence / min_nodes config keys.
Per-marker dropped-tip count reported and written to the summary JSON.

Backward compatibility

With --max-divergence unset, output is byte-identical to the previous behavior (full root→tip paths). Existing tests unchanged; 51/51 pass (includes new truncation/drop-filter tests).

Verification

A/B tested on 5 markers across all 5 phyla (incl. a 181k-tip pseudomonadota marker), --max-divergence 1.0 --min-nodes 3:

The cutoff fires (was previously a no-op under an earlier metric).
Every kept trajectory is a contiguous suffix of the untruncated trajectory, with identical shared sequences and ≥3 nodes.
Truncation is tight: the most-basal kept node is within the cutoff; the next-deeper ancestor would exceed it.
The dropped-tip set matches the divergence predicate exactly.
Median trajectory length drops as expected (e.g. pseudomonadota 63→40 nodes); ~1.5% of tips dropped on the large marker.

🤖 Generated with Claude Code

Full root-to-tip paths over-represent the root and deep internal nodes in training data (they appear in nearly every trajectory) and those deep nodes are also the least-reliably reconstructed. Add an optional --max-divergence flag to trajectory.py that truncates each trajectory's basal end, keeping only the portion near the tip. When --max-divergence is set, walk from the tip toward the root accumulating per-branch substitution counts (the same gap-ignoring Hamming values written in the |n_subs| header field) and drop ancestors once cumulative subs / L exceeds the threshold, where L is the marker's alignment.fasta column count. The --min-nodes guard (default 3) extends the path basally if the cutoff yields too few nodes, since training requires >= 3 nodes per path. The most-basal kept node becomes the new origin (header reset to |0|0). When --max-divergence is unset behaviour is unchanged: the full path is emitted and the summary JSON's tree-edge cumulative distance is identical to before. Verified on 6 markers across all 5 phyla (cyanobacteriota, bacteroidota incl. a 20.8k-tip marker, actinomycetota, bacillota, pseudomonadota incl. the 247k-tip TIGR00064): every truncated trajectory is a contiguous suffix of the full path, has >= min(min_nodes, full_depth) nodes, and is a tight cut (kept node within the threshold, next ancestor over it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two bug fixes to the --max-divergence basal truncation feature: Bug 1 — wrong divergence metric. Truncation previously computed divergence as cumulative observed-Hamming substitutions / alignment column count. bac120 alignments are ~70% gap columns, so that number maxes at ~0.13-0.58 root-to-tip and --max-divergence 1.0 truncated nothing. Now divergence(ancestor -> tip) = div[tip] - div[ancestor] where div is the auspice tree divergence (node_attrs.div, cumulative subs/site augur computed). Adds a --auspice argument (required when --max-divergence is set); trajectory.py loads the JSON, builds a {node: div} map, then frees the JSON. The Snakefile trajectory rule gains auspice_raw.json as an input and passes --auspice. Bug 2 — min-nodes wrongly extended past the cutoff. min_nodes was a minimum-length guard that re-added the exact over-divergence deep nodes the truncation exists to remove. min_nodes is now a drop filter: the path is never extended basally past the cutoff; a truncated trajectory with fewer than min_nodes frames is dropped entirely. Drop accounting: build_trajectory_content returns path_depth == -1 to mark a divergence-attributable drop -- one where the untruncated trajectory would have been emitted (>= 2 frames) -- distinct from path_depth 0 for paths too short to emit regardless of truncation. The per-marker "Dropped N tips" report and the summary JSON dropped_tips field count only -1 drops, so N is exactly the number of trajectories truncation removed from the output. Summary stats use an empty-list-tolerant _stats() helper so a marker with every tip dropped writes a summary instead of crashing on min()/max() of an empty list. Backward compatible: with --max-divergence unset, output is byte-identical to origin/main (verified). Verified on 5 markers across all 5 phyla (incl. a 181k-tip pseudomonadota marker): cutoff at 1.0 fires (truncates/drops a meaningful fraction), kept trajectories are contiguous suffixes with identical shared sequences and >= min_nodes, truncation is tight, the drop set matches the iff predicate exactly, and the reported drop count matches the verified iff count on every marker. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…jectory.py branches.py already parses the auspice JSON to extract branch structure, so it's the natural place to also extract divergence. This adds a "div" column to branches.tsv containing per-branch divergence (child_div - parent_div from node_attrs.div), consistent with how the hamming column stores per-branch Hamming distance. trajectory.py reads div from branches.tsv and accumulates per-branch values along each root-to-tip path to get cumulative divergence for the truncation cutoff comparison. This removes the need to separately parse the auspice JSON: the --auspice flag and load_div_map() function are dropped entirely. Changes: - branches.py: emit per-branch div (child_div - parent_div) as 4th column - train_test_split.py: pass through div column between hamming and train_test - trajectory.py: parse_branches() returns div_of[(parent, child)] dict; build_trajectory_content() accumulates along path for truncation - Snakefile: remove auspice_raw.json input and --auspice from trajectories rule - tests: update fixtures to use per-branch div_of format All 51 tests pass. Integration-tested on flu-h3-xs with --max-divergence 0.02: 9219 train trajectories written, 333 tips dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

trvrb · 2026-05-17T02:43:01Z

Cool! I like where this is headed. I cleaned this up a bit to drop --auspice input to trajectories.py and instead include div in branches.tsv. This tidies trajectories.py and workflow logic.

I like including --max-divergence as a command line option to trajectories.py. But we should have a way to set this as (global) config. And to store it in the metadata alongside the trajectories so that we can inspect how the data was processed. I think so far there hasn't been need for global config values?

Yeah, this is currently specified only per-analysis in the config, like https://github.com/blab/trajectories/blob/main/Snakefile#L225. We should probably change this to a global setting in config.yaml that can be over-ridden per analysis. For things like test_proportion, mutations_back, max_clade_proportion and new max_divergence. What do you think?

Rayan Chikhi and others added 3 commits May 16, 2026 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Divergence-based truncation of forwards trajectories#20

Divergence-based truncation of forwards trajectories#20
rchikhi wants to merge 3 commits into
mainfrom
feat/divergence-truncation

rchikhi commented May 16, 2026

Uh oh!

trvrb commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rchikhi commented May 16, 2026

Summary

Changes

Backward compatibility

Verification

Uh oh!

trvrb commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trvrb commented May 17, 2026 •

edited

Loading