Skip to content

Divergence-based truncation of forwards trajectories#20

Open
rchikhi wants to merge 3 commits into
mainfrom
feat/divergence-truncation
Open

Divergence-based truncation of forwards trajectories#20
rchikhi wants to merge 3 commits into
mainfrom
feat/divergence-truncation

Conversation

@rchikhi
Copy link
Copy Markdown
Collaborator

@rchikhi rchikhi commented May 16, 2026

Summary

Adds optional basal truncation to forwards trajectories in trajectory.py, so the root and deep internal nodes aren't over-represented in training data.

Currently each forwards trajectory is the full root→tip path, which means the root and the deepest internal nodes appear in essentially every trajectory — and those deep nodes are also the least-reliably reconstructed (saturation). Per team discussion (truncate so deep nodes aren't trained on repeatedly), this truncates each trajectory's basal end by cumulative divergence, not hop count, since reconstruction accuracy tracks substitutions/site rather than node count.

Changes

  • New trajectory.py flags:
    • --max-divergence (float, subs/site) — truncate each trajectory to the tip-ward subpath within this divergence of the leaf.
    • --min-nodes (int, default 3) — drop filter: if the truncated trajectory has fewer than this many nodes, the tip's trajectory is dropped entirely (never extended past the cutoff).
    • --auspice (path) — required with --max-divergence; supplies the tree.
  • Divergence metric is the auspice tree divergence: div[tip] − div[ancestor] from node_attrs.div. The most-basal kept node becomes the trajectory origin (|0|0); per-branch substitution counts and node sequences are unchanged.
  • Snakefile trajectory rule gains auspice_raw.json as an input and optional max_divergence / min_nodes config keys.
  • Per-marker dropped-tip count reported and written to the summary JSON.

Backward compatibility

With --max-divergence unset, output is byte-identical to the previous behavior (full root→tip paths). Existing tests unchanged; 51/51 pass (includes new truncation/drop-filter tests).

Verification

A/B tested on 5 markers across all 5 phyla (incl. a 181k-tip pseudomonadota marker), --max-divergence 1.0 --min-nodes 3:

  • The cutoff fires (was previously a no-op under an earlier metric).
  • Every kept trajectory is a contiguous suffix of the untruncated trajectory, with identical shared sequences and ≥3 nodes.
  • Truncation is tight: the most-basal kept node is within the cutoff; the next-deeper ancestor would exceed it.
  • The dropped-tip set matches the divergence predicate exactly.
  • Median trajectory length drops as expected (e.g. pseudomonadota 63→40 nodes); ~1.5% of tips dropped on the large marker.

🤖 Generated with Claude Code

Rayan Chikhi and others added 3 commits May 16, 2026 11:50
Full root-to-tip paths over-represent the root and deep internal nodes
in training data (they appear in nearly every trajectory) and those
deep nodes are also the least-reliably reconstructed. Add an optional
--max-divergence flag to trajectory.py that truncates each trajectory's
basal end, keeping only the portion near the tip.

When --max-divergence is set, walk from the tip toward the root
accumulating per-branch substitution counts (the same gap-ignoring
Hamming values written in the |n_subs| header field) and drop ancestors
once cumulative subs / L exceeds the threshold, where L is the marker's
alignment.fasta column count. The --min-nodes guard (default 3) extends
the path basally if the cutoff yields too few nodes, since training
requires >= 3 nodes per path. The most-basal kept node becomes the new
origin (header reset to |0|0). When --max-divergence is unset behaviour
is unchanged: the full path is emitted and the summary JSON's tree-edge
cumulative distance is identical to before.

Verified on 6 markers across all 5 phyla (cyanobacteriota, bacteroidota
incl. a 20.8k-tip marker, actinomycetota, bacillota, pseudomonadota incl.
the 247k-tip TIGR00064): every truncated trajectory is a contiguous
suffix of the full path, has >= min(min_nodes, full_depth) nodes, and is
a tight cut (kept node within the threshold, next ancestor over it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bug fixes to the --max-divergence basal truncation feature:

Bug 1 — wrong divergence metric. Truncation previously computed
divergence as cumulative observed-Hamming substitutions / alignment
column count. bac120 alignments are ~70% gap columns, so that number
maxes at ~0.13-0.58 root-to-tip and --max-divergence 1.0 truncated
nothing. Now divergence(ancestor -> tip) = div[tip] - div[ancestor]
where div is the auspice tree divergence (node_attrs.div, cumulative
subs/site augur computed). Adds a --auspice argument (required when
--max-divergence is set); trajectory.py loads the JSON, builds a
{node: div} map, then frees the JSON. The Snakefile trajectory rule
gains auspice_raw.json as an input and passes --auspice.

Bug 2 — min-nodes wrongly extended past the cutoff. min_nodes was a
minimum-length guard that re-added the exact over-divergence deep
nodes the truncation exists to remove. min_nodes is now a drop
filter: the path is never extended basally past the cutoff; a
truncated trajectory with fewer than min_nodes frames is dropped
entirely.

Drop accounting: build_trajectory_content returns path_depth == -1
to mark a divergence-attributable drop -- one where the untruncated
trajectory would have been emitted (>= 2 frames) -- distinct from
path_depth 0 for paths too short to emit regardless of truncation.
The per-marker "Dropped N tips" report and the summary JSON
dropped_tips field count only -1 drops, so N is exactly the number
of trajectories truncation removed from the output.

Summary stats use an empty-list-tolerant _stats() helper so a marker
with every tip dropped writes a summary instead of crashing on
min()/max() of an empty list.

Backward compatible: with --max-divergence unset, output is
byte-identical to origin/main (verified).

Verified on 5 markers across all 5 phyla (incl. a 181k-tip
pseudomonadota marker): cutoff at 1.0 fires (truncates/drops a
meaningful fraction), kept trajectories are contiguous suffixes with
identical shared sequences and >= min_nodes, truncation is tight, the
drop set matches the iff predicate exactly, and the reported drop
count matches the verified iff count on every marker.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…jectory.py

branches.py already parses the auspice JSON to extract branch structure, so
it's the natural place to also extract divergence. This adds a "div" column
to branches.tsv containing per-branch divergence (child_div - parent_div
from node_attrs.div), consistent with how the hamming column stores
per-branch Hamming distance.

trajectory.py reads div from branches.tsv and accumulates per-branch values
along each root-to-tip path to get cumulative divergence for the truncation
cutoff comparison. This removes the need to separately parse the auspice
JSON: the --auspice flag and load_div_map() function are dropped entirely.

Changes:
- branches.py: emit per-branch div (child_div - parent_div) as 4th column
- train_test_split.py: pass through div column between hamming and train_test
- trajectory.py: parse_branches() returns div_of[(parent, child)] dict;
  build_trajectory_content() accumulates along path for truncation
- Snakefile: remove auspice_raw.json input and --auspice from trajectories rule
- tests: update fixtures to use per-branch div_of format

All 51 tests pass. Integration-tested on flu-h3-xs with --max-divergence
0.02: 9219 train trajectories written, 333 tips dropped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@trvrb
Copy link
Copy Markdown
Member

trvrb commented May 17, 2026

Cool! I like where this is headed. I cleaned this up a bit to drop --auspice input to trajectories.py and instead include div in branches.tsv. This tidies trajectories.py and workflow logic.

I like including --max-divergence as a command line option to trajectories.py. But we should have a way to set this as (global) config. And to store it in the metadata alongside the trajectories so that we can inspect how the data was processed. I think so far there hasn't been need for global config values?

Yeah, this is currently specified only per-analysis in the config, like https://github.com/blab/trajectories/blob/main/Snakefile#L225. We should probably change this to a global setting in config.yaml that can be over-ridden per analysis. For things like test_proportion, mutations_back, max_clade_proportion and new max_divergence. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants