Skip to content

evo2 SAE eval (2/2): probing harness + CLI (on #1630)#1636

Open
polinabinder1 wants to merge 3 commits into
pbinder/evo2-eval-harnessfrom
pbinder/evo2-eval-probe
Open

evo2 SAE eval (2/2): probing harness + CLI (on #1630)#1636
polinabinder1 wants to merge 3 commits into
pbinder/evo2-eval-harnessfrom
pbinder/evo2-eval-probe

Conversation

@polinabinder1

@polinabinder1 polinabinder1 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Part 2 of 2 of the eval, on top of the label producers in #1630 — review/merge together. #1630 defines the concepts; this runs the model and emits the artifacts.

Summary

The probing harness / CLI ("how to run it"), stacked on the label producers (#1630).

Contents

  • probe.pyextract / auroc / linear / codon-aa / euk-f1 / domain-eval / annotate
  • evo2_buffer.py — Evo2 engine → ActivationBuffer
  • probe_loss_recovered.py — fidelity via shared sae.eval.loss_recovered

How to run ($CKPT=mbridge dir, $SAE=trained SAE .pt)

# 1. build the probing buffer once (needs the model)
python probe.py extract --evo2-ckpt-dir $CKPT --sae-checkpoint $SAE --layer 26 --fasta probe.fa --out buf.npz
# 2. score (no model)
python probe.py auroc  --acts buf.npz --labels base_A,base_C,base_G,base_T,motif_ATG,cds_coding
python probe.py linear --acts buf.npz --labels cds_coding,is_prok
# 3. persist annotations -> the feature_metadata parquet the engine/dashboard load
python probe.py annotate --acts buf.npz --out feature_annotations.parquet --min-auroc 0.85
# 4. annotated-dataset domain-F1 (prec/nt, recall/annotation) + AUROC vs any BED/GFF (needs model)
python probe.py domain-eval --evo2-ckpt-dir $CKPT --sae-checkpoint $SAE --layer 26 \
    --fasta GRCh38_chr20.fa --track exon=refseq.gff3:exon --track cCRE=encode_ccre.bed
# 5. SAE fidelity (loss recovered)
python probe_loss_recovered.py --evo2-ckpt-dir $CKPT --sae-checkpoint $SAE --layer 26 --fasta probe.fa

Imports labelers (#1630) + scoring/annotate (#1629); needs the evo2_sae engine (#1622) importable in the env.

Summary by CodeRabbit

  • New Features
    • Added Evo2 sparse autoencoder interpretability suite with tools for extracting activation buffers, computing feature performance metrics, generating feature annotations, evaluating genomic domain structures, and measuring model reconstruction fidelity—all accessible via unified command-line interface.

Expected output

  • extract → writes buf.npz (codes [N_tokens, n_features] + per-token labels); prints the sequence/token counts.
  • auroc → table label %pos best AUROC feature. Expect strong detectors on top — base_A/C/G/T ≈ 0.95–1.0 (a feature firing on one nucleotide separates it cleanly), and motif_ATG/cds_coding in the 0.85–0.95 band for the ~5 monosemantic features found at 7B/L26.
  • annotate"[annotate] K features labeled (AUROC ≥ 0.85) over L concepts → parquet" + {feature_id, label, auroc, …}.
  • domain-eval → per-concept domF1 / AUROC / %pos; exon/cds should beat the shuffle null.

Measured (7B/L26): variance-explained ~89.5%, dead latents ~21%, ~5 features with motif AUROC > 0.85. Base-detector AUROC is expected near 1.0 but not yet measured — extract→auroc run pending the GPU env.

The 'how to run it' half: probe.py (extract/auroc/linear/codon-aa/euk-f1/domain-eval/
annotate CLI), evo2_buffer (Evo2 -> ActivationBuffer), probe_loss_recovered (fidelity).
Imports the label producers from the base PR + sae.eval.probing from #1629.

Signed-off-by: Polina Binder <pbinder@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26748685-d4ae-43c1-a0d9-dac2db2324af

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Three new scripts implement a complete Evo2 sparse autoencoder interpretability suite. evo2_buffer.py reads FASTA files, samples DNA sequences from prokaryotic/eukaryotic pools, and builds ActivationBuffer objects containing sparse codes and per-token labels. probe.py provides a unified CLI for buffer-based metric probing (AUROC, linear classifiers, codon-AA analysis, feature annotation) and model-backed genomic domain evaluation. probe_loss_recovered.py computes loss-recovery metrics by wrapping the SAE and installing layer hooks to capture or replace residual activations.

Changes

Evo2 SAE Interpretability Probing

Layer / File(s) Summary
Activation buffer construction from sequences
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py
FASTA parser, sequence sampler with kingdom-based oversampling and token-budget controls, and buffer builder that preallocates tensors, tokenizes sequences with kingdom tags, runs locked Evo2 forward passes, encodes sparse codes, computes per-token labeler outputs, and returns NumPy-backed ActivationBuffer.
Metric-based probing from precomputed buffers
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py (lines 1–352, 408–409)
Buffer loading and label resolution utilities, AUROC command to compute per-label metrics and print tables, _eval_matrix helper to fit logistic regression and compute train/test scores, linear command to compare SAE vs. dense representations, codon-aa classifier to compute family-disjoint amino-acid recall, annotate command to generate per-feature concept metadata as Parquet, and CLI dispatcher with shared argument handling for all subcommands.
Model-backed streaming and genomic domain evaluation
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py (lines 206–318, 320–341)
Core _encode_windows helper that tiles windows through batches, performs locked Evo2 forward passes, fills label/instance masks, extract command to build and save buffers using Evo2, euk-f1 command to compute exon/intron/CDS domain F1, and domain-eval command to evaluate arbitrary genomic tracks against SAE features with per-concept AUROC and optimal thresholds.
Loss recovery and reconstruction fidelity evaluation
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py
SAEWrap class to produce reconstructions in residual space with optional denormalization, L26Hook forward-hook for capturing or replacing layer outputs, and main evaluation routine that samples sequences, registers the layer hook, computes next-token cross-entropy deltas under locked engine execution, and reports loss-recovered metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • jstjohn
  • trvachov

Poem

🐰 A rabbit hops through DNA so bright,
Sparse codes hidden from the light,
Probing domains, aurocs gleam,
Building buffers in the stream,
Loss recovered—fidelity's dream! 🧬✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'evo2 SAE eval (2/2): probing harness + CLI (on #1630)' accurately summarizes the main change: adding the probing harness and CLI for Evo2 SAE evaluation as part 2 of the evaluation workflow.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively covers the objectives, contents, usage examples, and expected outputs of the probing harness and CLI implementation for Evo2 SAE evaluation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pbinder/evo2-eval-probe

Comment @coderabbitai help to get the list of available commands and usage tips.

cmd_auroc/cmd_linear/cmd_annotate each repeated 'load buffer + resolve names (filtered to
buffer, default all)'. One _load(a) -> (buf, names) helper. CPU annotate smoke still green.

Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 changed the title evo2 SAE eval: probing harness + CLI evo2 SAE eval (2/2): probing harness + CLI (on #1630) Jun 12, 2026
@polinabinder1

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@polinabinder1

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py (1)

37-52: ⚡ Quick win

Add Google-style docstrings instead of suppressing D103 in bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py and bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py.

The shared root cause is the same in both files: new public helpers and CLI entry points are being introduced with # noqa: D103 (or no docstring) instead of repo-required docstrings. Please document the FASTA/sampling helpers in evo2_buffer.py and the public CLI commands in probe.py with Google-style docstrings rather than suppressing the lint rule. As per coding guidelines, **/*.py: Use Google-style docstrings (pydocstyle convention) in Python code and Use Ruff for Python linting and formatting with Google-style docstrings.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py`
around lines 37 - 52, Replace the "# noqa: D103" suppressions by adding
Google-style docstrings for the public helpers and CLI entry points: add a short
summary, Args, Returns, and Examples where appropriate to the read_fasta and
sample_sequences functions in evo2_buffer.py (document parameters like path,
fasta, max_tokens, seq_len, kingdoms, seed and what is yielded/returned), and
likewise add Google-style docstrings to the public CLI functions/commands in
probe.py (document command purpose, parameters/flags and exit behavior); ensure
docstrings follow pydocstyle/Google conventions so Ruff/linting passes.

Source: Coding guidelines

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py (1)

51-85: ⚡ Quick win

Replace the D10x suppressions with actual Google-style docstrings.

This new file opts out of pydocstyle for SAEWrap, L26Hook, and main() instead of documenting the new CLI surface and hook behavior. Adding concise Google-style docstrings here is a small cleanup that keeps the script aligned with the repository standard.

As per coding guidelines, **/*.py: "Ensure all Python files follow Google-style docstrings (pydocstyle convention)" and "Use Ruff for Python linting and formatting with Google-style docstrings".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`
around lines 51 - 85, Add Google-style docstrings for SAEWrap, L26Hook, and
main() instead of suppressing pydocstyle (remove the noqa D10x markers). For
SAEWrap, document the class purpose, the constructor parameter sae (type and
expected interface: encode, decoder, pre_bias, optional normalize_input) and
describe forward(x) return values (recon, codes) and normalization behavior. For
L26Hook, document the class purpose, the attributes mode/override/captured,
valid mode values ("off", "capture", "replace"), and __call__ behavior (how it
captures or replaces the first tensor in output and dtype handling). For main(),
add a concise docstring describing the CLI surface and what the script does when
executed (setup, hooking behavior, and overall flow). Ensure docstrings follow
Google style for one-line summary, optional Args/Returns sections where useful.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`:
- Around line 107-115: The code can proceed with an empty `batches` list and
produce misleading perfect scores; add a fail-fast guard after building
`batches` (after the loop that calls `sample_sequences`, `engine.tokenize`, and
appends to `batches`) that checks `if not batches:` and raises a clear exception
(e.g., ValueError) explaining no evaluable sequences were produced (mention
`--n-seqs`, filtered FASTA, or token length <= 4) so the evaluator (which
computes `ce_original`, `ce_sae`, `ce_zero` and `loss_recovered`) never runs on
empty data.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py`:
- Around line 343-351: The --auroc-device argument in _add_model_args is
hardcoded to "cuda:1", causing failures on single-GPU hosts; change its default
to match the primary device (e.g., "cuda:0") or make it None and resolve to the
same device as --device at runtime; update the same hardcoded default at the
other occurrence (lines referenced in the review) so all model-backed buffers
use the primary device by default (reference: _add_model_args and the
--auroc-device argument).
- Around line 19-27: The CLI help lists a "loss-recovered" subcommand but main()
never registers it; update main() to add a subparser for "loss-recovered" (using
the existing ArgumentParser.add_subparsers usage), set its help text to match
the help block, and wire it to the actual handler (e.g.,
set_defaults(func=probe_loss_recovered.main) or call the probe_loss_recovered
entrypoint) so invoking "probe.py loss-recovered" runs the intended logic; also
audit the similar missing registrations noted around the 354-405 region and
register any other commands mentioned in the top help that lack corresponding
subparsers.
- Around line 86-90: The _load function currently silently filters unknown
labels; change it to validate requested labels and fail fast: after loading buf
via ActivationBuffer.load(a.acts) and parsing labels from a.labels or
buf.label_names, compute missing = [t for t in labels if t not in buf.name_idx]
and if missing is non-empty raise a ValueError (or SystemExit) with a clear
message listing the unknown label names and available label names; otherwise
return buf and the requested labels in the original order (not filtered).
- Around line 114-120: The current guard skips only all-zero test labels but not
all-one test labels, causing best_single_train_test and auroc_vec to receive
single-class targets; update the conditional that checks ytr and yte so it also
treats an all-ones yte as invalid (e.g., check yte.sum() in (0, len(yte)) or
equivalent) and set out[n] to (nan,nan) in that case; modify the logic around
ytr, yte, the existing if-block, and the block that calls fit_logreg,
best_single_train_test and auroc_vec so both all-negative and all-positive test
folds are skipped.
- Around line 159-165: The codon→aa probe is leaking test info because Xz is
standardized with full-dataset mean/std; compute mean/std from the training
split and apply that normalization to both train and test instead. Specifically,
after calling split_indices(...) (tr, te), compute mu and sigma from X[tr] (use
X[tr].mean(0) and X[tr].std(0)+1e-6), then set Xz = (X - mu)/sigma before
calling decode_eval and fit_softmax; also ensure trn is chosen as the
intersection of tr with the non-hidx indices (instead of using the full-dataset
non-hidx mask) so fit_softmax(Xz[trn], ...) only uses training examples.

---

Nitpick comments:
In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py`:
- Around line 37-52: Replace the "# noqa: D103" suppressions by adding
Google-style docstrings for the public helpers and CLI entry points: add a short
summary, Args, Returns, and Examples where appropriate to the read_fasta and
sample_sequences functions in evo2_buffer.py (document parameters like path,
fasta, max_tokens, seq_len, kingdoms, seed and what is yielded/returned), and
likewise add Google-style docstrings to the public CLI functions/commands in
probe.py (document command purpose, parameters/flags and exit behavior); ensure
docstrings follow pydocstyle/Google conventions so Ruff/linting passes.

In
`@bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py`:
- Around line 51-85: Add Google-style docstrings for SAEWrap, L26Hook, and
main() instead of suppressing pydocstyle (remove the noqa D10x markers). For
SAEWrap, document the class purpose, the constructor parameter sae (type and
expected interface: encode, decoder, pre_bias, optional normalize_input) and
describe forward(x) return values (recon, codes) and normalization behavior. For
L26Hook, document the class purpose, the attributes mode/override/captured,
valid mode values ("off", "capture", "replace"), and __call__ behavior (how it
captures or replaces the first tensor in output and dtype handling). For main(),
add a concise docstring describing the CLI surface and what the script does when
executed (setup, hooking behavior, and overall flow). Ensure docstrings follow
Google style for one-line summary, optional Args/Returns sections where useful.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5bc44345-d335-4a1a-b937-a1042492fd94

📥 Commits

Reviewing files that changed from the base of the PR and between fe6bb14 and d892bb8.

📒 Files selected for processing (3)
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/evo2_buffer.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe.py
  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/probe_loss_recovered.py

…standardization, robustness

- cmd_auroc/_eval_matrix: skip all-positive test folds (AUROC undefined), not just all-negative
- codon-aa: standardize on the train split only (was leaking test-set mean/std)
- _load: reject unknown --labels explicitly instead of silently dropping them
- --auroc-device now defaults to --device (was hardcoded cuda:1; broke single-GPU)
- probe_loss_recovered: fail fast when no evaluable sequences were built
- help: loss-recovered is a separate script, not a subcommand

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
@polinabinder1 polinabinder1 marked this pull request as ready for review June 12, 2026 05:32
@polinabinder1 polinabinder1 requested review from jstjohn and pstjohn and removed request for jstjohn, pstjohn and trvachov June 12, 2026 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant