Skip to content

feat(validation): unified leakage_probes module (PR 3.1)#65

Merged
shaypal5 merged 3 commits into
mainfrom
feat/leakage-probes-module
May 5, 2026
Merged

feat(validation): unified leakage_probes module (PR 3.1)#65
shaypal5 merged 3 commits into
mainfrom
feat/leakage-probes-module

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

PR 3.1 of the v1 dataset release sequence — first of three Phase-3 PRs (release validation hardening). Lays the foundation that PR 3.2 (release_quality.py + reporting.py) and PR 3.3 (validate_release_candidate.py driver) build on.

Subsumes the PR 2.1 relational_leakage.py module and broadens it to the full leakage taxonomy from docs/release/v1_release_design.md / docs/release/v1_acceptance_gates.md: direct, time-window, relational, split, and model-realism. Every public-bundle leakage gate G4–G6 has a probe in this module.

What's new

leadforge/validation/leakage_probes.py (new, ~600 LOC + ~1k LOC tests)

Direct (G4.1-G4.3). probe_banned_columns, probe_banned_tables. Lifted from PR 2.1 and generalised to accept caller-supplied banned sets — non-relational publication channels (Kaggle / HF / instructor companion) can declare their own without depending on the snapshot-safe defaults.

Time-window (G4.4). probe_snapshot_window. Lifted and generalised over (table, ts_col) pairs so future event tables can register their own anchor.

Relational (G4.5). probe_deterministic_reconstruction, deterministic_relational_reconstruction. Lifted unchanged.

Split (G6.1-G6.3, NEW).

  • probe_split_id_overlap — lead/account/contact cross-split overlap;
  • probe_split_near_duplicates — deterministic rounded-vector hashing (no sklearn dep, no O(N²); under-reports rather than over-reports — documented in the docstring);
  • probe_split_label_drift — opt-in.

Model-realism (G5.1-G5.4, NEW). All opt-in with required max_auc=... (PR 3.3 supplies per-tier bands).

  • probe_bonus_model_auc — lifted from PR 2.1 (also opt-in);
  • probe_id_only_baseline (G5.3) — hashes IDs to bound feature width and avoid memorisation; HistGBM only;
  • probe_feature_subset_baseline (G5.1/G5.2 umbrella) — caller declares the suspect column subset; HistGBM only (handles NaN natively, LR would not).

Probe registry & meta-test. PROBE_REGISTRY: Mapping[str, ProbeSpec] is the single source of truth (probe → taxonomy / opt-in flag). Meta-test test_probe_registry_covers_every_module_level_probe enforces that any new module-level probe_* function is registered, so the "I-added-a-probe-but-forgot-to-wire-it" regression fails loudly.

Orchestrators.

  • run_all_probes / run_all_probes_on_dataframes — structural, signature unchanged so validate_bundle keeps working without touching bundle_checks.py semantics. Bonus probe stays opt-in.
  • run_split_probes (new) — split-level over {split_name: DataFrame}. PR 3.2/3.3 will plumb this through the release-quality driver.

Migration

leadforge/validation/relational_leakage.py deleted. Every internal call site updated:

  • leadforge/validation/{bundle_checks,invariants}.py
  • leadforge/render/{manifests,relational_snapshot_safe}.py
  • leadforge/exposure/filters.py (docstring only)
  • scripts/probe_relational_leakage.py

tests/validation/test_relational_leakage.py renamed via git mv to test_leakage_probes.py (preserves blame); existing 33 tests preserved verbatim plus 24 new tests for the new probes + meta-coverage.

RelationalLeakageError retained — it now spans every taxonomy in the module, not just relational join reconstruction; LeakageError alias added for the new umbrella name.

What's NOT in scope

  • Calibrating per-tier bands for the opt-in model-realism probes (PR 3.3).
  • Wiring split / model-realism probes into validate_bundle (PR 3.3 once bands are calibrated).
  • release_quality.py + reporting.py (PR 3.2).
  • scripts/validate_release_candidate.py + resolving TBD-* gates in v1_acceptance_gates.md (PR 3.3).

Test plan

  • pytest — 1067/1067 pass (994 pre-PR + 24 new probe tests + 8 generalisation tests + 5 meta-tests; one renamed; the rest unaffected).
  • ruff check — clean.
  • ruff format --check — clean.
  • mypy leadforge/ — clean (80 source files).
  • scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 — all three exit 0 (path A-E rates = 0.000).
  • scripts/verify_hash_determinism.py — PASS, 67/67 files identical.
  • BUNDLE_SCHEMA_VERSION unchanged at "5" — PR is purely additive on the validator side; on-disk bundle shape is unchanged.
  • Round-trip integration: tests/integration/test_snapshot_safe_bundle.py (existing) still validates clean bundles via the renamed module.

Risks / non-obvious choices

  • Kept RelationalLeakageError as the canonical exception class. Renaming would cascade through every except clause for negligible benefit; the docstring now explicitly documents that it spans every taxonomy.
  • Near-duplicate detection uses rounded-vector hashing, not cosine similarity. Cosine machinery would flake on sparse one-hot data and drift across sklearn versions; rounded hashing is reproducible, deterministic, and orthogonal to model choice. The docstring is explicit that it under-reports rather than over-reports.
  • probe_id_only_baseline hashes IDs to int64 instead of one-hot encoding. One-hot would explode the dimension and guarantee AUC ≈ 1.0 on train via memorisation, defeating the probe's purpose.
  • Model-realism probes use HistGBM, not LR. HistGBM handles NaN natively; the probes are calibrated baselines, not pipeline replicas.

🤖 Generated with Claude Code

Subsume `relational_leakage.py` and broaden it to the full design-doc /
acceptance-gates leakage taxonomy. PR 3.2's `release_quality.py` and PR
3.3's `validate_release_candidate.py` driver consume this module.

Probe families
--------------

* Direct (G4.1-G4.3) — `probe_banned_columns`, `probe_banned_tables`.
  Generalised to accept caller-supplied banned sets so non-relational
  publication channels (Kaggle / HF / instructor companion) can declare
  their own without depending on the snapshot-safe defaults.
* Time-window (G4.4) — `probe_snapshot_window`. Generalised over
  `(table, ts_col)` pairs so future event tables can register their
  own anchor.
* Relational (G4.5) — `probe_deterministic_reconstruction`,
  `deterministic_relational_reconstruction`. Lifted unchanged.
* Split (G6.1-G6.3) — `probe_split_id_overlap` (lead/account/contact
  cross-split overlap), `probe_split_near_duplicates` (deterministic
  rounded-vector hashing — no sklearn dep, no O(N²)),
  `probe_split_label_drift` (opt-in).
* Model-realism (G5.1-G5.4) — `probe_bonus_model_auc` (lifted, opt-in),
  new opt-in `probe_id_only_baseline` (G5.3, hashes IDs to bound
  feature width), `probe_feature_subset_baseline` (G5.1/G5.2 umbrella;
  HistGBM only — handles NaN natively, LR would not).

`PROBE_REGISTRY` is the single source of truth (probe → taxonomy /
opt-in flag); meta-test asserts every module-level `probe_*` is
registered, so a future "I added a probe but forgot to wire it"
regression fails loudly.

Two orchestrators: `run_all_probes` / `run_all_probes_on_dataframes`
(structural; signature unchanged so `validate_bundle` keeps working
without touching `bundle_checks.py` semantics) and new
`run_split_probes` (split-level over `{split_name: DataFrame}`).

Migration
---------

`leadforge/validation/relational_leakage.py` deleted; every internal
call site updated:

* `leadforge/validation/{bundle_checks,invariants}.py`
* `leadforge/render/{manifests,relational_snapshot_safe}.py`
* `leadforge/exposure/filters.py` (docstring only)
* `scripts/probe_relational_leakage.py`

`tests/validation/test_relational_leakage.py` renamed via `git mv`
to `test_leakage_probes.py`; existing 33 tests preserved verbatim
plus 24 new tests for the new probes + meta-coverage.

`RelationalLeakageError` retained (now spans every taxonomy in the
module, not just relational join reconstruction) with `LeakageError`
alias for the new umbrella name.

Acceptance
----------

* `BUNDLE_SCHEMA_VERSION` unchanged at "5" — PR 3.1 is purely additive
  on the validator side; on-disk bundle shape is unchanged.
* 1067/1067 tests pass; ruff + mypy clean.
* `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced}
  --max-accuracy 0.65` exits 0 on every public tier (path E rate = 0.000).
* `scripts/verify_hash_determinism.py` PASS — 67/67 files identical
  across pinned-timestamp builds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 20:50
@shaypal5 shaypal5 added type: feature New capability layer: validation validation/ invariants and checks labels May 5, 2026
@github-actions

This comment has been minimized.

`LeakageError = RelationalLeakageError` was exported in `__all__` but
referenced nowhere internally — exactly the kind of dead alias the
project guidance flags as backwards-compat noise. Drop it; callers
that want the new umbrella name can import `RelationalLeakageError`
directly (the docstring already documents that it spans every
taxonomy in the module, not just relational join reconstruction).

Also tidy the header comment above
`deterministic_relational_reconstruction` — the PR 2.1 → 3.1 hop made
"Lifted from PR 1.1's ``scripts/...``" historically muddled; the
script-side re-export is the load-bearing fact and was already noted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates the prior relational_leakage validation logic into a new unified leadforge.validation.leakage_probes module, expanding coverage to a broader leakage taxonomy (direct, time-window, relational, split, and model-realism) and updating internal call sites/tests accordingly as part of the v1 release validation hardening sequence.

Changes:

  • Introduces leadforge/validation/leakage_probes.py implementing the unified probe taxonomy, probe registry, and bundle/split orchestrators.
  • Migrates existing imports/call sites from leadforge.validation.relational_leakage to the new module and deletes the old module.
  • Expands/renames validation tests to cover new split/model-realism probes and adds registry meta-tests.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
leadforge/validation/leakage_probes.py New unified leakage probe implementations + registry + orchestrators; replaces the deleted module.
leadforge/validation/bundle_checks.py Switches bundle validation wiring to import/run probes from leakage_probes.
leadforge/validation/invariants.py Updates snapshot-safe contract imports to the new module.
leadforge/render/relational_snapshot_safe.py Updates documentation/imports for snapshot-safe contract constants source.
leadforge/render/manifests.py Updates manifest redaction-constant imports to the new module.
leadforge/exposure/filters.py Updates docstring references to contract constants in the new module.
scripts/probe_relational_leakage.py Re-exports deterministic reconstruction from the new canonical module location.
tests/validation/test_leakage_probes.py Renamed/expanded tests to cover split/model-realism probes and registry meta-coverage.
tests/integration/test_snapshot_safe_bundle.py Updates imports to the new module constants.
tests/render/test_bundle_schema_v5_contract.py Updates reference in comments to the new module.
leadforge/validation/relational_leakage.py Deleted (superseded by leakage_probes).
.agent-plan.md Updates phase plan status to reflect PR 3.1 completion.
Comments suppressed due to low confidence (2)

tests/validation/test_leakage_probes.py:802

  • This test’s name/docstring says the probe “fires” / “must catch it”, but the assertions only check that it returns a list (and the comment says it won’t always fire). Either assert the expected finding (make the synthetic data reliably separable under the hashing approach) or rename/reword the test to reflect that it’s only a smoke test for wiring.
    tests/validation/test_leakage_probes.py:946
  • Docstring says “four documented taxonomies”, but the validation set includes five (direct/time_window/relational/split/model_realism). Update the docstring to match the actual taxonomy list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/validation/leakage_probes.py Outdated
Comment thread leadforge/validation/leakage_probes.py
Comment thread leadforge/validation/leakage_probes.py
Comment thread leadforge/validation/leakage_probes.py
Comment thread leadforge/validation/leakage_probes.py
Seven small fixes from the PR review:

leakage_probes.py
* Module docstring referenced a meta-test name that didn't exist
  (``test_probe_registry_covers_all_probes``); rename to the actual
  ``test_probe_registry_covers_every_module_level_probe``.
* `probe_split_id_overlap` and `probe_split_near_duplicates` were
  slicing a `set` before sorting (``list(shared)[:N]``); set iteration
  order is implementation-defined, so the resulting ``sample=...``
  message field would drift across runs.  Sort first, then slice.
* `probe_split_near_duplicates` claimed to skip when no requested
  columns were numeric, but actually coerced everything to NaN and
  then stringified — yielding a single saturating ``"nan|nan|..."``
  signature that collided as a false-positive across splits.  Drop
  rows whose entire signature is NaN before intersecting and update
  the docstring to match.
* `_hash_id_columns` docstring said "32-bit integers" but the dtype
  was int64; clarify as "32-bit hashes (stored as int64)".

test_leakage_probes.py
* `test_probe_id_only_baseline_fires_when_id_encodes_label` claimed
  the probe must catch the leak but only asserted ``isinstance(...,
  list)`` — a "fires" demonstration is structurally impossible from
  hashed lead_ids alone (independent train/test hashes ⇒ HistGBM can't
  generalise).  Rename to
  ``_runs_cleanly_on_partitioned_ids``, document why a positive-fire
  test isn't possible from this synthetic data, and point at the
  feature-subset baseline test for the actual fire path.
* Meta-test docstring said "four documented taxonomies" but the
  validation set has five (direct/time_window/relational/split/
  model_realism); fix the prose.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #65 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25402374833 attempt 1
Comment timestamp: 2026-05-05T21:07:49.110388+00:00
PR head commit: 8ad4465200d4ceb18600a94534ef7f5936962e07

@shaypal5 shaypal5 merged commit 8ad9260 into main May 5, 2026
8 checks passed
@shaypal5 shaypal5 deleted the feat/leakage-probes-module branch May 5, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: validation validation/ invariants and checks type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants