feat(scripts,validation): validate_release_candidate driver + acceptance bands resolved (PR 3.3)#67
Merged
Merged
Conversation
…nce bands resolved (PR 3.3)
Closes Phase 3 of the v1 dataset release sequence. New CLI driver
``scripts/validate_release_candidate.py`` orchestrates a cross-seed ×
cross-tier release-quality sweep, runs split-level leakage probes against
each tier's canonical seed, and gates the release on YAML-declared
acceptance bands. ``leadforge/validation/difficulty.py`` is extended with
the band-checker stack (BandSpec / TierBands / LeakageProbeBands /
AcceptanceBands / GateFailure + load_bands / check_release_bands).
Bands live in ``docs/release/v1_acceptance_gates_bands.yaml`` (calibrated
to the cross-seed median ± 2× max-min spread on the first authentic N=5
sweep over release/{intro,intermediate,advanced}/, seeds 42–46). The
human-readable contract in ``docs/release/v1_acceptance_gates.md`` no
longer carries any TBD-* placeholder; every gate records a concrete band
plus the median that produced it.
G7.4.4 (cross-tier GBM−LR positivity) is softened to follow the per-tier
gbm_minus_lr_auc bands rather than a hard-fail boolean: the v1 snapshot
is dominated by linear features (engagement aggregates + firmographics)
and HistGBM does not consistently beat a regularised LR (deltas
−0.0045 / −0.0072 / −0.0133 across tiers). The v1→v2 finding is
documented in the gates doc; v2 will inject non-linear interactions in
the simulator so the gate bites.
Outputs: release/validation/validation_report.{json,md} + the seven
pinned figures (lift_curve_{intro,intermediate,advanced},
calibration_intermediate, leakage_delta, cohort_shift, value_capture).
release/_release_quality/ workdir gitignored.
Tests: 48 new across tests/validation/test_difficulty_bands.py (band
parsing, per-tier checks, cross-seed spread, cohort shift, cross-tier
ordering, leakage findings, GateFailure immutability) and
tests/scripts/test_validate_release_candidate.py (CLI helpers, mocked
pipeline, end-to-end --quick run). 1152/1152 tests pass; ruff + mypy
clean; scripts/probe_relational_leakage.py release/{intro,intermediate,advanced}
--max-accuracy 0.65 exits 0 on every tier; scripts/verify_hash_determinism.py
PASS 67/67. BUNDLE_SCHEMA_VERSION unchanged at 5 (purely additive).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
PR 3.3 self-review: the `if TYPE_CHECKING: pass` block in the driver was left over from an earlier iteration that did forward-reference some release_quality dataclasses; the final imports are eager so the guard is dead code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds a release-candidate validation driver that runs the cross-seed/cross-tier release-quality panel, executes split-level leakage probes, renders pinned validation artifacts, and gates the release using YAML-defined acceptance bands (with the human-readable contract updated accordingly).
Changes:
- Introduces
scripts/validate_release_candidate.pyCLI driver to regenerate/load tier bundles across seeds, measure release quality, run leakage probes, render reports/figures, and return gate-based exit codes. - Extends
leadforge.validation.difficultywith YAML parsing + acceptance-band evaluation (load_bands,check_release_bands, and supporting dataclasses). - Adds calibrated v1 acceptance bands YAML + updates the acceptance-gates markdown; commits pinned
release/validation/report artifacts; adds comprehensive tests.
Reviewed changes
Copilot reviewed 9 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/validate_release_candidate.py |
New CLI driver orchestrating rebuild/measure/probe/render/gate flow for release candidates. |
leadforge/validation/difficulty.py |
Adds YAML-driven acceptance-band parsing and gate evaluation over ReleaseQualityReport + LeakageReports. |
docs/release/v1_acceptance_gates_bands.yaml |
New machine-readable source of truth for numeric acceptance bands. |
docs/release/v1_acceptance_gates.md |
Updates human-readable gate contract and resolves former TBD bands with calibrated values/rationale. |
tests/validation/test_difficulty_bands.py |
New unit tests for band parsing and gate evaluation behavior. |
tests/scripts/test_validate_release_candidate.py |
New tests for driver CLI helpers + mocked pipeline + --quick integration run. |
release/validation/validation_report.{json,md} |
Committed pinned baseline validation artifacts for v1. |
.gitignore |
Ignores release/_release_quality/ workdir cache. |
.agent-plan.md |
Marks Phase 3 PR 3.3 as completed and documents what shipped. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…67 Four substantive findings from the auto-reviewer; all fixed with regression tests. 1. Missing-tier gate id was double-prefixed as ``G7.G7.1`` because the ``_GATE_PREFIX_BY_TIER`` values already carry a ``G7.`` prefix and the missing-tier branch added another. Fix: use the prefix dict directly with a ``f"G7.{tier_name}"`` fallback for unknown tiers. Test pins the gate id and forbids ``G7.G7`` substrings. 2. ``_config_from_args`` claimed it fell back to the smallest seed in the sweep but used ``seeds[0]``, so equivalent invocations like ``--seeds 11 10`` and ``--seeds 10 11`` could produce different cohort/leakage results. Fix: sort + dedup the seed list at config-time and use ``min(seeds)`` for the fallback. Test asserts ascending and descending input produce the same canonical seed and that duplicates collapse. 3. ``split_label_drift`` findings were mapped to gate ``G6.4``, which the acceptance-gates doc reserves for the cohort/time-shift AUC degradation gate. Mapping both to G6.4 grouped unrelated failures under one id. Fix: drop the explicit mapping; the channel falls through to ``leakage:split_label_drift`` (v1 acceptance gates do not number per-split label-rate drift as a distinct gate). YAML comment updated to match. Test asserts no G6.4 collision. 4. ``format_failures`` docstring claimed "groups by gate id, then by tier", but the implementation only sorted by gate; within each gate the input order was preserved (which is itself non-stable because per-tier checks emit in YAML iteration order while cross-tier checks emit in code order). Fix: sort within each gate by ``(tier or "", message)``. Two tests pin the within-gate sort and the cross-tier-first ordering. 53/53 tests in the PR 3.3 suite pass; ruff + mypy clean; ``validate_release_candidate.py --no-rebuild`` still exits 0; the release/validation/ artefacts re-rendered for byte-stable EOF. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #67 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 3.3 of the v1 dataset release sequence — closes Phase 3 ("release validation hardening"). Wires together the PR 3.1 (
leakage_probes) + PR 3.2 (release_quality+reporting) modules into a single CLI driver, calibrates every TBD-* numeric band inv1_acceptance_gates.mdfrom the first authentic N=5 cross-seed sweep, and ships the resultingrelease/validation/validation_report.{json,md}+ figure set.What's new
scripts/validate_release_candidate.py(new)The release-candidate driver. Orchestrates:
regenerate_tier_for_seeds(spec, seeds, workdir)× N=5 (default seeds 42–46) per tier.measure_release_quality(...)→ full G7.* / G8.* / G6.4 panel.run_split_probes(...)against each tier's canonical seed with the calibrated thresholds from the YAML bands (G5.*).render_report(report, out_dir)writes the pinned JSON / markdown / 7-figure contract.check_release_bands(report, bands, leakage_reports=...)evaluates every numeric band and returnslist[GateFailure].CLI surface:
--release-dir,--workdir,--out-dir,--bands(paths)--seeds 42 43 44 45 46(default; cross-seed sweep)--cohort-canonical-seed 42(default; falls back to smallest seed if outside sweep)--tiers intro intermediate advanced(default; subset selectable)--quick— N=2 with 500-lead populations; ~20s end-to-end smoke test--no-rebuild— reuse existing workdir bundles for fast band-tweak iterationExit codes:
0pass /1gate failure /2pre-flight error (missing release dir, missing tier under--no-rebuild, malformed bands YAML).Driver vs
leadforge validateboundary documented in the script docstring:leadforge validatechecks one bundle's structural+FK+leakage contract (seconds); this script runs the cross-seed × cross-tier release-quality panel (minutes). Complementary, not merged.leadforge/validation/difficulty.pyextendedNew band-checker stack (purely additive — old
check_difficulty/check_difficulty_orderinguntouched):BandSpec/TierBands/LeakageProbeBands/AcceptanceBandsfrozen dataclasses for the parsed YAML.GateFailure(gate, tier, message)frozen dataclass.load_bands(path) -> AcceptanceBandsparser with strict shape validation.check_release_bands(report, bands, *, leakage_reports=...) -> list[GateFailure]evaluates per-tier numeric bands (G7.1.* / G7.2.* / G7.3.*), cross-seed spread (G8.1), cohort-shift degradation (G6.4), cross-tier ordering (G7.4.1–.3 hard, G7.4.4 soft per below), and leakage-probe findings (G4.5 / G5.1 / G5.2 / G5.3 / G6.1 / G6.3 / G6.4).docs/release/v1_acceptance_gates_bands.yaml(new)Operational source of truth for numeric bands. Format follows the
recipes/.../difficulty_profiles.yamlprecedent. Tunable between releases without code review.docs/release/v1_acceptance_gates.mdupdatedEvery
TBD-*placeholder replaced with a concrete band + the median that produced it. Document is now the human-readable contract; the YAML is the machine-readable one.release/validation/artefacts committedvalidation_report.json(64 KB) — full machine-readable report.validation_report.md(12 KB) — human-readable, every metric carries a$.tiers.<tier>....JSON-path citation per G10.6.figures/(344 KB total) — 7 pinned PNGs: lift_curve_{intro,intermediate,advanced}, calibration_intermediate, leakage_delta, cohort_shift, value_capture.release/_release_quality/workdir gitignoredCross-seed sweep cache. Bundles regenerate idempotently per
regenerate_tier_for_seeds; the workdir is operator-local.Calibration baseline (first authentic N=5 sweep)
Cross-tier AP / P@100 / conversion-rate ordering all hold (intro > intermediate > advanced).
Known finding: G7.4.4 (GBM−LR positivity) — v1 → v2
GBM−LR delta is slightly negative in every tier. v1's snapshot is dominated by linear features (engagement aggregates + firmographics), so HistGBM does not consistently beat a regularised LR at this signal level. The driver gates on the per-tier
gbm_minus_lr_aucbands (G7.1.4 / G7.2.4 / G7.3.4 — bands fitted to data, ranging from -0.06 to +0.05) rather than the cross-tier sign boolean. The boolean is reported as informational. v2 will inject non-linear interactions (saturation curves, threshold effects) into the simulator so the gate bites; tracked in the post-v1 roadmap and in §"Cross-tier ordering" ofv1_acceptance_gates.md.Test plan
pytest— 1152/1152 pass (48 new acrosstests/validation/test_difficulty_bands.py+tests/scripts/test_validate_release_candidate.py).ruff check . && ruff format --check .— clean.mypy leadforge/ scripts/validate_release_candidate.py— no issues.python scripts/validate_release_candidate.py— full N=5 sweep exits 0; report + figures regenerated underrelease/validation/.python scripts/validate_release_candidate.py --quick— N=2 × 500 leads × 3 tiers smoke run completes in ~20s.python scripts/validate_release_candidate.py --no-rebuild— reuses workdir, exits 0.python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65— exits 0 on every tier.python scripts/verify_hash_determinism.py— PASS, 67/67 files identical.TBD-*strings remain indocs/release/v1_acceptance_gates.md.BUNDLE_SCHEMA_VERSIONunchanged at 5 (purely additive driver+gating layer).🤖 Generated with Claude Code