feat(scripts,validation): validate_release_candidate driver + acceptance bands resolved (PR 3.3) by shaypal5 · Pull Request #67 · leadforge-dev/leadforge

shaypal5 · 2026-05-06T07:11:50Z

Summary

PR 3.3 of the v1 dataset release sequence — closes Phase 3 ("release validation hardening"). Wires together the PR 3.1 (leakage_probes) + PR 3.2 (release_quality + reporting) modules into a single CLI driver, calibrates every TBD-* numeric band in v1_acceptance_gates.md from the first authentic N=5 cross-seed sweep, and ships the resulting release/validation/validation_report.{json,md} + figure set.

What's new

`scripts/validate_release_candidate.py` (new)

The release-candidate driver. Orchestrates:

regenerate_tier_for_seeds(spec, seeds, workdir) × N=5 (default seeds 42–46) per tier.
measure_release_quality(...) → full G7.* / G8.* / G6.4 panel.
run_split_probes(...) against each tier's canonical seed with the calibrated thresholds from the YAML bands (G5.*).
render_report(report, out_dir) writes the pinned JSON / markdown / 7-figure contract.
check_release_bands(report, bands, leakage_reports=...) evaluates every numeric band and returns list[GateFailure].

CLI surface:

--release-dir, --workdir, --out-dir, --bands (paths)
--seeds 42 43 44 45 46 (default; cross-seed sweep)
--cohort-canonical-seed 42 (default; falls back to smallest seed if outside sweep)
--tiers intro intermediate advanced (default; subset selectable)
--quick — N=2 with 500-lead populations; ~20s end-to-end smoke test
--no-rebuild — reuse existing workdir bundles for fast band-tweak iteration

Exit codes: 0 pass / 1 gate failure / 2 pre-flight error (missing release dir, missing tier under --no-rebuild, malformed bands YAML).

Driver vs leadforge validate boundary documented in the script docstring: leadforge validate checks one bundle's structural+FK+leakage contract (seconds); this script runs the cross-seed × cross-tier release-quality panel (minutes). Complementary, not merged.

`leadforge/validation/difficulty.py` extended

New band-checker stack (purely additive — old check_difficulty / check_difficulty_ordering untouched):

BandSpec / TierBands / LeakageProbeBands / AcceptanceBands frozen dataclasses for the parsed YAML.
GateFailure(gate, tier, message) frozen dataclass.
load_bands(path) -> AcceptanceBands parser with strict shape validation.
check_release_bands(report, bands, *, leakage_reports=...) -> list[GateFailure] evaluates per-tier numeric bands (G7.1.* / G7.2.* / G7.3.*), cross-seed spread (G8.1), cohort-shift degradation (G6.4), cross-tier ordering (G7.4.1–.3 hard, G7.4.4 soft per below), and leakage-probe findings (G4.5 / G5.1 / G5.2 / G5.3 / G6.1 / G6.3 / G6.4).

`docs/release/v1_acceptance_gates_bands.yaml` (new)

Operational source of truth for numeric bands. Format follows the recipes/.../difficulty_profiles.yaml precedent. Tunable between releases without code review.

`docs/release/v1_acceptance_gates.md` updated

Every TBD-* placeholder replaced with a concrete band + the median that produced it. Document is now the human-readable contract; the YAML is the machine-readable one.

`release/validation/` artefacts committed

validation_report.json (64 KB) — full machine-readable report.
validation_report.md (12 KB) — human-readable, every metric carries a $.tiers.<tier>.... JSON-path citation per G10.6.
figures/ (344 KB total) — 7 pinned PNGs: lift_curve_{intro,intermediate,advanced}, calibration_intermediate, leakage_delta, cohort_shift, value_capture.

`release/_release_quality/` workdir gitignored

Cross-seed sweep cache. Bundles regenerate idempotently per regenerate_tier_for_seeds; the workdir is operator-local.

Calibration baseline (first authentic N=5 sweep)

Tier	Conv. rate	LR AUC	GBM AUC	GBM−LR	LR AP	Brier	P@100
intro	0.4267	0.8788	0.8729	-0.0045	0.7608	0.1301	0.80
intermediate	0.2160	0.8859	0.8755	-0.0072	0.5752	0.1096	0.59
advanced	0.0840	0.8861	0.8726	-0.0133	0.3514	0.0611	0.34

Cross-tier AP / P@100 / conversion-rate ordering all hold (intro > intermediate > advanced).

Known finding: G7.4.4 (GBM−LR positivity) — v1 → v2

GBM−LR delta is slightly negative in every tier. v1's snapshot is dominated by linear features (engagement aggregates + firmographics), so HistGBM does not consistently beat a regularised LR at this signal level. The driver gates on the per-tier gbm_minus_lr_auc bands (G7.1.4 / G7.2.4 / G7.3.4 — bands fitted to data, ranging from -0.06 to +0.05) rather than the cross-tier sign boolean. The boolean is reported as informational. v2 will inject non-linear interactions (saturation curves, threshold effects) into the simulator so the gate bites; tracked in the post-v1 roadmap and in §"Cross-tier ordering" of v1_acceptance_gates.md.

Test plan

🤖 Generated with Claude Code

…nce bands resolved (PR 3.3) Closes Phase 3 of the v1 dataset release sequence. New CLI driver ``scripts/validate_release_candidate.py`` orchestrates a cross-seed × cross-tier release-quality sweep, runs split-level leakage probes against each tier's canonical seed, and gates the release on YAML-declared acceptance bands. ``leadforge/validation/difficulty.py`` is extended with the band-checker stack (BandSpec / TierBands / LeakageProbeBands / AcceptanceBands / GateFailure + load_bands / check_release_bands). Bands live in ``docs/release/v1_acceptance_gates_bands.yaml`` (calibrated to the cross-seed median ± 2× max-min spread on the first authentic N=5 sweep over release/{intro,intermediate,advanced}/, seeds 42–46). The human-readable contract in ``docs/release/v1_acceptance_gates.md`` no longer carries any TBD-* placeholder; every gate records a concrete band plus the median that produced it. G7.4.4 (cross-tier GBM−LR positivity) is softened to follow the per-tier gbm_minus_lr_auc bands rather than a hard-fail boolean: the v1 snapshot is dominated by linear features (engagement aggregates + firmographics) and HistGBM does not consistently beat a regularised LR (deltas −0.0045 / −0.0072 / −0.0133 across tiers). The v1→v2 finding is documented in the gates doc; v2 will inject non-linear interactions in the simulator so the gate bites. Outputs: release/validation/validation_report.{json,md} + the seven pinned figures (lift_curve_{intro,intermediate,advanced}, calibration_intermediate, leakage_delta, cohort_shift, value_capture). release/_release_quality/ workdir gitignored. Tests: 48 new across tests/validation/test_difficulty_bands.py (band parsing, per-tier checks, cross-seed spread, cohort shift, cross-tier ordering, leakage findings, GateFailure immutability) and tests/scripts/test_validate_release_candidate.py (CLI helpers, mocked pipeline, end-to-end --quick run). 1152/1152 tests pass; ruff + mypy clean; scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 exits 0 on every tier; scripts/verify_hash_determinism.py PASS 67/67. BUNDLE_SCHEMA_VERSION unchanged at 5 (purely additive). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PR 3.3 self-review: the `if TYPE_CHECKING: pass` block in the driver was left over from an earlier iteration that did forward-reference some release_quality dataclasses; the final imports are eager so the guard is dead code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a release-candidate validation driver that runs the cross-seed/cross-tier release-quality panel, executes split-level leakage probes, renders pinned validation artifacts, and gates the release using YAML-defined acceptance bands (with the human-readable contract updated accordingly).

Changes:

Introduces scripts/validate_release_candidate.py CLI driver to regenerate/load tier bundles across seeds, measure release quality, run leakage probes, render reports/figures, and return gate-based exit codes.
Extends leadforge.validation.difficulty with YAML parsing + acceptance-band evaluation (load_bands, check_release_bands, and supporting dataclasses).
Adds calibrated v1 acceptance bands YAML + updates the acceptance-gates markdown; commits pinned release/validation/ report artifacts; adds comprehensive tests.

Reviewed changes

Copilot reviewed 9 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`scripts/validate_release_candidate.py`	New CLI driver orchestrating rebuild/measure/probe/render/gate flow for release candidates.
`leadforge/validation/difficulty.py`	Adds YAML-driven acceptance-band parsing and gate evaluation over `ReleaseQualityReport` + `LeakageReport`s.
`docs/release/v1_acceptance_gates_bands.yaml`	New machine-readable source of truth for numeric acceptance bands.
`docs/release/v1_acceptance_gates.md`	Updates human-readable gate contract and resolves former TBD bands with calibrated values/rationale.
`tests/validation/test_difficulty_bands.py`	New unit tests for band parsing and gate evaluation behavior.
`tests/scripts/test_validate_release_candidate.py`	New tests for driver CLI helpers + mocked pipeline + `--quick` integration run.
`release/validation/validation_report.{json,md}`	Committed pinned baseline validation artifacts for v1.
`.gitignore`	Ignores `release/_release_quality/` workdir cache.
`.agent-plan.md`	Marks Phase 3 PR 3.3 as completed and documents what shipped.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…67 Four substantive findings from the auto-reviewer; all fixed with regression tests. 1. Missing-tier gate id was double-prefixed as ``G7.G7.1`` because the ``_GATE_PREFIX_BY_TIER`` values already carry a ``G7.`` prefix and the missing-tier branch added another. Fix: use the prefix dict directly with a ``f"G7.{tier_name}"`` fallback for unknown tiers. Test pins the gate id and forbids ``G7.G7`` substrings. 2. ``_config_from_args`` claimed it fell back to the smallest seed in the sweep but used ``seeds[0]``, so equivalent invocations like ``--seeds 11 10`` and ``--seeds 10 11`` could produce different cohort/leakage results. Fix: sort + dedup the seed list at config-time and use ``min(seeds)`` for the fallback. Test asserts ascending and descending input produce the same canonical seed and that duplicates collapse. 3. ``split_label_drift`` findings were mapped to gate ``G6.4``, which the acceptance-gates doc reserves for the cohort/time-shift AUC degradation gate. Mapping both to G6.4 grouped unrelated failures under one id. Fix: drop the explicit mapping; the channel falls through to ``leakage:split_label_drift`` (v1 acceptance gates do not number per-split label-rate drift as a distinct gate). YAML comment updated to match. Test asserts no G6.4 collision. 4. ``format_failures`` docstring claimed "groups by gate id, then by tier", but the implementation only sorted by gate; within each gate the input order was preserved (which is itself non-stable because per-tier checks emit in YAML iteration order while cross-tier checks emit in code order). Fix: sort within each gate by ``(tier or "", message)``. Two tests pin the within-gate sort and the cross-tier-first ordering. 53/53 tests in the PR 3.3 suite pass; ruff + mypy clean; ``validate_release_candidate.py --no-rebuild`` still exits 0; the release/validation/ artefacts re-rendered for byte-stable EOF. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-06T07:40:33Z

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #67 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25422617464 attempt 1
Comment timestamp: 2026-05-06T07:39:43.351154+00:00
PR head commit: 1b134c3647a1615cb84490d6fbf6cdbb61716f0e

Copilot AI review requested due to automatic review settings May 6, 2026 07:11

shaypal5 added this to the dataset: leadforge-lead-scoring-v1 milestone May 6, 2026

shaypal5 added type: feature New capability layer: validation validation/ invariants and checks layer: cli cli/ command-line interface labels May 6, 2026

Copilot started reviewing on behalf of shaypal5 May 6, 2026 07:12 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread leadforge/validation/difficulty.py

Comment thread scripts/validate_release_candidate.py

Comment thread leadforge/validation/difficulty.py

Comment thread scripts/validate_release_candidate.py

shaypal5 merged commit 38bc373 into main May 6, 2026
8 checks passed

shaypal5 deleted the feat/validate-release-candidate branch May 6, 2026 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts,validation): validate_release_candidate driver + acceptance bands resolved (PR 3.3)#67

feat(scripts,validation): validate_release_candidate driver + acceptance bands resolved (PR 3.3)#67
shaypal5 merged 3 commits into
mainfrom
feat/validate-release-candidate

shaypal5 commented May 6, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 6, 2026

Summary

What's new

scripts/validate_release_candidate.py (new)

leadforge/validation/difficulty.py extended

docs/release/v1_acceptance_gates_bands.yaml (new)

docs/release/v1_acceptance_gates.md updated

release/validation/ artefacts committed

release/_release_quality/ workdir gitignored

Calibration baseline (first authentic N=5 sweep)

Known finding: G7.4.4 (GBM−LR positivity) — v1 → v2

Test plan

Uh oh!

This comment has been minimized.

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`scripts/validate_release_candidate.py` (new)

`leadforge/validation/difficulty.py` extended

`docs/release/v1_acceptance_gates_bands.yaml` (new)

`docs/release/v1_acceptance_gates.md` updated

`release/validation/` artefacts committed

`release/_release_quality/` workdir gitignored