Skip to content

feat(scripts,validation): validate_release_candidate driver + acceptance bands resolved (PR 3.3)#67

Merged
shaypal5 merged 3 commits into
mainfrom
feat/validate-release-candidate
May 6, 2026
Merged

feat(scripts,validation): validate_release_candidate driver + acceptance bands resolved (PR 3.3)#67
shaypal5 merged 3 commits into
mainfrom
feat/validate-release-candidate

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

PR 3.3 of the v1 dataset release sequence — closes Phase 3 ("release validation hardening"). Wires together the PR 3.1 (leakage_probes) + PR 3.2 (release_quality + reporting) modules into a single CLI driver, calibrates every TBD-* numeric band in v1_acceptance_gates.md from the first authentic N=5 cross-seed sweep, and ships the resulting release/validation/validation_report.{json,md} + figure set.

What's new

scripts/validate_release_candidate.py (new)

The release-candidate driver. Orchestrates:

  1. regenerate_tier_for_seeds(spec, seeds, workdir) × N=5 (default seeds 42–46) per tier.
  2. measure_release_quality(...) → full G7.* / G8.* / G6.4 panel.
  3. run_split_probes(...) against each tier's canonical seed with the calibrated thresholds from the YAML bands (G5.*).
  4. render_report(report, out_dir) writes the pinned JSON / markdown / 7-figure contract.
  5. check_release_bands(report, bands, leakage_reports=...) evaluates every numeric band and returns list[GateFailure].

CLI surface:

  • --release-dir, --workdir, --out-dir, --bands (paths)
  • --seeds 42 43 44 45 46 (default; cross-seed sweep)
  • --cohort-canonical-seed 42 (default; falls back to smallest seed if outside sweep)
  • --tiers intro intermediate advanced (default; subset selectable)
  • --quick — N=2 with 500-lead populations; ~20s end-to-end smoke test
  • --no-rebuild — reuse existing workdir bundles for fast band-tweak iteration

Exit codes: 0 pass / 1 gate failure / 2 pre-flight error (missing release dir, missing tier under --no-rebuild, malformed bands YAML).

Driver vs leadforge validate boundary documented in the script docstring: leadforge validate checks one bundle's structural+FK+leakage contract (seconds); this script runs the cross-seed × cross-tier release-quality panel (minutes). Complementary, not merged.

leadforge/validation/difficulty.py extended

New band-checker stack (purely additive — old check_difficulty / check_difficulty_ordering untouched):

  • BandSpec / TierBands / LeakageProbeBands / AcceptanceBands frozen dataclasses for the parsed YAML.
  • GateFailure(gate, tier, message) frozen dataclass.
  • load_bands(path) -> AcceptanceBands parser with strict shape validation.
  • check_release_bands(report, bands, *, leakage_reports=...) -> list[GateFailure] evaluates per-tier numeric bands (G7.1.* / G7.2.* / G7.3.*), cross-seed spread (G8.1), cohort-shift degradation (G6.4), cross-tier ordering (G7.4.1–.3 hard, G7.4.4 soft per below), and leakage-probe findings (G4.5 / G5.1 / G5.2 / G5.3 / G6.1 / G6.3 / G6.4).

docs/release/v1_acceptance_gates_bands.yaml (new)

Operational source of truth for numeric bands. Format follows the recipes/.../difficulty_profiles.yaml precedent. Tunable between releases without code review.

docs/release/v1_acceptance_gates.md updated

Every TBD-* placeholder replaced with a concrete band + the median that produced it. Document is now the human-readable contract; the YAML is the machine-readable one.

release/validation/ artefacts committed

  • validation_report.json (64 KB) — full machine-readable report.
  • validation_report.md (12 KB) — human-readable, every metric carries a $.tiers.<tier>.... JSON-path citation per G10.6.
  • figures/ (344 KB total) — 7 pinned PNGs: lift_curve_{intro,intermediate,advanced}, calibration_intermediate, leakage_delta, cohort_shift, value_capture.

release/_release_quality/ workdir gitignored

Cross-seed sweep cache. Bundles regenerate idempotently per regenerate_tier_for_seeds; the workdir is operator-local.

Calibration baseline (first authentic N=5 sweep)

Tier Conv. rate LR AUC GBM AUC GBM−LR LR AP Brier P@100
intro 0.4267 0.8788 0.8729 -0.0045 0.7608 0.1301 0.80
intermediate 0.2160 0.8859 0.8755 -0.0072 0.5752 0.1096 0.59
advanced 0.0840 0.8861 0.8726 -0.0133 0.3514 0.0611 0.34

Cross-tier AP / P@100 / conversion-rate ordering all hold (intro > intermediate > advanced).

Known finding: G7.4.4 (GBM−LR positivity) — v1 → v2

GBM−LR delta is slightly negative in every tier. v1's snapshot is dominated by linear features (engagement aggregates + firmographics), so HistGBM does not consistently beat a regularised LR at this signal level. The driver gates on the per-tier gbm_minus_lr_auc bands (G7.1.4 / G7.2.4 / G7.3.4 — bands fitted to data, ranging from -0.06 to +0.05) rather than the cross-tier sign boolean. The boolean is reported as informational. v2 will inject non-linear interactions (saturation curves, threshold effects) into the simulator so the gate bites; tracked in the post-v1 roadmap and in §"Cross-tier ordering" of v1_acceptance_gates.md.

Test plan

  • pytest — 1152/1152 pass (48 new across tests/validation/test_difficulty_bands.py + tests/scripts/test_validate_release_candidate.py).
  • ruff check . && ruff format --check . — clean.
  • mypy leadforge/ scripts/validate_release_candidate.py — no issues.
  • python scripts/validate_release_candidate.py — full N=5 sweep exits 0; report + figures regenerated under release/validation/.
  • python scripts/validate_release_candidate.py --quick — N=2 × 500 leads × 3 tiers smoke run completes in ~20s.
  • python scripts/validate_release_candidate.py --no-rebuild — reuses workdir, exits 0.
  • python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 — exits 0 on every tier.
  • python scripts/verify_hash_determinism.py — PASS, 67/67 files identical.
  • No TBD-* strings remain in docs/release/v1_acceptance_gates.md.
  • BUNDLE_SCHEMA_VERSION unchanged at 5 (purely additive driver+gating layer).

🤖 Generated with Claude Code

…nce bands resolved (PR 3.3)

Closes Phase 3 of the v1 dataset release sequence. New CLI driver
``scripts/validate_release_candidate.py`` orchestrates a cross-seed ×
cross-tier release-quality sweep, runs split-level leakage probes against
each tier's canonical seed, and gates the release on YAML-declared
acceptance bands. ``leadforge/validation/difficulty.py`` is extended with
the band-checker stack (BandSpec / TierBands / LeakageProbeBands /
AcceptanceBands / GateFailure + load_bands / check_release_bands).

Bands live in ``docs/release/v1_acceptance_gates_bands.yaml`` (calibrated
to the cross-seed median ± 2× max-min spread on the first authentic N=5
sweep over release/{intro,intermediate,advanced}/, seeds 42–46). The
human-readable contract in ``docs/release/v1_acceptance_gates.md`` no
longer carries any TBD-* placeholder; every gate records a concrete band
plus the median that produced it.

G7.4.4 (cross-tier GBM−LR positivity) is softened to follow the per-tier
gbm_minus_lr_auc bands rather than a hard-fail boolean: the v1 snapshot
is dominated by linear features (engagement aggregates + firmographics)
and HistGBM does not consistently beat a regularised LR (deltas
−0.0045 / −0.0072 / −0.0133 across tiers). The v1→v2 finding is
documented in the gates doc; v2 will inject non-linear interactions in
the simulator so the gate bites.

Outputs: release/validation/validation_report.{json,md} + the seven
pinned figures (lift_curve_{intro,intermediate,advanced},
calibration_intermediate, leakage_delta, cohort_shift, value_capture).
release/_release_quality/ workdir gitignored.

Tests: 48 new across tests/validation/test_difficulty_bands.py (band
parsing, per-tier checks, cross-seed spread, cohort shift, cross-tier
ordering, leakage findings, GateFailure immutability) and
tests/scripts/test_validate_release_candidate.py (CLI helpers, mocked
pipeline, end-to-end --quick run). 1152/1152 tests pass; ruff + mypy
clean; scripts/probe_relational_leakage.py release/{intro,intermediate,advanced}
--max-accuracy 0.65 exits 0 on every tier; scripts/verify_hash_determinism.py
PASS 67/67. BUNDLE_SCHEMA_VERSION unchanged at 5 (purely additive).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 07:11
@shaypal5 shaypal5 added type: feature New capability layer: validation validation/ invariants and checks layer: cli cli/ command-line interface labels May 6, 2026
@github-actions

This comment has been minimized.

PR 3.3 self-review: the `if TYPE_CHECKING: pass` block in the driver was
left over from an earlier iteration that did forward-reference some
release_quality dataclasses; the final imports are eager so the guard
is dead code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a release-candidate validation driver that runs the cross-seed/cross-tier release-quality panel, executes split-level leakage probes, renders pinned validation artifacts, and gates the release using YAML-defined acceptance bands (with the human-readable contract updated accordingly).

Changes:

  • Introduces scripts/validate_release_candidate.py CLI driver to regenerate/load tier bundles across seeds, measure release quality, run leakage probes, render reports/figures, and return gate-based exit codes.
  • Extends leadforge.validation.difficulty with YAML parsing + acceptance-band evaluation (load_bands, check_release_bands, and supporting dataclasses).
  • Adds calibrated v1 acceptance bands YAML + updates the acceptance-gates markdown; commits pinned release/validation/ report artifacts; adds comprehensive tests.

Reviewed changes

Copilot reviewed 9 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/validate_release_candidate.py New CLI driver orchestrating rebuild/measure/probe/render/gate flow for release candidates.
leadforge/validation/difficulty.py Adds YAML-driven acceptance-band parsing and gate evaluation over ReleaseQualityReport + LeakageReports.
docs/release/v1_acceptance_gates_bands.yaml New machine-readable source of truth for numeric acceptance bands.
docs/release/v1_acceptance_gates.md Updates human-readable gate contract and resolves former TBD bands with calibrated values/rationale.
tests/validation/test_difficulty_bands.py New unit tests for band parsing and gate evaluation behavior.
tests/scripts/test_validate_release_candidate.py New tests for driver CLI helpers + mocked pipeline + --quick integration run.
release/validation/validation_report.{json,md} Committed pinned baseline validation artifacts for v1.
.gitignore Ignores release/_release_quality/ workdir cache.
.agent-plan.md Marks Phase 3 PR 3.3 as completed and documents what shipped.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/validation/difficulty.py
Comment thread scripts/validate_release_candidate.py
Comment thread leadforge/validation/difficulty.py
Comment thread scripts/validate_release_candidate.py
…67

Four substantive findings from the auto-reviewer; all fixed with
regression tests.

1. Missing-tier gate id was double-prefixed as ``G7.G7.1`` because the
   ``_GATE_PREFIX_BY_TIER`` values already carry a ``G7.`` prefix and
   the missing-tier branch added another.  Fix: use the prefix dict
   directly with a ``f"G7.{tier_name}"`` fallback for unknown tiers.
   Test pins the gate id and forbids ``G7.G7`` substrings.

2. ``_config_from_args`` claimed it fell back to the smallest seed in
   the sweep but used ``seeds[0]``, so equivalent invocations like
   ``--seeds 11 10`` and ``--seeds 10 11`` could produce different
   cohort/leakage results.  Fix: sort + dedup the seed list at
   config-time and use ``min(seeds)`` for the fallback.  Test asserts
   ascending and descending input produce the same canonical seed and
   that duplicates collapse.

3. ``split_label_drift`` findings were mapped to gate ``G6.4``, which
   the acceptance-gates doc reserves for the cohort/time-shift AUC
   degradation gate.  Mapping both to G6.4 grouped unrelated failures
   under one id.  Fix: drop the explicit mapping; the channel falls
   through to ``leakage:split_label_drift`` (v1 acceptance gates do
   not number per-split label-rate drift as a distinct gate).  YAML
   comment updated to match.  Test asserts no G6.4 collision.

4. ``format_failures`` docstring claimed "groups by gate id, then by
   tier", but the implementation only sorted by gate; within each
   gate the input order was preserved (which is itself non-stable
   because per-tier checks emit in YAML iteration order while
   cross-tier checks emit in code order).  Fix: sort within each
   gate by ``(tier or "", message)``.  Two tests pin the within-gate
   sort and the cross-tier-first ordering.

53/53 tests in the PR 3.3 suite pass; ruff + mypy clean;
``validate_release_candidate.py --no-rebuild`` still exits 0; the
release/validation/ artefacts re-rendered for byte-stable EOF.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #67 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25422617464 attempt 1
Comment timestamp: 2026-05-06T07:39:43.351154+00:00
PR head commit: 1b134c3647a1615cb84490d6fbf6cdbb61716f0e

@shaypal5 shaypal5 merged commit 38bc373 into main May 6, 2026
8 checks passed
@shaypal5 shaypal5 deleted the feat/validate-release-candidate branch May 6, 2026 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: cli cli/ command-line interface layer: validation validation/ invariants and checks type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants