feat(validation): release_quality + reporting modules (PR 3.2) by shaypal5 · Pull Request #66 · leadforge-dev/leadforge

shaypal5 · 2026-05-05T21:49:29Z

Summary

PR 3.2 of the v1 dataset release sequence (docs/release/v1_release_roadmap.md Phase 3).
Built on PR 3.1's leakage taxonomy. Adds the release-grade metric panel + report
renderer that PR 3.3's validate_release_candidate.py driver will consume.

Two new modules + 28 new tests; purely additive on the validator/reporting side
(no on-disk bundle shape change, BUNDLE_SCHEMA_VERSION stays at 5).

What's in `leadforge/validation/release_quality.py`

JSON-primitive dataclasses:

TierMetrics — full per-(tier, seed) panel: ROC-AUC, PR-AUC, log loss, Brier,
reliability bins + max-bin error, P@{50,100}, R@K, lift@{1,5,10}%, top-decile
rate, expected-ACV capture@K, LR-vs-HistGBM AUC delta, and baseline AUCs for
source_only / engagement_only / stage_only / post_snapshot_aggregates /
id_only (G5.1–G5.3 / G7.* of v1_acceptance_gates.md).
CrossSeedTierMetrics — per-tier seed sweep with median + (max-min) spread
on every headline field (G8.1).
CohortShiftMetrics — random-vs-chronological-cohort-split AUC degradation
(G6.4) using HistGBM, NaN-tolerant when lead_created_at is missing.
CrossTierOrdering — descending rankings + booleans for G7.4.* checks (AP,
P@100, GBM-vs-LR, conversion-rate orderings; "GBM beats LR in every tier"
flag).
ReleaseQualityReport — top-level container. Stable field order; new
metrics go at the bottom of TierMetrics so PR 3.3's gates don't drift.

Public functions:

measure_tier_from_bundle(bundle_dir, *, seed, tier_name) — single bundle
→ full TierMetrics. Reads the primary task's train.parquet /
test.parquet (the valid split is intentionally unused so model
selection stays separate from reporting). Surfaces a clear ValueError
if the train split has < 2 classes.
measure_cohort_shift_from_bundle(bundle_dir, *, seed, tier_name) — splits
the train+test pool chronologically by lead_created_at at
COHORT_TRAIN_FRAC = 0.85, fits HistGBM on the early half, scores the
late half.
regenerate_tier_for_seeds(spec, seeds, workdir) — idempotent helper that
calls Generator.from_recipe(...).generate().save(...) once per seed under
<workdir>/<tier>__seed{seed}/. PR 3.3's driver and the round-trip test
both consume this.
measure_release_quality(tier_bundles, *, cohort_canonical_seed, ...) —
top-level orchestrator returning the full ReleaseQualityReport.
report_to_dict / report_to_json — deterministic serialisation with
NaN/Inf coerced to JSON null (the cohort-shift NaN-on-missing-timestamp
case is the canonical example).

Reuse:

The canonical preprocessor (median-impute + standard-scale numeric;
most-frequent-impute + one-hot-encode categorical) mirrors
leadforge.pipelines.ml.build_preprocessor, but is rebuilt locally because
the bundle's task-split column set differs from the v6/v7 flat-CSV column set.
_partition_columns excludes ID / timestamp anchors deterministically rather
than relying on the v6/v7 hardcoded CAT_FEATURES / NUM_FEATURES lists.
_id_only_auc hashes IDs the same way leakage_probes._hash_id_columns does
so the leakage probe's id_only baseline and the release-quality id_only
baseline produce comparable numbers.

What's in `leadforge/validation/reporting.py`

render_report(report, output_dir) writes the pinned output contract from
docs/release/v1_release_design.md §"Output contract":

release/validation/
  validation_report.json
  validation_report.md
  figures/
    lift_curve_intro.png
    lift_curve_intermediate.png
    lift_curve_advanced.png
    calibration_intermediate.png
    leakage_delta.png
    cohort_shift.png
    value_capture.png

Filenames are exact constants in the module — renaming any of them is a
contract change.

JSON dump uses dataclasses.asdict + json.dumps(sort_keys=True, indent=2),
no custom encoders. Two consecutive renders of the same report produce
byte-identical JSON (asserted by test).
Markdown report carries a $.tiers.<tier>.medians.<field> JSON-path
citation on every metric cell so G10.6 ("every claim has a backing
reference") is satisfied at render time. NaN values render as _n/a_.
Matplotlib forced to Agg before pyplot imports so the renderer is
CI-safe. matplotlib>=3.7 added to the [scripts] and [dev] extras
with a corresponding mypy override.

Tests (28 new)

tests/validation/test_release_quality.py — metric primitives with
hand-built inputs (perfect ranker, known miscalibration on a 0.9-only
predictor, zero-base-rate edge case), JSON serialisation including NaN
coercion, cross-tier ordering, and bundle-level measurement against
synthetic mini-bundles (degenerate train surfaces a clear ValueError).
tests/validation/test_reporting.py — every contract file is written
non-empty, JSON well-formed, markdown cites JSON paths for every tier,
partial-release rendering skips missing tiers, byte-identical
determinism across two consecutive renders.
tests/integration/test_release_quality_round_trip.py — full
Generator.from_recipe(...).generate(_SMALL).save(...) →
measure_release_quality → render_report flow at N=2 seeds (the full
N=5 release-time sweep is PR 3.3's driver).

Acceptance

1095/1095 tests pass (was 1067 — 28 new).
ruff check + ruff format --check clean.
mypy leadforge/ clean.
BUNDLE_SCHEMA_VERSION unchanged (purely additive on the
validator/reporting side).
scripts/verify_hash_determinism.py PASS, 67/67 files identical.
scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65
exits 0 on every public tier.
validate_bundle() integration unchanged externally — public bundles
still validate clean.
Figures render via Agg backend without requiring a display server.

Not in scope (PR 3.3)

scripts/validate_release_candidate.py driver (calls these libraries).
Resolving TBD-* bands in v1_acceptance_gates.md.
Per-tier max_auc for the opt-in PR 3.1 leakage probes.
Updating leadforge/validation/difficulty.py with new band checks
(after the bands exist).

Test plan

Run pytest -q → 1095 passing locally.
Run ruff check leadforge tests && ruff format --check leadforge tests.
Run mypy leadforge/.
Run scripts/verify_hash_determinism.py and confirm 67/67 PASS.
Run scripts/probe_relational_leakage.py against each public tier.
Manually inspect a generated validation_report.{json,md} once PR 3.3's driver lands.

🤖 Generated with Claude Code

Adds the release-grade metric panel and renderer that v1's validate_release_candidate driver (PR 3.3) will consume. leadforge/validation/release_quality.py: - TierMetrics / CrossSeedTierMetrics / CohortShiftMetrics / CrossTierOrdering / ReleaseQualityReport dataclasses, JSON-primitive end-to-end so report_to_dict + json.dumps suffices. - measure_tier_from_bundle(): full G7.* panel — LR + HistGBM AUC, AP, log loss, Brier, calibration bins, P@{50,100}, R@K, lift@{1,5,10}%, top-decile rate, expected-ACV capture, GBM-vs-LR delta, plus source/engagement/stage/post-snapshot/ID-only baseline AUCs (G5.*). - measure_cohort_shift_from_bundle(): random-vs-chronological-cohort split AUC degradation (G6.4) using HistGBM (NaN-tolerant). - regenerate_tier_for_seeds() orchestrates cross-seed rebuilds via Generator.from_recipe; idempotent across re-runs. - measure_release_quality() aggregates per-(tier, seed) into the full ReleaseQualityReport with G8.1 cross-seed spreads and G7.4.* cross-tier ordering booleans. leadforge/validation/reporting.py: - render_report(report, output_dir) writes the pinned output contract: validation_report.json, validation_report.md, and figures/ lift_curve_{intro,intermediate,advanced}.png, calibration_intermediate.png, leakage_delta.png, cohort_shift.png, value_capture.png — under the Agg backend (CI-safe). - Markdown carries a $.tiers.<tier>.medians.<field> JSON-path citation on every metric cell per G10.6. - NaN floats serialise as JSON null and as _n/a_ in markdown. pyproject.toml: - matplotlib>=3.7 added to [scripts] and [dev] extras + mypy override. Tests (28 new): - tests/validation/test_release_quality.py: metric primitives with hand-built inputs (perfect ranker, known miscalibration, zero base rate), JSON serialisation including NaN coercion, cross-tier ordering, and bundle-level measurement against synthetic mini- bundles (degenerate train surfaces a clear ValueError). - tests/validation/test_reporting.py: every contract file is written non-empty, JSON well-formed, markdown cites JSON paths for every tier, partial-release rendering skips missing tiers, byte-identical determinism across two consecutive renders. - tests/integration/test_release_quality_round_trip.py: full Generator.from_recipe(_SMALL).save → measure_release_quality → render_report flow at N=2 seeds (the full N=5 sweep is PR 3.3's driver). Acceptance: - 1095/1095 tests pass (was 1067). - ruff check + format clean; mypy clean. - BUNDLE_SCHEMA_VERSION unchanged (purely additive). - scripts/verify_hash_determinism.py PASS, 67/67 files identical. - scripts/probe_relational_leakage.py release/{intro,intermediate, advanced} --max-accuracy 0.65 still exits 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Self-review pass on PR 3.2: - ``measure_cohort_shift_from_bundle`` was overriding the caller's ``seed`` with a hardcoded ``COHORT_SEED`` for the chronological-split HistGBM fit only, while the random-split fit used the caller's seed. That made the AUC pair non-comparable across reseeded calls and was surprising — drop the constant and use the caller's seed for both fits. - The four NaN-return branches duplicated the same ``CohortShiftMetrics`` literal with the ``label = tier_name or manifest.difficulty`` expansion inline. Factor into a local ``_no_cohort()`` closure; reduces the function from ~85 to ~65 lines without changing behaviour. Tests + ruff + mypy still clean (28/28 PR-3.2 tests pass). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds the release-grade validation/reporting layer for the leadforge-lead-scoring-v1 dataset release sequence (Phase 3 / PR 3.2), producing a structured ReleaseQualityReport plus a renderer that writes the pinned JSON/markdown/figures contract for downstream release-candidate gating (PR 3.3).

Changes:

Introduces leadforge.validation.release_quality to measure per-tier, cross-seed, and cohort-shift metrics and serialize them deterministically.
Introduces leadforge.validation.reporting to render the report to validation_report.json, validation_report.md, and the pinned figure set under a headless matplotlib backend.
Adds new unit + integration tests and updates dev/scripts extras to include matplotlib (plus mypy override).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`leadforge/validation/release_quality.py`	New metric dataclasses, bundle measurement functions, cross-seed orchestration, and JSON serialization helpers.
`leadforge/validation/reporting.py`	New renderer that writes the release validation JSON/markdown and figure artefacts.
`tests/validation/test_release_quality.py`	Unit tests for metric primitives, serialization, ordering, and synthetic bundle measurement.
`tests/validation/test_reporting.py`	Tests for renderer output contract, citations, partial rendering, and determinism of text artefacts.
`tests/integration/test_release_quality_round_trip.py`	End-to-end integration tests: generate bundle → measure → render, plus seed-regeneration idempotency.
`pyproject.toml`	Adds `matplotlib>=3.7` to `[dev]` and `[scripts]` extras and ignores missing imports for matplotlib in mypy.
`.agent-plan.md`	Marks PR 3.2 as completed and documents the new modules/tests/deps at a high level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Brutal self-review caught seven real issues; this commit addresses them all. C1. Lift curve was fabricated. ``_write_lift_curve`` plotted three measured lift@pct values (1/5/10%) connected by straight lines, then jumped to (100%, 1.0). Saturated ``min(1.0, lift × pct/100)`` lied for high-lift models. Fix: add ``cumulative_gains`` to ``TierMetrics``, sampled at CUMULATIVE_GAINS_PCTS = (0, 10, …, 100); the renderer plots the actual curve. New ``_cumulative_gains_curve`` helper + 2 unit tests (perfect-ranker, no-positives-NaN). C2. Orchestrator passed the bundle's *generation* seed as the model's ``random_state``. ``measure_tier_from_bundle.seed`` was double-dipped: stored on the result for traceability AND used as ``random_state`` for the LR / HistGBM fits. Cross-seed sweeps then confounded data variance with model-RNG variance. Fix: add ``model_random_state: int = DEFAULT_MODEL_RANDOM_STATE`` parameter; ``measure_release_quality`` passes a constant. New unit test asserts identical AUCs across two distinct ``seed`` values when ``model_random_state`` is held constant. C3. ``len(set(np.ndarray))`` replaced with ``np.unique(...).size`` at three call sites — idiomatic, type-safe under nullable dtypes. C4. Removed duplicate ``average_precision`` field on ``TierMetrics`` (it was an exact alias of ``lr_average_precision``). D1. ``CrossTierOrdering`` booleans were defaulting to ``True`` when tiers were missing — silently green-lit partial releases at the PR 3.3 gating layer. Fix: type as ``bool | None``; missing pair → ``None``. Test updated to assert ``None`` on partial / empty releases. D2. ``_HEADLINE_FIELDS`` was hand-maintained with no drift guard. Fix: add ``TestHeadlineFieldsRegistry`` meta-test (mirrors the ``PROBE_REGISTRY`` pattern in ``leakage_probes.py``). Asserts every entry is a scalar ``float`` field on ``TierMetrics`` and that the cross-seed aggregator emits a median + spread for each. D3. Cohort-split tie-breaking on duplicate timestamps was non- deterministic across pandas versions / concat orders. Fix: explicit secondary sort key (``lead_id``) when present. P1. Removed three redundant ``import numpy as np`` calls inside figure helpers (numpy is already imported at module top). P3. Markdown report now footnotes the gate references (G6.4 / G7.* / G7.4 / G8.1) so a reader without the acceptance-gates doc can decode the section headers. Acceptance: - 1101/1101 tests pass (was 1095 — 6 new tests). - ruff check + format clean; mypy clean. - ``BUNDLE_SCHEMA_VERSION`` unchanged. - ``probe_relational_leakage.py`` exits 0 on every public tier. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Triage of 6 Copilot comments: 3 real, 3 false-positives / already-treated. Real fixes: * COPILOT-2 — ``measure_cohort_shift_from_bundle`` now class-checks ``y_train`` / ``y_test`` and raises a clear ``ValueError`` on degeneracy, matching ``measure_tier_from_bundle``'s posture instead of letting sklearn's ``roc_auc_score`` raise from inside the call. NaN-return is still reserved for *missing* inputs (no timestamp, unparseable timestamps); a degenerate label is a structural bundle problem and fails loudly. * COPILOT-4 — markdown baseline-cell citations now use the ``$.tiers.<tier>.per_seed[<i>].baselines.<name>`` list-index form that matches the actual JSON shape. The previous ``[seed=<seed>]`` selector was invented and unverifiable. * COPILOT-5 — ``Path.write_text`` calls in ``render_report`` pin ``encoding="utf-8"``. The default ``locale.getpreferredencoding(False)`` would mojibake the markdown's em-dashes / minus signs on legacy ANSI-locale Windows systems and break the byte-identical-text-artefact promise. * COPILOT-6 — module docstring softened. We no longer claim figures are deterministic byte-for-byte across environments; the guarantee is now JSON+markdown only, and figures are stable within an environment. Pinning matplotlib rcParams + fonts for cross-environment determinism is fragile in practice (font availability / hinting / antialiasing) and out of scope for v1. Resolved without changes: * COPILOT-1 — ``isinstance(obj, list | tuple)`` is valid in Python 3.10+ (PEP 604); leadforge requires-python is ``>=3.11``. Code is correct; verified by every passing test. * COPILOT-3 — the ``COHORT_SEED`` / ``seed`` divergence was already fixed in the previous self-review commit (``6c0f835``). Copilot reviewed the merge-base diff which still showed the original code. New tests (3): - ``test_markdown_baseline_citations_use_list_index`` — citation format must match the actual JSON shape; the invented ``[seed=...]`` selector is gone. - ``test_text_outputs_are_utf8_encoded`` — UTF-8 round-trip on JSON+markdown including the em-dash. - ``test_cohort_shift_raises_on_degenerate_train`` — clear ``ValueError`` on degenerate task splits. Acceptance: 1104/1104 tests pass; ruff + mypy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-05T22:10:49Z

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #66 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25405060704 attempt 1
Comment timestamp: 2026-05-05T22:10:00.292068+00:00
PR head commit: 286a633e90f77a4a4169f3f7bdeb39ea6a848237

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            ``advanced`` names but tolerates their absence (the
+            corresponding ordering-bool fields default to ``True`` so a
+            partial release does not over-report ordering failures).


+    rand = [cohort[t].random_split_auc for t in tier_names]
+    coh = [
+        cohort[t].cohort_split_auc if not math.isnan(cohort[t].cohort_split_auc) else 0.0
+        for t in tier_names
+    ]


+        for tier_name in tier_names:
+            csm = tiers[tier_name]
+            seed_aucs: list[float] = [tm.baselines[bn] for tm in csm.per_seed if bn in tm.baselines]
+            ys.append(float(np.median(seed_aucs)) if seed_aucs else 0.0)


+import math
+from collections.abc import Mapping
+from pathlib import Path
+from typing import Any
+
+import matplotlib
+import numpy as np
+
+matplotlib.use("Agg")  # headless / deterministic; must precede pyplot import.
+
+import matplotlib.pyplot as plt  # noqa: E402
+


Copilot AI review requested due to automatic review settings May 5, 2026 21:49

shaypal5 added this to the dataset: leadforge-lead-scoring-v1 milestone May 5, 2026

shaypal5 added type: feature New capability layer: validation validation/ invariants and checks labels May 5, 2026

Copilot started reviewing on behalf of shaypal5 May 5, 2026 21:50 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 5, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Copilot AI review requested due to automatic review settings May 5, 2026 22:09

Copilot started reviewing on behalf of shaypal5 May 5, 2026 22:10 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

shaypal5 merged commit 609f4a1 into main May 5, 2026
12 checks passed

shaypal5 deleted the feat/release-quality-reporting branch June 11, 2026 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validation): release_quality + reporting modules (PR 3.2)#66

feat(validation): release_quality + reporting modules (PR 3.2)#66
shaypal5 merged 4 commits into
mainfrom
feat/release-quality-reporting

shaypal5 commented May 5, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 5, 2026

Summary

What's in leadforge/validation/release_quality.py

What's in leadforge/validation/reporting.py

Tests (28 new)

Acceptance

Not in scope (PR 3.3)

Test plan

Uh oh!

This comment has been minimized.

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What's in `leadforge/validation/release_quality.py`

What's in `leadforge/validation/reporting.py`