feat(validation): release_quality + reporting modules (PR 3.2)#66
Conversation
Adds the release-grade metric panel and renderer that v1's
validate_release_candidate driver (PR 3.3) will consume.
leadforge/validation/release_quality.py:
- TierMetrics / CrossSeedTierMetrics / CohortShiftMetrics /
CrossTierOrdering / ReleaseQualityReport dataclasses, JSON-primitive
end-to-end so report_to_dict + json.dumps suffices.
- measure_tier_from_bundle(): full G7.* panel — LR + HistGBM AUC, AP,
log loss, Brier, calibration bins, P@{50,100}, R@K, lift@{1,5,10}%,
top-decile rate, expected-ACV capture, GBM-vs-LR delta, plus
source/engagement/stage/post-snapshot/ID-only baseline AUCs (G5.*).
- measure_cohort_shift_from_bundle(): random-vs-chronological-cohort
split AUC degradation (G6.4) using HistGBM (NaN-tolerant).
- regenerate_tier_for_seeds() orchestrates cross-seed rebuilds via
Generator.from_recipe; idempotent across re-runs.
- measure_release_quality() aggregates per-(tier, seed) into the full
ReleaseQualityReport with G8.1 cross-seed spreads and G7.4.*
cross-tier ordering booleans.
leadforge/validation/reporting.py:
- render_report(report, output_dir) writes the pinned output contract:
validation_report.json, validation_report.md, and figures/
lift_curve_{intro,intermediate,advanced}.png,
calibration_intermediate.png, leakage_delta.png, cohort_shift.png,
value_capture.png — under the Agg backend (CI-safe).
- Markdown carries a $.tiers.<tier>.medians.<field> JSON-path citation
on every metric cell per G10.6.
- NaN floats serialise as JSON null and as _n/a_ in markdown.
pyproject.toml:
- matplotlib>=3.7 added to [scripts] and [dev] extras + mypy override.
Tests (28 new):
- tests/validation/test_release_quality.py: metric primitives with
hand-built inputs (perfect ranker, known miscalibration, zero
base rate), JSON serialisation including NaN coercion, cross-tier
ordering, and bundle-level measurement against synthetic mini-
bundles (degenerate train surfaces a clear ValueError).
- tests/validation/test_reporting.py: every contract file is written
non-empty, JSON well-formed, markdown cites JSON paths for every
tier, partial-release rendering skips missing tiers, byte-identical
determinism across two consecutive renders.
- tests/integration/test_release_quality_round_trip.py: full
Generator.from_recipe(_SMALL).save → measure_release_quality →
render_report flow at N=2 seeds (the full N=5 sweep is PR 3.3's
driver).
Acceptance:
- 1095/1095 tests pass (was 1067).
- ruff check + format clean; mypy clean.
- BUNDLE_SCHEMA_VERSION unchanged (purely additive).
- scripts/verify_hash_determinism.py PASS, 67/67 files identical.
- scripts/probe_relational_leakage.py release/{intro,intermediate,
advanced} --max-accuracy 0.65 still exits 0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Self-review pass on PR 3.2: - ``measure_cohort_shift_from_bundle`` was overriding the caller's ``seed`` with a hardcoded ``COHORT_SEED`` for the chronological-split HistGBM fit only, while the random-split fit used the caller's seed. That made the AUC pair non-comparable across reseeded calls and was surprising — drop the constant and use the caller's seed for both fits. - The four NaN-return branches duplicated the same ``CohortShiftMetrics`` literal with the ``label = tier_name or manifest.difficulty`` expansion inline. Factor into a local ``_no_cohort()`` closure; reduces the function from ~85 to ~65 lines without changing behaviour. Tests + ruff + mypy still clean (28/28 PR-3.2 tests pass). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds the release-grade validation/reporting layer for the leadforge-lead-scoring-v1 dataset release sequence (Phase 3 / PR 3.2), producing a structured ReleaseQualityReport plus a renderer that writes the pinned JSON/markdown/figures contract for downstream release-candidate gating (PR 3.3).
Changes:
- Introduces
leadforge.validation.release_qualityto measure per-tier, cross-seed, and cohort-shift metrics and serialize them deterministically. - Introduces
leadforge.validation.reportingto render the report tovalidation_report.json,validation_report.md, and the pinned figure set under a headless matplotlib backend. - Adds new unit + integration tests and updates dev/scripts extras to include matplotlib (plus mypy override).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
leadforge/validation/release_quality.py |
New metric dataclasses, bundle measurement functions, cross-seed orchestration, and JSON serialization helpers. |
leadforge/validation/reporting.py |
New renderer that writes the release validation JSON/markdown and figure artefacts. |
tests/validation/test_release_quality.py |
Unit tests for metric primitives, serialization, ordering, and synthetic bundle measurement. |
tests/validation/test_reporting.py |
Tests for renderer output contract, citations, partial rendering, and determinism of text artefacts. |
tests/integration/test_release_quality_round_trip.py |
End-to-end integration tests: generate bundle → measure → render, plus seed-regeneration idempotency. |
pyproject.toml |
Adds matplotlib>=3.7 to [dev] and [scripts] extras and ignores missing imports for matplotlib in mypy. |
.agent-plan.md |
Marks PR 3.2 as completed and documents the new modules/tests/deps at a high level. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Brutal self-review caught seven real issues; this commit addresses them all. C1. Lift curve was fabricated. ``_write_lift_curve`` plotted three measured lift@pct values (1/5/10%) connected by straight lines, then jumped to (100%, 1.0). Saturated ``min(1.0, lift × pct/100)`` lied for high-lift models. Fix: add ``cumulative_gains`` to ``TierMetrics``, sampled at CUMULATIVE_GAINS_PCTS = (0, 10, …, 100); the renderer plots the actual curve. New ``_cumulative_gains_curve`` helper + 2 unit tests (perfect-ranker, no-positives-NaN). C2. Orchestrator passed the bundle's *generation* seed as the model's ``random_state``. ``measure_tier_from_bundle.seed`` was double-dipped: stored on the result for traceability AND used as ``random_state`` for the LR / HistGBM fits. Cross-seed sweeps then confounded data variance with model-RNG variance. Fix: add ``model_random_state: int = DEFAULT_MODEL_RANDOM_STATE`` parameter; ``measure_release_quality`` passes a constant. New unit test asserts identical AUCs across two distinct ``seed`` values when ``model_random_state`` is held constant. C3. ``len(set(np.ndarray))`` replaced with ``np.unique(...).size`` at three call sites — idiomatic, type-safe under nullable dtypes. C4. Removed duplicate ``average_precision`` field on ``TierMetrics`` (it was an exact alias of ``lr_average_precision``). D1. ``CrossTierOrdering`` booleans were defaulting to ``True`` when tiers were missing — silently green-lit partial releases at the PR 3.3 gating layer. Fix: type as ``bool | None``; missing pair → ``None``. Test updated to assert ``None`` on partial / empty releases. D2. ``_HEADLINE_FIELDS`` was hand-maintained with no drift guard. Fix: add ``TestHeadlineFieldsRegistry`` meta-test (mirrors the ``PROBE_REGISTRY`` pattern in ``leakage_probes.py``). Asserts every entry is a scalar ``float`` field on ``TierMetrics`` and that the cross-seed aggregator emits a median + spread for each. D3. Cohort-split tie-breaking on duplicate timestamps was non- deterministic across pandas versions / concat orders. Fix: explicit secondary sort key (``lead_id``) when present. P1. Removed three redundant ``import numpy as np`` calls inside figure helpers (numpy is already imported at module top). P3. Markdown report now footnotes the gate references (G6.4 / G7.* / G7.4 / G8.1) so a reader without the acceptance-gates doc can decode the section headers. Acceptance: - 1101/1101 tests pass (was 1095 — 6 new tests). - ruff check + format clean; mypy clean. - ``BUNDLE_SCHEMA_VERSION`` unchanged. - ``probe_relational_leakage.py`` exits 0 on every public tier. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Triage of 6 Copilot comments: 3 real, 3 false-positives / already-treated. Real fixes: * COPILOT-2 — ``measure_cohort_shift_from_bundle`` now class-checks ``y_train`` / ``y_test`` and raises a clear ``ValueError`` on degeneracy, matching ``measure_tier_from_bundle``'s posture instead of letting sklearn's ``roc_auc_score`` raise from inside the call. NaN-return is still reserved for *missing* inputs (no timestamp, unparseable timestamps); a degenerate label is a structural bundle problem and fails loudly. * COPILOT-4 — markdown baseline-cell citations now use the ``$.tiers.<tier>.per_seed[<i>].baselines.<name>`` list-index form that matches the actual JSON shape. The previous ``[seed=<seed>]`` selector was invented and unverifiable. * COPILOT-5 — ``Path.write_text`` calls in ``render_report`` pin ``encoding="utf-8"``. The default ``locale.getpreferredencoding(False)`` would mojibake the markdown's em-dashes / minus signs on legacy ANSI-locale Windows systems and break the byte-identical-text-artefact promise. * COPILOT-6 — module docstring softened. We no longer claim figures are deterministic byte-for-byte across environments; the guarantee is now JSON+markdown only, and figures are stable within an environment. Pinning matplotlib rcParams + fonts for cross-environment determinism is fragile in practice (font availability / hinting / antialiasing) and out of scope for v1. Resolved without changes: * COPILOT-1 — ``isinstance(obj, list | tuple)`` is valid in Python 3.10+ (PEP 604); leadforge requires-python is ``>=3.11``. Code is correct; verified by every passing test. * COPILOT-3 — the ``COHORT_SEED`` / ``seed`` divergence was already fixed in the previous self-review commit (``6c0f835``). Copilot reviewed the merge-base diff which still showed the original code. New tests (3): - ``test_markdown_baseline_citations_use_list_index`` — citation format must match the actual JSON shape; the invented ``[seed=...]`` selector is gone. - ``test_text_outputs_are_utf8_encoded`` — UTF-8 round-trip on JSON+markdown including the em-dash. - ``test_cohort_shift_raises_on_degenerate_train`` — clear ``ValueError`` on degenerate task splits. Acceptance: 1104/1104 tests pass; ruff + mypy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #66 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ``advanced`` names but tolerates their absence (the | ||
| corresponding ordering-bool fields default to ``True`` so a | ||
| partial release does not over-report ordering failures). |
| rand = [cohort[t].random_split_auc for t in tier_names] | ||
| coh = [ | ||
| cohort[t].cohort_split_auc if not math.isnan(cohort[t].cohort_split_auc) else 0.0 | ||
| for t in tier_names | ||
| ] |
| for tier_name in tier_names: | ||
| csm = tiers[tier_name] | ||
| seed_aucs: list[float] = [tm.baselines[bn] for tm in csm.per_seed if bn in tm.baselines] | ||
| ys.append(float(np.median(seed_aucs)) if seed_aucs else 0.0) |
| import math | ||
| from collections.abc import Mapping | ||
| from pathlib import Path | ||
| from typing import Any | ||
|
|
||
| import matplotlib | ||
| import numpy as np | ||
|
|
||
| matplotlib.use("Agg") # headless / deterministic; must precede pyplot import. | ||
|
|
||
| import matplotlib.pyplot as plt # noqa: E402 | ||
|
|
Summary
PR 3.2 of the v1 dataset release sequence (
docs/release/v1_release_roadmap.mdPhase 3).Built on PR 3.1's leakage taxonomy. Adds the release-grade metric panel + report
renderer that PR 3.3's
validate_release_candidate.pydriver will consume.Two new modules + 28 new tests; purely additive on the validator/reporting side
(no on-disk bundle shape change,
BUNDLE_SCHEMA_VERSIONstays at5).What's in
leadforge/validation/release_quality.pyJSON-primitive dataclasses:
TierMetrics— full per-(tier, seed) panel: ROC-AUC, PR-AUC, log loss, Brier,reliability bins + max-bin error, P@{50,100}, R@K, lift@{1,5,10}%, top-decile
rate, expected-ACV capture@K, LR-vs-HistGBM AUC delta, and baseline AUCs for
source_only/engagement_only/stage_only/post_snapshot_aggregates/id_only(G5.1–G5.3 / G7.* ofv1_acceptance_gates.md).CrossSeedTierMetrics— per-tier seed sweep with median + (max-min) spreadon every headline field (G8.1).
CohortShiftMetrics— random-vs-chronological-cohort-split AUC degradation(G6.4) using HistGBM, NaN-tolerant when
lead_created_atis missing.CrossTierOrdering— descending rankings + booleans for G7.4.* checks (AP,P@100, GBM-vs-LR, conversion-rate orderings; "GBM beats LR in every tier"
flag).
ReleaseQualityReport— top-level container. Stable field order; newmetrics go at the bottom of
TierMetricsso PR 3.3's gates don't drift.Public functions:
measure_tier_from_bundle(bundle_dir, *, seed, tier_name)— single bundle→ full
TierMetrics. Reads the primary task'strain.parquet/test.parquet(thevalidsplit is intentionally unused so modelselection stays separate from reporting). Surfaces a clear
ValueErrorif the train split has < 2 classes.
measure_cohort_shift_from_bundle(bundle_dir, *, seed, tier_name)— splitsthe train+test pool chronologically by
lead_created_atatCOHORT_TRAIN_FRAC = 0.85, fits HistGBM on the early half, scores thelate half.
regenerate_tier_for_seeds(spec, seeds, workdir)— idempotent helper thatcalls
Generator.from_recipe(...).generate().save(...)once per seed under<workdir>/<tier>__seed{seed}/. PR 3.3's driver and the round-trip testboth consume this.
measure_release_quality(tier_bundles, *, cohort_canonical_seed, ...)—top-level orchestrator returning the full
ReleaseQualityReport.report_to_dict/report_to_json— deterministic serialisation withNaN/Inf coerced to JSON
null(the cohort-shift NaN-on-missing-timestampcase is the canonical example).
Reuse:
most-frequent-impute + one-hot-encode categorical) mirrors
leadforge.pipelines.ml.build_preprocessor, but is rebuilt locally becausethe bundle's task-split column set differs from the v6/v7 flat-CSV column set.
_partition_columnsexcludes ID / timestamp anchors deterministically ratherthan relying on the v6/v7 hardcoded
CAT_FEATURES/NUM_FEATURESlists._id_only_auchashes IDs the same wayleakage_probes._hash_id_columnsdoesso the leakage probe's
id_onlybaseline and the release-qualityid_onlybaseline produce comparable numbers.
What's in
leadforge/validation/reporting.pyrender_report(report, output_dir)writes the pinned output contract fromdocs/release/v1_release_design.md§"Output contract":Filenames are exact constants in the module — renaming any of them is a
contract change.
dataclasses.asdict+json.dumps(sort_keys=True, indent=2),no custom encoders. Two consecutive renders of the same report produce
byte-identical JSON (asserted by test).
$.tiers.<tier>.medians.<field>JSON-pathcitation on every metric cell so G10.6 ("every claim has a backing
reference") is satisfied at render time. NaN values render as
_n/a_.Aggbeforepyplotimports so the renderer isCI-safe.
matplotlib>=3.7added to the[scripts]and[dev]extraswith a corresponding mypy override.
Tests (28 new)
tests/validation/test_release_quality.py— metric primitives withhand-built inputs (perfect ranker, known miscalibration on a 0.9-only
predictor, zero-base-rate edge case), JSON serialisation including NaN
coercion, cross-tier ordering, and bundle-level measurement against
synthetic mini-bundles (degenerate train surfaces a clear
ValueError).tests/validation/test_reporting.py— every contract file is writtennon-empty, JSON well-formed, markdown cites JSON paths for every tier,
partial-release rendering skips missing tiers, byte-identical
determinism across two consecutive renders.
tests/integration/test_release_quality_round_trip.py— fullGenerator.from_recipe(...).generate(_SMALL).save(...)→measure_release_quality→render_reportflow at N=2 seeds (the fullN=5 release-time sweep is PR 3.3's driver).
Acceptance
ruff check+ruff format --checkclean.mypy leadforge/clean.BUNDLE_SCHEMA_VERSIONunchanged (purely additive on thevalidator/reporting side).
scripts/verify_hash_determinism.pyPASS, 67/67 files identical.scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65exits 0 on every public tier.
validate_bundle()integration unchanged externally — public bundlesstill validate clean.
Not in scope (PR 3.3)
scripts/validate_release_candidate.pydriver (calls these libraries).TBD-*bands inv1_acceptance_gates.md.max_aucfor the opt-in PR 3.1 leakage probes.leadforge/validation/difficulty.pywith new band checks(after the bands exist).
Test plan
pytest -q→ 1095 passing locally.ruff check leadforge tests && ruff format --check leadforge tests.mypy leadforge/.scripts/verify_hash_determinism.pyand confirm 67/67 PASS.scripts/probe_relational_leakage.pyagainst each public tier.validation_report.{json,md}once PR 3.3's driver lands.🤖 Generated with Claude Code