Skip to content

feat(validation): release_quality + reporting modules (PR 3.2)#66

Merged
shaypal5 merged 4 commits into
mainfrom
feat/release-quality-reporting
May 5, 2026
Merged

feat(validation): release_quality + reporting modules (PR 3.2)#66
shaypal5 merged 4 commits into
mainfrom
feat/release-quality-reporting

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

PR 3.2 of the v1 dataset release sequence (docs/release/v1_release_roadmap.md Phase 3).
Built on PR 3.1's leakage taxonomy. Adds the release-grade metric panel + report
renderer that PR 3.3's validate_release_candidate.py driver will consume.

Two new modules + 28 new tests; purely additive on the validator/reporting side
(no on-disk bundle shape change, BUNDLE_SCHEMA_VERSION stays at 5).

What's in leadforge/validation/release_quality.py

JSON-primitive dataclasses:

  • TierMetrics — full per-(tier, seed) panel: ROC-AUC, PR-AUC, log loss, Brier,
    reliability bins + max-bin error, P@{50,100}, R@K, lift@{1,5,10}%, top-decile
    rate, expected-ACV capture@K, LR-vs-HistGBM AUC delta, and baseline AUCs for
    source_only / engagement_only / stage_only / post_snapshot_aggregates /
    id_only (G5.1–G5.3 / G7.* of v1_acceptance_gates.md).
  • CrossSeedTierMetrics — per-tier seed sweep with median + (max-min) spread
    on every headline field (G8.1).
  • CohortShiftMetrics — random-vs-chronological-cohort-split AUC degradation
    (G6.4) using HistGBM, NaN-tolerant when lead_created_at is missing.
  • CrossTierOrdering — descending rankings + booleans for G7.4.* checks (AP,
    P@100, GBM-vs-LR, conversion-rate orderings; "GBM beats LR in every tier"
    flag).
  • ReleaseQualityReport — top-level container. Stable field order; new
    metrics go at the bottom of TierMetrics so PR 3.3's gates don't drift.

Public functions:

  • measure_tier_from_bundle(bundle_dir, *, seed, tier_name) — single bundle
    → full TierMetrics. Reads the primary task's train.parquet /
    test.parquet (the valid split is intentionally unused so model
    selection stays separate from reporting). Surfaces a clear ValueError
    if the train split has < 2 classes.
  • measure_cohort_shift_from_bundle(bundle_dir, *, seed, tier_name) — splits
    the train+test pool chronologically by lead_created_at at
    COHORT_TRAIN_FRAC = 0.85, fits HistGBM on the early half, scores the
    late half.
  • regenerate_tier_for_seeds(spec, seeds, workdir) — idempotent helper that
    calls Generator.from_recipe(...).generate().save(...) once per seed under
    <workdir>/<tier>__seed{seed}/. PR 3.3's driver and the round-trip test
    both consume this.
  • measure_release_quality(tier_bundles, *, cohort_canonical_seed, ...)
    top-level orchestrator returning the full ReleaseQualityReport.
  • report_to_dict / report_to_json — deterministic serialisation with
    NaN/Inf coerced to JSON null (the cohort-shift NaN-on-missing-timestamp
    case is the canonical example).

Reuse:

  • The canonical preprocessor (median-impute + standard-scale numeric;
    most-frequent-impute + one-hot-encode categorical) mirrors
    leadforge.pipelines.ml.build_preprocessor, but is rebuilt locally because
    the bundle's task-split column set differs from the v6/v7 flat-CSV column set.
  • _partition_columns excludes ID / timestamp anchors deterministically rather
    than relying on the v6/v7 hardcoded CAT_FEATURES / NUM_FEATURES lists.
  • _id_only_auc hashes IDs the same way leakage_probes._hash_id_columns does
    so the leakage probe's id_only baseline and the release-quality id_only
    baseline produce comparable numbers.

What's in leadforge/validation/reporting.py

render_report(report, output_dir) writes the pinned output contract from
docs/release/v1_release_design.md §"Output contract":

release/validation/
  validation_report.json
  validation_report.md
  figures/
    lift_curve_intro.png
    lift_curve_intermediate.png
    lift_curve_advanced.png
    calibration_intermediate.png
    leakage_delta.png
    cohort_shift.png
    value_capture.png

Filenames are exact constants in the module — renaming any of them is a
contract change.

  • JSON dump uses dataclasses.asdict + json.dumps(sort_keys=True, indent=2),
    no custom encoders. Two consecutive renders of the same report produce
    byte-identical JSON (asserted by test).
  • Markdown report carries a $.tiers.<tier>.medians.<field> JSON-path
    citation on every metric cell so G10.6 ("every claim has a backing
    reference") is satisfied at render time. NaN values render as _n/a_.
  • Matplotlib forced to Agg before pyplot imports so the renderer is
    CI-safe. matplotlib>=3.7 added to the [scripts] and [dev] extras
    with a corresponding mypy override.

Tests (28 new)

  • tests/validation/test_release_quality.py — metric primitives with
    hand-built inputs (perfect ranker, known miscalibration on a 0.9-only
    predictor, zero-base-rate edge case), JSON serialisation including NaN
    coercion, cross-tier ordering, and bundle-level measurement against
    synthetic mini-bundles (degenerate train surfaces a clear ValueError).
  • tests/validation/test_reporting.py — every contract file is written
    non-empty, JSON well-formed, markdown cites JSON paths for every tier,
    partial-release rendering skips missing tiers, byte-identical
    determinism across two consecutive renders.
  • tests/integration/test_release_quality_round_trip.py — full
    Generator.from_recipe(...).generate(_SMALL).save(...)
    measure_release_qualityrender_report flow at N=2 seeds (the full
    N=5 release-time sweep is PR 3.3's driver).

Acceptance

  • 1095/1095 tests pass (was 1067 — 28 new).
  • ruff check + ruff format --check clean.
  • mypy leadforge/ clean.
  • BUNDLE_SCHEMA_VERSION unchanged (purely additive on the
    validator/reporting side).
  • scripts/verify_hash_determinism.py PASS, 67/67 files identical.
  • scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65
    exits 0 on every public tier.
  • validate_bundle() integration unchanged externally — public bundles
    still validate clean.
  • Figures render via Agg backend without requiring a display server.

Not in scope (PR 3.3)

  • scripts/validate_release_candidate.py driver (calls these libraries).
  • Resolving TBD-* bands in v1_acceptance_gates.md.
  • Per-tier max_auc for the opt-in PR 3.1 leakage probes.
  • Updating leadforge/validation/difficulty.py with new band checks
    (after the bands exist).

Test plan

  • Run pytest -q → 1095 passing locally.
  • Run ruff check leadforge tests && ruff format --check leadforge tests.
  • Run mypy leadforge/.
  • Run scripts/verify_hash_determinism.py and confirm 67/67 PASS.
  • Run scripts/probe_relational_leakage.py against each public tier.
  • Manually inspect a generated validation_report.{json,md} once PR 3.3's driver lands.

🤖 Generated with Claude Code

Adds the release-grade metric panel and renderer that v1's
validate_release_candidate driver (PR 3.3) will consume.

leadforge/validation/release_quality.py:
- TierMetrics / CrossSeedTierMetrics / CohortShiftMetrics /
  CrossTierOrdering / ReleaseQualityReport dataclasses, JSON-primitive
  end-to-end so report_to_dict + json.dumps suffices.
- measure_tier_from_bundle(): full G7.* panel — LR + HistGBM AUC, AP,
  log loss, Brier, calibration bins, P@{50,100}, R@K, lift@{1,5,10}%,
  top-decile rate, expected-ACV capture, GBM-vs-LR delta, plus
  source/engagement/stage/post-snapshot/ID-only baseline AUCs (G5.*).
- measure_cohort_shift_from_bundle(): random-vs-chronological-cohort
  split AUC degradation (G6.4) using HistGBM (NaN-tolerant).
- regenerate_tier_for_seeds() orchestrates cross-seed rebuilds via
  Generator.from_recipe; idempotent across re-runs.
- measure_release_quality() aggregates per-(tier, seed) into the full
  ReleaseQualityReport with G8.1 cross-seed spreads and G7.4.*
  cross-tier ordering booleans.

leadforge/validation/reporting.py:
- render_report(report, output_dir) writes the pinned output contract:
  validation_report.json, validation_report.md, and figures/
  lift_curve_{intro,intermediate,advanced}.png,
  calibration_intermediate.png, leakage_delta.png, cohort_shift.png,
  value_capture.png — under the Agg backend (CI-safe).
- Markdown carries a $.tiers.<tier>.medians.<field> JSON-path citation
  on every metric cell per G10.6.
- NaN floats serialise as JSON null and as _n/a_ in markdown.

pyproject.toml:
- matplotlib>=3.7 added to [scripts] and [dev] extras + mypy override.

Tests (28 new):
- tests/validation/test_release_quality.py: metric primitives with
  hand-built inputs (perfect ranker, known miscalibration, zero
  base rate), JSON serialisation including NaN coercion, cross-tier
  ordering, and bundle-level measurement against synthetic mini-
  bundles (degenerate train surfaces a clear ValueError).
- tests/validation/test_reporting.py: every contract file is written
  non-empty, JSON well-formed, markdown cites JSON paths for every
  tier, partial-release rendering skips missing tiers, byte-identical
  determinism across two consecutive renders.
- tests/integration/test_release_quality_round_trip.py: full
  Generator.from_recipe(_SMALL).save → measure_release_quality →
  render_report flow at N=2 seeds (the full N=5 sweep is PR 3.3's
  driver).

Acceptance:
- 1095/1095 tests pass (was 1067).
- ruff check + format clean; mypy clean.
- BUNDLE_SCHEMA_VERSION unchanged (purely additive).
- scripts/verify_hash_determinism.py PASS, 67/67 files identical.
- scripts/probe_relational_leakage.py release/{intro,intermediate,
  advanced} --max-accuracy 0.65 still exits 0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 21:49
@shaypal5 shaypal5 added type: feature New capability layer: validation validation/ invariants and checks labels May 5, 2026
@github-actions

This comment has been minimized.

Self-review pass on PR 3.2:

- ``measure_cohort_shift_from_bundle`` was overriding the caller's
  ``seed`` with a hardcoded ``COHORT_SEED`` for the chronological-split
  HistGBM fit only, while the random-split fit used the caller's seed.
  That made the AUC pair non-comparable across reseeded calls and was
  surprising — drop the constant and use the caller's seed for both
  fits.
- The four NaN-return branches duplicated the same ``CohortShiftMetrics``
  literal with the ``label = tier_name or manifest.difficulty``
  expansion inline.  Factor into a local ``_no_cohort()`` closure;
  reduces the function from ~85 to ~65 lines without changing behaviour.

Tests + ruff + mypy still clean (28/28 PR-3.2 tests pass).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the release-grade validation/reporting layer for the leadforge-lead-scoring-v1 dataset release sequence (Phase 3 / PR 3.2), producing a structured ReleaseQualityReport plus a renderer that writes the pinned JSON/markdown/figures contract for downstream release-candidate gating (PR 3.3).

Changes:

  • Introduces leadforge.validation.release_quality to measure per-tier, cross-seed, and cohort-shift metrics and serialize them deterministically.
  • Introduces leadforge.validation.reporting to render the report to validation_report.json, validation_report.md, and the pinned figure set under a headless matplotlib backend.
  • Adds new unit + integration tests and updates dev/scripts extras to include matplotlib (plus mypy override).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
leadforge/validation/release_quality.py New metric dataclasses, bundle measurement functions, cross-seed orchestration, and JSON serialization helpers.
leadforge/validation/reporting.py New renderer that writes the release validation JSON/markdown and figure artefacts.
tests/validation/test_release_quality.py Unit tests for metric primitives, serialization, ordering, and synthetic bundle measurement.
tests/validation/test_reporting.py Tests for renderer output contract, citations, partial rendering, and determinism of text artefacts.
tests/integration/test_release_quality_round_trip.py End-to-end integration tests: generate bundle → measure → render, plus seed-regeneration idempotency.
pyproject.toml Adds matplotlib>=3.7 to [dev] and [scripts] extras and ignores missing imports for matplotlib in mypy.
.agent-plan.md Marks PR 3.2 as completed and documents the new modules/tests/deps at a high level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/validation/release_quality.py
Comment thread leadforge/validation/release_quality.py
Comment thread leadforge/validation/release_quality.py
Comment thread leadforge/validation/reporting.py Outdated
Comment thread leadforge/validation/reporting.py Outdated
Comment thread leadforge/validation/reporting.py Outdated
Brutal self-review caught seven real issues; this commit addresses them all.

C1. Lift curve was fabricated.
  ``_write_lift_curve`` plotted three measured lift@pct values (1/5/10%)
  connected by straight lines, then jumped to (100%, 1.0).  Saturated
  ``min(1.0, lift × pct/100)`` lied for high-lift models.
  Fix: add ``cumulative_gains`` to ``TierMetrics``, sampled at
  CUMULATIVE_GAINS_PCTS = (0, 10, …, 100); the renderer plots the actual
  curve.  New ``_cumulative_gains_curve`` helper + 2 unit tests
  (perfect-ranker, no-positives-NaN).

C2. Orchestrator passed the bundle's *generation* seed as the model's
  ``random_state``.
  ``measure_tier_from_bundle.seed`` was double-dipped: stored on the
  result for traceability AND used as ``random_state`` for the LR /
  HistGBM fits.  Cross-seed sweeps then confounded data variance with
  model-RNG variance.
  Fix: add ``model_random_state: int = DEFAULT_MODEL_RANDOM_STATE``
  parameter; ``measure_release_quality`` passes a constant.  New unit
  test asserts identical AUCs across two distinct ``seed`` values when
  ``model_random_state`` is held constant.

C3. ``len(set(np.ndarray))`` replaced with ``np.unique(...).size`` at
  three call sites — idiomatic, type-safe under nullable dtypes.

C4. Removed duplicate ``average_precision`` field on ``TierMetrics``
  (it was an exact alias of ``lr_average_precision``).

D1. ``CrossTierOrdering`` booleans were defaulting to ``True`` when
  tiers were missing — silently green-lit partial releases at the PR
  3.3 gating layer.
  Fix: type as ``bool | None``; missing pair → ``None``.  Test updated
  to assert ``None`` on partial / empty releases.

D2. ``_HEADLINE_FIELDS`` was hand-maintained with no drift guard.
  Fix: add ``TestHeadlineFieldsRegistry`` meta-test (mirrors the
  ``PROBE_REGISTRY`` pattern in ``leakage_probes.py``).  Asserts every
  entry is a scalar ``float`` field on ``TierMetrics`` and that the
  cross-seed aggregator emits a median + spread for each.

D3. Cohort-split tie-breaking on duplicate timestamps was non-
  deterministic across pandas versions / concat orders.
  Fix: explicit secondary sort key (``lead_id``) when present.

P1. Removed three redundant ``import numpy as np`` calls inside figure
  helpers (numpy is already imported at module top).

P3. Markdown report now footnotes the gate references (G6.4 / G7.* /
  G7.4 / G8.1) so a reader without the acceptance-gates doc can decode
  the section headers.

Acceptance:
- 1101/1101 tests pass (was 1095 — 6 new tests).
- ruff check + format clean; mypy clean.
- ``BUNDLE_SCHEMA_VERSION`` unchanged.
- ``probe_relational_leakage.py`` exits 0 on every public tier.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Triage of 6 Copilot comments: 3 real, 3 false-positives / already-treated.

Real fixes:

* COPILOT-2 — ``measure_cohort_shift_from_bundle`` now class-checks
  ``y_train`` / ``y_test`` and raises a clear ``ValueError`` on
  degeneracy, matching ``measure_tier_from_bundle``'s posture instead
  of letting sklearn's ``roc_auc_score`` raise from inside the call.
  NaN-return is still reserved for *missing* inputs (no timestamp,
  unparseable timestamps); a degenerate label is a structural bundle
  problem and fails loudly.
* COPILOT-4 — markdown baseline-cell citations now use the
  ``$.tiers.<tier>.per_seed[<i>].baselines.<name>`` list-index form
  that matches the actual JSON shape.  The previous
  ``[seed=<seed>]`` selector was invented and unverifiable.
* COPILOT-5 — ``Path.write_text`` calls in ``render_report`` pin
  ``encoding="utf-8"``.  The default
  ``locale.getpreferredencoding(False)`` would mojibake the
  markdown's em-dashes / minus signs on legacy ANSI-locale Windows
  systems and break the byte-identical-text-artefact promise.
* COPILOT-6 — module docstring softened.  We no longer claim figures
  are deterministic byte-for-byte across environments; the
  guarantee is now JSON+markdown only, and figures are stable
  within an environment.  Pinning matplotlib rcParams + fonts for
  cross-environment determinism is fragile in practice (font
  availability / hinting / antialiasing) and out of scope for v1.

Resolved without changes:

* COPILOT-1 — ``isinstance(obj, list | tuple)`` is valid in Python
  3.10+ (PEP 604); leadforge requires-python is ``>=3.11``.  Code is
  correct; verified by every passing test.
* COPILOT-3 — the ``COHORT_SEED`` / ``seed`` divergence was already
  fixed in the previous self-review commit (``6c0f835``).  Copilot
  reviewed the merge-base diff which still showed the original code.

New tests (3):
- ``test_markdown_baseline_citations_use_list_index`` — citation
  format must match the actual JSON shape; the invented ``[seed=...]``
  selector is gone.
- ``test_text_outputs_are_utf8_encoded`` — UTF-8 round-trip on
  JSON+markdown including the em-dash.
- ``test_cohort_shift_raises_on_degenerate_train`` — clear
  ``ValueError`` on degenerate task splits.

Acceptance: 1104/1104 tests pass; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 22:09
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #66 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25405060704 attempt 1
Comment timestamp: 2026-05-05T22:10:00.292068+00:00
PR head commit: 286a633e90f77a4a4169f3f7bdeb39ea6a848237

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +587 to +589
``advanced`` names but tolerates their absence (the
corresponding ordering-bool fields default to ``True`` so a
partial release does not over-report ordering failures).
Comment on lines +475 to +479
rand = [cohort[t].random_split_auc for t in tier_names]
coh = [
cohort[t].cohort_split_auc if not math.isnan(cohort[t].cohort_split_auc) else 0.0
for t in tier_names
]
for tier_name in tier_names:
csm = tiers[tier_name]
seed_aucs: list[float] = [tm.baselines[bn] for tm in csm.per_seed if bn in tm.baselines]
ys.append(float(np.median(seed_aucs)) if seed_aucs else 0.0)
Comment on lines +41 to +52
import math
from collections.abc import Mapping
from pathlib import Path
from typing import Any

import matplotlib
import numpy as np

matplotlib.use("Agg") # headless / deterministic; must precede pyplot import.

import matplotlib.pyplot as plt # noqa: E402

@shaypal5 shaypal5 merged commit 609f4a1 into main May 5, 2026
12 checks passed
@shaypal5 shaypal5 deleted the feat/release-quality-reporting branch June 11, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: validation validation/ invariants and checks type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants