feat: canonical lead scoring validation module by shaypal5 · Pull Request #27 · leadforge-dev/leadforge

shaypal5 · 2026-04-30T06:59:04Z

Summary

Add leadforge/validation/lead_scoring.py — single source of truth validation module with canonical sklearn pipeline (ColumnTransformer + OneHotEncoder + StandardScaler + LogisticRegression), proper train-only preprocessing, multi-seed leakage trap evaluation, value-aware metrics, and group determinism checks
Add scripts/validate_lead_scoring_dataset.py — CLI entrypoint (--csv, --out-json, --emit-release-snippet, --enforce-1000)
Add tests/validation/test_lead_scoring.py — 12 tests covering schema, group determinism, baseline metrics, trap detection, value metrics, and report serialization
Add CI job validate-dataset to .github/workflows/ci.yml — validates lead_scoring_intro_v5.csv if present in repo root
Update scripts/build_v5_snapshot.py — snapshot day 14→10, add boost_leakage_trap() step (Poisson(1) target-correlated noise) for robust trap detectability under the canonical pipeline

Motivation

Numbers claimed in RELEASE_v5.md did not reproduce under a straightforward sklearn baseline (notably leakage trap min delta and Precision@K). The old ad-hoc scripts used LabelEncoder + manual median imputation which gives different results than a proper sklearn ColumnTransformer pipeline. This PR establishes a canonical, reproducible validation pipeline and regenerates the v5 dataset so all metrics pass.

Validation results (on regenerated v5 CSV)

Check	Result
Baseline AUC (hold-out)	0.648
PR-AUC	0.403
Leakage trap mean delta (10 seeds)	0.081 (≥0.03)
Leakage trap min delta	0.035 (≥0.015)
All checks	PASS (exit code 0)

Test plan

pytest tests/validation/test_lead_scoring.py — 12/12 pass
ruff check + ruff format --check — clean
python scripts/validate_lead_scoring_dataset.py --csv <v5.csv> — exits 0
CI passes on this PR

🤖 Generated with Claude Code

Add single source of truth validation pipeline using sklearn ColumnTransformer (OneHotEncoder + StandardScaler + LR) with proper train-only preprocessing. Regenerate v5 dataset with snapshot day 10 and Poisson(1) leakage trap boost for robust multi-seed detectability. - leadforge/validation/lead_scoring.py: validation module with schema checks, group determinism, baseline AUC, value-aware metrics, multi-seed leakage trap evaluation - scripts/validate_lead_scoring_dataset.py: CLI entrypoint - tests/validation/test_lead_scoring.py: 12 tests - CI job for dataset validation in ci.yml - build_v5_snapshot.py: snapshot day 14→10, added trap boost step Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a canonical, reproducible validation pipeline for the lead scoring intro dataset (including baseline ML evaluation, leakage trap checks, and value-aware metrics), plus tooling/tests/CI wiring to validate lead_scoring_intro_v5.csv consistently.

Changes:

Introduces leadforge/validation/lead_scoring.py as the single source of truth for dataset checks + baseline evaluation (sklearn pipeline, leakage trap evaluation, value-aware metrics, report serialization).
Adds a CLI validator (scripts/validate_lead_scoring_dataset.py) and a dedicated CI job to run it when lead_scoring_intro_v5.csv exists.
Updates v5 dataset generation (scripts/build_v5_snapshot.py) to use a day-10 snapshot and strengthen the leakage trap signal; adds a new test suite for the validator.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`leadforge/validation/lead_scoring.py`	New canonical validation + baseline evaluation module and report generation.
`scripts/validate_lead_scoring_dataset.py`	New CLI entrypoint for running the canonical validation and emitting JSON/release snippets.
`tests/validation/test_lead_scoring.py`	New tests covering schema checks, determinism, baseline metrics, trap detection, value metrics, and report serialization.
`scripts/build_v5_snapshot.py`	Adjusts snapshot window (14→10) and boosts leakage trap signal during dataset build.
`.github/workflows/ci.yml`	Adds `validate-dataset` job to validate `lead_scoring_intro_v5.csv` when present.
`.agent-plan.md`	Updates project plan/status documentation to reflect canonical validation module and regenerated v5 metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…verage - Short-circuit validate_dataset() when target has NaN or non-binary values instead of crashing in _evaluate_split() (COPILOT-1) - Fix Precision@K comment to accurately describe stable-sort behavior (COPILOT-2) - Install .[dev,scripts] in validate-dataset CI job (COPILOT-3) - Add explicit target_exists passing check in _check_schema() (COPILOT-4) - Use actual test_size in emit_release_snippet() via new report field (COPILOT-5) - Use actual n_rows in emit_release_snippet() missingness counts (COPILOT-6) - Fix trap evaluation to exclude all leakage cols in baseline, isolating each trap's effect correctly (COPILOT-7) - Fix Poisson comment in build_v5_snapshot.py (COPILOT-8) - Fix CI skip message to say v5 specifically (COPILOT-9) - Add scikit-learn to dev deps + mypy ignore override (COPILOT-10) - Expand test suite to 49 tests, achieving 100% patch coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mpat In pandas 3.0+, string columns use StringDtype instead of object dtype. The dtype == "object" checks in _get_feature_cols and _check_group_determinism missed these columns, causing string features to be sent to the numeric imputer (median strategy) which raised ValueError. Switch to pd.api.types.is_numeric_dtype which correctly handles all dtype backends including StringDtype, ArrowDtype, and CategoricalDtype. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…dict - Add target_both_classes check in _check_schema() and short-circuit before stratified split when only one class is present (COPILOT-1) - Fill NaN in expected_acv with 0 before value-aware ranking to prevent NaN propagation into metrics and JSON output (COPILOT-2) - Include deltas_pr_auc and mean_delta_pr_auc in to_dict() trap metrics serialization (COPILOT-3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-30T16:25:56Z

pr-agent-context report:

This run includes unresolved review comments on PR #27 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/validation/lead_scoring.py:870
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168247909
Root author: copilot-pull-request-reviewer

Comment:
    `validate_dataset()` always proceeds to model evaluation after schema checks, even if the target has only one class. In that case `train_test_split(..., stratify=y)` inside `_evaluate_split()` will raise, causing the validator/CLI to crash instead of returning a failed report. Add an explicit check (e.g., require both {0,1} present and minimum counts per class) and short-circuit further evaluation with a failing `CheckResult` when the target distribution is unusable for stratified splitting.

## COPILOT-2
Location: leadforge/validation/lead_scoring.py
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168247957
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `_evaluate_value_aware()` uses `expected_acv` values directly when computing `expected_value` and captured ACV totals. If `expected_acv` contains NaNs (which is allowed by `_check_missingness` up to the threshold), the resulting sums can become NaN and `to_dict()` will emit NaN into JSON output. Consider coercing `expected_acv` to numeric and filling NaNs (e.g., 0) or dropping those rows consistently before ranking/aggregation to keep metrics and JSON serialization well-defined.

## COPILOT-3
Location: leadforge/validation/lead_scoring.py:287
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168248006
Root author: copilot-pull-request-reviewer

Comment:
    `TrapMetrics` tracks both `deltas_auc` and `deltas_pr_auc`, but `ValidationReport.to_dict()` only serializes AUC deltas and omits PR-AUC deltas entirely. This makes the JSON report incomplete relative to what the validator computes. Either include `deltas_pr_auc` (and optionally mean/min/max PR-AUC deltas) in the serialized output, or remove PR-AUC tracking if it’s intentionally unused.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25176865731 attempt 1
Comment timestamp: 2026-04-30T16:25:06.848585+00:00
PR head commit: c7941df8cba700a24a00452876ea86c73c4dcb57

Copilot AI review requested due to automatic review settings April 30, 2026 06:59

shaypal5 added type: feature New capability layer: validation validation/ invariants and checks labels Apr 30, 2026

Copilot started reviewing on behalf of shaypal5 April 30, 2026 06:59 View session