Skip to content

feat: canonical lead scoring validation module#27

Merged
shaypal5 merged 4 commits into
mainfrom
feat/lead-scoring-validation-module
Apr 30, 2026
Merged

feat: canonical lead scoring validation module#27
shaypal5 merged 4 commits into
mainfrom
feat/lead-scoring-validation-module

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

  • Add leadforge/validation/lead_scoring.py — single source of truth validation module with canonical sklearn pipeline (ColumnTransformer + OneHotEncoder + StandardScaler + LogisticRegression), proper train-only preprocessing, multi-seed leakage trap evaluation, value-aware metrics, and group determinism checks
  • Add scripts/validate_lead_scoring_dataset.py — CLI entrypoint (--csv, --out-json, --emit-release-snippet, --enforce-1000)
  • Add tests/validation/test_lead_scoring.py — 12 tests covering schema, group determinism, baseline metrics, trap detection, value metrics, and report serialization
  • Add CI job validate-dataset to .github/workflows/ci.yml — validates lead_scoring_intro_v5.csv if present in repo root
  • Update scripts/build_v5_snapshot.py — snapshot day 14→10, add boost_leakage_trap() step (Poisson(1) target-correlated noise) for robust trap detectability under the canonical pipeline

Motivation

Numbers claimed in RELEASE_v5.md did not reproduce under a straightforward sklearn baseline (notably leakage trap min delta and Precision@K). The old ad-hoc scripts used LabelEncoder + manual median imputation which gives different results than a proper sklearn ColumnTransformer pipeline. This PR establishes a canonical, reproducible validation pipeline and regenerates the v5 dataset so all metrics pass.

Validation results (on regenerated v5 CSV)

Check Result
Baseline AUC (hold-out) 0.648
PR-AUC 0.403
Leakage trap mean delta (10 seeds) 0.081 (≥0.03)
Leakage trap min delta 0.035 (≥0.015)
All checks PASS (exit code 0)

Test plan

  • pytest tests/validation/test_lead_scoring.py — 12/12 pass
  • ruff check + ruff format --check — clean
  • python scripts/validate_lead_scoring_dataset.py --csv <v5.csv> — exits 0
  • CI passes on this PR

🤖 Generated with Claude Code

Add single source of truth validation pipeline using sklearn
ColumnTransformer (OneHotEncoder + StandardScaler + LR) with proper
train-only preprocessing. Regenerate v5 dataset with snapshot day 10
and Poisson(1) leakage trap boost for robust multi-seed detectability.

- leadforge/validation/lead_scoring.py: validation module with
  schema checks, group determinism, baseline AUC, value-aware metrics,
  multi-seed leakage trap evaluation
- scripts/validate_lead_scoring_dataset.py: CLI entrypoint
- tests/validation/test_lead_scoring.py: 12 tests
- CI job for dataset validation in ci.yml
- build_v5_snapshot.py: snapshot day 14→10, added trap boost step

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 06:59
@shaypal5 shaypal5 added type: feature New capability layer: validation validation/ invariants and checks labels Apr 30, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a canonical, reproducible validation pipeline for the lead scoring intro dataset (including baseline ML evaluation, leakage trap checks, and value-aware metrics), plus tooling/tests/CI wiring to validate lead_scoring_intro_v5.csv consistently.

Changes:

  • Introduces leadforge/validation/lead_scoring.py as the single source of truth for dataset checks + baseline evaluation (sklearn pipeline, leakage trap evaluation, value-aware metrics, report serialization).
  • Adds a CLI validator (scripts/validate_lead_scoring_dataset.py) and a dedicated CI job to run it when lead_scoring_intro_v5.csv exists.
  • Updates v5 dataset generation (scripts/build_v5_snapshot.py) to use a day-10 snapshot and strengthen the leakage trap signal; adds a new test suite for the validator.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
leadforge/validation/lead_scoring.py New canonical validation + baseline evaluation module and report generation.
scripts/validate_lead_scoring_dataset.py New CLI entrypoint for running the canonical validation and emitting JSON/release snippets.
tests/validation/test_lead_scoring.py New tests covering schema checks, determinism, baseline metrics, trap detection, value metrics, and report serialization.
scripts/build_v5_snapshot.py Adjusts snapshot window (14→10) and boosts leakage trap signal during dataset build.
.github/workflows/ci.yml Adds validate-dataset job to validate lead_scoring_intro_v5.csv when present.
.agent-plan.md Updates project plan/status documentation to reflect canonical validation module and regenerated v5 metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/validation/lead_scoring.py
Comment thread leadforge/validation/lead_scoring.py Outdated
Comment thread .github/workflows/ci.yml Outdated
Comment thread leadforge/validation/lead_scoring.py
Comment thread leadforge/validation/lead_scoring.py Outdated
Comment thread leadforge/validation/lead_scoring.py Outdated
Comment thread leadforge/validation/lead_scoring.py
Comment thread scripts/build_v5_snapshot.py Outdated
Comment thread .github/workflows/ci.yml Outdated
Comment thread leadforge/validation/lead_scoring.py
@github-actions

This comment has been minimized.

…verage

- Short-circuit validate_dataset() when target has NaN or non-binary
  values instead of crashing in _evaluate_split() (COPILOT-1)
- Fix Precision@K comment to accurately describe stable-sort behavior (COPILOT-2)
- Install .[dev,scripts] in validate-dataset CI job (COPILOT-3)
- Add explicit target_exists passing check in _check_schema() (COPILOT-4)
- Use actual test_size in emit_release_snippet() via new report field (COPILOT-5)
- Use actual n_rows in emit_release_snippet() missingness counts (COPILOT-6)
- Fix trap evaluation to exclude all leakage cols in baseline, isolating
  each trap's effect correctly (COPILOT-7)
- Fix Poisson comment in build_v5_snapshot.py (COPILOT-8)
- Fix CI skip message to say v5 specifically (COPILOT-9)
- Add scikit-learn to dev deps + mypy ignore override (COPILOT-10)
- Expand test suite to 49 tests, achieving 100% patch coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@shaypal5 shaypal5 self-assigned this Apr 30, 2026
@github-actions

This comment has been minimized.

…mpat

In pandas 3.0+, string columns use StringDtype instead of object dtype.
The dtype == "object" checks in _get_feature_cols and _check_group_determinism
missed these columns, causing string features to be sent to the numeric
imputer (median strategy) which raised ValueError.

Switch to pd.api.types.is_numeric_dtype which correctly handles all
dtype backends including StringDtype, ArrowDtype, and CategoricalDtype.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 13:22
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/validation/lead_scoring.py
Comment thread leadforge/validation/lead_scoring.py Outdated
Comment thread leadforge/validation/lead_scoring.py
@github-actions

This comment has been minimized.

…dict

- Add target_both_classes check in _check_schema() and short-circuit
  before stratified split when only one class is present (COPILOT-1)
- Fill NaN in expected_acv with 0 before value-aware ranking to prevent
  NaN propagation into metrics and JSON output (COPILOT-2)
- Include deltas_pr_auc and mean_delta_pr_auc in to_dict() trap metrics
  serialization (COPILOT-3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #27 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/validation/lead_scoring.py:870
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168247909
Root author: copilot-pull-request-reviewer

Comment:
    `validate_dataset()` always proceeds to model evaluation after schema checks, even if the target has only one class. In that case `train_test_split(..., stratify=y)` inside `_evaluate_split()` will raise, causing the validator/CLI to crash instead of returning a failed report. Add an explicit check (e.g., require both {0,1} present and minimum counts per class) and short-circuit further evaluation with a failing `CheckResult` when the target distribution is unusable for stratified splitting.

## COPILOT-2
Location: leadforge/validation/lead_scoring.py
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168247957
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `_evaluate_value_aware()` uses `expected_acv` values directly when computing `expected_value` and captured ACV totals. If `expected_acv` contains NaNs (which is allowed by `_check_missingness` up to the threshold), the resulting sums can become NaN and `to_dict()` will emit NaN into JSON output. Consider coercing `expected_acv` to numeric and filling NaNs (e.g., 0) or dropping those rows consistently before ranking/aggregation to keep metrics and JSON serialization well-defined.

## COPILOT-3
Location: leadforge/validation/lead_scoring.py:287
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168248006
Root author: copilot-pull-request-reviewer

Comment:
    `TrapMetrics` tracks both `deltas_auc` and `deltas_pr_auc`, but `ValidationReport.to_dict()` only serializes AUC deltas and omits PR-AUC deltas entirely. This makes the JSON report incomplete relative to what the validator computes. Either include `deltas_pr_auc` (and optionally mean/min/max PR-AUC deltas) in the serialized output, or remove PR-AUC tracking if it’s intentionally unused.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25176865731 attempt 1
Comment timestamp: 2026-04-30T16:25:06.848585+00:00
PR head commit: c7941df8cba700a24a00452876ea86c73c4dcb57

@shaypal5 shaypal5 merged commit d2d0294 into main Apr 30, 2026
6 checks passed
@shaypal5 shaypal5 deleted the feat/lead-scoring-validation-module branch April 30, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: validation validation/ invariants and checks type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants