feat: canonical lead scoring validation module#27
Merged
Conversation
Add single source of truth validation pipeline using sklearn ColumnTransformer (OneHotEncoder + StandardScaler + LR) with proper train-only preprocessing. Regenerate v5 dataset with snapshot day 10 and Poisson(1) leakage trap boost for robust multi-seed detectability. - leadforge/validation/lead_scoring.py: validation module with schema checks, group determinism, baseline AUC, value-aware metrics, multi-seed leakage trap evaluation - scripts/validate_lead_scoring_dataset.py: CLI entrypoint - tests/validation/test_lead_scoring.py: 12 tests - CI job for dataset validation in ci.yml - build_v5_snapshot.py: snapshot day 14→10, added trap boost step Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds a canonical, reproducible validation pipeline for the lead scoring intro dataset (including baseline ML evaluation, leakage trap checks, and value-aware metrics), plus tooling/tests/CI wiring to validate lead_scoring_intro_v5.csv consistently.
Changes:
- Introduces
leadforge/validation/lead_scoring.pyas the single source of truth for dataset checks + baseline evaluation (sklearn pipeline, leakage trap evaluation, value-aware metrics, report serialization). - Adds a CLI validator (
scripts/validate_lead_scoring_dataset.py) and a dedicated CI job to run it whenlead_scoring_intro_v5.csvexists. - Updates v5 dataset generation (
scripts/build_v5_snapshot.py) to use a day-10 snapshot and strengthen the leakage trap signal; adds a new test suite for the validator.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
leadforge/validation/lead_scoring.py |
New canonical validation + baseline evaluation module and report generation. |
scripts/validate_lead_scoring_dataset.py |
New CLI entrypoint for running the canonical validation and emitting JSON/release snippets. |
tests/validation/test_lead_scoring.py |
New tests covering schema checks, determinism, baseline metrics, trap detection, value metrics, and report serialization. |
scripts/build_v5_snapshot.py |
Adjusts snapshot window (14→10) and boosts leakage trap signal during dataset build. |
.github/workflows/ci.yml |
Adds validate-dataset job to validate lead_scoring_intro_v5.csv when present. |
.agent-plan.md |
Updates project plan/status documentation to reflect canonical validation module and regenerated v5 metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment has been minimized.
This comment has been minimized.
…verage - Short-circuit validate_dataset() when target has NaN or non-binary values instead of crashing in _evaluate_split() (COPILOT-1) - Fix Precision@K comment to accurately describe stable-sort behavior (COPILOT-2) - Install .[dev,scripts] in validate-dataset CI job (COPILOT-3) - Add explicit target_exists passing check in _check_schema() (COPILOT-4) - Use actual test_size in emit_release_snippet() via new report field (COPILOT-5) - Use actual n_rows in emit_release_snippet() missingness counts (COPILOT-6) - Fix trap evaluation to exclude all leakage cols in baseline, isolating each trap's effect correctly (COPILOT-7) - Fix Poisson comment in build_v5_snapshot.py (COPILOT-8) - Fix CI skip message to say v5 specifically (COPILOT-9) - Add scikit-learn to dev deps + mypy ignore override (COPILOT-10) - Expand test suite to 49 tests, achieving 100% patch coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…mpat In pandas 3.0+, string columns use StringDtype instead of object dtype. The dtype == "object" checks in _get_feature_cols and _check_group_determinism missed these columns, causing string features to be sent to the numeric imputer (median strategy) which raised ValueError. Switch to pd.api.types.is_numeric_dtype which correctly handles all dtype backends including StringDtype, ArrowDtype, and CategoricalDtype. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment has been minimized.
This comment has been minimized.
…dict - Add target_both_classes check in _check_schema() and short-circuit before stratified split when only one class is present (COPILOT-1) - Fill NaN in expected_acv with 0 before value-aware ranking to prevent NaN propagation into metrics and JSON output (COPILOT-2) - Include deltas_pr_auc and mean_delta_pr_auc in to_dict() trap metrics serialization (COPILOT-3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #27 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: leadforge/validation/lead_scoring.py:870
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168247909
Root author: copilot-pull-request-reviewer
Comment:
`validate_dataset()` always proceeds to model evaluation after schema checks, even if the target has only one class. In that case `train_test_split(..., stratify=y)` inside `_evaluate_split()` will raise, causing the validator/CLI to crash instead of returning a failed report. Add an explicit check (e.g., require both {0,1} present and minimum counts per class) and short-circuit further evaluation with a failing `CheckResult` when the target distribution is unusable for stratified splitting.
## COPILOT-2
Location: leadforge/validation/lead_scoring.py
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168247957
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`_evaluate_value_aware()` uses `expected_acv` values directly when computing `expected_value` and captured ACV totals. If `expected_acv` contains NaNs (which is allowed by `_check_missingness` up to the threshold), the resulting sums can become NaN and `to_dict()` will emit NaN into JSON output. Consider coercing `expected_acv` to numeric and filling NaNs (e.g., 0) or dropping those rows consistently before ranking/aggregation to keep metrics and JSON serialization well-defined.
## COPILOT-3
Location: leadforge/validation/lead_scoring.py:287
URL: https://github.com/leadforge-dev/leadforge/pull/27#discussion_r3168248006
Root author: copilot-pull-request-reviewer
Comment:
`TrapMetrics` tracks both `deltas_auc` and `deltas_pr_auc`, but `ValidationReport.to_dict()` only serializes AUC deltas and omits PR-AUC deltas entirely. This makes the JSON report incomplete relative to what the validator computes. Either include `deltas_pr_auc` (and optionally mean/min/max PR-AUC deltas) in the serialized output, or remove PR-AUC tracking if it’s intentionally unused.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
leadforge/validation/lead_scoring.py— single source of truth validation module with canonical sklearn pipeline (ColumnTransformer+OneHotEncoder+StandardScaler+LogisticRegression), proper train-only preprocessing, multi-seed leakage trap evaluation, value-aware metrics, and group determinism checksscripts/validate_lead_scoring_dataset.py— CLI entrypoint (--csv,--out-json,--emit-release-snippet,--enforce-1000)tests/validation/test_lead_scoring.py— 12 tests covering schema, group determinism, baseline metrics, trap detection, value metrics, and report serializationvalidate-datasetto.github/workflows/ci.yml— validateslead_scoring_intro_v5.csvif present in repo rootscripts/build_v5_snapshot.py— snapshot day 14→10, addboost_leakage_trap()step (Poisson(1) target-correlated noise) for robust trap detectability under the canonical pipelineMotivation
Numbers claimed in RELEASE_v5.md did not reproduce under a straightforward sklearn baseline (notably leakage trap min delta and Precision@K). The old ad-hoc scripts used LabelEncoder + manual median imputation which gives different results than a proper sklearn
ColumnTransformerpipeline. This PR establishes a canonical, reproducible validation pipeline and regenerates the v5 dataset so all metrics pass.Validation results (on regenerated v5 CSV)
Test plan
pytest tests/validation/test_lead_scoring.py— 12/12 passruff check+ruff format --check— cleanpython scripts/validate_lead_scoring_dataset.py --csv <v5.csv>— exits 0🤖 Generated with Claude Code