feat: mid-project lead scoring — generation framework by shaypal5 · Pull Request #79 · leadforge-dev/leadforge

shaypal5 · 2026-05-13T08:15:30Z

Summary

Adds the reusable generation and validation framework for the mid-project lead scoring dataset. No dataset files are committed here (leadforge is a public repo). The generated CSV and release docs live in leadforge-datasets-private.

Files added

File	Purpose
`leadforge/pipelines/build_midproject.py`	Pipeline module (seed=100, SUBSAMPLE_N=1200)
`scripts/build_midproject_lead_scoring.py`	Build CLI — `python scripts/build_midproject_lead_scoring.py OUTPUT_DIR`
`scripts/validate_midproject_lead_scoring.py`	Validation script with all spec checks
`scripts/quick_baseline_eval_midproject.py`	Multi-model baseline eval script

Dataset location

Dataset artifacts (CSV, docs, validation report) live in:
leadforge-datasets-private/lead_scoring_midproject/

Design

Same schema and missingness patterns as lead_scoring_intro_v7.csv
Different seed (100 vs v7's 42) → different rows, cannot copy class notebook outputs
1,200 rows (vs v7's 1,000), 30% conversion rate, snapshot day 20
No __leakage__ columns
Validated LR AUC=0.703, P@25=0.520 (Lift 1.73x), EV uplift +35% at K=25

Regenerate

python scripts/build_midproject_lead_scoring.py /path/to/output/
python scripts/validate_midproject_lead_scoring.py /path/to/output/lead_scoring_midproject.csv

Test plan

No dataset files in branch history (force-pushed clean branch from main)
Framework scripts are importable and lint-clean

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds a new “mid-project” variant of the lead-scoring teaching dataset (student-safe single CSV) plus the build/validate/eval tooling to reproduce and sanity-check it, aligned to the existing v7 schema/narrative.

Changes:

Introduces lead_scoring_midproject.csv (1,200 rows) with accompanying BACKGROUND/RELEASE documentation and a saved validation report JSON.
Adds a midproject pipeline module (leadforge/pipelines/build_midproject.py) and build/validate/baseline-eval scripts under scripts/.
Updates .agent-plan.md with the completed work item entry for this dataset.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`scripts/validate_midproject_lead_scoring.py`	New validator for the midproject CSV (schema, missingness, baseline metrics, cohort/value-aware checks).
`scripts/quick_baseline_eval_midproject.py`	Quick baseline runner to reproduce headline metrics and feature importances.
`scripts/build_midproject_lead_scoring.py`	CLI to generate the midproject dataset CSV from the simulator/pipeline.
`leadforge/pipelines/build_midproject.py`	Midproject pipeline configuration (seed=100, subsample=1200) reusing common v6/v7 pipeline steps.
`lead_scoring_midproject/validation_midproject_report.json`	Persisted output from running the validator on the generated CSV.
`lead_scoring_midproject/lead_scoring_midproject.csv`	The generated student-safe dataset artifact.
`lead_scoring_midproject/lead_scoring_midproject_RELEASE.md`	Release documentation with metrics, teaching affordances, and provenance.
`lead_scoring_midproject/lead_scoring_midproject_BACKGROUND.md`	Student-facing narrative/background and feature descriptions.
`.agent-plan.md`	Tracks the completion of the midproject dataset work item.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Adds reusable pipeline + scripts for generating and validating the mid-project lead scoring dataset. Dataset artifacts (CSV, docs, validation report) live in leadforge-datasets-private. - leadforge/pipelines/build_midproject.py: pipeline module (seed=100, SUBSAMPLE_N=1200, same schema/missingness as v7) - scripts/build_midproject_lead_scoring.py: build CLI - scripts/validate_midproject_lead_scoring.py: validation script - scripts/quick_baseline_eval_midproject.py: baseline eval script Regenerate the dataset: python scripts/build_midproject_lead_scoring.py OUTPUT_DIR Validate: python scripts/validate_midproject_lead_scoring.py OUTPUT_DIR/lead_scoring_midproject.csv Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove EXPECTED_ROWS and ROW_TOLERANCE (unused dead constants; row-count policy already expressed correctly inline in check_basic) - Guard df[TARGET] in validate() so a missing target column records an error cleanly instead of crashing with KeyError Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-13T11:21:20Z

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #79 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25795848924 attempt 1
Comment timestamp: 2026-05-13T11:20:31.681850+00:00
PR head commit: cea57addb5cdf5127efd079e0f05aa79b0419453

Copilot AI review requested due to automatic review settings May 13, 2026 08:15

shaypal5 added this to the v1.0.0 — Polished OSS release milestone May 13, 2026

shaypal5 added type: feature New capability layer: recipes recipes/ recipe assets and registry labels May 13, 2026

Copilot started reviewing on behalf of shaypal5 May 13, 2026 08:16 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread scripts/validate_midproject_lead_scoring.py Outdated

Comment thread scripts/validate_midproject_lead_scoring.py Outdated

Comment thread scripts/build_midproject_lead_scoring.py Outdated

shaypal5 force-pushed the midproject-lead-scoring-dataset branch from df5361a to 8c78568 Compare May 13, 2026 08:30

shaypal5 changed the title ~~feat: mid-project lead scoring dataset (1200 rows, seed=100)~~ feat: mid-project lead scoring — generation framework May 13, 2026

This comment has been minimized.

Sign in to view

shaypal5 merged commit f64224e into main May 13, 2026
10 checks passed

shaypal5 deleted the midproject-lead-scoring-dataset branch May 13, 2026 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mid-project lead scoring — generation framework#79

feat: mid-project lead scoring — generation framework#79
shaypal5 merged 2 commits into
mainfrom
midproject-lead-scoring-dataset

shaypal5 commented May 13, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files added

Dataset location

Design

Regenerate

Test plan

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shaypal5 commented May 13, 2026 •

edited

Loading