Skip to content

feat: mid-project lead scoring — generation framework#79

Merged
shaypal5 merged 2 commits into
mainfrom
midproject-lead-scoring-dataset
May 13, 2026
Merged

feat: mid-project lead scoring — generation framework#79
shaypal5 merged 2 commits into
mainfrom
midproject-lead-scoring-dataset

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the reusable generation and validation framework for the mid-project lead scoring dataset. No dataset files are committed here (leadforge is a public repo). The generated CSV and release docs live in leadforge-datasets-private.

Files added

File Purpose
leadforge/pipelines/build_midproject.py Pipeline module (seed=100, SUBSAMPLE_N=1200)
scripts/build_midproject_lead_scoring.py Build CLI — python scripts/build_midproject_lead_scoring.py OUTPUT_DIR
scripts/validate_midproject_lead_scoring.py Validation script with all spec checks
scripts/quick_baseline_eval_midproject.py Multi-model baseline eval script

Dataset location

Dataset artifacts (CSV, docs, validation report) live in:
leadforge-datasets-private/lead_scoring_midproject/

Design

  • Same schema and missingness patterns as lead_scoring_intro_v7.csv
  • Different seed (100 vs v7's 42) → different rows, cannot copy class notebook outputs
  • 1,200 rows (vs v7's 1,000), 30% conversion rate, snapshot day 20
  • No __leakage__ columns
  • Validated LR AUC=0.703, P@25=0.520 (Lift 1.73x), EV uplift +35% at K=25

Regenerate

python scripts/build_midproject_lead_scoring.py /path/to/output/
python scripts/validate_midproject_lead_scoring.py /path/to/output/lead_scoring_midproject.csv

Test plan

  • No dataset files in branch history (force-pushed clean branch from main)
  • Framework scripts are importable and lint-clean

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 13, 2026 08:15
@shaypal5 shaypal5 added type: feature New capability layer: recipes recipes/ recipe assets and registry labels May 13, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “mid-project” variant of the lead-scoring teaching dataset (student-safe single CSV) plus the build/validate/eval tooling to reproduce and sanity-check it, aligned to the existing v7 schema/narrative.

Changes:

  • Introduces lead_scoring_midproject.csv (1,200 rows) with accompanying BACKGROUND/RELEASE documentation and a saved validation report JSON.
  • Adds a midproject pipeline module (leadforge/pipelines/build_midproject.py) and build/validate/baseline-eval scripts under scripts/.
  • Updates .agent-plan.md with the completed work item entry for this dataset.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/validate_midproject_lead_scoring.py New validator for the midproject CSV (schema, missingness, baseline metrics, cohort/value-aware checks).
scripts/quick_baseline_eval_midproject.py Quick baseline runner to reproduce headline metrics and feature importances.
scripts/build_midproject_lead_scoring.py CLI to generate the midproject dataset CSV from the simulator/pipeline.
leadforge/pipelines/build_midproject.py Midproject pipeline configuration (seed=100, subsample=1200) reusing common v6/v7 pipeline steps.
lead_scoring_midproject/validation_midproject_report.json Persisted output from running the validator on the generated CSV.
lead_scoring_midproject/lead_scoring_midproject.csv The generated student-safe dataset artifact.
lead_scoring_midproject/lead_scoring_midproject_RELEASE.md Release documentation with metrics, teaching affordances, and provenance.
lead_scoring_midproject/lead_scoring_midproject_BACKGROUND.md Student-facing narrative/background and feature descriptions.
.agent-plan.md Tracks the completion of the midproject dataset work item.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/validate_midproject_lead_scoring.py Outdated
Comment thread scripts/validate_midproject_lead_scoring.py Outdated
Comment thread scripts/build_midproject_lead_scoring.py Outdated
Adds reusable pipeline + scripts for generating and validating the
mid-project lead scoring dataset. Dataset artifacts (CSV, docs,
validation report) live in leadforge-datasets-private.

- leadforge/pipelines/build_midproject.py: pipeline module
  (seed=100, SUBSAMPLE_N=1200, same schema/missingness as v7)
- scripts/build_midproject_lead_scoring.py: build CLI
- scripts/validate_midproject_lead_scoring.py: validation script
- scripts/quick_baseline_eval_midproject.py: baseline eval script

Regenerate the dataset:
  python scripts/build_midproject_lead_scoring.py OUTPUT_DIR

Validate:
  python scripts/validate_midproject_lead_scoring.py OUTPUT_DIR/lead_scoring_midproject.csv

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 force-pushed the midproject-lead-scoring-dataset branch from df5361a to 8c78568 Compare May 13, 2026 08:30
@shaypal5 shaypal5 changed the title feat: mid-project lead scoring dataset (1200 rows, seed=100) feat: mid-project lead scoring — generation framework May 13, 2026
@github-actions

This comment has been minimized.

- Remove EXPECTED_ROWS and ROW_TOLERANCE (unused dead constants;
  row-count policy already expressed correctly inline in check_basic)
- Guard df[TARGET] in validate() so a missing target column records an
  error cleanly instead of crashing with KeyError

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #79 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25795848924 attempt 1
Comment timestamp: 2026-05-13T11:20:31.681850+00:00
PR head commit: cea57addb5cdf5127efd079e0f05aa79b0419453

@shaypal5 shaypal5 merged commit f64224e into main May 13, 2026
10 checks passed
@shaypal5 shaypal5 deleted the midproject-lead-scoring-dataset branch May 13, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: recipes recipes/ recipe assets and registry type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants