From d3192fa24ec4cf8ef2607d1a7e28017401397913 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 23:57:46 +0300 Subject: [PATCH 1/4] PR 6.3: break-me guide + issue templates + v2 decision log MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 6 closer. Five new artefacts plus three follow-up syncs: - docs/release/break_me_guide.md — adversarial playbook. Meta-recipe (read the dictionary → ablate, don't just probe → check the time window → treat the train/test split as untrusted) plus 9 patterns grouped by category. Each pattern carries a "how to detect on any dataset" recipe and a worked-example pointer back into the v1 bundle (notebook §, validation_report JSON path, or feature_dictionary.csv field). Delivers the three explicit promises notebook 04 §10 made (target-encoding leakage, train-test contamination via account_id, cohort-by-segment) plus six others (naming smells, standalone-AUC vs tree-ablation gap, time-window violations, value-aware ranking inversions, threshold-vs-rank ties, calibration drift across segments). - docs/release/v2_decision_log.md — empty stub with the schema (7 columns: received_at / source / topic / severity / verdict / next_step / link) and verdict vocabulary documented in the preamble. - .github/ISSUE_TEMPLATE/dataset_breakage_report.yml — GitHub Issue Forms. Fields: tier, seed, bundle hash, suggested triage label, severity, summary, repro, expected-vs-actual, environment, two confirmation checkboxes. - .github/ISSUE_TEMPLATE/realism_feedback.yml — GitHub Issue Forms. Fields: aspect, tier(s)-affected, domain experience, claim, data observation, suggested fix, severity, two confirmations. - release/README.md — "Maintenance, adversarial framing, license" section now links to the break-me guide, both issue templates, and the v2 decision log. _release_common.py's existing relative-link rewriter handles the Kaggle/HF rendering automatically; the regenerated release/kaggle/dataset-metadata.json and release/huggingface/README.md sync are bundled in this commit and pass the audit-artifact-sync tests. Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from plain `docs/release/break_me_guide.md` text to Markdown links pointing at the GitHub blob URL — relative path would break on Kaggle/HF where notebooks ship without the docs/ tree, the blob URL works in both contexts. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every public tier; hash determinism PASS 67/67; validate_release_candidate --no-rebuild exits 0; BUNDLE_SCHEMA_VERSION unchanged at 5 (this PR is documentation-only). Co-Authored-By: Claude Opus 4.7 --- .agent-plan.md | 3 +- .../dataset_breakage_report.yml | 140 +++++++ .github/ISSUE_TEMPLATE/realism_feedback.yml | 116 ++++++ docs/release/break_me_guide.md | 341 ++++++++++++++++++ docs/release/v2_decision_log.md | 38 ++ release/README.md | 18 +- release/huggingface/README.md | 18 +- release/kaggle/dataset-metadata.json | 2 +- .../03_leakage_and_time_windows.ipynb | 2 +- .../04_lift_calibration_value_ranking.ipynb | 2 +- scripts/build_release_notebook_03.py | 10 +- scripts/build_release_notebook_04.py | 10 +- 12 files changed, 676 insertions(+), 24 deletions(-) create mode 100644 .github/ISSUE_TEMPLATE/dataset_breakage_report.yml create mode 100644 .github/ISSUE_TEMPLATE/realism_feedback.yml create mode 100644 docs/release/break_me_guide.md create mode 100644 docs/release/v2_decision_log.md diff --git a/.agent-plan.md b/.agent-plan.md index 99821f3..33891c2 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -60,8 +60,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family ### Phase 6 — Notebook sequence + adversarial framing - [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples. This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance. Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report). Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance). Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. - [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added. Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from. The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*. Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI. Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users). Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test). Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline. Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`. Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`. Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. -- [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml` -- [ ] `docs/release/{break_me_guide,v2_decision_log}.md` +- [x] PR 6.3: adversarial framing landed. `docs/release/break_me_guide.md` (new) — meta-recipe playbook organised as a 4-step recipe (read the dictionary → ablate, don't just probe → check the time window → treat the train/test split as untrusted) + 9 patterns grouped by category (leakage / split discipline / metric and ranking traps / robustness and realism). Each pattern carries a "how to detect on any dataset" recipe and a "worked example" pointer back into the v1 bundle (notebook §, validation_report JSON path, or feature_dictionary.csv field), so the guide extends the notebooks rather than duplicating them. Three explicit promises notebook 04 §10 made are delivered: target-encoding leakage on test (pattern 4, anchored on NB02 §4.4), train-test contamination via `account_id` overlap (pattern 5, with the honest "v1 only checks `lead_id`, not `account_id`" caveat), cohort-by-segment evaluation (pattern 6, extends NB04 §7's tier-wide cohort-shift to per-segment using the actual segment columns: `industry`, `region`, `employee_band`, `estimated_revenue_band`). Other 6 patterns: naming smells, standalone-AUC vs tree-ablation gap (NB03 finding generalised), time-window violations on engineered features (with the `customers`-table example), value-aware ranking surprises (P × ACV vs P-only), threshold-vs-rank ties at the operating point (NB04 §6 finding), calibration drift across cohorts and segments. Triage-label table at the top (`critical-leakage` / `realism` / `difficulty` / `documentation` / `platform` / `notebook` / `pedagogy` / `v2-idea` / `out-of-scope-v1`) gives reporters a vocabulary; the same labels are auto-applied (`needs-triage`) by the issue templates. `docs/release/v2_decision_log.md` (new, empty stub) — schema documented in the file's preamble (7 columns: `received_at` / `source` / `topic` / `severity` / `verdict` / `next_step` / `link`; verdict vocabulary `accepted-for-v2` / `deferred` / `wont-fix` / `needs-investigation` with explicit semantics for each). `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` (new) and `.github/ISSUE_TEMPLATE/realism_feedback.yml` (new) — GitHub Issue Forms YAML, both carry the `dataset: leadforge-lead-scoring-v1` + `needs-triage` labels. Breakage report: tier dropdown (intro / intermediate / advanced / instructor / multiple), seed input (default 42), bundle hash field (validation: required), suggested triage label dropdown, severity dropdown (high/medium/low), summary, minimal repro, expected-vs-actual citing JSON paths, environment, two confirmation checkboxes (read break-me guide; reporting on as-shipped bundle). Realism feedback: aspect dropdown (industry mix / persona / funnel timing / channel / pricing / account-to-lead density / region / other), tier(s)-affected dropdown, domain-experience one-liner (required — helps weight findings), claim, data observation (with concrete pandas-snippet placeholder example), suggested fix (optional), severity, two confirmations (read README "Known limitations"; checked post_v1_roadmap + v2_decision_log). Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from plain `docs/release/break_me_guide.md` text to Markdown links pointing at the GitHub blob URL (`https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md`) — relative path would break on Kaggle/HF where notebooks ship without the `docs/` tree, the blob URL works in both contexts. `release/README.md` "Maintenance, adversarial framing, license" section rewritten: dead "(PR 6.3)" forward-pointers replaced with real Markdown links to the break-me guide, both issue templates, and the v2 decision log; `_release_common.py`'s existing `](../foo)` → GitHub-blob-URL rewriter handles the Kaggle/HF rendering automatically (verified by the regenerated `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` sync tests). Hostile-reviewer self-review caught two factual hallucinations in the first revision before they shipped: claimed "15 industries" for `industry` (actually 4: logistics / healthcare_non_clinical / manufacturing / professional_services) and used loose segment-column names ("employee tier", "ARR band") instead of the actual columns (`employee_band`, `estimated_revenue_band`); both fixed. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR is documentation-only). Phase 6 closed — Phase 7 (LLM critique + publish) is next. ### Phase 7 — LLM critique + publish - [ ] `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly) diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml new file mode 100644 index 0000000..0923386 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml @@ -0,0 +1,140 @@ +name: Dataset breakage report +description: I broke leadforge-lead-scoring-v1 — leakage, train/test contamination, ranking surprise, or a notebook that won't execute. See docs/release/break_me_guide.md for the playbook this template is built around. +title: "[breakage] " +labels: ["dataset: leadforge-lead-scoring-v1", "needs-triage"] +body: + - type: markdown + attributes: + value: | + Thank you for breaking the dataset on purpose. This template is for findings that affect *what's in the bundle* — leakage, split contamination, metric inversions, notebook failures. Distributional / realism critiques (e.g. "industry mix doesn't look like real procurement") belong in the [realism feedback template](realism_feedback.yml) instead. + + The [break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues the patterns this template is shaped around. + + - type: dropdown + id: tier + attributes: + label: Tier + description: Which bundle tier did you break? Pick `instructor` if the finding is on the `intermediate_instructor` companion. + options: + - intro + - intermediate + - advanced + - instructor + - multiple + validations: + required: true + + - type: input + id: seed + attributes: + label: Seed + description: Generation seed of the bundle you used. Default canonical seed is 42; the published cross-seed sweep covers 42–46. + placeholder: "42" + value: "42" + validations: + required: true + + - type: input + id: bundle_hash + attributes: + label: Bundle hash + description: Paste a hash that identifies the exact bundle. Either `manifest.json` → `bundle_hash` (if your build records one), or a `sha256sum tasks/converted_within_90_days/test.parquet` from your local checkout. This makes the report unambiguous if we regenerate. + placeholder: "sha256:abc123… (or manifest.json bundle_hash)" + validations: + required: true + + - type: dropdown + id: triage_label + attributes: + label: Suggested triage label + description: Best guess from the [triage table](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md#triage-labels) at the top of the break-me guide. The maintainer applies the final label. + options: + - critical-leakage + - difficulty + - documentation + - notebook + - pedagogy + - platform + - v2-idea + - out-of-scope-v1 + - I don't know + validations: + required: true + + - type: dropdown + id: severity + attributes: + label: Severity (your assessment) + description: | + - **high**: blocks v1 release or downstream usage (label reconstructed via undocumented path; notebook fails on a clean checkout). + - **medium**: meaningful for downstream users but workaround exists (segment-conditional miscalibration; ties at threshold inflate slate). + - **low**: cosmetic or pedagogical (documentation drift, missing footnote). + options: + - high + - medium + - low + validations: + required: true + + - type: textarea + id: summary + attributes: + label: Summary + description: One paragraph. What did you find? + placeholder: | + On the intermediate tier I trained a HistGBM with `account_avg_touches` joined in via account_id and got AUC 0.97 vs the reported 0.886. The split has account_id collisions across train/test, so the engineered feature is leaking the test labels through the account. + validations: + required: true + + - type: textarea + id: repro + attributes: + label: Minimal reproduction + description: | + The smallest code or command sequence that reproduces the finding. Prefer runnable Python / shell over screenshots. If it's a notebook reproduction, name the notebook and section. + placeholder: | + ```python + # Run from release/notebooks/ against intermediate/ + train = pd.read_parquet("../intermediate/tasks/converted_within_90_days/train.parquet") + test = pd.read_parquet("../intermediate/tasks/converted_within_90_days/test.parquet") + overlap = set(train["account_id"]) & set(test["account_id"]) + print(f"shared accounts: {len(overlap)}") # > 0 ⇒ this template applies + ``` + render: markdown + validations: + required: true + + - type: textarea + id: expected_actual + attributes: + label: Expected vs actual + description: What did you expect, and what did you observe? Cite a `validation_report.json` JSON path or a notebook tolerance gate if relevant. + placeholder: | + Expected: `tiers.intermediate.medians.gbm_auc` ≈ 0.876 (validation_report.json). + Actual: 0.97 with the engineered feature, dropping to 0.88 under GroupKFold(account_id). + validations: + required: true + + - type: textarea + id: environment + attributes: + label: Environment + description: Anything reproducibility-relevant — Python version, package versions, OS. `pip freeze | grep -E "leadforge|scikit|pandas"` is enough for most reports. + placeholder: | + leadforge==1.0.0 + scikit-learn==1.5.0 + pandas==2.2.2 + Python 3.11.9, macOS 14.5 + render: text + validations: + required: false + + - type: checkboxes + id: confirmations + attributes: + label: Confirmations + options: + - label: I read the [break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) and this finding isn't already an explicit v1 simplification documented there or in the dataset card. + required: true + - label: I'm reporting a finding on the as-shipped public bundle, not on a privately modified copy. (If you modified the bundle, open a `realism` issue or a discussion instead.) + required: true diff --git a/.github/ISSUE_TEMPLATE/realism_feedback.yml b/.github/ISSUE_TEMPLATE/realism_feedback.yml new file mode 100644 index 0000000..07091f2 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/realism_feedback.yml @@ -0,0 +1,116 @@ +name: Realism feedback +description: A modelled distribution in leadforge-lead-scoring-v1 doesn't match what a domain expert would expect — industry mix, persona behaviour, funnel timing, channel attribution, pricing, etc. For *bundle-level* findings (leakage, contamination, ranking inversions), use the breakage report instead. +title: "[realism] " +labels: ["dataset: leadforge-lead-scoring-v1", "realism", "needs-triage"] +body: + - type: markdown + attributes: + value: | + Realism feedback is one of the highest-value report types we can receive — the simulator can be calibrated against any concrete observation about the real B2B procurement world. Examples we welcome: "industry distribution overweights manufacturing", "VP Finance personas convert too fast", "expected_acv is uncorrelated with industry but should track it strongly". + + We're particularly interested in findings backed by domain experience or public data. The dataset card already documents some intentional simplifications; please skim [`release/README.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/README.md) and [`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md) before filing — if the issue is already listed under "Known limitations" or "Out-of-scope uses", a thumbs-up on an existing issue is more useful than a new one. + + - type: dropdown + id: aspect + attributes: + label: Which aspect of the dataset? + options: + - industry mix / firmographics + - persona behaviour / personographics + - funnel timing (snapshot day, conversion horizon, stage progression) + - channel attribution (lead_source, first_touch_channel, conversion-by-channel) + - pricing / ACV distribution + - account-to-lead density (multi-lead accounts, lead/account ratio) + - regional distribution (US/UK split, regional conversion differences) + - other (please describe in the claim) + validations: + required: true + + - type: dropdown + id: tier + attributes: + label: Which tier(s) does this affect? + description: Realism findings often hold across all tiers because tiers differ in *signal strength*, not in the underlying simulator. Pick `all` if you didn't specifically check one tier. + options: + - intro + - intermediate + - advanced + - instructor companion + - all (didn't check tier-specifically) + validations: + required: true + + - type: textarea + id: domain_experience + attributes: + label: Domain experience (one-line) + description: A short note on why your perspective is informed — "AE at a procurement SaaS for 6 years", "academic study of B2B funnel CR", "ex-RevOps at a $50M ARR mid-market vendor". This isn't gatekeeping; it helps the maintainer weight the finding when several land in parallel. + placeholder: "5y experience as RevOps lead at a procurement SaaS in the same ARR band as Veridian Procure." + validations: + required: true + + - type: textarea + id: claim + attributes: + label: Claim + description: What does the dataset get wrong, and what would you expect instead? One paragraph. + placeholder: | + The intermediate tier shows a 22% conversion rate on the manufacturing industry slice, identical (within noise) to the rate on healthcare and retail. In real procurement / AP automation, manufacturing typically converts ~1.5x healthcare because manufacturing already has discrete-item AP volume that benefits more from automation. The dataset's conversion rate should differ across industries by 1.3-2x, not be flat. + validations: + required: true + + - type: textarea + id: data_observation + attributes: + label: Data observation supporting the claim + description: | + Evidence from the bundle, public benchmarks, or your own data. A pandas snippet that reads the bundle and prints the relevant numbers is ideal. + placeholder: | + ```python + leads = pd.read_parquet("intermediate/tables/leads.parquet") + accts = pd.read_parquet("intermediate/tables/accounts.parquet") + tasks = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet") + joined = tasks.merge(leads[["lead_id", "account_id"]], on="lead_id").merge( + accts[["account_id", "industry"]], on="account_id" + ) + print(joined.groupby("industry")["converted_within_90_days"].mean()) + # All industries within 0.02 of 0.22 — flat across industry. + ``` + render: markdown + validations: + required: true + + - type: textarea + id: suggested_fix + attributes: + label: Suggested fix (optional) + description: How would you change the simulator to match? Even a rough direction ("make `industry` modulate `ConversionHazard` weights") helps. Leave blank if you'd rather flag the problem than prescribe the fix. + placeholder: | + Add a per-industry scalar to `ConversionHazard` in `leadforge/mechanisms/hazards.py` keyed off `accounts.industry`, so the manufacturing path sees ~1.5x base hazard relative to the modal industry. Acceptance band would live in `v1_acceptance_gates_bands.yaml`. + validations: + required: false + + - type: dropdown + id: severity + attributes: + label: Severity (your assessment) + description: | + - **high**: the dataset's pedagogy is materially misleading — students learn a wrong intuition about the real domain. + - **medium**: the gap is real but the lesson still transfers (e.g. flat per-industry rates teach baseline discipline even if production has more variance). + - **low**: cosmetic / footnote. + options: + - high + - medium + - low + validations: + required: true + + - type: checkboxes + id: confirmations + attributes: + label: Confirmations + options: + - label: I checked [`release/README.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/README.md) "Known limitations" / "Out-of-scope uses" and this finding isn't already explicitly documented. + required: true + - label: I checked [`docs/release/post_v1_roadmap.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/post_v1_roadmap.md) and the existing [v2 decision log](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md), and this isn't already an accepted v2 work item. + required: true diff --git a/docs/release/break_me_guide.md b/docs/release/break_me_guide.md new file mode 100644 index 0000000..3f8e844 --- /dev/null +++ b/docs/release/break_me_guide.md @@ -0,0 +1,341 @@ +# Break Me — adversarial playbook for `leadforge-lead-scoring-v1` + +We *want* this dataset to be broken on purpose. The notebooks +ship the headline walkthroughs (notebook 03 dissects the +documented `total_touches_all` trap; notebook 04 covers +calibration, value-aware ranking, and cohort shift). This guide +is the **meta-recipe**: the patterns to look for on any +synthetic teaching dataset, with worked-example pointers back +into the v1 bundle so each pattern is grounded in a number +you can reproduce. + +If you find one of these on `leadforge-lead-scoring-v1`, +file an issue using one of the templates in +[`.github/ISSUE_TEMPLATE/`](https://github.com/leadforge-dev/leadforge/tree/main/.github/ISSUE_TEMPLATE). +Accepted findings are logged in +[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md). + +## Triage labels + +When you file an issue, suggest one of these labels in the +title or body. The maintainer applies the final label. + +| Label | When | +|---|---| +| `critical-leakage` | The dataset reconstructs the label via a path that wasn't documented. Highest priority — blocks v1 if reproducible on the as-shipped bundle. | +| `realism` | A modelled distribution disagrees with what a domain expert expects (industry mix, persona behaviour, funnel timing, channel attribution, pricing). Belongs in the realism issue template. | +| `difficulty` | A tier sits outside its declared band on a metric documented in `release/validation/validation_report.md`. Likely a band recalibration in v2. | +| `documentation` | A claim in the dataset card or notebooks doesn't match the artefact. Cheap to fix; please file. | +| `platform` | Kaggle / HF artefact issue (broken link, malformed YAML, schema mismatch). Phase 5 territory. | +| `notebook` | A notebook fails to execute, or its tolerance gate fires on a fresh checkout. | +| `pedagogy` | The teaching framing is misleading even though the artefact is technically correct. | +| `v2-idea` | A capability worth adding (cohort drift, channel-conditional probabilities, non-linear motifs). | +| `out-of-scope-v1` | True observation, but explicitly deferred — the dataset card already documents it as a v1 simplification. | + +## The meta-recipe + +Notebook 03 §7 introduces a three-step recipe (read the feature +dictionary → ablate, don't just probe → check the time window). +This guide extends it with one more step that the notebook +doesn't cover, then organises the patterns to apply each step +to. + +1. **Read the feature dictionary first.** Every public bundle + ships `feature_dictionary.csv` with a `leakage_risk` column. + Treat that as the primary leakage audit before any modelling. +2. **Ablate, don't just probe.** A standalone-AUC probe on a + single feature can rate a column as ~0.5 AUC while a tree + model extracts non-trivial lift from the same column once + it can combine it with the rest of the panel. Notebook 03 + §4–§5 demonstrate the gap on `total_touches_all` + (standalone 0.531 → GBM lift +0.032 vs LR lift +0.009). +3. **Check the time window.** If you have any event table + with timestamps, cross-check every aggregate feature against + `lead_created_at + snapshot_day`. The validation report's + `post_snapshot_aggregates` baseline (`$.tiers.intermediate.per_seed[*].baselines.post_snapshot_aggregates`) + bench-tests this same idea at scale. +4. **Treat the train/test split as untrusted.** The split file + says one thing; what the model sees during fitting is what + matters. Sections 5 and 6 below cover the most common ways + the two diverge. + +The pattern catalogue below maps each pattern to the recipe +step it operationalises. + +--- + +## Leakage patterns + +### 1. Naming smells the dictionary should already flag + +A column whose name mentions `total`, `all`, `lifetime`, +`final`, `outcome`, or any superlative that crosses the +prediction horizon is suspicious by default on a snapshot- +anchored task. `leadforge-lead-scoring-v1` ships exactly one +such column — `total_touches_all` — and the +`feature_dictionary.csv` row for it sets `leakage_risk=True` +and explains *why* in the description. + +**How to detect on any dataset.** Grep the column list for +`*_total`, `*_all`, `*_lifetime`, `*_final`, `*_outcome`, +`current_*`, `is_*` (especially `is_won`, `is_closed`). +Cross-check each hit against the dataset's stated prediction +horizon and snapshot anchor. If the column name implies a +window the snapshot can't have observed, the dictionary should +either flag it or rename it; if neither, that's a `documentation` +issue at minimum and probably `critical-leakage`. + +**Worked example.** Notebook 03 §2 shows the dictionary read +in three lines of pandas; the column it surfaces is +`total_touches_all`. + +### 2. The standalone-AUC undersell (tree-friendly leakage) + +A feature can score ~0.5 AUC as a single-column ranker and +still hand a tree model material lift once interactions with +other columns are available. The validation report's +`post_snapshot_aggregates` baseline (a fitted LR on the trap +column alone) gives ~0.55 AUC — the trap "looks" innocuous on +a standalone audit. Notebook 03 §5 then runs a full panel +ablation and HistGBM extracts +0.032 AUC; LR with the same +preprocessing only extracts +0.009 because it can't represent +the relevant interaction. + +**How to detect on any dataset.** Don't audit leakage with +single-feature AUC. For every column you flagged in pattern 1, +fit two tree models on the same train/test split — one with +the column, one without — and read the AUC delta. A delta +larger than your sampling noise is a flag, regardless of the +standalone number. + +**Worked example.** Notebook 03 §4 (standalone) and §5 +(ablation), with the side-by-side bar chart in §5.1. The +sign-aware tolerance gate in §6 (`MIN_GBM_LIFT = 0.015`) +formalises the asymmetry as a CI assertion. + +### 3. Time-window violations on engineered features + +The non-negotiable rule: no feature on a snapshot-anchored +task may use events later than `lead_created_at + snapshot_day`. +The public bundle's event tables (`touches`, `sessions`, +`sales_activities`, `opportunities`) are pre-filtered to +satisfy this rule (notebook 02 §3 verifies the contract on +the bundle as shipped, including a *minimum headroom under +cutoff* readout). The hazard you can still create yourself is +to engineer a feature that joins back to a non-event table +without filtering — for instance, joining `customers` (which +exists only for *converted* leads) into a feature panel. + +**How to detect on any dataset.** For every per-lead +aggregate you build, write the query as `SELECT … WHERE +event.timestamp <= lead.created_at + INTERVAL ''` +explicitly, even when the underlying table is already filtered. +If the same SQL works against the instructor companion (full- +horizon tables) AND the public bundle, you'll catch +yourself if you accidentally rely on rows that exist only in +the unfiltered view. + +**Worked example.** Notebook 02 §3 implements the per-table +inline assertion. The validation report's +`$.tiers..per_seed[*].baselines.post_snapshot_aggregates` +HistGBM AUC documents what a model can recover when the rule +is intentionally violated. + +### 4. Target-encoding leakage on test + +Mean-target encoding of a categorical feature is a textbook +hazard: fit the encoding on the *full* train+test population +and you've leaked test labels into the feature. Notebook 02 +§4.4 demonstrates the train-only-fit posture on `industry` +(four industries — logistics, healthcare_non_clinical, +manufacturing, professional_services — encoded by their +training-split conversion rate, with a global-mean fallback +for industries not seen in train). The leakage version is a one-line change — using +`pd.concat([train, test]).groupby('industry')['target'].mean()` +instead — and we deliberately *don't* show that in the +notebook because the lesson is the discipline, not the trap. + +**How to detect on any dataset.** When mean-target encoding +shows up in a notebook or pipeline, check three things in +order: (a) the encoding's `.fit()` call sees only training +labels; (b) the same encoding is applied to test via merge +or join, never re-fitted; (c) categories present in test but +not train fall back to a deterministic value (global mean is +fine; computing a fallback from test is not). If the encoding +is fit on test labels even partially — including via a +"smoothed" encoder that uses pooled train+test counts — you +have target leakage. + +**Worked example.** Notebook 02 §4.4 (train-only fit) and +§4.5 (the merge that applies the encoding to test). The +fallback-to-train-mean handling is in `attach_engineered`. + +--- + +## Split discipline + +### 5. Train-test contamination + +The bundle ships a deterministic 70/15/15 split on `lead_id` +(see `tasks//task_manifest.json`). That guarantees +`lead_id` uniqueness across splits — but `account_id` is +*not* split on. Two leads in the same account can land in +train and test, and the model can ride strong account-level +signal across the split boundary in ways that don't generalise +to a fresh account. + +**How to detect on any dataset.** Compute the intersection +of `account_id` (or whatever the per-entity grouping key is) +between train and test. If it's non-empty *and* you've +engineered any account-level features, retrain with +account-level grouped splitting (e.g. `GroupKFold` on +`account_id`) and re-read the AUC delta. The delta is the +amount of "free" lift the random-split was buying you. The +right framing isn't "remove the leak"; it's *report both +numbers so the reader knows which is which.* + +**Worked example.** Notebook 02 §4.2 builds an account-level +density feature using *only* train leads' touches — a +defensive posture against this hazard. The +`tasks/converted_within_90_days/task_manifest.json` records +the split policy and is the right artefact to cite when filing +an issue under this label. A bundle-level `account_id` +overlap audit isn't included in v1 — the validation report's +split-leakage probe (`probe_split_id_overlap`) checks +`lead_id` only. + +### 6. Cohort-by-segment evaluation + +Notebook 04 §7 demonstrates **tier-wide** cohort shift — +sort leads chronologically, train on the first 85 %, score +the last 15 % — and finds intermediate cohort-split AUC +sits *higher* than random-split AUC by ~0.0155 (the v1 +simulator has no time drift baked in over the 90-day horizon). +The richer stress test is **per-segment** cohort shift: +chronological resplit *within* each industry, region, or +revenue tier, and read the same delta per segment. Segment- +conditional drift can hide inside a stable tier-wide number +— industry A drifting up by 0.04 cancels industry B drifting +down by 0.04 in the average. + +**How to detect on any dataset.** For each segment column +(`industry`, `region`, `employee_band`, +`estimated_revenue_band`), repeat the cohort-split protocol +from notebook 04 §7 conditioned on that segment. Report the +per-segment AUC degradation and the spread across segments. +A spread larger than your tier-wide cross-seed band +(`$.tiers..spreads.lr_auc`) is a realism flag — the +simulator is producing a homogeneous world that real +production cohorts wouldn't be. + +**Worked example.** Notebook 04 §7 (tier-wide, validator- +mirrored). The validation report's `cohort_shift..auc_degradation` +field gives the v1 baseline you're trying to refine. v1 +intentionally runs only the tier-wide check; the per-segment +audit is a `v2-idea` candidate. + +--- + +## Metric and ranking traps + +### 7. Value-aware ranking surprises + +P(convert) ranking and `P(convert) × expected_acv` ranking +are both reasonable depending on the operational question. +Notebook 04 §5 shows the gap on this bundle — at top-50, ACV +capture jumps from 0.16 (P-only) to 0.40 (P × ACV). The trap +is reaching for one metric when the operational question +demands the other and not noticing the inversion. AUC ranks +*everything* by P(convert); a salesperson with capacity for +50 leads cares about revenue-weighted top-50 capture. + +**How to detect on any dataset.** Compute both `precision_at_k` +and `expected_acv_capture_at_k` for the same top-K. If their +ranking of model variants disagrees, that's a finding — at +minimum a `pedagogy` issue, possibly `realism` if the gap is +so large it suggests the simulator's ACV column has unrealistic +correlation with P(convert). + +**Worked example.** Notebook 04 §5 produces both curves +side-by-side; the validation report's +`$.tiers..per_seed[*].expected_acv_capture_at_k` +gives the canonical numbers across seeds. + +### 8. Threshold-vs-rank semantics + +A `precision >= threshold` operating point and a `top-K by +rank` operating point are not the same thing when probabilities +have ties. Notebook 04 §6 picks a threshold that "should" +admit 50 leads and reads back `actually_above` to surface +when ties at the operating point inflate the slate beyond +capacity. On a fresh seed this can quietly admit 70+ leads +into a 50-lead capacity plan if several ties sit at the +chosen probability. + +**How to detect on any dataset.** When you set a probability +threshold for a fixed-capacity decision, always log the +*realised* count above threshold, not just the threshold value. +If realised > capacity by more than a few percent, ties are +inflating the slate and you need either a finer probability +grid (less likely to help on a calibrated model) or a +secondary rank score to break ties. + +**Worked example.** Notebook 04 §6 prints +`capacity / threshold / actually_above / precision / recall` +and walks through the threshold sweep for context. The +calibration-bin output in §3 is the related receipt — a model +with poor bin-error is more likely to have ties at common +probabilities. + +--- + +## Robustness and realism + +### 9. Calibration drift across cohorts and segments + +The validation report tracks `calibration_max_bin_error` +per tier (`$.tiers..medians.calibration_max_bin_error`) +— intermediate ~0.25, intro ~0.25, advanced ~0.52. That's a +single number per tier on a single split; it can mask +segment-conditional miscalibration where the model is +well-calibrated overall but consistently over-predicts on +small-revenue accounts and under-predicts on large ones, or +drifts late-in-cohort vs early. Notebook 04 §3 shows the +tier-level reliability diagram on the public bundle; the +analogous per-segment diagram is the next stress test. + +**How to detect on any dataset.** Reproduce notebook 04 §3's +binning protocol *within* each segment column you care about +(`industry`, `region`, `employee_band`, +`estimated_revenue_band`). Report `max_bin_error` per segment +and the spread across segments. A segment whose max-bin-error +is materially worse than the tier-level number is a `realism` +finding — the world isn't producing the correlation structure +between segment and outcome that real production data would. + +**Worked example.** Notebook 04 §3 covers the tier-level +case end-to-end. The cohort-shift block in §7 is the +chronological analogue (calibration over time, in +expectation, via AUC degradation as a coarse summary). v1 +doesn't ship a per-segment calibration audit; it's a +`v2-idea`. + +--- + +## What to do when you find one + +1. Reproduce the finding from a clean checkout against the + as-shipped bundle. Note the seed, tier, and `manifest.json` + `bundle_hash` (or a freshly computed file hash if your + build doesn't expose one). +2. Pick the issue template that fits — leakage / contamination + / metric findings go in [`dataset_breakage_report.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml); + distributional / realism critiques go in + [`realism_feedback.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml). +3. Suggest a triage label from the table at the top of this + guide. The maintainer applies the final label. +4. Watch [`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md) + for the disposition. Accepted findings get an entry with + a verdict (`accepted-for-v2`, `deferred`, `wont-fix`, + `needs-investigation`) and a pointer to the resulting v2 + work item. diff --git a/docs/release/v2_decision_log.md b/docs/release/v2_decision_log.md new file mode 100644 index 0000000..6590775 --- /dev/null +++ b/docs/release/v2_decision_log.md @@ -0,0 +1,38 @@ +# v2 Decision Log — `leadforge-lead-scoring-v2` + +This log tracks every external finding against +`leadforge-lead-scoring-v1` and the disposition the maintainer +took on each one. It exists so a contributor in 2027 can see +*why* a v2 design call was made (or why a v1 quirk was kept). + +The log starts empty. The first real entry will be added when +the first issue lands; the schema below is what that entry +will fill in. + +## Schema + +Each row is one disposition. Add new rows at the bottom; never +edit historical entries. + +| Field | Required | Format | Notes | +|---|---|---|---| +| `received_at` | yes | `YYYY-MM-DD` | Date the finding was received (issue opened / reviewer comment / direct message). Use the wall-clock date in the maintainer's timezone. | +| `source` | yes | one of `issue:#NNN`, `pr:#NNN`, `email`, `direct` | Where the finding came in. `issue` and `pr` link via the GitHub number. | +| `topic` | yes | one short phrase | What the finding is about — e.g. "expected_acv realism", "industry conversion rates", "cohort-by-segment drift". | +| `severity` | yes | `low` / `medium` / `high` | Reporter's claim, sanity-checked by the maintainer. `high` is the equivalent of the breakage-report `high` severity tier. | +| `verdict` | yes | one of `accepted-for-v2`, `deferred`, `wont-fix`, `needs-investigation` | See vocabulary below. | +| `next_step` | yes | one sentence | What concretely happens next (or has happened). Free-form but specific — "tracked in v2 milestone as #NNN", "documented as v1 simplification in dataset card", etc. | +| `link` | optional | URL or path | Pointer to the resulting commit, doc change, or v2 work item. Empty for `wont-fix` and `needs-investigation`. | + +### Verdict vocabulary + +| Verdict | When | +|---|---| +| `accepted-for-v2` | The finding is real and the fix lands in v2. There should be a linked v2 milestone work item. | +| `deferred` | The finding is real but the fix is post-v2 (or unsized). Counts as a backlog entry, not a v2 commitment. | +| `wont-fix` | The finding is correct but the design call is intentional. The dataset card or roadmap should already document it; if not, the entry should result in a doc update. | +| `needs-investigation` | The finding is plausible but not yet reproduced or scoped. Stays in this state for at most one cycle; the maintainer must promote it to one of the other three verdicts before declaring v2 ready. | + +## Log + +(no entries yet — first entry lands when the first external finding is received) diff --git a/release/README.md b/release/README.md index cb1329c..d27b608 100644 --- a/release/README.md +++ b/release/README.md @@ -211,11 +211,19 @@ intended difficulty axis (intro > intermediate > advanced). ## Maintenance, adversarial framing, license -We *want* the dataset to be broken. Issue templates ship under -`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as -`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships, -`docs/release/v2_decision_log.md` will track every accepted finding -and the design call that came from it. File issues at +We *want* the dataset to be broken. The +[break-me guide](../docs/release/break_me_guide.md) catalogues +nine adversarial patterns to look for (leakage, split +contamination, ranking inversions, calibration drift) with +worked-example pointers back into the notebooks. Issue +templates ship under `.github/ISSUE_TEMPLATE/`: a +[breakage report](../.github/ISSUE_TEMPLATE/dataset_breakage_report.yml) +form for findings on the bundle itself, and a +[realism feedback](../.github/ISSUE_TEMPLATE/realism_feedback.yml) +form for distributional critiques. Accepted findings are +logged in +[`docs/release/v2_decision_log.md`](../docs/release/v2_decision_log.md). +File issues at [leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge); PRs welcome. diff --git a/release/huggingface/README.md b/release/huggingface/README.md index 885e834..ca0ecd1 100644 --- a/release/huggingface/README.md +++ b/release/huggingface/README.md @@ -256,11 +256,19 @@ intended difficulty axis (intro > intermediate > advanced). ## Maintenance, adversarial framing, license -We *want* the dataset to be broken. Issue templates ship under -`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as -`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships, -`docs/release/v2_decision_log.md` will track every accepted finding -and the design call that came from it. File issues at +We *want* the dataset to be broken. The +[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues +nine adversarial patterns to look for (leakage, split +contamination, ranking inversions, calibration drift) with +worked-example pointers back into the notebooks. Issue +templates ship under `.github/ISSUE_TEMPLATE/`: a +[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml) +form for findings on the bundle itself, and a +[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml) +form for distributional critiques. Accepted findings are +logged in +[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md). +File issues at [leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge); PRs welcome. diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json index 2f4b9b2..6f1dab4 100644 --- a/release/kaggle/dataset-metadata.json +++ b/release/kaggle/dataset-metadata.json @@ -1,6 +1,6 @@ { "collaborators": [], - "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier\n│ ├── manifest.json # provenance + file hashes\n│ ├── dataset_card.md # auto-rendered per-bundle card\n│ ├── feature_dictionary.csv # authoritative column spec\n│ ├── lead_scoring.csv # flat convenience CSV (all splits)\n│ ├── tables/*.parquet # 7 snapshot-safe relational tables\n│ └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── dataset-metadata.json # Kaggle dataset metadata\n├── dataset-cover-image.png # Kaggle cover image\n├── README.md # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n# --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Dataset summary\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier |\n|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 |\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n The instructor companion exposes the hidden graph for teaching, not\n designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n attributes.\n\n## Known limitations\n\n- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across\n every tier. Difficulty is visible in AP, P@K, Brier, and value\n capture. Treat AUC as a sanity check, not a difficulty signal.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n is slightly negative in every tier (intro −0.0045, intermediate\n −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n all tiers and the per-channel rate spread is ≤0.05. The simulator\n does not encode channel-conditional probabilities; channel-conditional\n encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n baked in; the cohort-shift gate (G6.4) is informational and will\n bite in v2.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n sales_activities, opportunities (public); plus customers and\n subscriptions (instructor only). Per-row counts per bundle live in\n `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n the per-bundle `feature_dictionary.csv` is the authoritative\n machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n recorded in `tasks/converted_within_90_days/task_manifest.json`.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. Issue templates ship under\n`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as\n`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships,\n`docs/release/v2_decision_log.md` will track every accepted finding\nand the design call that came from it. File issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate `; every file\nis hashed in `manifest.json`.\n", + "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier\n│ ├── manifest.json # provenance + file hashes\n│ ├── dataset_card.md # auto-rendered per-bundle card\n│ ├── feature_dictionary.csv # authoritative column spec\n│ ├── lead_scoring.csv # flat convenience CSV (all splits)\n│ ├── tables/*.parquet # 7 snapshot-safe relational tables\n│ └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── dataset-metadata.json # Kaggle dataset metadata\n├── dataset-cover-image.png # Kaggle cover image\n├── README.md # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n# --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Dataset summary\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier |\n|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 |\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n The instructor companion exposes the hidden graph for teaching, not\n designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n attributes.\n\n## Known limitations\n\n- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across\n every tier. Difficulty is visible in AP, P@K, Brier, and value\n capture. Treat AUC as a sanity check, not a difficulty signal.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n is slightly negative in every tier (intro −0.0045, intermediate\n −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n all tiers and the per-channel rate spread is ≤0.05. The simulator\n does not encode channel-conditional probabilities; channel-conditional\n encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n baked in; the cohort-shift gate (G6.4) is informational and will\n bite in v2.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n sales_activities, opportunities (public); plus customers and\n subscriptions (instructor only). Per-row counts per bundle live in\n `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n the per-bundle `feature_dictionary.csv` is the authoritative\n machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n recorded in `tasks/converted_within_90_days/task_manifest.json`.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate `; every file\nis hashed in `manifest.json`.\n", "expectedUpdateFrequency": "never", "id": "leadforge/leadforge-lead-scoring-v1", "image": "dataset-cover-image.png", diff --git a/release/notebooks/03_leakage_and_time_windows.ipynb b/release/notebooks/03_leakage_and_time_windows.ipynb index 2130369..3f81625 100644 --- a/release/notebooks/03_leakage_and_time_windows.ipynb +++ b/release/notebooks/03_leakage_and_time_windows.ipynb @@ -398,7 +398,7 @@ "cell_type": "markdown", "id": "cell_018", "metadata": {}, - "source": "## 7. A detection recipe you can run on any dataset\n\nThe trap was easy to spot here because the dataset\n*advertises* it. On a third-party dataset you don't get\nthat courtesy. The same recipe still works:\n\n1. **Read any feature dictionary you have.** Any column\n whose description references a window longer than the\n prediction horizon is suspicious. Even when no\n dictionary ships, an obvious naming smell (`*_total`,\n `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a\n flag.\n2. **Probe the standalone AUC** *and* **the contribution to\n a tree model.** A standalone probe alone undersells\n tree-friendly leakage (sections 4 and 5 demonstrate why\n on this dataset). Train a model with the column, train\n another without, and compare. The ablation captures\n interactions the standalone probe can't.\n3. **Inspect the time window.** Cross-check the suspect\n column against any time-stamped event tables. If the\n column's value can only be explained by events past the\n snapshot anchor, you've found a trap. Section 3 makes\n this concrete here — the same technique generalises\n anywhere there's an event table to corroborate.\n\nA walkthrough of additional detection patterns\n(column-name heuristics, isolation-via-residuals,\ntarget-encoding leakage on test) lives in\n`docs/release/break_me_guide.md` (coming in PR 6.3) — pair\nit with this notebook for a more complete playbook.\n\n## Next\n\n- **Notebook 04** — value-aware ranking\n (`expected_acv` × P(convert)), calibration plots,\n threshold selection for top-K capacity, and a\n cohort-shift / bootstrap robustness harness." + "source": "## 7. A detection recipe you can run on any dataset\n\nThe trap was easy to spot here because the dataset\n*advertises* it. On a third-party dataset you don't get\nthat courtesy. The same recipe still works:\n\n1. **Read any feature dictionary you have.** Any column\n whose description references a window longer than the\n prediction horizon is suspicious. Even when no\n dictionary ships, an obvious naming smell (`*_total`,\n `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a\n flag.\n2. **Probe the standalone AUC** *and* **the contribution to\n a tree model.** A standalone probe alone undersells\n tree-friendly leakage (sections 4 and 5 demonstrate why\n on this dataset). Train a model with the column, train\n another without, and compare. The ablation captures\n interactions the standalone probe can't.\n3. **Inspect the time window.** Cross-check the suspect\n column against any time-stamped event tables. If the\n column's value can only be explained by events past the\n snapshot anchor, you've found a trap. Section 3 makes\n this concrete here — the same technique generalises\n anywhere there's an event table to corroborate.\n\nA walkthrough of additional detection patterns\n(column-name heuristics, target-encoding leakage on\ntest, train-test contamination via account_id,\ncohort-by-segment evaluation) lives in\n[`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) —\npair it with this notebook for a more complete\nplaybook.\n\n## Next\n\n- **Notebook 04** — value-aware ranking\n (`expected_acv` × P(convert)), calibration plots,\n threshold selection for top-K capacity, and a\n cohort-shift / bootstrap robustness harness." } ], "metadata": { diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb index 25933be..43c1646 100644 --- a/release/notebooks/04_lift_calibration_value_ranking.ipynb +++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb @@ -666,7 +666,7 @@ "cell_type": "markdown", "id": "cell_019", "metadata": {}, - "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.13\n on the trap-dropped headline panel, vs ~0.19 on the\n with-trap panel the validation report tracks) and lifts\n the top decile to ~2.75× the base rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n top-K slot than P-only ranking — the gap depends on K\n but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n (the late cohort is *easier*, not harder); the report\n documents this, and the notebook reproduces it. The\n intro and advanced tiers show small positive\n degradations.\n* Bootstrap on the existing test split gives a within-\n bundle confidence band that's tighter than the cross-seed\n spread the validation report computes — useful for \"how\n confident is this single AUC\" questions, not for \"how\n much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n heuristic for it and recompute section 5 — the revenue\n capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n (coming in PR 6.3) catalogues additional stress tests\n (target-encoding leakage, train-test contamination,\n cohort-by-segment) and how to detect each from a\n single bundle." + "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.13\n on the trap-dropped headline panel, vs ~0.19 on the\n with-trap panel the validation report tracks) and lifts\n the top decile to ~2.75× the base rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n top-K slot than P-only ranking — the gap depends on K\n but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n (the late cohort is *easier*, not harder); the report\n documents this, and the notebook reproduces it. The\n intro and advanced tiers show small positive\n degradations.\n* Bootstrap on the existing test split gives a within-\n bundle confidence band that's tighter than the cross-seed\n spread the validation report computes — useful for \"how\n confident is this single AUC\" questions, not for \"how\n much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n heuristic for it and recompute section 5 — the revenue\n capture story should sharpen.\n3. The break-me playbook in\n [`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md)\n catalogues additional stress tests (target-encoding\n leakage, train-test contamination, cohort-by-segment)\n and how to detect each from a single bundle." } ], "metadata": { diff --git a/scripts/build_release_notebook_03.py b/scripts/build_release_notebook_03.py index e811af6..2593876 100644 --- a/scripts/build_release_notebook_03.py +++ b/scripts/build_release_notebook_03.py @@ -542,10 +542,12 @@ def fit_score(cols: list[str], *, model: str) -> np.ndarray: anywhere there's an event table to corroborate. A walkthrough of additional detection patterns - (column-name heuristics, isolation-via-residuals, - target-encoding leakage on test) lives in - `docs/release/break_me_guide.md` (coming in PR 6.3) — pair - it with this notebook for a more complete playbook. + (column-name heuristics, target-encoding leakage on + test, train-test contamination via account_id, + cohort-by-segment evaluation) lives in + [`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) — + pair it with this notebook for a more complete + playbook. ## Next diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py index f57bbf8..42cfd4c 100644 --- a/scripts/build_release_notebook_04.py +++ b/scripts/build_release_notebook_04.py @@ -843,11 +843,11 @@ def _summary(arr: np.ndarray, name: str) -> None: 2. If you have real ACV data, swap the `expected_acv` heuristic for it and recompute section 5 — the revenue capture story should sharpen. - 3. The break-me playbook in `docs/release/break_me_guide.md` - (coming in PR 6.3) catalogues additional stress tests - (target-encoding leakage, train-test contamination, - cohort-by-segment) and how to detect each from a - single bundle. + 3. The break-me playbook in + [`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) + catalogues additional stress tests (target-encoding + leakage, train-test contamination, cohort-by-segment) + and how to detect each from a single bundle. """ ), ] From c17347873c92954f5ad2640ed44e942413360cfc Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Fri, 8 May 2026 00:11:04 +0300 Subject: [PATCH 2/4] PR 6.3 self-review: factual fixes + label creation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hostile-reviewer pass on the PR 6.3 diff caught seven issues. Fixes: 1. **post_snapshot_aggregates baseline misidentified.** The break-me guide patterns 2/3 called the validator's baseline "a fitted LR"; the actual implementation in leadforge/validation/release_quality.py is HistGBM. Fixed, with a link to the source so the model identity is verifiable. 2. **manifest.json `bundle_hash` is fictitious.** The real manifest has per-file sha256s nested under `tasks..test_sha256` and `tables..sha256`, no top-level `bundle_hash`. Fixed both the break-me guide ("What to do when you find one" §1) and the breakage issue template's bundle-hash field; placeholder now carries a real hash from the as-shipped intermediate bundle. 3. **Fabricated "70+ leads" in pattern 8.** The slate-inflation number was made up — the as-shipped intermediate bundle's actually_above readout matches capacity. Reframed as a defensive instrument, with a more honest LR-vs-GBM tie-rate observation. 4. **Misleading "instructor" tier dropdown.** Only `intermediate_instructor` ships in v1; the unqualified label suggested multiple instructor companions. Renamed to `intermediate_instructor`. 5. **Underspecified "~0.55 AUC".** Pinned to "intermediate, median across seeds 42–46" with the cross-tier range (0.52–0.61) so readers know the variance. 6. **Speculative pattern-9 calibration claim.** "Consistently over-predicts on small-revenue accounts" was hypothetical, not observed. Reframed as "in principle… whether v1 actually exhibits such drift is an open question". 7. **Issue templates referenced labels that didn't exist.** Created `dataset: leadforge-lead-scoring-v1`, `needs-triage`, `realism` so the templates' `labels:` fields actually fire on issue submission. (Triage labels in the dropdowns — critical-leakage etc. — stay as free-text choices the maintainer applies post-triage.) Net: 1260/1260 tests still pass; ruff + mypy clean; YAML still validates; no new bundle regeneration. BUNDLE_SCHEMA_VERSION unchanged at 5. Co-Authored-By: Claude Opus 4.7 --- .../dataset_breakage_report.yml | 14 ++++-- docs/release/break_me_guide.md | 47 +++++++++++-------- 2 files changed, 38 insertions(+), 23 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml index 0923386..611d0d0 100644 --- a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml +++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml @@ -14,12 +14,12 @@ body: id: tier attributes: label: Tier - description: Which bundle tier did you break? Pick `instructor` if the finding is on the `intermediate_instructor` companion. + description: Which bundle tier did you break? `intermediate_instructor` is the only instructor companion shipped in v1. options: - intro - intermediate - advanced - - instructor + - intermediate_instructor - multiple validations: required: true @@ -38,8 +38,14 @@ body: id: bundle_hash attributes: label: Bundle hash - description: Paste a hash that identifies the exact bundle. Either `manifest.json` → `bundle_hash` (if your build records one), or a `sha256sum tasks/converted_within_90_days/test.parquet` from your local checkout. This makes the report unambiguous if we regenerate. - placeholder: "sha256:abc123… (or manifest.json bundle_hash)" + description: | + Paste a hash that identifies the exact bundle. Two equivalent forms: + + - From `manifest.json` → `tasks.converted_within_90_days.test_sha256` (the test-split sha256 the bundle records; pinned at generation time). + - From a local `sha256sum tasks/converted_within_90_days/test.parquet`. + + This makes the report unambiguous if we regenerate. + placeholder: "d428c07decc2b1fdf8b5f56a1a63c65799897f1e22b61afd9f5d517f74593f09" validations: required: true diff --git a/docs/release/break_me_guide.md b/docs/release/break_me_guide.md index 3f8e844..1b326ea 100644 --- a/docs/release/break_me_guide.md +++ b/docs/release/break_me_guide.md @@ -94,12 +94,16 @@ in three lines of pandas; the column it surfaces is A feature can score ~0.5 AUC as a single-column ranker and still hand a tree model material lift once interactions with other columns are available. The validation report's -`post_snapshot_aggregates` baseline (a fitted LR on the trap -column alone) gives ~0.55 AUC — the trap "looks" innocuous on -a standalone audit. Notebook 03 §5 then runs a full panel -ablation and HistGBM extracts +0.032 AUC; LR with the same -preprocessing only extracts +0.009 because it can't represent -the relevant interaction. +`post_snapshot_aggregates` baseline (HistGBM on the trap +column alone, see +[`leadforge/validation/release_quality.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/release_quality.py)) +gives ~0.55 AUC on intermediate (median across seeds 42–46; +0.52–0.61 across all tier × seed pairs) — the trap "looks" +innocuous even when scored by a tree model on its own. +Notebook 03 §5 then runs a full panel ablation and HistGBM +extracts +0.032 AUC; LR with the same preprocessing only +extracts +0.009 because it can't represent the relevant +interaction. **How to detect on any dataset.** Don't audit leakage with single-feature AUC. For every column you flagged in pattern 1, @@ -266,11 +270,13 @@ gives the canonical numbers across seeds. A `precision >= threshold` operating point and a `top-K by rank` operating point are not the same thing when probabilities have ties. Notebook 04 §6 picks a threshold that "should" -admit 50 leads and reads back `actually_above` to surface -when ties at the operating point inflate the slate beyond -capacity. On a fresh seed this can quietly admit 70+ leads -into a 50-lead capacity plan if several ties sit at the -chosen probability. +admit 50 leads and reads back `actually_above` as a defensive +instrument — on the as-shipped intermediate bundle the realised +count happens to match capacity, but the readout exists so a +seed where ties cluster at the operating probability fails +loud rather than silently inflating the slate. On a calibrated +LR with continuous scores, ties are rare; on a coarse-grained +GBM probability output they're routine. **How to detect on any dataset.** When you set a probability threshold for a fixed-capacity decision, always log the @@ -296,11 +302,10 @@ probabilities. The validation report tracks `calibration_max_bin_error` per tier (`$.tiers..medians.calibration_max_bin_error`) — intermediate ~0.25, intro ~0.25, advanced ~0.52. That's a -single number per tier on a single split; it can mask -segment-conditional miscalibration where the model is -well-calibrated overall but consistently over-predicts on -small-revenue accounts and under-predicts on large ones, or -drifts late-in-cohort vs early. Notebook 04 §3 shows the +single number per tier on a single split; in principle it can +mask segment-conditional miscalibration. Whether v1 actually +exhibits such drift is an open question — the per-segment +audit is the way to find out. Notebook 04 §3 shows the tier-level reliability diagram on the public bundle; the analogous per-segment diagram is the next stress test. @@ -325,9 +330,13 @@ doesn't ship a per-segment calibration audit; it's a ## What to do when you find one 1. Reproduce the finding from a clean checkout against the - as-shipped bundle. Note the seed, tier, and `manifest.json` - `bundle_hash` (or a freshly computed file hash if your - build doesn't expose one). + as-shipped bundle. Note the seed, tier, and the test-split + sha256 from `manifest.json` — under + `tasks.converted_within_90_days.test_sha256`. That single + hash uniquely identifies the bundle the finding was + reproduced on; the manifest also carries per-table hashes + under `tables..sha256` if a table-specific hash is + the right anchor for the finding. 2. Pick the issue template that fits — leakage / contamination / metric findings go in [`dataset_breakage_report.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml); distributional / realism critiques go in From 943b2f79a6d05a6c812d7065b12059724b40bb8e Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Fri, 8 May 2026 00:27:48 +0300 Subject: [PATCH 3/4] PR 6.3 self-review round 2 + planning: mock-page preview PR MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hostile-reviewer pass on round 1 caught 8 more issues; one of them (SOURCE_TREE_BLOCK staleness) was an issue I'd already deferred once, which I'm calling a cop-out and fixing now. Also folds in a planning update for the v1 release roadmap: insert a new PR before the publish PR that lets the maintainer render local mock-page previews of the Kaggle and HuggingFace dataset pages from the *exact* upload artefacts before any platform upload — staging gate so styling, link, embed, and YAML-rendering issues are caught before they hit the live page where rollback is expensive. ### Round-2 fixes (8) 1. **Pattern 6 metric mismatch.** Cohort shift uses HistGBM (release_quality.py:463/514), so the cross-seed band to compare against is `gbm_auc.spread`, not `lr_auc.spread`. Fixed. 2. **Pattern 5 numerically anchored.** "Two leads in the same account can land in train and test" was vague; the empirical reality is **518 of 557 test accounts (93 %) appear in train** — and the same numbers hold across all three tiers because the splitter is `lead_id`-keyed. Added a copy-pasteable pandas snippet so reporters can verify in 4 lines. 3. **Pattern 4 self-contradiction.** Original text said "we deliberately don't show that" while showing the leakage one-liner inline. Reworded so the contrast is clear: notebook 02 doesn't show the leakage variant; the guide does, so reviewers recognise it in code. 4. **Cross-platform hash command.** Issue template said `sha256sum tasks/.../test.parquet` — that's GNU coreutils, not present on macOS. Added `shasum -a 256` (macOS) and a portable Python one-liner; led with "easiest source: copy from manifest.json". 5. **Relative links within docs/.** Round 1 used absolute GitHub URLs for every in-repo link in the break-me guide, breaking local-clone reading (clicks went to GitHub web instead of opening the local file). Switched to relative paths within the docs tree; kept absolute URLs only where they're load-bearing (notebook forward-pointers ship to Kaggle/HF without `docs/`; issue-template Markdown bodies render via GitHub Issues; README's `](../foo)` gets rewritten by `_release_common.py` for Kaggle/HF). 6. **`expected_acv_capture_at_k` JSON path lands on a dict.** `$.tiers..per_seed[*].expected_acv_capture_at_k` resolves to `{"50": …, "100": …}` keyed by string K. Pinned the path to `…expected_acv_capture_at_k.50` so a reader following the citation hits a scalar. 7. **Pattern 8 GBM-tie speculation.** "On a coarse-grained GBM probability output ties are routine" was over-stated — HistGBM's `predict_proba` is continuous via leaf-score sums. Dropped the LR-vs-GBM theorising; kept the empirical observation that on the as-shipped intermediate bundle the realised count matches capacity. 8. **SOURCE_TREE_BLOCK staleness.** Both `release/README.md` and `_release_common.py`'s `SOURCE_TREE_BLOCK` constant listed `notebooks/01_baseline_lead_scoring.ipynb` as the only notebook; four ship now. Updated to a one-liner that names all four. Audit-sync test passes. ### Planning update — local mock-page preview PR `docs/release/v1_release_roadmap.md` and `.agent-plan.md` updated: - Phase 7 grows from 2 PRs to 3. - New **PR 7.2** sits between the LLM critique PR (now 7.1) and the publish PR (now 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks from the *exact* upload artefacts (`release/kaggle/dataset-metadata.json` + inlined README + cover image; `release/huggingface/README.md` with frontmatter + body), serve over `localhost`, accept `--variant=public|instructor` (HF) and `--port` / `--open-browser` flags. Tests cover required-field presence, link resolution, schema-column listing, configs-block round-trip. - The publish PR's runbook (`v1_release_notes.md`) cites the preview commands as required pre-flight. - Phase summary table + total PR count (14 → 15) updated for consistency. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; BUNDLE_SCHEMA_VERSION unchanged at 5. Co-Authored-By: Claude Opus 4.7 --- .agent-plan.md | 10 +-- .../dataset_breakage_report.yml | 11 ++- docs/release/break_me_guide.md | 87 +++++++++++-------- docs/release/v1_release_roadmap.md | 38 +++++--- release/README.md | 2 +- scripts/_release_common.py | 2 +- 6 files changed, 91 insertions(+), 59 deletions(-) diff --git a/.agent-plan.md b/.agent-plan.md index 33891c2..aceeaab 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -62,12 +62,10 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family - [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added. Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from. The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*. Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI. Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users). Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test). Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline. Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`. Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`. Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. - [x] PR 6.3: adversarial framing landed. `docs/release/break_me_guide.md` (new) — meta-recipe playbook organised as a 4-step recipe (read the dictionary → ablate, don't just probe → check the time window → treat the train/test split as untrusted) + 9 patterns grouped by category (leakage / split discipline / metric and ranking traps / robustness and realism). Each pattern carries a "how to detect on any dataset" recipe and a "worked example" pointer back into the v1 bundle (notebook §, validation_report JSON path, or feature_dictionary.csv field), so the guide extends the notebooks rather than duplicating them. Three explicit promises notebook 04 §10 made are delivered: target-encoding leakage on test (pattern 4, anchored on NB02 §4.4), train-test contamination via `account_id` overlap (pattern 5, with the honest "v1 only checks `lead_id`, not `account_id`" caveat), cohort-by-segment evaluation (pattern 6, extends NB04 §7's tier-wide cohort-shift to per-segment using the actual segment columns: `industry`, `region`, `employee_band`, `estimated_revenue_band`). Other 6 patterns: naming smells, standalone-AUC vs tree-ablation gap (NB03 finding generalised), time-window violations on engineered features (with the `customers`-table example), value-aware ranking surprises (P × ACV vs P-only), threshold-vs-rank ties at the operating point (NB04 §6 finding), calibration drift across cohorts and segments. Triage-label table at the top (`critical-leakage` / `realism` / `difficulty` / `documentation` / `platform` / `notebook` / `pedagogy` / `v2-idea` / `out-of-scope-v1`) gives reporters a vocabulary; the same labels are auto-applied (`needs-triage`) by the issue templates. `docs/release/v2_decision_log.md` (new, empty stub) — schema documented in the file's preamble (7 columns: `received_at` / `source` / `topic` / `severity` / `verdict` / `next_step` / `link`; verdict vocabulary `accepted-for-v2` / `deferred` / `wont-fix` / `needs-investigation` with explicit semantics for each). `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` (new) and `.github/ISSUE_TEMPLATE/realism_feedback.yml` (new) — GitHub Issue Forms YAML, both carry the `dataset: leadforge-lead-scoring-v1` + `needs-triage` labels. Breakage report: tier dropdown (intro / intermediate / advanced / instructor / multiple), seed input (default 42), bundle hash field (validation: required), suggested triage label dropdown, severity dropdown (high/medium/low), summary, minimal repro, expected-vs-actual citing JSON paths, environment, two confirmation checkboxes (read break-me guide; reporting on as-shipped bundle). Realism feedback: aspect dropdown (industry mix / persona / funnel timing / channel / pricing / account-to-lead density / region / other), tier(s)-affected dropdown, domain-experience one-liner (required — helps weight findings), claim, data observation (with concrete pandas-snippet placeholder example), suggested fix (optional), severity, two confirmations (read README "Known limitations"; checked post_v1_roadmap + v2_decision_log). Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from plain `docs/release/break_me_guide.md` text to Markdown links pointing at the GitHub blob URL (`https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md`) — relative path would break on Kaggle/HF where notebooks ship without the `docs/` tree, the blob URL works in both contexts. `release/README.md` "Maintenance, adversarial framing, license" section rewritten: dead "(PR 6.3)" forward-pointers replaced with real Markdown links to the break-me guide, both issue templates, and the v2 decision log; `_release_common.py`'s existing `](../foo)` → GitHub-blob-URL rewriter handles the Kaggle/HF rendering automatically (verified by the regenerated `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` sync tests). Hostile-reviewer self-review caught two factual hallucinations in the first revision before they shipped: claimed "15 industries" for `industry` (actually 4: logistics / healthcare_non_clinical / manufacturing / professional_services) and used loose segment-column names ("employee tier", "ARR band") instead of the actual columns (`employee_band`, `estimated_revenue_band`); both fixed. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR is documentation-only). Phase 6 closed — Phase 7 (LLM critique + publish) is next. -### Phase 7 — LLM critique + publish -- [ ] `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly) -- [ ] `docs/release/llm_critique_prompt.md` + `scripts/run_llm_critique.py` -- [ ] Adjudicate any high-severity findings (resolve in code or document in `v2_decision_log.md`) -- [ ] `scripts/{publish_kaggle,publish_hf}.py` (dry-run → private/draft → public) -- [ ] Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md` +### Phase 7 — LLM critique + publish (3 PRs) +- [ ] **PR 7.1** — `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly) + `docs/release/llm_critique_prompt.md` + `scripts/run_llm_critique.py`. Adjudicate any high-severity findings (resolve in code or document in `v2_decision_log.md`). +- [ ] **PR 7.2** — local Kaggle + HuggingFace mock-page preview tooling (must land before PR 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks of the public Kaggle and HF dataset pages from the *exact* upload artefacts (metadata JSON, README, cover image), serve over `localhost`, and let the maintainer click through both pages in a browser before any platform upload — catches styling / link / YAML-rendering issues before they hit cached previews on the live page. Tests cover required-field presence, link resolution, schema column listing, configs-block round-trip. +- [ ] **PR 7.3** — `scripts/{publish_kaggle,publish_hf}.py` (dry-run → local mock-page review → private/draft → public). Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md` (cites PR 7.2's preview commands as required pre-flight). --- diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml index 611d0d0..1b7a2e1 100644 --- a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml +++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml @@ -39,10 +39,15 @@ body: attributes: label: Bundle hash description: | - Paste a hash that identifies the exact bundle. Two equivalent forms: + Paste a hash that identifies the exact bundle. Easiest source: + copy `tasks.converted_within_90_days.test_sha256` straight out + of `manifest.json` (it's pinned at bundle generation time, so + no local hashing is needed). If you've modified the file + locally, recompute via: - - From `manifest.json` → `tasks.converted_within_90_days.test_sha256` (the test-split sha256 the bundle records; pinned at generation time). - - From a local `sha256sum tasks/converted_within_90_days/test.parquet`. + - **macOS:** `shasum -a 256 tasks/converted_within_90_days/test.parquet` + - **Linux:** `sha256sum tasks/converted_within_90_days/test.parquet` + - **Cross-platform:** `python -c "import hashlib,sys; print(hashlib.sha256(open(sys.argv[1],'rb').read()).hexdigest())" tasks/converted_within_90_days/test.parquet` This makes the report unambiguous if we regenerate. placeholder: "d428c07decc2b1fdf8b5f56a1a63c65799897f1e22b61afd9f5d517f74593f09" diff --git a/docs/release/break_me_guide.md b/docs/release/break_me_guide.md index 1b326ea..6548626 100644 --- a/docs/release/break_me_guide.md +++ b/docs/release/break_me_guide.md @@ -11,9 +11,9 @@ you can reproduce. If you find one of these on `leadforge-lead-scoring-v1`, file an issue using one of the templates in -[`.github/ISSUE_TEMPLATE/`](https://github.com/leadforge-dev/leadforge/tree/main/.github/ISSUE_TEMPLATE). +[`.github/ISSUE_TEMPLATE/`](../../.github/ISSUE_TEMPLATE). Accepted findings are logged in -[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md). +[`v2_decision_log.md`](v2_decision_log.md). ## Triage labels @@ -96,7 +96,7 @@ still hand a tree model material lift once interactions with other columns are available. The validation report's `post_snapshot_aggregates` baseline (HistGBM on the trap column alone, see -[`leadforge/validation/release_quality.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/release_quality.py)) +[`leadforge/validation/release_quality.py`](../../leadforge/validation/release_quality.py)) gives ~0.55 AUC on intermediate (median across seeds 42–46; 0.52–0.61 across all tier × seed pairs) — the trap "looks" innocuous even when scored by a tree model on its own. @@ -154,10 +154,11 @@ and you've leaked test labels into the feature. Notebook 02 (four industries — logistics, healthcare_non_clinical, manufacturing, professional_services — encoded by their training-split conversion rate, with a global-mean fallback -for industries not seen in train). The leakage version is a one-line change — using -`pd.concat([train, test]).groupby('industry')['target'].mean()` -instead — and we deliberately *don't* show that in the -notebook because the lesson is the discipline, not the trap. +for industries not seen in train). The leakage variant is a +one-liner — `pd.concat([train, test]).groupby('industry')['target'].mean()` +— and the notebook deliberately doesn't show it, because the +lesson there is the discipline. This guide shows the leakage +form (above) so you recognise it during code review. **How to detect on any dataset.** When mean-target encoding shows up in a notebook or pipeline, check three things in @@ -183,20 +184,30 @@ fallback-to-train-mean handling is in `attach_engineered`. The bundle ships a deterministic 70/15/15 split on `lead_id` (see `tasks//task_manifest.json`). That guarantees `lead_id` uniqueness across splits — but `account_id` is -*not* split on. Two leads in the same account can land in -train and test, and the model can ride strong account-level -signal across the split boundary in ways that don't generalise -to a fresh account. - -**How to detect on any dataset.** Compute the intersection -of `account_id` (or whatever the per-entity grouping key is) -between train and test. If it's non-empty *and* you've -engineered any account-level features, retrain with -account-level grouped splitting (e.g. `GroupKFold` on -`account_id`) and re-read the AUC delta. The delta is the -amount of "free" lift the random-split was buying you. The -right framing isn't "remove the leak"; it's *report both -numbers so the reader knows which is which.* +*not* split on. On the as-shipped intermediate bundle, +**518 of 557 test accounts (93 %) also appear in train**; +the same numbers hold on intro and advanced because the +splitter is `lead_id`-keyed and tier-invariant. Models can +ride strong account-level signal across the split boundary +in ways that don't generalise to a fresh account. + +**How to detect on any dataset.** + +```python +import pandas as pd +train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet") +test = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet") +overlap = set(train["account_id"]) & set(test["account_id"]) +print(f"shared accounts: {len(overlap)} / {test['account_id'].nunique()}") +``` + +If the overlap is non-empty *and* you've engineered any +account-level features, retrain with account-level grouped +splitting (e.g. `GroupKFold` on `account_id`) and re-read the +AUC delta. The delta is the amount of "free" lift the +random-split was buying you. The right framing isn't "remove +the leak"; it's *report both numbers so the reader knows +which is which.* **Worked example.** Notebook 02 §4.2 builds an account-level density feature using *only* train leads' touches — a @@ -227,10 +238,10 @@ down by 0.04 in the average. `estimated_revenue_band`), repeat the cohort-split protocol from notebook 04 §7 conditioned on that segment. Report the per-segment AUC degradation and the spread across segments. -A spread larger than your tier-wide cross-seed band -(`$.tiers..spreads.lr_auc`) is a realism flag — the -simulator is producing a homogeneous world that real -production cohorts wouldn't be. +A spread larger than the tier's cross-seed GBM-AUC band +(`$.tiers..spreads.gbm_auc` — same model the cohort-shift +block uses) is a realism flag: the simulator is producing a +homogeneous world that real production cohorts wouldn't be. **Worked example.** Notebook 04 §7 (tier-wide, validator- mirrored). The validation report's `cohort_shift..auc_degradation` @@ -261,9 +272,10 @@ so large it suggests the simulator's ACV column has unrealistic correlation with P(convert). **Worked example.** Notebook 04 §5 produces both curves -side-by-side; the validation report's -`$.tiers..per_seed[*].expected_acv_capture_at_k` -gives the canonical numbers across seeds. +side-by-side; the validation report's per-seed scalars live +under +`$.tiers..per_seed[*].expected_acv_capture_at_k.50` +(and `.100` for top-100), keyed by string K. ### 8. Threshold-vs-rank semantics @@ -272,11 +284,9 @@ rank` operating point are not the same thing when probabilities have ties. Notebook 04 §6 picks a threshold that "should" admit 50 leads and reads back `actually_above` as a defensive instrument — on the as-shipped intermediate bundle the realised -count happens to match capacity, but the readout exists so a -seed where ties cluster at the operating probability fails -loud rather than silently inflating the slate. On a calibrated -LR with continuous scores, ties are rare; on a coarse-grained -GBM probability output they're routine. +count matches capacity, but the readout exists so a seed where +ties cluster at the operating probability fails loud rather +than silently inflating the slate. **How to detect on any dataset.** When you set a probability threshold for a fixed-capacity decision, always log the @@ -338,13 +348,14 @@ doesn't ship a per-segment calibration audit; it's a under `tables..sha256` if a table-specific hash is the right anchor for the finding. 2. Pick the issue template that fits — leakage / contamination - / metric findings go in [`dataset_breakage_report.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml); + / metric findings go in + [`dataset_breakage_report.yml`](../../.github/ISSUE_TEMPLATE/dataset_breakage_report.yml); distributional / realism critiques go in - [`realism_feedback.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml). + [`realism_feedback.yml`](../../.github/ISSUE_TEMPLATE/realism_feedback.yml). 3. Suggest a triage label from the table at the top of this guide. The maintainer applies the final label. -4. Watch [`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md) - for the disposition. Accepted findings get an entry with - a verdict (`accepted-for-v2`, `deferred`, `wont-fix`, +4. Watch [`v2_decision_log.md`](v2_decision_log.md) for the + disposition. Accepted findings get an entry with a verdict + (`accepted-for-v2`, `deferred`, `wont-fix`, `needs-investigation`) and a pointer to the resulting v2 work item. diff --git a/docs/release/v1_release_roadmap.md b/docs/release/v1_release_roadmap.md index e7b0bb2..520f042 100644 --- a/docs/release/v1_release_roadmap.md +++ b/docs/release/v1_release_roadmap.md @@ -44,13 +44,13 @@ A release candidate is v1-ready when **all** of the following hold. Concrete ban | 4 | Channel-signal audit + dataset card | M-S | 1 | 3 | not started | | 5 | Platform packaging | M | 2 | 4 | not started | | 6 | Notebook sequence + adversarial framing | M-L | 3 | 5 | not started | -| 7 | LLM critique + publish | M | 2 | 6 | not started | +| 7 | LLM critique + publish | M | 3 | 6 | not started | -**Total: 14 PRs.** Each PR follows the `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment (`dataset: leadforge-lead-scoring-v1`). PR-level decomposition is in the **PR breakdown** section immediately below. +**Total: 15 PRs.** Each PR follows the `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment (`dataset: leadforge-lead-scoring-v1`). PR-level decomposition is in the **PR breakdown** section immediately below. ## PR breakdown -First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` is a planning ID, not a GitHub PR number. Sizes are estimates; we may merge or split during implementation. Within a phase, PRs are typically sequential (later sub-PRs depend on earlier ones); cross-phase dependencies follow the phase summary above. +First-cut decomposition of the 7 phases into ~15 PRs. The numbering `phase.seq` is a planning ID, not a GitHub PR number. Sizes are estimates; we may merge or split during implementation. Within a phase, PRs are typically sequential (later sub-PRs depend on earlier ones); cross-phase dependencies follow the phase summary above. ### Phase 1 — Audit and naming (1 PR) @@ -154,7 +154,7 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` - Labels: `type: docs` - Size: S (~300 lines) -### Phase 7 — LLM critique + publish (2 PRs) +### Phase 7 — LLM critique + publish (3 PRs) - **PR 7.1** — `feat(validation): llm_critique module + prompt + driver` - `leadforge/validation/llm_critique.py` — single-provider, env-var creds, skip-cleanly without @@ -165,18 +165,29 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` - Labels: `type: feature`, `layer: validation` - Size: M (~500 lines) -- **PR 7.2** — `feat(scripts): publish_kaggle + publish_hf + tag v1 release` +- **PR 7.2** — `feat(scripts): local Kaggle + HuggingFace mock-page preview` ⚠️ **must land before PR 7.3** + - **Goal:** before any real Kaggle/HF publish, the maintainer can render a faithful local preview of how each platform will display the dataset and click through it in a browser. Catch styling, link, embed, and YAML-rendering issues *before* they land on the live page where rollback is expensive (Kaggle and HF both keep cached previews around). + - `scripts/preview_kaggle_page.py` — reads `release/kaggle/dataset-metadata.json` + the inlined README + the cover image, renders an offline HTML mock that mimics the public Kaggle dataset page (header, description, schema/columns table, file tree, license footer). Serves on `http://localhost:8765` via `python -m http.server` or a small Flask shim. + - `scripts/preview_hf_page.py` — reads `release/huggingface/README.md` (YAML frontmatter + body), renders an offline HTML mock that mimics the HF dataset page (frontmatter pills, configs dropdown, README body, file tree). Serves on `http://localhost:8766`. + - Both scripts: `--release-dir`, `--port`, `--variant=public|instructor` (HF only), `--open-browser`. Dry-run / no-network. + - Both must round-trip the *exact* artefacts the publish PR will upload — same metadata JSON, same README, same cover image — so the preview is faithful, not a sketch. + - Tests: `tests/scripts/test_preview_kaggle_page.py` + `tests/scripts/test_preview_hf_page.py`. Each renders the page once and asserts: required field labels appear, every Markdown link in the source resolves to a non-404 URL pattern, every config block is present, the Kaggle schema table lists every CSV/parquet column. + - Pedagogically: this is the staging gate. The release runbook (`docs/release/v1_release_notes.md` in PR 7.3) cites both preview commands as required steps before `kaggle datasets create` / `huggingface-cli upload`. + - Labels: `type: feature`, `layer: cli` + - Size: M (~600 lines — two HTML templates + two render scripts + two test files) + +- **PR 7.3** — `feat(scripts): publish_kaggle + publish_hf + tag v1 release` - `scripts/publish_kaggle.py` - `scripts/publish_hf.py` - `docs/release/v1_release_notes.md` - - Dry-run → private/draft → public publish (manual step performed by maintainer with credentials, within the PR or as a follow-up release tag) + - Dry-run → private/draft → public publish (manual step performed by maintainer with credentials, within the PR or as a follow-up release tag). The runbook references PR 7.2's preview commands as a required pre-flight. - Tag `leadforge-lead-scoring-v1` - Labels: `type: feature`, `layer: cli` - Size: S (~300 lines code + manual publish step) ## PR breakdown — totals -- **14 PRs** across 7 phases. +- **15 PRs** across 7 phases. - Estimated total LoC: ~6,500 (excluding regenerated parquet bundles and notebook JSON). - All 14 PRs target the `dataset: leadforge-lead-scoring-v1` GitHub milestone. - Calendar duration is not committed; depends on iteration cadence and review feedback. @@ -420,22 +431,29 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` - New `docs/release/llm_critique_prompt.md` — the rubric document, structured as the prompt the script feeds. - New `scripts/run_llm_critique.py` — driver: builds the input bundle (README.md, dataset card, generation method, manifest, feature dictionary, validation report, first 100 public rows, public/instructor diff summary, public-safe mechanism summary) → calls the critique → writes `release/validation/llm_critique_raw_*.json` and `release/validation/llm_critique_summary.md`. - Adjudicate any high-severity findings; resolve in code or document acknowledgment in `v2_decision_log.md` if intentional-and-accepted. +- **Local mock-page preview (PR 7.2 — must land before publish):** maintainer renders Kaggle and HF dataset pages locally from the actual upload artefacts (the same metadata JSON, README, cover image the publish PR will use) and clicks through them in a browser before any platform upload, so styling / link / YAML-rendering issues are caught before they hit cached previews on the live page. + - `scripts/preview_kaggle_page.py` — reads `release/kaggle/dataset-metadata.json` + inlined README + cover image, renders an offline HTML page that mimics the public Kaggle dataset view. + - `scripts/preview_hf_page.py` — reads `release/huggingface/README.md` (frontmatter + body), renders the analogous HF view. + - Both serve over `python -m http.server` (or a small Flask shim) and accept `--variant=public|instructor` (HF), `--port`, `--open-browser`. + - Tests: required field labels appear, every Markdown link resolves to a non-404 URL pattern, every config block is present, the Kaggle schema table lists every CSV/parquet column. - New `scripts/publish_kaggle.py` — uses `kagglehub.dataset_upload()` with `version_notes` containing the commit hash and tag. - New `scripts/publish_hf.py` — uses `huggingface_hub.HfApi().upload_folder()` with the dataset repo type. - Tag the release: `leadforge-lead-scoring-v1`. Tag the leadforge package release if a coordinated package version bump is needed (TBD — likely just a patch bump). -- `docs/release/v1_release_notes.md` — public-facing release notes. -- Both publish scripts exercised in **dry-run** before actual upload, then upload to **private/draft** repos for download smoke test, then promote to public. +- `docs/release/v1_release_notes.md` — public-facing release notes; references the PR 7.2 preview commands as a required pre-flight step. +- Both publish scripts exercised in **dry-run** before actual upload, **and the local mock-page previews from PR 7.2 reviewed in a browser**, then upload to **private/draft** repos for download smoke test, then promote to public. **Files touched:** - `leadforge/validation/llm_critique.py` (new) - `docs/release/llm_critique_prompt.md` (new) - `docs/release/v1_release_notes.md` (new) -- `scripts/run_llm_critique.py`, `scripts/publish_kaggle.py`, `scripts/publish_hf.py` (new) +- `scripts/run_llm_critique.py`, `scripts/preview_kaggle_page.py`, `scripts/preview_hf_page.py`, `scripts/publish_kaggle.py`, `scripts/publish_hf.py` (new) +- `tests/scripts/test_preview_kaggle_page.py`, `tests/scripts/test_preview_hf_page.py` (new) - `release/validation/llm_critique_raw_*.json`, `release/validation/llm_critique_summary.md` (output artifacts) **Acceptance:** - LLM critique runs successfully with credentials; produces structured findings. - No unresolved high-severity findings before tag. +- Local Kaggle and HF preview pages render against the as-shipped upload artefacts and are reviewed in a browser before any platform upload. - Both platform publishes succeed in dry-run. - Both private/draft uploads succeed; download smoke test passes from a clean environment. - Public Kaggle and HF pages render the dataset; `load_dataset()` from a clean env works. diff --git a/release/README.md b/release/README.md index d27b608..d548614 100644 --- a/release/README.md +++ b/release/README.md @@ -35,7 +35,7 @@ release/ │ ├── tables/*.parquet # 7 snapshot-safe relational tables │ └── tasks/converted_within_90_days/{train,valid,test}.parquet ├── intermediate_instructor/ # research companion: full-horizon tables + metadata/ -├── notebooks/01_baseline_lead_scoring.ipynb +├── notebooks/ # 01 baseline · 02 relational · 03 leakage · 04 calibration └── validation/ # validation_report.{json,md} + figures ``` diff --git a/scripts/_release_common.py b/scripts/_release_common.py index 3a59858..4d5f4d9 100644 --- a/scripts/_release_common.py +++ b/scripts/_release_common.py @@ -101,7 +101,7 @@ class ValidationError: │ ├── tables/*.parquet # 7 snapshot-safe relational tables │ └── tasks/converted_within_90_days/{train,valid,test}.parquet ├── intermediate_instructor/ # research companion: full-horizon tables + metadata/ -├── notebooks/01_baseline_lead_scoring.ipynb +├── notebooks/ # 01 baseline · 02 relational · 03 leakage · 04 calibration └── validation/ # validation_report.{json,md} + figures ```""" From 4f70a16527fd3f9f0864584a2e69214912eaecff Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Fri, 8 May 2026 00:33:17 +0300 Subject: [PATCH 4/4] PR 6.3: address Copilot review on issue templates MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three Copilot comments on the issue templates, all accepted: 1. **COPILOT-1** (dataset_breakage_report.yml:9 → realism template link). The relative `realism_feedback.yml` link doesn't resolve when the description renders in the GitHub Issue Forms chooser. Replaced with the "open another template directly" URL form (`/issues/new?template=realism_feedback.yml`), which lands a misrouted reporter on the realism *form*, not on raw YAML. 2. **COPILOT-2** (dataset_breakage_report.yml:139 → `render: text`). `render: text` isn't a documented value for Issue Forms textareas (the supported set is language identifiers from GitHub's syntax highlighter, plus `markdown`). The intent here is plain-text formatting, which is the default when `render` is omitted. Dropped the `render` line entirely. 3. **COPILOT-3** (realism_feedback.yml:58 → placeholder mentions `retail`). Same hallucination class as the round-1 "15 industries" fix. The actual industries are logistics, healthcare_non_clinical, manufacturing, professional_services. Rewrote the placeholder to use the real names so a reporter glancing at the example doesn't think `retail` is a category they should be filing about. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; YAML validates; BUNDLE_SCHEMA_VERSION unchanged at 5. Co-Authored-By: Claude Opus 4.7 --- .github/ISSUE_TEMPLATE/dataset_breakage_report.yml | 3 +-- .github/ISSUE_TEMPLATE/realism_feedback.yml | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml index 1b7a2e1..4809c9b 100644 --- a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml +++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml @@ -6,7 +6,7 @@ body: - type: markdown attributes: value: | - Thank you for breaking the dataset on purpose. This template is for findings that affect *what's in the bundle* — leakage, split contamination, metric inversions, notebook failures. Distributional / realism critiques (e.g. "industry mix doesn't look like real procurement") belong in the [realism feedback template](realism_feedback.yml) instead. + Thank you for breaking the dataset on purpose. This template is for findings that affect *what's in the bundle* — leakage, split contamination, metric inversions, notebook failures. Distributional / realism critiques (e.g. "industry mix doesn't look like real procurement") belong in the [realism feedback template](https://github.com/leadforge-dev/leadforge/issues/new?template=realism_feedback.yml) instead. The [break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues the patterns this template is shaped around. @@ -136,7 +136,6 @@ body: scikit-learn==1.5.0 pandas==2.2.2 Python 3.11.9, macOS 14.5 - render: text validations: required: false diff --git a/.github/ISSUE_TEMPLATE/realism_feedback.yml b/.github/ISSUE_TEMPLATE/realism_feedback.yml index 07091f2..a37aa66 100644 --- a/.github/ISSUE_TEMPLATE/realism_feedback.yml +++ b/.github/ISSUE_TEMPLATE/realism_feedback.yml @@ -55,7 +55,7 @@ body: label: Claim description: What does the dataset get wrong, and what would you expect instead? One paragraph. placeholder: | - The intermediate tier shows a 22% conversion rate on the manufacturing industry slice, identical (within noise) to the rate on healthcare and retail. In real procurement / AP automation, manufacturing typically converts ~1.5x healthcare because manufacturing already has discrete-item AP volume that benefits more from automation. The dataset's conversion rate should differ across industries by 1.3-2x, not be flat. + The intermediate tier shows a 22% conversion rate on the manufacturing industry slice, identical (within noise) to the rates on logistics, healthcare_non_clinical, and professional_services. In real procurement / AP automation, manufacturing typically converts ~1.5x healthcare_non_clinical because manufacturing already has discrete-item AP volume that benefits more from automation. The dataset's conversion rate should differ across industries by 1.3-2x, not be flat. validations: required: true