leadforge-dev · shaypal5 · May 7, 2026 · May 7, 2026 · May 7, 2026 · May 7, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -60,15 +60,12 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 ### Phase 6 — Notebook sequence + adversarial framing
 - [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added.  Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples.  This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance.  Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report).  Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05).  Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance).  Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04.  Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps.  G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present).  G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`.  Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree.  Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
 - [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added.  Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from.  The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*.  Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI.  Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users).  Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test).  Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline.  Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`.  Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`.  Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link.  Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
-- [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml`
-- [ ] `docs/release/{break_me_guide,v2_decision_log}.md`
-
-### Phase 7 — LLM critique + publish
-- [ ] `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly)
-- [ ] `docs/release/llm_critique_prompt.md` + `scripts/run_llm_critique.py`
-- [ ] Adjudicate any high-severity findings (resolve in code or document in `v2_decision_log.md`)
-- [ ] `scripts/{publish_kaggle,publish_hf}.py` (dry-run → private/draft → public)
-- [ ] Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md`
+- [x] PR 6.3: adversarial framing landed.  `docs/release/break_me_guide.md` (new) — meta-recipe playbook organised as a 4-step recipe (read the dictionary → ablate, don't just probe → check the time window → treat the train/test split as untrusted) + 9 patterns grouped by category (leakage / split discipline / metric and ranking traps / robustness and realism).  Each pattern carries a "how to detect on any dataset" recipe and a "worked example" pointer back into the v1 bundle (notebook §, validation_report JSON path, or feature_dictionary.csv field), so the guide extends the notebooks rather than duplicating them.  Three explicit promises notebook 04 §10 made are delivered: target-encoding leakage on test (pattern 4, anchored on NB02 §4.4), train-test contamination via `account_id` overlap (pattern 5, with the honest "v1 only checks `lead_id`, not `account_id`" caveat), cohort-by-segment evaluation (pattern 6, extends NB04 §7's tier-wide cohort-shift to per-segment using the actual segment columns: `industry`, `region`, `employee_band`, `estimated_revenue_band`).  Other 6 patterns: naming smells, standalone-AUC vs tree-ablation gap (NB03 finding generalised), time-window violations on engineered features (with the `customers`-table example), value-aware ranking surprises (P × ACV vs P-only), threshold-vs-rank ties at the operating point (NB04 §6 finding), calibration drift across cohorts and segments.  Triage-label table at the top (`critical-leakage` / `realism` / `difficulty` / `documentation` / `platform` / `notebook` / `pedagogy` / `v2-idea` / `out-of-scope-v1`) gives reporters a vocabulary; the same labels are auto-applied (`needs-triage`) by the issue templates.  `docs/release/v2_decision_log.md` (new, empty stub) — schema documented in the file's preamble (7 columns: `received_at` / `source` / `topic` / `severity` / `verdict` / `next_step` / `link`; verdict vocabulary `accepted-for-v2` / `deferred` / `wont-fix` / `needs-investigation` with explicit semantics for each).  `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` (new) and `.github/ISSUE_TEMPLATE/realism_feedback.yml` (new) — GitHub Issue Forms YAML, both carry the `dataset: leadforge-lead-scoring-v1` + `needs-triage` labels.  Breakage report: tier dropdown (intro / intermediate / advanced / instructor / multiple), seed input (default 42), bundle hash field (validation: required), suggested triage label dropdown, severity dropdown (high/medium/low), summary, minimal repro, expected-vs-actual citing JSON paths, environment, two confirmation checkboxes (read break-me guide; reporting on as-shipped bundle).  Realism feedback: aspect dropdown (industry mix / persona / funnel timing / channel / pricing / account-to-lead density / region / other), tier(s)-affected dropdown, domain-experience one-liner (required — helps weight findings), claim, data observation (with concrete pandas-snippet placeholder example), suggested fix (optional), severity, two confirmations (read README "Known limitations"; checked post_v1_roadmap + v2_decision_log).  Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from plain `docs/release/break_me_guide.md` text to Markdown links pointing at the GitHub blob URL (`https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md`) — relative path would break on Kaggle/HF where notebooks ship without the `docs/` tree, the blob URL works in both contexts.  `release/README.md` "Maintenance, adversarial framing, license" section rewritten: dead "(PR 6.3)" forward-pointers replaced with real Markdown links to the break-me guide, both issue templates, and the v2 decision log; `_release_common.py`'s existing `](../foo)` → GitHub-blob-URL rewriter handles the Kaggle/HF rendering automatically (verified by the regenerated `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` sync tests).  Hostile-reviewer self-review caught two factual hallucinations in the first revision before they shipped: claimed "15 industries" for `industry` (actually 4: logistics / healthcare_non_clinical / manufacturing / professional_services) and used loose segment-column names ("employee tier", "ARR band") instead of the actual columns (`employee_band`, `estimated_revenue_band`); both fixed.  Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR is documentation-only).  Phase 6 closed — Phase 7 (LLM critique + publish) is next.
+
+### Phase 7 — LLM critique + publish (3 PRs)
+- [ ] **PR 7.1** — `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly) + `docs/release/llm_critique_prompt.md` + `scripts/run_llm_critique.py`. Adjudicate any high-severity findings (resolve in code or document in `v2_decision_log.md`).
+- [ ] **PR 7.2** — local Kaggle + HuggingFace mock-page preview tooling (must land before PR 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks of the public Kaggle and HF dataset pages from the *exact* upload artefacts (metadata JSON, README, cover image), serve over `localhost`, and let the maintainer click through both pages in a browser before any platform upload — catches styling / link / YAML-rendering issues before they hit cached previews on the live page. Tests cover required-field presence, link resolution, schema column listing, configs-block round-trip.
+- [ ] **PR 7.3** — `scripts/{publish_kaggle,publish_hf}.py` (dry-run → local mock-page review → private/draft → public). Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md` (cites PR 7.2's preview commands as required pre-flight).
 
 ---