leadforge-dev · shaypal5 · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -40,9 +40,10 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 - [x] PR 3.3: `scripts/validate_release_candidate.py` (new) — release-candidate driver. Orchestrates `regenerate_tier_for_seeds(spec, seeds, workdir)` × N=5 (default) per tier, calls `measure_release_quality`, runs `run_split_probes` against each tier's canonical seed, renders the JSON / markdown / figure contract via `render_report`, and gates on YAML-declared bands. Flags: `--release-dir`, `--workdir`, `--out-dir`, `--bands`, `--seeds`, `--cohort-canonical-seed`, `--tiers`, `--quick` (N=2 with 500-lead populations; ~20s end-to-end), `--no-rebuild` (reuses workdir for fast band-tweak iteration). Exit codes: 0 pass / 1 gate failure / 2 pre-flight error. Driver vs `leadforge validate` boundary documented in the script docstring (one-bundle structural contract vs. cross-seed × cross-tier release-readiness panel — complementary, not merged). `leadforge/validation/difficulty.py` extended with `BandSpec` / `TierBands` / `LeakageProbeBands` / `AcceptanceBands` / `GateFailure` dataclasses and `load_bands` / `check_release_bands` (consumes `ReleaseQualityReport` + per-tier `LeakageReport`s, returns `list[GateFailure]`). G7.4.4 (cross-tier GBM−LR positivity) softened to follow per-tier `gbm_minus_lr_auc` bands rather than hard-fail on the boolean — the v1 dataset's snapshot is dominated by linear features and HistGBM does not consistently beat LR; documented as a known v1→v2 finding with the cross-tier check tracked as informational. `docs/release/v1_acceptance_gates_bands.yaml` (new) is the operational source of truth for numeric bands; `docs/release/v1_acceptance_gates.md` updated to remove every `TBD-*` placeholder and to record medians + rationale per gate. `release/_release_quality/` workdir gitignored; `release/validation/` (validation_report.{json,md} + 7 pinned figures: lift_curve_{intro,intermediate,advanced}, calibration_intermediate, leakage_delta, cohort_shift, value_capture) committed. New tests: `tests/validation/test_difficulty_bands.py` (29 tests over band parsing / per-tier checks / cross-seed spread / cohort shift / cross-tier ordering / leakage findings / GateFailure immutability) and `tests/scripts/test_validate_release_candidate.py` (19 tests over CLI helpers, mocked pipeline, end-to-end --quick run); 1152/1152 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67 files identical; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (purely additive driver+gating layer). First authentic full-release run baseline (seeds 42–46): intro AP 0.7608 / LR AUC 0.879 / GBM AUC 0.873; intermediate AP 0.5752 / LR AUC 0.886 / GBM AUC 0.876; advanced AP 0.3514 / LR AUC 0.886 / GBM AUC 0.873; cross-tier AP / P@100 / conversion-rate ordering all hold; GBM−LR delta is slightly negative in every tier (−0.0045 / −0.0072 / −0.0133 — the v1→v2 finding above).
 
 ### Phase 4 — Channel-signal audit + dataset card hardening
-- [ ] `scripts/audit_channel_signal.py` → `docs/release/channel_signal_audit.md`
-- [ ] `release/README.md` rewrite (release-grade dataset card; macro-framing paragraph; simulation-simplifications section)
-- [ ] `docs/release/{generation_method,feature_dictionary}.md`
+- [x] PR 4.1: `scripts/audit_channel_signal.py` (new) — analysis driver. For each tier (and each of `lead_source` / `first_touch_channel`), computes per-channel conversion rate + univariate AUC scored as the empirical positive rate per channel (a 1-D Bayes classifier, equivalent to a saturated LR on one-hot channel features). Writes `docs/release/channel_signal_audit.{md,json}`. CLI: `--release-dir`, `--tier`, `--task`, `--channel-column`, `--out-md`, `--out-json`, `--print`. Determinism guarded by `tests/scripts/test_audit_channel_signal.py` (10 tests: per-channel rollup, closed-form univariate AUC, single-class fallback, missing-column error, build/render round-trip, byte-identical re-run against the committed `release/` bundles, error paths). Audit verdict on the canonical PR 2.2 bundles: **weak channel signal** — across all three tiers and both channel columns the largest per-channel rate spread is 0.043 and the largest univariate AUC is 0.521, well below the G2 / Gemini v2 industry MQL→SQL band (SEO ~51%, PPC ~26%, Email <1%). v1 drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities; channel-conditional encoding is tracked in `docs/release/post_v1_roadmap.md`.
+- [x] PR 4.1: `docs/release/generation_method.md` (new) — standalone DGP summary written for external readers (Kaggle/HF). Reads alone, references `docs/leadforge_architecture_spec.md`. Covers the five generation layers (motif families → mechanism layer → population → 90-day daily simulation → snapshot rendering), bundle output contract, public-vs-instructor split, calibration / validation, and an explicit "what this is not" boundary. Satisfies G10.2.
+- [x] PR 4.1: `docs/release/feature_dictionary.md` (new) — narrative companion to the per-bundle `feature_dictionary.csv`. Groups every public-mode column by analytical role (lead identity / firmographics / personographics / engagement / funnel / value / leakage trap / target), documents difficulty modulation parameters, modelling defaults, and the deliberate `total_touches_all` trap. Satisfies G10.3.
+- [x] PR 4.1: `release/README.md` (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1). New sections: macro framing paragraph (2024–2026 SaaS context, recommendation #19), simulation simplifications (modelled / approximate / not modelled, per chatgpt v2 §2.6), calibration documentation linking to `release/validation/validation_report.md`, public-vs-instructor redaction policy with concrete column lists citing `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` from `leadforge/validation/leakage_probes.py`, intended-use vs out-of-scope-use, known limitations (G7.4.4 GBM−LR sign finding, weak channel signal from the Phase 4 audit, flat AUC across tiers, small cohort-shift gap), composition section per Datasheets format, adversarial-framing pointer (placeholder link to `docs/release/break_me_guide.md` that lands in PR 6.3), and a maintenance plan. Every realism / calibration / difficulty claim in the card is anchored to `validation_report.md` per G10.6. `BUNDLE_SCHEMA_VERSION` unchanged at 5 (documentation-only PR); 1167/1167 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0.
 
 ### Phase 5 — Platform packaging
 - [ ] `scripts/package_kaggle_release.py` → `release/kaggle/dataset-metadata.json`

diff --git a/docs/release/channel_signal_audit.json b/docs/release/channel_signal_audit.json
@@ -0,0 +1,241 @@
+{
+  "channel_columns": [
+    "lead_source",
+    "first_touch_channel"
+  ],
+  "industry_mql_to_sql_benchmarks": {
+    "Email": 0.005,
+    "PPC": 0.26,
+    "SEO": 0.51
+  },
+  "label_column": "converted_within_90_days",
+  "release_dir": "release",
+  "task": "converted_within_90_days",
+  "tiers": [
+    {
+      "columns": [
+        {
+          "channels": [
+            {
+              "conversion_rate": 0.43439490445859874,
+              "n": 1570,
+              "n_converted": 682,
+              "name": "inbound_marketing",
+              "share": 0.44857142857142857
+            },
+            {
+              "conversion_rate": 0.39111747851002865,
+              "n": 698,
+              "n_converted": 273,
+              "name": "partner_referral",
+              "share": 0.19942857142857143
+            },
+            {
+              "conversion_rate": 0.4025974025974026,
+              "n": 1232,
+              "n_converted": 496,
+              "name": "sdr_outbound",
+              "share": 0.352
+            }
+          ],
+          "column": "lead_source",
+          "n_test": 750,
+          "n_train": 3500,
+          "rate_spread": 0.04327742594857009,
+          "test_conversion_rate": 0.4266666666666667,
+          "train_conversion_rate": 0.4145714285714286,
+          "univariate_auc_in_sample": 0.5199794894149169,
+          "univariate_auc_out_of_sample": 0.5013517441860464
+        },
+        {
+          "channels": [
+            {
+              "conversion_rate": 0.43439490445859874,
+              "n": 1570,
+              "n_converted": 682,
+              "name": "inbound_marketing",
+              "share": 0.44857142857142857
+            },
+            {
+              "conversion_rate": 0.39111747851002865,
+              "n": 698,
+              "n_converted": 273,
+              "name": "partner_referral",
+              "share": 0.19942857142857143
+            },
+            {
+              "conversion_rate": 0.4025974025974026,
+              "n": 1232,
+              "n_converted": 496,
+              "name": "sdr_outbound",
+              "share": 0.352
+            }
+          ],
+          "column": "first_touch_channel",
+          "n_test": 750,
+          "n_train": 3500,
+          "rate_spread": 0.04327742594857009,
+          "test_conversion_rate": 0.4266666666666667,
+          "train_conversion_rate": 0.4145714285714286,
+          "univariate_auc_in_sample": 0.5199794894149169,
+          "univariate_auc_out_of_sample": 0.5013517441860464
+        }
+      ],
+      "n_test": 750,
+      "n_train": 3500,
+      "test_conversion_rate": 0.4266666666666667,
+      "tier": "intro",
+      "train_conversion_rate": 0.4145714285714286
+    },
+    {
+      "columns": [
+        {
+          "channels": [
+            {
+              "conversion_rate": 0.21273885350318472,
+              "n": 1570,
+              "n_converted": 334,
+              "name": "inbound_marketing",
+              "share": 0.44857142857142857
+            },
+            {
+              "conversion_rate": 0.17621776504297995,
+              "n": 698,
+              "n_converted": 123,
+              "name": "partner_referral",
+              "share": 0.19942857142857143
+            },
+            {
+              "conversion_rate": 0.2012987012987013,
+              "n": 1232,
+              "n_converted": 248,
+              "name": "sdr_outbound",
+              "share": 0.352
+            }
+          ],
+          "column": "lead_source",
+          "n_test": 750,
+          "n_train": 3500,
+          "rate_spread": 0.03652108846020477,
+          "test_conversion_rate": 0.22266666666666668,
+          "train_conversion_rate": 0.20142857142857143,
+          "univariate_auc_in_sample": 0.5212431012826857,
+          "univariate_auc_out_of_sample": 0.5139326835180411
+        },
+        {
+          "channels": [
+            {
+              "conversion_rate": 0.21273885350318472,
+              "n": 1570,
+              "n_converted": 334,
+              "name": "inbound_marketing",
+              "share": 0.44857142857142857
+            },
+            {
+              "conversion_rate": 0.17621776504297995,
+              "n": 698,
+              "n_converted": 123,
+              "name": "partner_referral",
+              "share": 0.19942857142857143
+            },
+            {
+              "conversion_rate": 0.2012987012987013,
+              "n": 1232,
+              "n_converted": 248,
+              "name": "sdr_outbound",
+              "share": 0.352
+            }
+          ],
+          "column": "first_touch_channel",
+          "n_test": 750,
+          "n_train": 3500,
+          "rate_spread": 0.03652108846020477,
+          "test_conversion_rate": 0.22266666666666668,
+          "train_conversion_rate": 0.20142857142857143,
+          "univariate_auc_in_sample": 0.5212431012826857,
+          "univariate_auc_out_of_sample": 0.5139326835180411
+        }
+      ],
+      "n_test": 750,
+      "n_train": 3500,
+      "test_conversion_rate": 0.22266666666666668,
+      "tier": "intermediate",
+      "train_conversion_rate": 0.20142857142857143
+    },
+    {
+      "columns": [
+        {
+          "channels": [
+            {
+              "conversion_rate": 0.08152866242038216,
+              "n": 1570,
+              "n_converted": 128,
+              "name": "inbound_marketing",
+              "share": 0.44857142857142857
+            },
+            {
+              "conversion_rate": 0.07593123209169055,
+              "n": 698,
+              "n_converted": 53,
+              "name": "partner_referral",
+              "share": 0.19942857142857143
+            },
+            {
+              "conversion_rate": 0.07792207792207792,
+              "n": 1232,
+              "n_converted": 96,
+              "name": "sdr_outbound",
+              "share": 0.352
+            }
+          ],
+          "column": "lead_source",
+          "n_test": 750,
+          "n_train": 3500,
+          "rate_spread": 0.005597430328691616,
+          "test_conversion_rate": 0.07866666666666666,
+          "train_conversion_rate": 0.07914285714285714,
+          "univariate_auc_in_sample": 0.5083011208921436,
+          "univariate_auc_out_of_sample": 0.5225784296892246
+        },
+        {
+          "channels": [
+            {
+              "conversion_rate": 0.08152866242038216,
+              "n": 1570,
+              "n_converted": 128,
+              "name": "inbound_marketing",
+              "share": 0.44857142857142857
+            },
+            {
+              "conversion_rate": 0.07593123209169055,
+              "n": 698,
+              "n_converted": 53,
+              "name": "partner_referral",
+              "share": 0.19942857142857143
+            },
+            {
+              "conversion_rate": 0.07792207792207792,
+              "n": 1232,
+              "n_converted": 96,
+              "name": "sdr_outbound",
+              "share": 0.352
+            }
+          ],
+          "column": "first_touch_channel",
+          "n_test": 750,
+          "n_train": 3500,
+          "rate_spread": 0.005597430328691616,
+          "test_conversion_rate": 0.07866666666666666,
+          "train_conversion_rate": 0.07914285714285714,
+          "univariate_auc_in_sample": 0.5083011208921436,
+          "univariate_auc_out_of_sample": 0.5225784296892246
+        }
+      ],
+      "n_test": 750,
+      "n_train": 3500,
+      "test_conversion_rate": 0.07866666666666666,
+      "tier": "advanced",
+      "train_conversion_rate": 0.07914285714285714
+    }
+  ]
+}
diff --git a/docs/release/channel_signal_audit.md b/docs/release/channel_signal_audit.md
@@ -0,0 +1,66 @@
+# Channel-signal audit — leadforge-lead-scoring-v1
+
+Audit produced by `scripts/audit_channel_signal.py`; see `channel_signal_audit.json` for the machine-readable form.
+
+**Scope.** For every tier we compute per-channel conversion rates on the train split and the univariate AUC of channel against `converted_within_90_days`, scored as the empirical positive rate per channel (a 1-D Bayes classifier). Two AUCs are reported: an **in-sample** number (train rates → train labels — biased upward by construction) and an **out-of-sample** number (train rates → test labels — directly comparable to the `source_only` baselines in `release/validation/validation_report.json`).
+
+**Caveat on the industry benchmark.** The G2 / Gemini v2 numbers below are single-step **MQL→SQL** rates (recommendation #8 in `docs/external_review/summaries/recommendations_pass.md`). v1's label is **90-day closed-won**, the entire funnel resolved. The two metrics are not directly comparable; the table is reproduced for context only.
+
+## Industry benchmark (context, not target)
+
+| Channel | MQL→SQL conversion rate |
+|---|---|
+| Email | 0.50% |
+| PPC | 26.00% |
+| SEO | 51.00% |
+
+## Tier: `intro`
+
+`n_train = 3500` (90-day conversion rate 41.46%); `n_test = 750` (rate 42.67%).
+
+### Columns: `lead_source`, `first_touch_channel` (audit values identical)
+
+Per-channel rate spread (max − min): **0.0433**  ·  In-sample univariate AUC: **0.5200**  ·  Out-of-sample univariate AUC: **0.5014**
+
+| Channel | n (train) | Share (train) | Converted (train) | Train rate |
+|---|---:|---:|---:|---:|
+| `inbound_marketing` | 1570 | 44.86% | 682 | 43.44% |
+| `partner_referral` | 698 | 19.94% | 273 | 39.11% |
+| `sdr_outbound` | 1232 | 35.20% | 496 | 40.26% |
+
+## Tier: `intermediate`
+
+`n_train = 3500` (90-day conversion rate 20.14%); `n_test = 750` (rate 22.27%).
+
+### Columns: `lead_source`, `first_touch_channel` (audit values identical)
+
+Per-channel rate spread (max − min): **0.0365**  ·  In-sample univariate AUC: **0.5212**  ·  Out-of-sample univariate AUC: **0.5139**
+
+| Channel | n (train) | Share (train) | Converted (train) | Train rate |
+|---|---:|---:|---:|---:|
+| `inbound_marketing` | 1570 | 44.86% | 334 | 21.27% |
+| `partner_referral` | 698 | 19.94% | 123 | 17.62% |
+| `sdr_outbound` | 1232 | 35.20% | 248 | 20.13% |
+
+## Tier: `advanced`
+
+`n_train = 3500` (90-day conversion rate 7.91%); `n_test = 750` (rate 7.87%).
+
+### Columns: `lead_source`, `first_touch_channel` (audit values identical)
+
+Per-channel rate spread (max − min): **0.0056**  ·  In-sample univariate AUC: **0.5083**  ·  Out-of-sample univariate AUC: **0.5226**
+
+| Channel | n (train) | Share (train) | Converted (train) | Train rate |
+|---|---:|---:|---:|---:|
+| `inbound_marketing` | 1570 | 44.86% | 128 | 8.15% |
+| `partner_referral` | 698 | 19.94% | 53 | 7.59% |
+| `sdr_outbound` | 1232 | 35.20% | 96 | 7.79% |
+
+## Discussion
+
+The numbers above answer one question: *how strongly does channel alone signal 90-day conversion in v1?* They do not answer *whether v1 matches industry channel performance*, since the benchmarks measure a different funnel transition (single MQL→SQL step) and v1 measures the entire funnel resolved over 90 days. Treat the v1 numbers as an internal description of the simulator's channel signal.
+
+Two empirical observations a reader can make from the numbers above:
+
+1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` + `first_touch_channel` as the only features. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison.
+2. **The numerical conclusion is bundle-specific.** When the per-channel rate spread is small and the OOS univariate AUC is close to chance, channel alone is a weak feature for the bundle this audit was run against. v1's bundles currently produce that outcome (see the per-tier sections above) — consistent with the design: the simulator drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities. Channel-conditional encoding is tracked as post-v1 work in `docs/release/post_v1_roadmap.md`.