Skip to content
Merged
7 changes: 4 additions & 3 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,10 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
- [x] PR 3.3: `scripts/validate_release_candidate.py` (new) — release-candidate driver. Orchestrates `regenerate_tier_for_seeds(spec, seeds, workdir)` × N=5 (default) per tier, calls `measure_release_quality`, runs `run_split_probes` against each tier's canonical seed, renders the JSON / markdown / figure contract via `render_report`, and gates on YAML-declared bands. Flags: `--release-dir`, `--workdir`, `--out-dir`, `--bands`, `--seeds`, `--cohort-canonical-seed`, `--tiers`, `--quick` (N=2 with 500-lead populations; ~20s end-to-end), `--no-rebuild` (reuses workdir for fast band-tweak iteration). Exit codes: 0 pass / 1 gate failure / 2 pre-flight error. Driver vs `leadforge validate` boundary documented in the script docstring (one-bundle structural contract vs. cross-seed × cross-tier release-readiness panel — complementary, not merged). `leadforge/validation/difficulty.py` extended with `BandSpec` / `TierBands` / `LeakageProbeBands` / `AcceptanceBands` / `GateFailure` dataclasses and `load_bands` / `check_release_bands` (consumes `ReleaseQualityReport` + per-tier `LeakageReport`s, returns `list[GateFailure]`). G7.4.4 (cross-tier GBM−LR positivity) softened to follow per-tier `gbm_minus_lr_auc` bands rather than hard-fail on the boolean — the v1 dataset's snapshot is dominated by linear features and HistGBM does not consistently beat LR; documented as a known v1→v2 finding with the cross-tier check tracked as informational. `docs/release/v1_acceptance_gates_bands.yaml` (new) is the operational source of truth for numeric bands; `docs/release/v1_acceptance_gates.md` updated to remove every `TBD-*` placeholder and to record medians + rationale per gate. `release/_release_quality/` workdir gitignored; `release/validation/` (validation_report.{json,md} + 7 pinned figures: lift_curve_{intro,intermediate,advanced}, calibration_intermediate, leakage_delta, cohort_shift, value_capture) committed. New tests: `tests/validation/test_difficulty_bands.py` (29 tests over band parsing / per-tier checks / cross-seed spread / cohort shift / cross-tier ordering / leakage findings / GateFailure immutability) and `tests/scripts/test_validate_release_candidate.py` (19 tests over CLI helpers, mocked pipeline, end-to-end --quick run); 1152/1152 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67 files identical; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (purely additive driver+gating layer). First authentic full-release run baseline (seeds 42–46): intro AP 0.7608 / LR AUC 0.879 / GBM AUC 0.873; intermediate AP 0.5752 / LR AUC 0.886 / GBM AUC 0.876; advanced AP 0.3514 / LR AUC 0.886 / GBM AUC 0.873; cross-tier AP / P@100 / conversion-rate ordering all hold; GBM−LR delta is slightly negative in every tier (−0.0045 / −0.0072 / −0.0133 — the v1→v2 finding above).

### Phase 4 — Channel-signal audit + dataset card hardening
- [ ] `scripts/audit_channel_signal.py` → `docs/release/channel_signal_audit.md`
- [ ] `release/README.md` rewrite (release-grade dataset card; macro-framing paragraph; simulation-simplifications section)
- [ ] `docs/release/{generation_method,feature_dictionary}.md`
- [x] PR 4.1: `scripts/audit_channel_signal.py` (new) — analysis driver. For each tier (and each of `lead_source` / `first_touch_channel`), computes per-channel conversion rate + univariate AUC scored as the empirical positive rate per channel (a 1-D Bayes classifier, equivalent to a saturated LR on one-hot channel features). Writes `docs/release/channel_signal_audit.{md,json}`. CLI: `--release-dir`, `--tier`, `--task`, `--channel-column`, `--out-md`, `--out-json`, `--print`. Determinism guarded by `tests/scripts/test_audit_channel_signal.py` (10 tests: per-channel rollup, closed-form univariate AUC, single-class fallback, missing-column error, build/render round-trip, byte-identical re-run against the committed `release/` bundles, error paths). Audit verdict on the canonical PR 2.2 bundles: **weak channel signal** — across all three tiers and both channel columns the largest per-channel rate spread is 0.043 and the largest univariate AUC is 0.521, well below the G2 / Gemini v2 industry MQL→SQL band (SEO ~51%, PPC ~26%, Email <1%). v1 drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities; channel-conditional encoding is tracked in `docs/release/post_v1_roadmap.md`.
- [x] PR 4.1: `docs/release/generation_method.md` (new) — standalone DGP summary written for external readers (Kaggle/HF). Reads alone, references `docs/leadforge_architecture_spec.md`. Covers the five generation layers (motif families → mechanism layer → population → 90-day daily simulation → snapshot rendering), bundle output contract, public-vs-instructor split, calibration / validation, and an explicit "what this is not" boundary. Satisfies G10.2.
- [x] PR 4.1: `docs/release/feature_dictionary.md` (new) — narrative companion to the per-bundle `feature_dictionary.csv`. Groups every public-mode column by analytical role (lead identity / firmographics / personographics / engagement / funnel / value / leakage trap / target), documents difficulty modulation parameters, modelling defaults, and the deliberate `total_touches_all` trap. Satisfies G10.3.
- [x] PR 4.1: `release/README.md` (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1). New sections: macro framing paragraph (2024–2026 SaaS context, recommendation #19), simulation simplifications (modelled / approximate / not modelled, per chatgpt v2 §2.6), calibration documentation linking to `release/validation/validation_report.md`, public-vs-instructor redaction policy with concrete column lists citing `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` from `leadforge/validation/leakage_probes.py`, intended-use vs out-of-scope-use, known limitations (G7.4.4 GBM−LR sign finding, weak channel signal from the Phase 4 audit, flat AUC across tiers, small cohort-shift gap), composition section per Datasheets format, adversarial-framing pointer (placeholder link to `docs/release/break_me_guide.md` that lands in PR 6.3), and a maintenance plan. Every realism / calibration / difficulty claim in the card is anchored to `validation_report.md` per G10.6. `BUNDLE_SCHEMA_VERSION` unchanged at 5 (documentation-only PR); 1167/1167 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0.

### Phase 5 — Platform packaging
- [ ] `scripts/package_kaggle_release.py` → `release/kaggle/dataset-metadata.json`
Expand Down
241 changes: 241 additions & 0 deletions docs/release/channel_signal_audit.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
{
"channel_columns": [
"lead_source",
"first_touch_channel"
],
"industry_mql_to_sql_benchmarks": {
"Email": 0.005,
"PPC": 0.26,
"SEO": 0.51
},
"label_column": "converted_within_90_days",
"release_dir": "release",
"task": "converted_within_90_days",
"tiers": [
{
"columns": [
{
"channels": [
{
"conversion_rate": 0.43439490445859874,
"n": 1570,
"n_converted": 682,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.39111747851002865,
"n": 698,
"n_converted": 273,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.4025974025974026,
"n": 1232,
"n_converted": 496,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "lead_source",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.04327742594857009,
"test_conversion_rate": 0.4266666666666667,
"train_conversion_rate": 0.4145714285714286,
"univariate_auc_in_sample": 0.5199794894149169,
"univariate_auc_out_of_sample": 0.5013517441860464
},
{
"channels": [
{
"conversion_rate": 0.43439490445859874,
"n": 1570,
"n_converted": 682,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.39111747851002865,
"n": 698,
"n_converted": 273,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.4025974025974026,
"n": 1232,
"n_converted": 496,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "first_touch_channel",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.04327742594857009,
"test_conversion_rate": 0.4266666666666667,
"train_conversion_rate": 0.4145714285714286,
"univariate_auc_in_sample": 0.5199794894149169,
"univariate_auc_out_of_sample": 0.5013517441860464
}
],
"n_test": 750,
"n_train": 3500,
"test_conversion_rate": 0.4266666666666667,
"tier": "intro",
"train_conversion_rate": 0.4145714285714286
},
{
"columns": [
{
"channels": [
{
"conversion_rate": 0.21273885350318472,
"n": 1570,
"n_converted": 334,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.17621776504297995,
"n": 698,
"n_converted": 123,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.2012987012987013,
"n": 1232,
"n_converted": 248,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "lead_source",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.03652108846020477,
"test_conversion_rate": 0.22266666666666668,
"train_conversion_rate": 0.20142857142857143,
"univariate_auc_in_sample": 0.5212431012826857,
"univariate_auc_out_of_sample": 0.5139326835180411
},
{
"channels": [
{
"conversion_rate": 0.21273885350318472,
"n": 1570,
"n_converted": 334,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.17621776504297995,
"n": 698,
"n_converted": 123,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.2012987012987013,
"n": 1232,
"n_converted": 248,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "first_touch_channel",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.03652108846020477,
"test_conversion_rate": 0.22266666666666668,
"train_conversion_rate": 0.20142857142857143,
"univariate_auc_in_sample": 0.5212431012826857,
"univariate_auc_out_of_sample": 0.5139326835180411
}
],
"n_test": 750,
"n_train": 3500,
"test_conversion_rate": 0.22266666666666668,
"tier": "intermediate",
"train_conversion_rate": 0.20142857142857143
},
{
"columns": [
{
"channels": [
{
"conversion_rate": 0.08152866242038216,
"n": 1570,
"n_converted": 128,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.07593123209169055,
"n": 698,
"n_converted": 53,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.07792207792207792,
"n": 1232,
"n_converted": 96,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "lead_source",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.005597430328691616,
"test_conversion_rate": 0.07866666666666666,
"train_conversion_rate": 0.07914285714285714,
"univariate_auc_in_sample": 0.5083011208921436,
"univariate_auc_out_of_sample": 0.5225784296892246
},
{
"channels": [
{
"conversion_rate": 0.08152866242038216,
"n": 1570,
"n_converted": 128,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.07593123209169055,
"n": 698,
"n_converted": 53,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.07792207792207792,
"n": 1232,
"n_converted": 96,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "first_touch_channel",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.005597430328691616,
"test_conversion_rate": 0.07866666666666666,
"train_conversion_rate": 0.07914285714285714,
"univariate_auc_in_sample": 0.5083011208921436,
"univariate_auc_out_of_sample": 0.5225784296892246
}
],
"n_test": 750,
"n_train": 3500,
"test_conversion_rate": 0.07866666666666666,
"tier": "advanced",
"train_conversion_rate": 0.07914285714285714
}
]
}
66 changes: 66 additions & 0 deletions docs/release/channel_signal_audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Channel-signal audit — leadforge-lead-scoring-v1

Audit produced by `scripts/audit_channel_signal.py`; see `channel_signal_audit.json` for the machine-readable form.

**Scope.** For every tier we compute per-channel conversion rates on the train split and the univariate AUC of channel against `converted_within_90_days`, scored as the empirical positive rate per channel (a 1-D Bayes classifier). Two AUCs are reported: an **in-sample** number (train rates → train labels — biased upward by construction) and an **out-of-sample** number (train rates → test labels — directly comparable to the `source_only` baselines in `release/validation/validation_report.json`).

**Caveat on the industry benchmark.** The G2 / Gemini v2 numbers below are single-step **MQL→SQL** rates (recommendation #8 in `docs/external_review/summaries/recommendations_pass.md`). v1's label is **90-day closed-won**, the entire funnel resolved. The two metrics are not directly comparable; the table is reproduced for context only.

## Industry benchmark (context, not target)

| Channel | MQL→SQL conversion rate |
|---|---|
| Email | 0.50% |
| PPC | 26.00% |
| SEO | 51.00% |

## Tier: `intro`

`n_train = 3500` (90-day conversion rate 41.46%); `n_test = 750` (rate 42.67%).

### Columns: `lead_source`, `first_touch_channel` (audit values identical)

Per-channel rate spread (max − min): **0.0433** · In-sample univariate AUC: **0.5200** · Out-of-sample univariate AUC: **0.5014**

| Channel | n (train) | Share (train) | Converted (train) | Train rate |
|---|---:|---:|---:|---:|
| `inbound_marketing` | 1570 | 44.86% | 682 | 43.44% |
| `partner_referral` | 698 | 19.94% | 273 | 39.11% |
| `sdr_outbound` | 1232 | 35.20% | 496 | 40.26% |

## Tier: `intermediate`

`n_train = 3500` (90-day conversion rate 20.14%); `n_test = 750` (rate 22.27%).

### Columns: `lead_source`, `first_touch_channel` (audit values identical)

Per-channel rate spread (max − min): **0.0365** · In-sample univariate AUC: **0.5212** · Out-of-sample univariate AUC: **0.5139**

| Channel | n (train) | Share (train) | Converted (train) | Train rate |
|---|---:|---:|---:|---:|
| `inbound_marketing` | 1570 | 44.86% | 334 | 21.27% |
| `partner_referral` | 698 | 19.94% | 123 | 17.62% |
| `sdr_outbound` | 1232 | 35.20% | 248 | 20.13% |

## Tier: `advanced`

`n_train = 3500` (90-day conversion rate 7.91%); `n_test = 750` (rate 7.87%).

### Columns: `lead_source`, `first_touch_channel` (audit values identical)

Per-channel rate spread (max − min): **0.0056** · In-sample univariate AUC: **0.5083** · Out-of-sample univariate AUC: **0.5226**

| Channel | n (train) | Share (train) | Converted (train) | Train rate |
|---|---:|---:|---:|---:|
| `inbound_marketing` | 1570 | 44.86% | 128 | 8.15% |
| `partner_referral` | 698 | 19.94% | 53 | 7.59% |
| `sdr_outbound` | 1232 | 35.20% | 96 | 7.79% |

## Discussion

The numbers above answer one question: *how strongly does channel alone signal 90-day conversion in v1?* They do not answer *whether v1 matches industry channel performance*, since the benchmarks measure a different funnel transition (single MQL→SQL step) and v1 measures the entire funnel resolved over 90 days. Treat the v1 numbers as an internal description of the simulator's channel signal.

Two empirical observations a reader can make from the numbers above:

1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` + `first_touch_channel` as the only features. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison.
2. **The numerical conclusion is bundle-specific.** When the per-channel rate spread is small and the OOS univariate AUC is close to chance, channel alone is a weak feature for the bundle this audit was run against. v1's bundles currently produce that outcome (see the per-tier sections above) — consistent with the design: the simulator drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities. Channel-conditional encoding is tracked as post-v1 work in `docs/release/post_v1_roadmap.md`.
Loading
Loading