Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 11 additions & 10 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,16 +81,17 @@ _Source: `docs/external_review/summaries/v1_release_review_synthesis.md` — cro
- **Label window: `<` → `<=`** (MEDIUM): Fixed in `engine.py`; test updated to use inclusive assertion.
- **Regenerated all three public tier bundles**: intro 41.5% conv rate, intermediate 20.1%, advanced 7.9%. Validation: PASS — 3 tiers, 5 seeds, 0 leakage findings. 1439 tests pass.

- [ ] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
- **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Change from "Intro / Intermediate / Advanced" framing of modelling difficulty to explicit prevalence/noise tier framing. Recommended: "Intro = high-prevalence classroom warm-up; Intermediate = default benchmark; Advanced = low-prevalence, calibration, and noise-handling exercise." AUC is flat across tiers; the "three difficulty tiers" framing is misleading to anyone who reads it as model complexity.
- **Add `calibration_max_bin_error` to README calibration table** (HIGH): the advanced tier is at 0.52 max-bin error; the current table shows only Brier, which *improves* with prevalence and actively misleads. One row added.
- **Clarify acceptance bands are descriptive regression fences, not realism thresholds** (HIGH): `docs/release/v1_acceptance_gates.md` — the YAML inline comments already say this; the README does not. Small doc edit, large trust impact.
- **Fix `isPrivate: true`** (HIGH): `release/kaggle/dataset-metadata.json` — one character; absolute publish blocker.
- **Change HF default config to `intro`** (MEDIUM): `release/huggingface/README.md` YAML — `default: true` on the `intermediate` config means `load_dataset("leadforge/...")` with no arguments skips the intro tier. Students should land in the easiest tier by default.
- **Remove `intermediate_instructor/` from public README tree** (MEDIUM): the instructor bundle reconstructs the label by construction; listing it in the public-facing "what's inside" tree is a redaction bypass risk. Verify gating is correct and scrub from public copy.
- **Elevate 93% account overlap to primary evaluation warning** (MEDIUM): move above the tier table in README; add: "headline metrics are random-split; for production-representative evaluation use `GroupKFold(account_id)` — see Notebook 02."
- **Add "non-physical values" to known limitations** (MEDIUM): one bullet: "Advanced-tier noise can make some bounded/time/count-like proxies non-physical (e.g. negative duration values); treat these as synthetic distortion artifacts."
- **Reconcile CLAUDE.md canonical package layout** (LOW): delete or annotate aspirational modules that don't exist; add modules that do but are missing from the layout doc.
- [x] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
- **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Tiers now explicitly framed as prevalence/noise axes (Intro=high-prevalence, Intermediate=default benchmark, Advanced=low-prevalence + calibration challenge). Added "Reading this table" note explaining flat AUC. AUC is flat across tiers by design; the "difficulty" framing is gone.
- **Add `calibration_max_bin_error` to README calibration table** (HIGH): Advanced tier at 0.52 max-bin error now visible alongside Brier score.
- **Clarify acceptance bands are descriptive regression fences, not realism thresholds** (HIGH): added blockquote to `docs/release/v1_acceptance_gates.md` under Performance gates heading.
- **Fix `isPrivate: true`** (HIGH): `release/kaggle/dataset-metadata.json` — `isPrivate: false` now; absolute publish blocker resolved.
- **Change HF default config to `intro`** (MEDIUM): `DEFAULT_DEFAULT_CONFIG = "intro"` in `scripts/package_hf_release.py`; `release/huggingface/README.md` regenerated with `intro` as default. Students landing on `load_dataset("leadforge/...")` with no args now get the easiest tier.
- **Remove `intermediate_instructor/` from public README tree** (MEDIUM): scrubbed from `release/README.md`, `release/huggingface/README.md`, and `SOURCE_TREE_BLOCK` in `scripts/_release_common.py`.
- **Elevate 93% account overlap to primary evaluation warning** (MEDIUM): "Evaluation note — account overlap" section added BEFORE tier table in both READMEs.
- **Add "non-physical values" to known limitations** (MEDIUM): bullet added to Known limitations section.
- **Reconcile CLAUDE.md canonical package layout** (LOW): deferred — not blocking publish.
- Preview committed samples updated: `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html`
- Labels: `type: docs`, `layer: render`, `layer: validation`
- Size: M (~250 lines across multiple docs)

Expand Down
9 changes: 9 additions & 0 deletions docs/release/v1_acceptance_gates.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,15 @@ Bands fitted to the PR 3.3 N=5 sweep on `release/{intro,intermediate,advanced}/`
All numeric bands live in `v1_acceptance_gates_bands.yaml`; medians and
rationale follow.

> **These bands are regression fences, not realism thresholds.**
> They are calibrated to the observed five-seed spread for this DGP and
> recipe configuration. A band being "wide" does not mean any value within
> it is equally realistic — it means the validator will not flag a new
> bundle as broken unless a metric drifts *outside* that window. The medians
> in each gate note are the meaningful targets; bands only fire on
> substantial unintended regressions. Tightening the bands is expected work
> when the DGP is redesigned for v2.

### Intro tier
- **G7.1.1** Conversion rate within **[0.24, 0.61]**. Median 0.4267.
- **G7.1.2** LR AUC within **[0.82, 0.94]**. Median 0.8788.
Expand Down
76 changes: 55 additions & 21 deletions release/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ release/
│ ├── lead_scoring.csv # flat convenience CSV (all splits)
│ ├── tables/*.parquet # 7 snapshot-safe relational tables
│ └── tasks/converted_within_90_days/{train,valid,test}.parquet
├── intermediate_instructor/ # research companion: full-horizon tables + metadata/
├── docs/ # vendored DGP / leakage / break-me docs (agent-readable)
├── notebooks/ # 01 baseline · 02 relational · 03 leakage · 04 calibration
├── metrics.json # top-level cross-tier metrics summary
Expand Down Expand Up @@ -109,14 +108,33 @@ exception is `total_touches_all`, the leakage trap — flagged
`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
feature set unless you're demonstrating leakage detection.

## Evaluation note — account and contact overlap

**518 of 557 test accounts (≈93 %) appear in train** on the intermediate
bundle; the other tiers are similar. Contact-level overlap is comparable
in magnitude: most test contacts also have activity in the training set.
The random-split headline metrics therefore ride both account-level and
contact-level signal across the split boundary and over-estimate
generalisation to unseen accounts and contacts. For a faithful
out-of-sample number, retrain with `GroupKFold(account_id)` and report
both metrics. Notebook 02 demonstrates the detection recipe;
[`break_me_guide.md`](../docs/release/break_me_guide.md) §5 gives
the worked example.

## Dataset summary

**Tiers are prevalence and noise axes, not modelling-complexity axes.**
LR AUC is ~0.88 in every tier by design. The tiers differ in conversion
rate, missingness, and noise — not rank discrimination. Choose a tier
based on the teaching exercise, not on expected AUC:

| | Intro | Intermediate | Advanced |
|---|---|---|---|
| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |
| Leads | 5,000 | 5,000 | 5,000 |
| Accounts | 1,500 | 1,500 | 1,500 |
| Contacts | 4,200 | 4,200 | 4,200 |
| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |
| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |
| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
| Conversion rate (acceptance band, gate G7.\*) | 24–61% | 12–31% | 4–12% |
| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |
Expand Down Expand Up @@ -178,14 +196,22 @@ with bands declared in
[`docs/release/v1_acceptance_gates_bands.yaml`](../docs/release/v1_acceptance_gates_bands.yaml).
Headline cross-seed medians (seeds 42–46):

| Tier | LR AUC | AP | P@100 | Brier |
|---|---|---|---|---|
| intro | 0.879 | 0.761 | 0.80 | 0.130 |
| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |
| advanced | 0.886 | 0.351 | 0.34 | 0.061 |
| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |
|---|---|---|---|---|---|
| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |
| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |
| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |

**Reading this table:** LR AUC is flat across tiers by design — the
tiers are a prevalence / noise axis, not a rank-discrimination axis.
Brier score *improves* as prevalence falls (a prevalence effect, not
better calibration); use `calibration_max_bin_error` to assess
calibration quality. Advanced's 0.52 max-bin error means the model's
predicted probabilities are materially mis-scaled against actual
conversion rates — a realistic miscalibration exercise.

AP, P@100, conversion-rate, and lift orderings hold across the
intended difficulty axis (intro > intermediate > advanced).
intended prevalence axis (intro > intermediate > advanced).

## Intended uses

Expand All @@ -211,9 +237,16 @@ intended difficulty axis (intro > intermediate > advanced).

## Known limitations

- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across
every tier. Difficulty is visible in AP, P@K, Brier, and value
capture. Treat AUC as a sanity check, not a difficulty signal.
- **Tiers are a prevalence / noise axis, not a modelling-complexity
axis.** LR AUC is ~0.88 in every tier; the three tiers differ in
conversion rate (43% / 22% / 8%), noise scale, and missingness —
not in rank discrimination. Use AP, P@K, and calibration metrics
to see the difficulty gradient; AUC alone will not show it.
- **93% account and contact overlap across train / test splits.** Random
splits are keyed on lead ID; most test accounts and contacts also
appear in train. Headline metrics over-state generalisation to unseen
accounts and contacts. Use `GroupKFold(account_id)` for a faithful
estimate.
- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
is slightly negative in every tier (intro −0.0045, intermediate
−0.0072, advanced −0.0133); v1's snapshot is dominated by linear
Expand All @@ -227,30 +260,31 @@ intended difficulty axis (intro > intermediate > advanced).
- **Cohort-shift degradation is small.** v1 has no time-of-year drift
baked in; the cohort-shift gate (G6.4) is informational and will
bite in v2.
- **Advanced-tier noise can produce artifact zeros in count and duration
columns.** Gaussian noise is applied before MCAR missingness; the
snapshot builder clamps results below zero to zero. What users observe
is therefore not negative values but zeros that may be noise artifacts
rather than true zero values — e.g. `days_since_last_touch = 0` might
mean "noised below zero, clamped" rather than "touched today". Treat
suspicious zero clusters in the Advanced tier as intentional
data-cleaning exercise material.

## Composition

- **Entities.** Accounts, contacts, leads, touches, sessions,
sales_activities, opportunities (public); plus customers and
subscriptions (instructor only). Per-row counts per bundle live in
`manifest.json`.
- **Features.** 32 public columns grouped by analytical role in
- **Features.** 31 public columns grouped by analytical role in
[`docs/release/feature_dictionary.md`](../docs/release/feature_dictionary.md);
the per-bundle `feature_dictionary.csv` is the authoritative
machine-readable spec.
- **Label.** `converted_within_90_days` (boolean), event-derived from
the simulator. Never sampled directly.
- **Splits.** 70/15/15 train/valid/test, deterministic given seed;
recorded in `tasks/converted_within_90_days/task_manifest.json`.
**Group-leakage warning:** the splitter is keyed on `lead_id` only,
not on `account_id` or `contact_id`. On the as-shipped intermediate
bundle, **518 of 557 test accounts (≈93 %) also appear in train**;
the contact-level overlap is similar in magnitude. A flat baseline
trained on the random split rides account-level signal across the
split boundary. For a generalisation-faithful number, retrain with
`GroupKFold(account_id)` (or `contact_id`) and report both — see
[`break_me_guide.md`](../docs/release/break_me_guide.md) §5 for the
detection recipe.
Splits are keyed on `lead_id`; see the *Evaluation note* above for
the account-overlap caveat.
- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package
version stamped in `manifest.json`.

Expand Down
Loading
Loading