Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agent-plan.md

Large diffs are not rendered by default.

52 changes: 30 additions & 22 deletions docs/release/break_me_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,41 +183,49 @@ fallback-to-train-mean handling is in `attach_engineered`.

The bundle ships a deterministic 70/15/15 split on `lead_id`
(see `tasks/<task>/task_manifest.json`). That guarantees
`lead_id` uniqueness across splits — but `account_id` is
*not* split on. On the as-shipped intermediate bundle,
**518 of 557 test accounts (93 %) also appear in train**;
the same numbers hold on intro and advanced because the
splitter is `lead_id`-keyed and tier-invariant. Models can
ride strong account-level signal across the split boundary
in ways that don't generalise to a fresh account.

**How to detect on any dataset.**
`lead_id` uniqueness across splits — but `account_id` and
`contact_id` are *not* split on. On the as-shipped intermediate
bundle, **518 of 557 test accounts (93 %) also appear in train**,
and the contact-level overlap is similar in magnitude (the
split is `lead_id`-keyed and `account_id` / `contact_id` are
shared foreign keys); the same proportions hold on intro and
advanced because the splitter is tier-invariant. Models can
ride account- or contact-level signal across the split boundary
in ways that don't generalise to a fresh account or fresh
contact.

**How to detect on any dataset.** Repeat the snippet below per
group key — every reusable foreign-key column the dataset
exposes (`account_id`, `contact_id`, and any derived strata
like `industry × region` you bake into engineered features) is
a separate group-leakage axis.

```python
import pandas as pd
train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
test = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")
overlap = set(train["account_id"]) & set(test["account_id"])
print(f"shared accounts: {len(overlap)} / {test['account_id'].nunique()}")
for key in ("account_id", "contact_id"):
overlap = set(train[key]) & set(test[key])
print(f"shared {key}: {len(overlap)} / {test[key].nunique()}")
```

If the overlap is non-empty *and* you've engineered any
account-level features, retrain with account-level grouped
splitting (e.g. `GroupKFold` on `account_id`) and re-read the
AUC delta. The delta is the amount of "free" lift the
random-split was buying you. The right framing isn't "remove
the leak"; it's *report both numbers so the reader knows
which is which.*
If any overlap is non-empty *and* you've engineered any
group-level features, retrain with group-aware splitting
(e.g. `GroupKFold` on the relevant key) and re-read the AUC
delta. The delta is the amount of "free" lift the random-split
was buying you. The right framing isn't "remove the leak"; it's
*report both numbers so the reader knows which is which.*

**Worked example.** Notebook 02 §4.2 builds an account-level
density feature using *only* train leads' touches — a
defensive posture against this hazard. The
`tasks/converted_within_90_days/task_manifest.json` records
the split policy and is the right artefact to cite when filing
an issue under this label. A bundle-level `account_id`
overlap audit isn't included in v1 — the validation report's
split-leakage probe (`probe_split_id_overlap`) checks
`lead_id` only.
an issue under this label. A bundle-level group-overlap audit
isn't included in v1 — the validation report's split-leakage
probe (`probe_split_id_overlap`) checks `lead_id` only;
extending it to enumerate `account_id` and `contact_id`
overlap is a `v2-idea` candidate.

### 6. Cohort-by-segment evaluation

Expand Down
29 changes: 29 additions & 0 deletions docs/release/llm_critique_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# PR 7.1 — `llm_critique` design notes

Working notes for the LLM critique module
(`leadforge/validation/llm_critique.py`), its rubric prompt
(`docs/release/llm_critique_prompt.md`), and its driver
(`scripts/run_llm_critique.py`). Captured before implementation; kept
short on purpose.

## Decisions

| # | Decision | Why |
|---|---|---|
| 1 | Single-provider (Anthropic Claude) via an `LLMCritiqueClient` protocol; no preemptive OpenAI / Gemini stubs. | Multi-provider is post-v1 (`post_v1_roadmap.md`). The protocol gives a future provider a seam without paying for it now. |
| 2 | `ANTHROPIC_API_KEY` env var. "Absent" = unset OR empty after `.strip()`. On absent: skip cleanly, exit 0, no I/O. `--require-execute` flag converts the skip into exit 2 for release-readiness CI. | Roadmap acceptance criterion: live API not required to pass `pytest`. Empty-after-strip handles `env -i` / stale `.envrc`. The CI gate needs an opt-in to fail loud. |
| 3 | Model `claude-opus-4-7`, `thinking={"type": "adaptive", "display": "summarized"}`, `effort="high"`, `messages.create()` with explicit 600s timeout, single prompt-cache breakpoint at end of input bundle. | Adaptive is the only mode on Opus 4.7 (manual `budget_tokens` 400s). `summarized` so the Markdown summary can quote reasoning. `high` is the recommended minimum for intelligence-sensitive work. One breakpoint suffices: system content sits inside the cached prefix anyway, and any rubric edit invalidates the bundle cache, so a second breakpoint buys nothing and burns a slot. |
| 4 | Frozen-dataclass schema (no pydantic). `category` vocabulary lifted **verbatim** from `break_me_guide.md` (the nine triage labels). `rubric_dimension` (D1–D13) required on every finding. Strict `release_id` equality check. Provenance triple (`model` / `effort` / `thinking_mode`) plus per-source-file `bundle_hashes` and assembled `input_bundle_sha256` carried for audit. | Matches the rest of the codebase (no pydantic anywhere). Locked vocabulary = findings route to existing labels without translation. Requiring `rubric_dimension` lets reviewers audit clustering. Strict `release_id` so silent drift can't defeat the audit gate. |
| 5 | Eleven-block input bundle, intermediate tier only: README, per-tier dataset card, generation method, manifest, feature dictionary, validation report `.{md,json}`, test-split `df.describe()` + 20-row head, public/instructor diff (live-derived from `BANNED_*` constants in `leakage_probes.py`), public-safe mechanism summary (motif family names + difficulty knob *names*, no values), break-me guide verbatim. | Each block earns its place. Live-derived diff = single source of truth, sync-tested. Mechanism summary names-only matches the `student_public` redaction posture. `df.describe()` carries the per-column statistics raw rows can't. All-three-tiers would triple context for marginal value (cross-tier spread is in the validation report already). |
| 6 | No fake determinism (Opus 4.7 doesn't accept `temperature`). Provenance instead: model + effort + thinking + bundle hashes recorded on every result. Timestamped raw JSON accumulates per run; canonical Markdown summary overwrites in place. | Reviewer concern is "could a different maintainer get a different result" — yes. Mitigation is provenance, not fake `temperature=0`. |
| 7 | CLI mirrors `scripts/validate_release_candidate.py`: free-function `parse_args`, frozen `DriverConfig`, `run_critique(config) -> DriverResult`, `main(argv) -> int`. Exit codes 0 / 1 / 2. Three modes alongside the live path: `--dry-run` writes the input bundle for inspection (no API call); `--no-execute` validates SDK + creds and exits (CI smoke gate, fails loud on absent creds); `--out-tag` suffixes both raw JSON *and* summary filenames for adjudication re-runs. | Maintainer muscle memory + small surface. `--out-tag` suffixes both files because the summary is the at-a-glance entry point — clobbering the canonical run's summary on adjudication is the bug. |
| 8 | Tests: no live API. Mocked `LLMCritiqueClient` protocol with a small in-process canned-response fake. Sync tests pin (a) every `VALID_CATEGORIES` entry appears in `break_me_guide.md`, (b) `VALID_RUBRIC_DIMENSIONS` is exactly D1–D14, (c) the live-derived public/instructor diff names every banned-column / banned-table constant. Smoke test exercises `build_input_bundle` against the real `release/intermediate/` artefacts when present. | Roadmap acceptance: live API not required. Sync tests are the cheap-but-load-bearing guards against vocabulary drift. |
| 9 | First live run is maintainer-driven. Outputs land at `release/validation/llm_critique_raw_<UTC-iso>.json` + `release/validation/llm_critique_summary.md`. Hand-adjudicate: resolve high-severity findings in code OR log to `docs/release/v2_decision_log.md` with verdict (`accepted-for-v2` / `deferred` / `wont-fix` / `needs-investigation`). | Adjudication is human work. The next critique's exit code is the gate. |

## What this PR does not touch

- `BUNDLE_SCHEMA_VERSION` stays at 5.
- `release/validation/validation_report.{json,md}` does not regenerate.
- PR 7.2 (Kaggle/HF mock-page preview) and PR 7.3 (publish + tag) are separate PRs.
- Multi-provider abstraction beyond the protocol seam.
- CI integration of the critique gate (post-v1 unless `--require-execute` lands in a workflow this PR or later).
Loading
Loading