PR 6.3: break-me guide + issue templates + v2 decision log#75
Merged
Conversation
Phase 6 closer. Five new artefacts plus three follow-up syncs: - docs/release/break_me_guide.md — adversarial playbook. Meta-recipe (read the dictionary → ablate, don't just probe → check the time window → treat the train/test split as untrusted) plus 9 patterns grouped by category. Each pattern carries a "how to detect on any dataset" recipe and a worked-example pointer back into the v1 bundle (notebook §, validation_report JSON path, or feature_dictionary.csv field). Delivers the three explicit promises notebook 04 §10 made (target-encoding leakage, train-test contamination via account_id, cohort-by-segment) plus six others (naming smells, standalone-AUC vs tree-ablation gap, time-window violations, value-aware ranking inversions, threshold-vs-rank ties, calibration drift across segments). - docs/release/v2_decision_log.md — empty stub with the schema (7 columns: received_at / source / topic / severity / verdict / next_step / link) and verdict vocabulary documented in the preamble. - .github/ISSUE_TEMPLATE/dataset_breakage_report.yml — GitHub Issue Forms. Fields: tier, seed, bundle hash, suggested triage label, severity, summary, repro, expected-vs-actual, environment, two confirmation checkboxes. - .github/ISSUE_TEMPLATE/realism_feedback.yml — GitHub Issue Forms. Fields: aspect, tier(s)-affected, domain experience, claim, data observation, suggested fix, severity, two confirmations. - release/README.md — "Maintenance, adversarial framing, license" section now links to the break-me guide, both issue templates, and the v2 decision log. _release_common.py's existing relative-link rewriter handles the Kaggle/HF rendering automatically; the regenerated release/kaggle/dataset-metadata.json and release/huggingface/README.md sync are bundled in this commit and pass the audit-artifact-sync tests. Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from plain `docs/release/break_me_guide.md` text to Markdown links pointing at the GitHub blob URL — relative path would break on Kaggle/HF where notebooks ship without the docs/ tree, the blob URL works in both contexts. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every public tier; hash determinism PASS 67/67; validate_release_candidate --no-rebuild exits 0; BUNDLE_SCHEMA_VERSION unchanged at 5 (this PR is documentation-only). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds Phase 6.3 “adversarial framing” artifacts to the release docs/tooling so external reviewers can report dataset breakage/realism issues consistently, and so v2 dispositions can be tracked over time.
Changes:
- Introduces a break-me guide (9 adversarial patterns) and an empty-but-schematized v2 decision log.
- Adds two GitHub Issue Forms (breakage + realism feedback) to standardize incoming reports.
- Updates release/readme + notebook forward-pointers and re-syncs Kaggle/HF rendered artifacts to use GitHub blob URLs.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/build_release_notebook_04.py |
Updates notebook 04 builder text to link to the break-me guide via GitHub blob URL. |
scripts/build_release_notebook_03.py |
Updates notebook 03 builder text to link to the break-me guide via GitHub blob URL. |
release/README.md |
Rewrites “Maintenance…” section to link to break-me guide, issue templates, and v2 decision log. |
release/notebooks/04_lift_calibration_value_ranking.ipynb |
Updates the rendered notebook markdown to include the break-me guide link. |
release/notebooks/03_leakage_and_time_windows.ipynb |
Updates the rendered notebook markdown to include the break-me guide link. |
release/kaggle/dataset-metadata.json |
Regenerates Kaggle metadata description with updated “Maintenance…” section links. |
release/huggingface/README.md |
Regenerates HF README with updated “Maintenance…” section links. |
docs/release/v2_decision_log.md |
Adds an empty decision log stub with schema/vocabulary documented. |
docs/release/break_me_guide.md |
Adds the adversarial “break-me” playbook with patterns + worked pointers. |
.github/ISSUE_TEMPLATE/realism_feedback.yml |
Adds a GitHub Issue Form for distributional/realism critiques. |
.github/ISSUE_TEMPLATE/dataset_breakage_report.yml |
Adds a GitHub Issue Form for bundle-level breakage findings. |
.agent-plan.md |
Marks Phase 6.3 deliverables as complete and documents what landed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Hostile-reviewer pass on the PR 6.3 diff caught seven issues. Fixes:
1. **post_snapshot_aggregates baseline misidentified.** The break-me
guide patterns 2/3 called the validator's baseline "a fitted LR";
the actual implementation in
leadforge/validation/release_quality.py is HistGBM. Fixed, with a
link to the source so the model identity is verifiable.
2. **manifest.json `bundle_hash` is fictitious.** The real manifest
has per-file sha256s nested under `tasks.<task>.test_sha256` and
`tables.<name>.sha256`, no top-level `bundle_hash`. Fixed both
the break-me guide ("What to do when you find one" §1) and the
breakage issue template's bundle-hash field; placeholder now
carries a real hash from the as-shipped intermediate bundle.
3. **Fabricated "70+ leads" in pattern 8.** The slate-inflation
number was made up — the as-shipped intermediate bundle's
actually_above readout matches capacity. Reframed as a defensive
instrument, with a more honest LR-vs-GBM tie-rate observation.
4. **Misleading "instructor" tier dropdown.** Only
`intermediate_instructor` ships in v1; the unqualified label
suggested multiple instructor companions. Renamed to
`intermediate_instructor`.
5. **Underspecified "~0.55 AUC".** Pinned to "intermediate, median
across seeds 42–46" with the cross-tier range (0.52–0.61) so
readers know the variance.
6. **Speculative pattern-9 calibration claim.** "Consistently
over-predicts on small-revenue accounts" was hypothetical, not
observed. Reframed as "in principle… whether v1 actually
exhibits such drift is an open question".
7. **Issue templates referenced labels that didn't exist.** Created
`dataset: leadforge-lead-scoring-v1`, `needs-triage`, `realism`
so the templates' `labels:` fields actually fire on issue
submission. (Triage labels in the dropdowns — critical-leakage
etc. — stay as free-text choices the maintainer applies
post-triage.)
Net: 1260/1260 tests still pass; ruff + mypy clean; YAML still
validates; no new bundle regeneration. BUNDLE_SCHEMA_VERSION
unchanged at 5.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Hostile-reviewer pass on round 1 caught 8 more issues; one of them
(SOURCE_TREE_BLOCK staleness) was an issue I'd already deferred once,
which I'm calling a cop-out and fixing now. Also folds in a planning
update for the v1 release roadmap: insert a new PR before the publish
PR that lets the maintainer render local mock-page previews of the
Kaggle and HuggingFace dataset pages from the *exact* upload artefacts
before any platform upload — staging gate so styling, link, embed,
and YAML-rendering issues are caught before they hit the live page
where rollback is expensive.
### Round-2 fixes (8)
1. **Pattern 6 metric mismatch.** Cohort shift uses HistGBM (release_quality.py:463/514), so the cross-seed band to compare against is `gbm_auc.spread`, not `lr_auc.spread`. Fixed.
2. **Pattern 5 numerically anchored.** "Two leads in the same account can land in train and test" was vague; the empirical reality is **518 of 557 test accounts (93 %) appear in train** — and the same numbers hold across all three tiers because the splitter is `lead_id`-keyed. Added a copy-pasteable pandas snippet so reporters can verify in 4 lines.
3. **Pattern 4 self-contradiction.** Original text said "we deliberately don't show that" while showing the leakage one-liner inline. Reworded so the contrast is clear: notebook 02 doesn't show the leakage variant; the guide does, so reviewers recognise it in code.
4. **Cross-platform hash command.** Issue template said `sha256sum tasks/.../test.parquet` — that's GNU coreutils, not present on macOS. Added `shasum -a 256` (macOS) and a portable Python one-liner; led with "easiest source: copy from manifest.json".
5. **Relative links within docs/.** Round 1 used absolute GitHub URLs for every in-repo link in the break-me guide, breaking local-clone reading (clicks went to GitHub web instead of opening the local file). Switched to relative paths within the docs tree; kept absolute URLs only where they're load-bearing (notebook forward-pointers ship to Kaggle/HF without `docs/`; issue-template Markdown bodies render via GitHub Issues; README's `](../foo)` gets rewritten by `_release_common.py` for Kaggle/HF).
6. **`expected_acv_capture_at_k` JSON path lands on a dict.** `$.tiers.<tier>.per_seed[*].expected_acv_capture_at_k` resolves to `{"50": …, "100": …}` keyed by string K. Pinned the path to `…expected_acv_capture_at_k.50` so a reader following the citation hits a scalar.
7. **Pattern 8 GBM-tie speculation.** "On a coarse-grained GBM probability output ties are routine" was over-stated — HistGBM's `predict_proba` is continuous via leaf-score sums. Dropped the LR-vs-GBM theorising; kept the empirical observation that on the as-shipped intermediate bundle the realised count matches capacity.
8. **SOURCE_TREE_BLOCK staleness.** Both `release/README.md` and `_release_common.py`'s `SOURCE_TREE_BLOCK` constant listed `notebooks/01_baseline_lead_scoring.ipynb` as the only notebook; four ship now. Updated to a one-liner that names all four. Audit-sync test passes.
### Planning update — local mock-page preview PR
`docs/release/v1_release_roadmap.md` and `.agent-plan.md` updated:
- Phase 7 grows from 2 PRs to 3.
- New **PR 7.2** sits between the LLM critique PR (now 7.1) and the publish PR (now 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks from the *exact* upload artefacts (`release/kaggle/dataset-metadata.json` + inlined README + cover image; `release/huggingface/README.md` with frontmatter + body), serve over `localhost`, accept `--variant=public|instructor` (HF) and `--port` / `--open-browser` flags. Tests cover required-field presence, link resolution, schema-column listing, configs-block round-trip.
- The publish PR's runbook (`v1_release_notes.md`) cites the preview commands as required pre-flight.
- Phase summary table + total PR count (14 → 15) updated for consistency.
Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy
clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67;
`validate_release_candidate --no-rebuild` exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Three Copilot comments on the issue templates, all accepted: 1. **COPILOT-1** (dataset_breakage_report.yml:9 → realism template link). The relative `realism_feedback.yml` link doesn't resolve when the description renders in the GitHub Issue Forms chooser. Replaced with the "open another template directly" URL form (`/issues/new?template=realism_feedback.yml`), which lands a misrouted reporter on the realism *form*, not on raw YAML. 2. **COPILOT-2** (dataset_breakage_report.yml:139 → `render: text`). `render: text` isn't a documented value for Issue Forms textareas (the supported set is language identifiers from GitHub's syntax highlighter, plus `markdown`). The intent here is plain-text formatting, which is the default when `render` is omitted. Dropped the `render` line entirely. 3. **COPILOT-3** (realism_feedback.yml:58 → placeholder mentions `retail`). Same hallucination class as the round-1 "15 industries" fix. The actual industries are logistics, healthcare_non_clinical, manufacturing, professional_services. Rewrote the placeholder to use the real names so a reporter glancing at the example doesn't think `retail` is a category they should be filing about. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; YAML validates; BUNDLE_SCHEMA_VERSION unchanged at 5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #75 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 6 closer. Five new artefacts plus three follow-up syncs:
docs/release/break_me_guide.md— adversarial playbook.Meta-recipe (read the dictionary → ablate, don't just probe → check
the time window → treat the train/test split as untrusted) plus
9 patterns grouped by category (leakage / split discipline /
metric and ranking traps / robustness and realism). Each pattern
carries a how to detect on any dataset recipe and a worked
example pointer back into the v1 bundle (notebook section,
validation_report.jsonJSON path, orfeature_dictionary.csvfield). Delivers the three explicit promises notebook 04 §10 made
(target-encoding leakage, train-test contamination via
account_id,cohort-by-segment) plus six others.
docs/release/v2_decision_log.md— empty stub with the schemadocumented in the preamble. Seven columns
(
received_at/source/topic/severity/verdict/next_step/link) and verdict vocabulary(
accepted-for-v2/deferred/wont-fix/needs-investigation)defined upfront so the first real entry doesn't trigger a refactor.
.github/ISSUE_TEMPLATE/dataset_breakage_report.yml— GitHubIssue Forms. Bundle-level findings (leakage, contamination, ranking
surprises, notebook failures). Fields: tier dropdown, seed input,
bundle hash, suggested triage label, severity, summary, repro,
expected-vs-actual citing JSON paths, environment, two confirmation
checkboxes.
.github/ISSUE_TEMPLATE/realism_feedback.yml— GitHub IssueForms. Distributional / realism critiques. Fields: aspect, tier(s)
affected, domain experience one-liner (helps weight findings), claim,
data observation with pandas-snippet placeholder, suggested fix,
severity, two confirmations.
release/README.md— "Maintenance, adversarial framing, license"section rewritten. Dead
(PR 6.3)forward-pointer replaced withreal Markdown links to the break-me guide, both issue templates, and
the v2 decision log.
_release_common.py's existing](../foo)→ GitHub-blob-URL rewriter handles the Kaggle/HFrendering automatically; the regenerated
release/kaggle/dataset-metadata.jsonandrelease/huggingface/README.mdare bundled in this commit andpass the audit-artifact-sync tests.
Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from
plain
docs/release/break_me_guide.mdtext to Markdown linkspointing at the GitHub blob URL — a relative path would break on
Kaggle/HF where notebooks ship without the
docs/tree; the blobURL works in both contexts.
Hostile-reviewer self-review pass
shipping: claimed "15 industries" for
industry(actually 4:logistics / healthcare_non_clinical / manufacturing /
professional_services) and used loose segment-column names
("employee tier", "ARR band") instead of the actual columns
(
employee_band,estimated_revenue_band). Both fixed.Out of scope (follow-ups)
templates (
needs-triage,dataset: leadforge-lead-scoring-v1,realism,critical-leakage, etc.) are not yet defined in therepo's GitHub label catalogue. Issue Forms fail-open on missing
labels — the form still renders, the label just isn't applied — so
this isn't a merge blocker, but creating these labels is the natural
follow-up admin task before the dataset goes public.
Test plan
ruff check ./ruff format --check .— All checks passedmypy leadforge/— Success: no issues found in 82 source filespytest -q— 1260 passed, 5 skipped (publish-extra-gated)python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65— exits 0; 0/3 path prediction rates on every public tierpython scripts/verify_hash_determinism.py— PASS 67/67 files identicalpython scripts/validate_release_candidate.py --no-rebuild— exits 0BUNDLE_SCHEMA_VERSIONunchanged at 5 (documentation-only PR).ipynbfiles;tests/scripts/test_release_notebook_builders.py+tests/release/notebooks/greenyaml.safe_loadround-trip)_release_common.pyGITHUB_BLOB_BASEconstant)🤖 Generated with Claude Code