Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
- [x] PR 5.2 (self-review pass): brutal review of the first revision caught real bugs the reviewer would otherwise have to call out. Fixes folded into the PR before review: (#1) `run_packager` validate→write order — both packagers were writing the README/metadata even when validation failed; the validation gate now early-returns with `errors` populated and zero artifacts on disk; new tests exercise the no-write path on both sides. (#2) Instructor README was inlining the public 3-tier README for a 1-tier dataset; replaced with a dedicated `INSTRUCTOR_BODY` constant that opens by linking to the public dataset, describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary), and uses the single-tier tree block. (#3) `validate_upload_dir_safe` now also rejects strict descendants of `release_dir` (e.g. `--huggingface-dir release/intro` would otherwise rmtree the intro bundle); allow-list keeps the canonical `release/{kaggle,huggingface,huggingface-instructor}` direct-children. (#4) `[publish]` extra in `pyproject.toml` (`datasets>=2.14`, `kaggle>=1.6`) makes the gated `load_dataset()` / Kaggle-CLI tests installable in a single command — closes the "G12.3/G12.4 untested in CI" gap to a one-line install. (#5) Shared-primitives extraction finished: `SOURCE_TREE_BLOCK`, `validate_readme_substitution`, `replace_file`, `replace_dir`, `load_manifest` all moved to `scripts/_release_common.py`; both packagers reduced to imports. (#6) Hand-rolled YAML renderer (60 lines + brittle quoting heuristic) replaced with `yaml.safe_dump` + a 4-line `_IndentedDumper` subclass that forces indent-2 on top-level sequences. (#7) Dead `--owner` / `--dataset-slug` CLI flags removed (PR 7.2 will add them when actually needed). (#8) `assemble_upload_dir` now takes `rendered_readme` as a parameter and writes it itself; the public name no longer lies about producing a complete tree. (#9) `build_config_for_tier` made pure (no I/O); `_assert_tier_dir_exists` does the cheap manifest-stat preflight. (#10) `--default-config` with `--variant=instructor` now errors instead of silently ignoring. (#11) Instructor tree-diagram drops the hardcoded "9 tables" claim. (#13–#16) Visual cleanups (duplicate divider, ruff-split imports, `COVER_IMAGE_FILENAME`-vs-`Path.name` redundancy, speculative comment about HF split rename). (#17) Test cruft removed (unused `tmp_path`, dead `tag_lines`); em-dash YAML round-trip parametrised for the instructor `pretty_name`. Net: 1223/1223 tests pass + 5 gated skips (4 `datasets`-SDK round-trip + 1 Kaggle `kaggle`-SDK from PR 5.1); ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape).

### Phase 6 — Notebook sequence + adversarial framing
- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
- [ ] Update `01_baseline_lead_scoring.ipynb` to reproduce validation report metrics
- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples. This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance. Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report). Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance). Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
- [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml`
- [ ] `docs/release/{break_me_guide,v2_decision_log}.md`

Expand Down
25 changes: 24 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,11 @@ jobs:
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install ruff
# Pin ruff so notebook .ipynb formatting (which the lint job is
# strict about) doesn't drift between contributor laptops and CI
# the moment a new ruff release ships. Bump in lock-step with
# any local re-runs of ``scripts/build_release_notebook_*.py``.
- run: pip install 'ruff==0.15.12'
- run: ruff check .
- run: ruff format --check .

Expand Down Expand Up @@ -138,3 +142,22 @@ jobs:
- name: Skip v7 (no dataset)
if: steps.check-v7.outputs.found != 'true'
run: echo "No v7 datasets found — skipping v7 validation"

notebooks:
name: Execute release notebooks (G13.1)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[dev,scripts,notebooks]"
- name: Register python3 kernelspec for nbclient
run: python -m ipykernel install --user --name python3
- name: Build the intermediate public bundle (only tier the notebooks need)
run: python scripts/build_public_release.py release --tier intermediate
- name: Execute release notebooks end-to-end + builder byte-stability
run: |
pytest tests/release/notebooks/test_execute_notebooks.py \
tests/scripts/test_release_notebook_builders.py \
-v
29 changes: 29 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,25 @@ publish = [
"datasets>=2.14",
"kaggle>=1.6",
]
# Optional dependencies for executing the public release notebooks.
# Installing this extra (``pip install -e ".[notebooks]"``) enables the
# G13.1 acceptance gate: the CI ``notebooks`` job nbclient-executes
# ``release/notebooks/*.ipynb`` end-to-end against a freshly built
# public bundle, asserting the notebooks reproduce validation_report.md
# metrics within ±0.05 (G13.2) and never load instructor artefacts
# (G13.3).
notebooks = [
"nbclient>=0.10",
"nbformat>=5.10",
# ``ipykernel`` provides the ``python3`` kernelspec that
# ``nbclient.NotebookClient(..., kernel_name="python3")`` looks up.
# Without it CI fails with ``NoSuchKernel: No such kernel named
# python3`` because the GitHub-hosted runner has no kernelspecs
# registered out of the box (local dev environments usually do).
"ipykernel>=6.0",
"scikit-learn>=1.3",
"matplotlib>=3.7",
]

[project.scripts]
leadforge = "leadforge.cli.main:app"
Expand All @@ -74,6 +93,16 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"]

[tool.ruff.lint.per-file-ignores]
"tests/**/*" = ["S101", "S108"]
# Release notebooks deliberately use ``assert`` for contract checks
# (path discipline, snapshot-safe joins, G13.2 tolerance gate). ``assert``
# is the conventional notebook idiom and these are exactly the cells we
# want to fail loud on regression.
"release/notebooks/**/*.ipynb" = ["S101"]
# Notebook builders emit dedented heredocs whose lines render as
# markdown tables and print-statement output inside the notebook.
# Line length is a property of the rendered cell, not the .py source,
# so 100c is the wrong yardstick here.
"scripts/build_release_notebook_*.py" = ["E501"]

[tool.mypy]
python_version = "3.11"
Expand Down
Loading
Loading