From 1a756462ef100626d1365c3b849064197d3b41ca Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Sun, 24 May 2026 22:59:47 +0300 Subject: [PATCH 1/4] feat(scripts): ShmuggingFace preview site builder + Cloudflare Pages deploy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - scripts/build_shmuggingface_site.py (new): reads the three public release tiers, renders release/README.md → HTML via markdown-it-py (linkify disabled), loads per-tier manifest/metrics/feature-dict/sample rows, emits a shmuggingface.config.mjs and drives ShmuggingFaceCore to produce a HuggingFace+Kaggle mock static site, then deploys via wrangler pages deploy. - ShmuggingFaceCore is auto-cloned to /tmp/shmuggingface-core on first run and git-pulled on subsequent runs; no npm dep installation required. - Config includes descriptionHtml (full README as HTML), coverImage, splits/subsets arrays, files[].about descriptions, and 8 sample rows. All file references use relative sourcePath so ShmuggingFaceCore copies real release files into the dist. - Cloudflare Pages project 'leadforge-lead-scoring-v1-preview' created on the adanim account; live at: https://leadforge-lead-scoring-v1-preview.pages.dev - pyproject.toml: per-file-ignores for S603/S607/S108/E501 on the script (subprocess calls with controlled inputs; long data strings). - .gitignore: add release/_shmuggingface/ and .wrangler/ - .agent-plan.md: mark PR 7.2.2 complete; update PR 7.3 to cite preview site Co-Authored-By: Claude Sonnet 4.6 --- .agent-plan.md | 3 +- .gitignore | 5 + pyproject.toml | 4 + scripts/build_shmuggingface_site.py | 478 ++++++++++++++++++++++++++++ 4 files changed, 489 insertions(+), 1 deletion(-) create mode 100644 scripts/build_shmuggingface_site.py diff --git a/.agent-plan.md b/.agent-plan.md index 0a55f90..718daf8 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -66,7 +66,8 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family - [x] PR 7.1: LLM critique module + prompt + driver landed. `leadforge/validation/llm_critique.py` (new) — single-provider Anthropic critique core via an `LLMCritiqueClient` protocol (no preemptive OpenAI/Gemini stubs); `_AnthropicCritiqueClient` lazy-imports the SDK so the module imports cleanly even on machines without `anthropic` installed (the skip-cleanly path needs to work without the SDK). `has_anthropic_credentials` / `api_key_or_skip` treat unset and empty-after-strip identically as "absent", explicitly to handle the `env -i` / stale `.envrc` case where the shell sets `ANTHROPIC_API_KEY=""` and the SDK would otherwise 401 instead of cleanly skipping. Default model `claude-opus-4-7` with `thinking={"type": "adaptive", "display": "summarized"}` (only mode supported on Opus 4.7 — manual `budget_tokens` 400s) and `output_config={"effort": "high"}` (recommended minimum for intelligence-sensitive work per the `claude-api` skill); two prompt-cache breakpoints (rubric + input bundle) per the design doc's caching strategy so the common adjudication-loop workflow hits cache on both layers; streamed via `messages.stream(...).get_final_message()` to dodge the 10-min idle-connection timeout on long adaptive-thinking responses. `build_input_bundle` is pure (same `release_dir` → byte-identical bytes → identical `sha256`) and assembles eleven blocks: `release/README.md`, per-tier `dataset_card.md`, `docs/release/generation_method.md`, `manifest.json`, `feature_dictionary.csv`, `validation_report.{md,json}`, the first 100 test-split rows rendered as deterministic CSV, the public/instructor diff summary (live-derived from the `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` constants in `leakage_probes.py` — single source of truth, auto-stays-in-sync, sync-tested), the public-safe mechanism summary (motif family **names** + difficulty knob **names**, never values — same redaction posture as `student_public`), and the break-me guide verbatim ("avoid re-deriving" the existing nine patterns). `parse_critique_response` schema-validator pins eleven malformations (missing required field, wrong severity, wrong category, wrong rubric dimension, finding-id collision, findings non-list, top-level non-object, non-JSON, score out of range, defensive code-fence stripping, empty findings list valid) and returns every problem in one error rather than the first one. Output schema is a frozen dataclass (no pydantic dependency) with the nine-value `category` vocabulary lifted **verbatim** from `break_me_guide.md` so findings route to existing issue-template labels without translation; `rubric_dimension: str` is required on every finding (D1-D14) so reviewers can audit clustering. Provenance triple (`model` / `effort` / `thinking_mode`) plus per-source-file `bundle_hashes` and the assembled `input_bundle_sha256` are carried on every result for audit-artifact-sync — re-runs on the same RC produce the same bundle hashes. `docs/release/llm_critique_prompt.md` (new) — the rubric document the driver feeds to Claude, parseable via `` / `` section markers with surrounding prose ignored; fourteen rubric dimensions (D1 documentation truthfulness · D2 leakage discipline · D3 realism vs disclosure · D4 difficulty signal · D5 calibration / value-aware ranking · D6 cohort/time-window discipline · D7 notebook integrity · D8 platform packaging hygiene · D9 adversarial-framing completeness · D10 pedagogy of the documented `total_touches_all` trap · D11 effective semantic diversity per recommendation #12 v1 scope · D12 Datasheets-for-Datasets composition · D13 manifest/provenance integrity · D14 out-of-scope guard). Severity calibration explicitly written to discourage padding the report with low-severity nits and to surface "no high-severity findings" as a positive signal vs "the critique didn't surface any". `scripts/run_llm_critique.py` (new) — driver mirroring `validate_release_candidate.py`'s posture (free-function `parse_args`, frozen `DriverConfig`, `run_critique(config) -> DriverResult`, `main(argv)` returning an exit code). Skip-cleanly path triggers BEFORE any I/O — no rubric read, no bundle build, no out-dir creation; tested explicitly with `not (tmp_path / "out").exists()` after the skip. Three modes alongside the live path: `--dry-run` writes the rendered input bundle to `/llm_critique_input_.md` for human inspection (different filename from the real raw JSON, can't be confused); `--no-execute` calls `api_key_or_skip` + `build_anthropic_client()` to prove the SDK is installed and creds are present without burning an API call (CI smoke); `--out-tag` suffixes the raw filename so adjudication re-runs don't shadow the canonical run. Outputs: timestamped `llm_critique_raw_.json` (accumulates per run, no clobber) + canonical `llm_critique_summary.md` (overwritten in place so dataset-card links don't rot). Exit codes mirror `validate_release_candidate.py`: 0 pass (skip-cleanly counts as pass), 1 high-severity surfaced and unresolved, 2 pre-flight error or schema-validation failure (every problem rendered to stderr, not just the first). Adjudication is **maintainer-driven** post-exit — resolve in code OR log to `v2_decision_log.md`, then re-run; the next critique's exit code is the gate. Tests: 61 cases across `tests/validation/test_llm_critique.py` (48) and `tests/scripts/test_run_llm_critique.py` (13), no live API; the protocol is exercised via a small in-process `_CannedClient` fake. Sync tests pin: every `VALID_CATEGORIES` entry appears in `break_me_guide.md` (vocabulary doesn't drift), `VALID_RUBRIC_DIMENSIONS` is exactly D1-D14, the live-derived public/instructor diff names every banned-column/banned-table constant (live reference, not duplicated string). Audit-artifact-sync smoke test (`test_real_release_dir_smoke`) builds the input bundle against the actual `release/intermediate/` artefacts and pins determinism on the real input, skipping cleanly when bundles aren't present. `docs/release/llm_critique_design.md` (new) records the nine load-bearing design calls before implementation so a reviewer can audit the choice (provider abstraction, skip-cleanly, model+caching+thinking, output schema, input-bundle composition, determinism via provenance, CLI flags, test posture, first-run adjudication workflow). Live first-run deferred to maintainer (no `ANTHROPIC_API_KEY` available to the agent); the dry-run path was exercised against the real release dir end-to-end, producing a 148KB byte-stable input bundle from the actual artefacts. Hostile self-review pass before requesting review caught and folded back twelve findings against the diff, including two BLOCKERs (`--no-execute` was performing pre-flight I/O before the credentials check, contradicting the design doc; raw-output filename collision at second-precision contradicted the "append-only history" promise — fixed with microsecond precision and a pinning test) and five HIGHs (silent `release_id` default that defeated the audit-artifact-sync gate; design-doc lies about a never-existing `temperature` field and "malformed timestamp" malformation that's driver-generated; dead `if/else` branches in `_safe_difficulty_knobs`; greedy regex for the rubric section markers so the prompt-injection warning paragraph that legitimately references `` doesn't break the parser). Prompt-injection mitigation added to the rubric (treat-input-as-data preamble) since the input bundle inlines user-authored content (dataset_card.md, break_me_guide.md). Schema validator hardened against silent `str()` coercion of finding prose fields (an int "claim" would have landed on disk as the string "5" — now rejected). Net: 1321/1321 tests pass + 5 publish-extra-gated skips; ruff + mypy clean (83 source files); leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5; validation_report timestamp drift reverted before commit per the brief. Second senior-dev review pass after PR #76 was opened caught and folded back 9 more issues, several of which were real bugs the first hostile pass missed: (B1) `--out-tag` suffixed only the raw JSON, leaving `llm_critique_summary.md` clobbered on adjudication runs — fix suffixes both files (`summary_output_path` now takes `tag`); (B2) skip-cleanly silently passed a release-readiness gate, contradicting `v1_release_roadmap.md`'s line-35 acceptance criterion that the critique must actually run — added `--require-execute` flag (default off; release-readiness CI sets it) that converts the skip path into `MissingCredentialsError` exit 2, plus a loud `WARNING — release-readiness gate has NOT been evaluated` stderr line on the regular skip path; (A2) two prompt-cache breakpoints cut to one — system content already sits inside the cached prefix on `messages.create` (system → messages render order), so the second breakpoint bought nothing and burned a slot; (M1) design doc cut from 394 lines to 73 — the 9-decision table replaces the multi-paragraph rationale-per-call shape that read as documentation theater; (M2) rubric cut from 420 lines to ~210 — each dimension now one paragraph instead of 3-6, dropped D14 ("out-of-scope guard") which was meta-instruction not a rubric dimension, made it a "What is NOT yours to audit" appendix at the end; rubric is now D1-D13 and `VALID_RUBRIC_DIMENSIONS` updated in lockstep; (M3) test-split sample replaced 100 raw rows of CSV with `df.describe(include="all")` per-column statistics + a 20-row head — distributional conclusions need statistics not raw rows, and the rendered input bundle dropped from 148KB to 128KB; (M5) streaming-via-`messages.stream` replaced with `messages.create(timeout=600.0)` — no stream events were processed anyway, the contract is just "don't time out on long adaptive-thinking responses" and an explicit timeout is the right way to spell that; (M6) `render_input_bundle_text` free function moved to `InputBundle.render()` method — leaky abstraction; the audit-artifact-sync framing was misleading (no committed-artefact diff) and was renamed to "smoke test against the real release dir" / "staleness check vs committed result" throughout the module and design doc. Net after the second pass: 1323/1323 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5; validation_report timestamp drift reverted again before this commit. First live critique run executed by the maintainer with a dedicated Anthropic project key (`leadforge-llm-critique-v1-prod`): score 7/10, six findings (1 high, 4 medium, 1 low), exit code 1 as designed for unresolved high-severity findings. Adjudication: F001 high-severity (93 % `account_id` overlap between train/test documented only in break_me_guide §5, missing from README/dataset_card) — **resolved in code** by adding a "Group-leakage warning" paragraph to `release/README.md` "Splits" subsection citing the 518/557 figure and a `GroupKFold(account_id)` recipe; the parallel disclosure on the auto-rendered `dataset_card.md` is logged as `accepted-for-v2` because the renderer change is out of scope for PR 7.1's no-bundle-regen rule. F004 medium (break_me_guide pattern 5 covered `account_id` but not `contact_id`, despite contacts being shared across the lead-keyed split at the same magnitude) — **resolved in code** by extending §5 to enumerate both keys and any reusable foreign-key column as group-leakage axes. F006 low (README "Conversion rate (recipe band)" column header didn't make clear it was a recipe-acceptance window not an observed range) — **resolved in code** by renaming to "(acceptance band, gate G7.\*)" and adding a one-sentence note that observed five-seed spreads sit comfortably inside the band. F002 medium (Gaussian noise produces non-physical values: negative ACV, negative day-deltas, day-deltas > snapshot_day=30, undisclosed in dataset card) — `accepted-for-v2`; requires `leadforge/narrative/dataset_card.py` change. F003 medium (`](../foo)` relative links would 404 on Kaggle/HF) — `wont-fix`: already treated by `scripts/_release_common.py::rewrite_release_links()` which both platform packagers (PR 5.1, 5.2) call at packaging time; the LLM didn't have visibility into the platform packagers and made a wrong inference. F005 medium (advanced-tier `calibration_max_bin_error = 0.5234` driven by an n=2 high-probability bin, no minimum-bin-count footnote) — `accepted-for-v2`; not a 1-line change, touches `release_quality.py` metric definition and would require regenerating `validation_report.{json,md}` which PR 7.1's brief explicitly forbids. Three missing-section callouts (Datasheets §Biases, §Privacy, per-bundle group-split warning) and three maintainer questions (noise/windowing interaction, `top_decile_rate` naming, Kaggle/HF docs subtree) all logged to `docs/release/v2_decision_log.md`. README edits cascaded into the platform packager artefacts; `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` regenerated cleanly via the existing packagers (`scripts/package_{kaggle,hf}_release.py`). Critique run output committed to `release/validation/llm_critique_raw_20260508T204359.124834Z.json` + `release/validation/llm_critique_summary.md`. Final net: 1325/1325 tests pass + 5 publish-extra-gated skips; ruff + mypy clean (83 source files); leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. Phase 7 PR 7.1 closed; PR 7.2 (local Kaggle/HF mock-page preview) is next. - [x] PR 7.2: local Kaggle + HuggingFace mock-page preview tooling landed. `scripts/preview_kaggle_page.py` (new) — reads the *exact* artefacts the publish PR will upload (`release/kaggle/dataset-metadata.json` + the inlined README body + the cover image, prefer `release/kaggle/dataset-cover-image.png` then fall back to the gitignore-resilient `release/dataset-cover-image.png` master copy) and renders an offline HTML page mocking the public Kaggle dataset view: header (title / subtitle / id pill / licence / update-frequency / visibility), cover image, rendered description (the inlined README body), file tree of declared resources grouped by tier with per-tier counts, schema/columns table for every tabular resource (`resources[].schema.fields[].name/type/description`) with per-table column counts in the heading, user-specified-sources block (rendered only when present), keywords + licence footer. Serves on `http://localhost:8765` via stdlib `http.server.ThreadingHTTPServer` (the threading variant inherits `allow_reuse_address=True` from `HTTPServer`, so Ctrl-C → re-run within ~60s does not raise `OSError [Errno 48] Address already in use` while the socket sits in TIME_WAIT — caught and folded back in self-review pass 1, the initial draft used `socketserver.ThreadingTCPServer` which defaults to `False`). `--no-serve` builds the HTML and exits (CI / inspection mode); `--open-browser` pops a tab on startup; `--port` / `--release-dir` / `--out-dir` round out the surface. `scripts/preview_hf_page.py` (new) — reads `release/huggingface/README.md` (or `release/huggingface-instructor/README.md` per `--variant=public|instructor`) and parses YAML frontmatter + Markdown body via a single anchored regex (`r"\A---\n(?P.*?)\n---\n(?P.*)\Z"` with `re.DOTALL`); renders the analogous HF view: header pills (pretty_name + license + task_categories + size_categories + language), tag chips, configs dropdown (one details-block per `configs[]` entry with the default config flagged via a single `badge--default` instance, data_files split→path table per config), file tree of declared YAML paths bucketed by config, README body, footer carrying the variant for human visual confirmation. `--variant` defaults `--out-dir` to `release/_preview/huggingface/` (public) or `release/_preview/huggingface-instructor/` (instructor); the instructor path also reads its README from a different location (`huggingface-instructor/README.md`) and looks for the cover under the variant directory first. Both scripts share the validation discipline from the Phase 5 packagers: build → validate → write; pre-flight failures (missing metadata, malformed JSON / YAML, unknown variant, missing cover) raise and the CLI converts to rc=2 without touching disk; runtime success exits 0. Markdown rendering via `markdown-it-py` in `gfm-like` preset (tables / fenced code / strikethrough on; `linkify` explicitly disabled so the optional `linkify-it-py` transitive dep is not required); the dep is added to the `[publish]` extra alongside `datasets` / `kaggle` (mirrors the PR 5.1 / 5.2 gating posture for publish-pipeline tooling), and absent imports raise a clean `ImportError` pointing at `pip install -e ".[publish]"` instead of a cryptic stdlib `ModuleNotFoundError`. Both renderers are pure: same `(metadata|doc, cover_filename|variant)` → byte-identical HTML (no `now()`, no random, no clock). Output landing at `release/_preview//index.html` is gitignored (`.gitignore` adds `release/_preview/`); the audit-artefact-sync gate lives at `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html` (committed alongside the scripts, mirrors the PR 4.1 / 5.1 / 5.2 / 7.1 audit-sync pattern). HTML is wrapped in a single self-contained file (CSS inlined, no external stylesheet) so each committed sample is human-inspectable directly from `git show` or a browser without a server. XSS-safety: every user-controlled string passes through a hand-rolled `_escape` (`&`, `<`, `>`, `"`, `'`); kept hand-rolled rather than `html.escape` so the committed samples' `'` (decimal) escapes don't churn against `html.escape`'s `'` (hex) entity. Tests: 48 cases across `tests/scripts/test_preview_kaggle_page.py` (20) and `tests/scripts/test_preview_hf_page.py` (28); no live HTTP, no network, no socket open. The four roadmap-mandated checks per script: required field labels appear in rendered HTML (Kaggle: title / subtitle / id / license / file count / schema column count; HF: pretty_name / license / configs / tags); every Markdown link in the source resolves to a non-allowlisted URL pattern fails the test (allow-list: `https://github.com/leadforge-dev/leadforge`, `https://huggingface.co/datasets/leadforge`, sibling-relative `LICENSE`, in-document `#` anchors — anything else is a 404 risk on the live page); the Kaggle schema table lists every column declared in `resources[].schema.fields` (iterates the committed metadata, asserts each `{name}` appears); every `configs[]` block in the HF YAML round-trips into the rendered dropdown. Determinism is double-tested: `test_render_is_byte_deterministic` runs two passes against the real release artefact and pins equality; `test_committed_*_sample_matches_fresh_regeneration` pins the committed HTML against fresh regeneration byte-for-byte (the audit-sync gate). Pre-flight error paths exercised end-to-end: missing artefact (`FileNotFoundError`), malformed JSON / YAML (`ValueError`), unknown variant, missing cover image — all return rc=2 via `main()` with informative stderr. HTML escape coverage: `test_render_escapes_html_in_field_values` asserts a `