feat(scripts): ShmuggingFace preview site builder + Cloudflare Pages deploy (PR 7.2.2)#80
feat(scripts): ShmuggingFace preview site builder + Cloudflare Pages deploy (PR 7.2.2)#80shaypal5 wants to merge 4 commits into
Conversation
…deploy - scripts/build_shmuggingface_site.py (new): reads the three public release tiers, renders release/README.md → HTML via markdown-it-py (linkify disabled), loads per-tier manifest/metrics/feature-dict/sample rows, emits a shmuggingface.config.mjs and drives ShmuggingFaceCore to produce a HuggingFace+Kaggle mock static site, then deploys via wrangler pages deploy. - ShmuggingFaceCore is auto-cloned to /tmp/shmuggingface-core on first run and git-pulled on subsequent runs; no npm dep installation required. - Config includes descriptionHtml (full README as HTML), coverImage, splits/subsets arrays, files[].about descriptions, and 8 sample rows. All file references use relative sourcePath so ShmuggingFaceCore copies real release files into the dist. - Cloudflare Pages project 'leadforge-lead-scoring-v1-preview' created on the adanim account; live at: https://leadforge-lead-scoring-v1-preview.pages.dev - pyproject.toml: per-file-ignores for S603/S607/S108/E501 on the script (subprocess calls with controlled inputs; long data strings). - .gitignore: add release/_shmuggingface/ and .wrangler/ - .agent-plan.md: mark PR 7.2.2 complete; update PR 7.3 to cite preview site Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…re Pages Without --branch main wrangler derives the branch name from the git checkout (feat/shmuggingface-preview-site) and deploys to a preview slot, leaving the root pages.dev URL serving Cloudflare's 'Nothing is here yet' placeholder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new one-command script that builds and (optionally) deploys a live ShmuggingFace/Cloudflare Pages preview minisite from the leadforge release artifacts, to support pre-publish review of how the dataset will look on Kaggle and Hugging Face. This complements the local PR 7.2 preview tooling with an external, browser-shareable preview ahead of the PR 7.3 publish work.
Changes:
- New
scripts/build_shmuggingface_site.py: renders README, loads per-tier manifest/metrics/feature dictionary/sample rows, emitsshmuggingface.config.mjs, drives ShmuggingFaceCore (auto-cloned to/tmp/shmuggingface-core), and optionally deploys to Cloudflare Pages viawrangler. .gitignoreaddsrelease/_shmuggingface/and.wrangler/;pyproject.tomladds Ruff per-file ignores (S603/S607/S108/E501) for the new script..agent-plan.mdmarks PR 7.2.2 done and updates PR 7.3 to reference the preview site.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| scripts/build_shmuggingface_site.py | New end-to-end build/deploy script for the ShmuggingFace preview site. |
| pyproject.toml | Ruff per-file ignores for the new script (subprocess + long data strings). |
| .gitignore | Ignore generated release/_shmuggingface/ output and .wrangler/. |
| .agent-plan.md | Marks PR 7.2.2 complete; updates PR 7.3 to cite the preview site. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _PARENT_LINK_RE = re.compile(r"\]\(\.\./([^)]+)\)") | ||
| _VALIDATION_LINK_RE = re.compile(r"\]\(validation/validation_report\.md\)") | ||
|
|
||
|
|
||
| def _rewrite_links(text: str) -> str: | ||
| """Rewrite relative markdown links to GitHub blob URLs.""" | ||
| text = _PARENT_LINK_RE.sub(rf"]({GITHUB_BLOB_BASE}/\1)", text) | ||
| text = _VALIDATION_LINK_RE.sub( | ||
| f"]({GITHUB_BLOB_BASE}/release/validation/validation_report.md)", text | ||
| ) | ||
| return text |
| def load_tier(release_dir: Path, tier: str) -> dict: | ||
| """Load manifest, metrics, feature dictionary, and sample rows for one tier.""" | ||
| tier_dir = release_dir / tier | ||
| manifest = json.loads((tier_dir / "manifest.json").read_text()) | ||
| metrics = json.loads((tier_dir / "metrics.json").read_text()) | ||
|
|
||
| fd = pd.read_csv(tier_dir / "feature_dictionary.csv") | ||
| columns = list(fd["name"]) | ||
|
|
||
| df = pd.read_csv(tier_dir / "lead_scoring.csv") |
| "task": "tabular-classification", | ||
| "language": "English", | ||
| "rowCount": n_leads, | ||
| "splits": ["train", "valid", "test"], |
| def kb(path: Path) -> str: | ||
| return f"{max(1, path.stat().st_size // 1024)} KB" |
| def ensure_smf_core(smf_core: Path | None) -> Path: | ||
| """Return path to a working ShmuggingFaceCore checkout, cloning if needed.""" | ||
| if smf_core is not None: | ||
| entry = smf_core / "bin/shmuggingface.mjs" | ||
| if not entry.exists(): | ||
| sys.exit(f"ShmuggingFaceCore entry point not found at {entry}") | ||
| return smf_core | ||
|
|
||
| entry = SMF_CORE_CACHE / "bin/shmuggingface.mjs" | ||
| if SMF_CORE_CACHE.exists() and entry.exists(): | ||
| print(f" Updating ShmuggingFaceCore cache at {SMF_CORE_CACHE}", file=sys.stderr) | ||
| subprocess.run( | ||
| ["git", "-C", str(SMF_CORE_CACHE), "pull", "--quiet"], | ||
| check=False, | ||
| ) | ||
| else: | ||
| print(f" Cloning ShmuggingFaceCore → {SMF_CORE_CACHE}", file=sys.stderr) | ||
| subprocess.run( | ||
| ["git", "clone", "--depth=1", SMF_CORE_REPO, str(SMF_CORE_CACHE)], | ||
| check=True, | ||
| ) | ||
| return SMF_CORE_CACHE |
| print(f" Updating ShmuggingFaceCore cache at {SMF_CORE_CACHE}", file=sys.stderr) | ||
| subprocess.run( | ||
| ["git", "-C", str(SMF_CORE_CACHE), "pull", "--quiet"], | ||
| check=False, |
| def _load_cf_env(cf_env_path: Path) -> dict: | ||
| """Parse a shell env file and return a dict of variable overrides.""" | ||
| env = os.environ.copy() | ||
| for raw_line in cf_env_path.read_text().splitlines(): | ||
| line = raw_line.strip() | ||
| if line.startswith("#") or not line: | ||
| continue | ||
| if line.startswith("export "): | ||
| line = line[len("export ") :] | ||
| if "=" in line: | ||
| key, _, val = line.partition("=") | ||
| env[key.strip()] = val.strip().strip("'\"") | ||
| return env |
| f"\n Live at: https://{project_name}.pages.dev", | ||
| file=sys.stderr, | ||
| ) | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # CLI | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def main() -> None: | ||
| import argparse | ||
|
|
||
| parser = argparse.ArgumentParser( | ||
| description="Build (and optionally deploy) the ShmuggingFace review minisite.", | ||
| formatter_class=argparse.RawDescriptionHelpFormatter, | ||
| ) | ||
| parser.add_argument( | ||
| "--release-dir", | ||
| default="release", | ||
| type=Path, | ||
| metavar="PATH", | ||
| help="Root of the release directory (default: release/)", | ||
| ) | ||
| parser.add_argument( | ||
| "--out-dir", | ||
| type=Path, | ||
| metavar="PATH", | ||
| help="Output directory for the static site (default: release/_shmuggingface/dist)", | ||
| ) | ||
| parser.add_argument( | ||
| "--smf-core", | ||
| type=Path, | ||
| default=None, | ||
| metavar="PATH", | ||
| help="Path to a local ShmuggingFaceCore checkout (auto-cloned if absent)", | ||
| ) | ||
| parser.add_argument( | ||
| "--deploy", | ||
| action="store_true", | ||
| help="Deploy to Cloudflare Pages after building", | ||
| ) | ||
| parser.add_argument( | ||
| "--cf-env", | ||
| type=Path, | ||
| default=DEFAULT_CF_ENV, |
This comment has been minimized.
This comment has been minimized.
…back
- package.json (new): declares @shmuggingface/core via the GitHub release tag
"github:ShmuggingFace/ShmuggingFaceCore#v1.0.0"
- package-lock.json (new): lockfile pinning the resolved SHA
- scripts/build_shmuggingface_site.py: ensure_smf_core() now resolves via
node_modules/@shmuggingface/core (npm install path) as the canonical source;
--smf-core PATH override kept for local dev; git-clone-to-/tmp fallback
removed in favour of a clear error pointing at npm install.
Logs 'Using npm-installed @shmuggingface/core vX.Y.Z' for traceability.
- .gitignore: add node_modules/ (package.json + lock are committed, not the tree)
Site rebuilt from v1.0.0 and redeployed to:
https://leadforge-lead-scoring-v1-preview.pages.dev
Both HF-style (12 pages) and Kaggle-style (12 pages) mocks confirmed present.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
subsets was hardcoded to 'leadforge-lead-scoring-v1' for all three tiers, making the HF Dataset Viewer's Subset dropdown show the same name regardless of which tier page you were on. Now each tier gets its own suffixed name: leadforge-lead-scoring-v1-intro leadforge-lead-scoring-v1-intermediate leadforge-lead-scoring-v1-advanced Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #80 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: scripts/build_shmuggingface_site.py:89
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231107
Root author: copilot-pull-request-reviewer
Comment:
The README link rewriting here duplicates the canonical implementation in `scripts/_release_common.py::rewrite_release_links` (which exposes `GITHUB_BLOB_BASE`, `PARENT_RELATIVE_LINK_RE`, and the validation-report rewrite). Importing and reusing `rewrite_release_links` would single-source the rule so the preview site cannot drift from what the Kaggle/HF packagers already rewrite at packaging time.
## COPILOT-2
Location: scripts/build_shmuggingface_site.py:114
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231118
Root author: copilot-pull-request-reviewer
Comment:
`load_tier` will raise a bare `FileNotFoundError` if a tier directory hasn't been materialized (the `release/{intro,intermediate,advanced}/` trees are gitignored per `.gitignore:212-217` and only exist after running `scripts/build_public_release.py`). Other release tooling — e.g. `scripts/verify_claims_register.py` — soft-skips missing bundle dirs and surfaces an actionable message. Consider a pre-flight check that emits a clear "run scripts/build_public_release.py first" error instead of a stack trace on a fresh checkout.
## COPILOT-3
Location: scripts/build_shmuggingface_site.py:237
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231123
Root author: copilot-pull-request-reviewer
Comment:
The HuggingFace packager uses the HF-canonical split label `validation` (see `scripts/package_hf_release.py:124-132` and the committed `release/huggingface/README.md` frontmatter), but the preview config advertises `"valid"` here. Since this site is explicitly meant to mirror how the dataset will look on Hugging Face, exposing `valid` in `splits` makes the preview disagree with what HF reviewers will actually see. Consider using `["train", "validation", "test"]` for the user-facing labels.
## COPILOT-4
Location: scripts/build_shmuggingface_site.py:164
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231124
Root author: copilot-pull-request-reviewer
Comment:
`kb()` reports `path.stat().st_size // 1024` KB, which truncates anything <1024 bytes to 0 (then bumped to 1 by `max(1, ...)`) and rounds down for everything else. For sub-MB CSVs/parquets the rounding bias can be material in a "human-readable size" column. Consider rounding to nearest and/or switching to MB above some threshold, similar to a `humanize`-style formatter.
## COPILOT-5
Location: scripts/build_shmuggingface_site.py
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231131
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`ensure_smf_core` updates a cache under `/tmp/shmuggingface-core` without pinning a commit or tag: a `git pull` (or fresh `--depth=1` clone) will silently pick up whatever is on `main` upstream. Because the generated site's structure depends on `bin/shmuggingface.mjs`, an unrelated change in ShmuggingFaceCore can quietly alter the preview output or break this script with no local code change. Consider pinning a specific commit/tag (e.g. via a `SMF_CORE_REF` constant and `git -C ... checkout <ref>` after fetch) for reproducibility.
## COPILOT-6
Location: scripts/build_shmuggingface_site.py
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231138
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The `git pull` invocation uses `check=False`, so a failed update (network glitch, dirty cache, force-push upstream) is silently ignored and the script continues with a stale checkout — there's no warning logged either. Either set `check=True` (consistent with the clone branch on the next lines) or inspect `returncode` and emit a warning so users know the cache wasn't refreshed.
## COPILOT-7
Location: scripts/build_shmuggingface_site.py:347
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231144
Root author: copilot-pull-request-reviewer
Comment:
`_load_cf_env`'s quote stripping uses `val.strip().strip("'\"")`, which strips any combination of leading/trailing single or double quotes rather than matching pairs. A value like `"it's"` would become `it's` correctly, but `'mixed"` would have both ends stripped, and a value containing an `=` sign is preserved (good — `partition` handles that). The bigger issue is that values containing `#` after the value (a trailing inline comment in some `.env` styles) are not handled, but more importantly any inline whitespace inside the value is preserved while surrounding whitespace is stripped — which differs from how a shell would `source` the file. Consider using `shlex.split` or `python-dotenv` if available to get shell-compatible parsing.
## COPILOT-8
Location: scripts/build_shmuggingface_site.py:435
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231147
Root author: copilot-pull-request-reviewer
Comment:
Nearby release/preview scripts (`scripts/preview_kaggle_page.py`, `scripts/preview_hf_page.py`, `scripts/validate_release_candidate.py`) all use a testable driver pattern: free-function `parse_args(argv)`, a frozen config dataclass, a `run_*(config) -> Outcome`, and `main(argv) -> int` returning an exit code (with `sys.exit(main())` at the bottom). This script uses `main() -> None` with `argparse` instantiated inline and scatters `sys.exit(...)` calls inside helpers (lines 44, 284, 344, 365, 427). That makes it harder to unit-test (no `argv` injection, exits short-circuit assertions) and breaks consistency with the existing PR 7.2 preview scripts that this PR explicitly cites as siblings. Consider refactoring to match the established pattern so the script becomes test-friendly.Run metadata: |
|
Superseded by PR #86 (PR 8.4), which brings this script to main with all review issues fixed:
This branch also had merge conflicts with main that would have needed resolving. Closing in favour of the clean PR 86. |
Summary
Adds
scripts/build_shmuggingface_site.py— a one-command script that builds a live HuggingFace + Kaggle mock review site from the release artifacts and deploys it to Cloudflare Pages.Live preview site: https://leadforge-lead-scoring-v1-preview.pages.dev
What this PR does
scripts/build_shmuggingface_site.py(new)Full pipeline in a single script:
release/README.md→ HTML viamarkdown-it-py(gfm-likepreset, linkify disabled), with](../foo)→ GitHub blob URL rewriting.manifest.json,metrics.json,feature_dictionary.csv, and 8 sample rows fromlead_scoring.csvfor each of intro / intermediate / advanced.release/_shmuggingface/shmuggingface.config.mjswith:descriptionHtml— full rendered README injected without escapingcoverImage— relative path torelease/dataset-cover-image.pngsplits/subsets— Dataset Viewer menus (train/valid/test split sizes, per tier)files[].about— human-readable per-file descriptions for all 6 files per tierrowssample rows per tiersourcePathentries useos.path.relpath()(config dir and tier dirs are siblings, soPath.relative_to()would raise)/tmp/shmuggingface-core(orgit pullon subsequent runs), then runsnode bin/shmuggingface.mjs build→ 48 static files.wrangler pages deploy release/_shmuggingface/dist --project-name leadforge-lead-scoring-v1-previewusing the adanim Cloudflare account credentials from~/.config/adanim/cloudflare_api_token.env(parsed in Python since subprocess can'tsourcea shell file).CLI
Options:
--release-dir,--out-dir,--smf-core,--deploy,--cf-env,--project-name.Other changes
.gitignore: addrelease/_shmuggingface/(config + dist) and.wrangler/pyproject.toml:per-file-ignoresforS603/S607/S108/E501on the new script (subprocess calls with repo-controlled inputs; long data strings are page content, not source).agent-plan.md: mark PR 7.2.2 ✓; update PR 7.3 description to cite the preview site as a pre-flight stepCloudflare Pages setup
Project
leadforge-lead-scoring-v1-previewwas created on the adanim account and the first deployment pushed 48 files. Subsequent--deployruns will push only changed files.🤖 Generated with Claude Code