Skip to content

feat(scripts): ShmuggingFace preview site builder + Cloudflare Pages deploy (PR 7.2.2)#80

Closed
shaypal5 wants to merge 4 commits into
mainfrom
feat/shmuggingface-preview-site
Closed

feat(scripts): ShmuggingFace preview site builder + Cloudflare Pages deploy (PR 7.2.2)#80
shaypal5 wants to merge 4 commits into
mainfrom
feat/shmuggingface-preview-site

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

Adds scripts/build_shmuggingface_site.py — a one-command script that builds a live HuggingFace + Kaggle mock review site from the release artifacts and deploys it to Cloudflare Pages.

Live preview site: https://leadforge-lead-scoring-v1-preview.pages.dev


What this PR does

scripts/build_shmuggingface_site.py (new)

Full pipeline in a single script:

  1. Render READMErelease/README.md → HTML via markdown-it-py (gfm-like preset, linkify disabled), with ](../foo) → GitHub blob URL rewriting.
  2. Load tiers — per-tier manifest.json, metrics.json, feature_dictionary.csv, and 8 sample rows from lead_scoring.csv for each of intro / intermediate / advanced.
  3. Build config — emits release/_shmuggingface/shmuggingface.config.mjs with:
    • descriptionHtml — full rendered README injected without escaping
    • coverImage — relative path to release/dataset-cover-image.png
    • splits / subsets — Dataset Viewer menus (train/valid/test split sizes, per tier)
    • files[].about — human-readable per-file descriptions for all 6 files per tier
    • 8 rows sample rows per tier
    • All sourcePath entries use os.path.relpath() (config dir and tier dirs are siblings, so Path.relative_to() would raise)
  4. Build site — auto-clones ShmuggingFaceCore to /tmp/shmuggingface-core (or git pull on subsequent runs), then runs node bin/shmuggingface.mjs build → 48 static files.
  5. Deploywrangler pages deploy release/_shmuggingface/dist --project-name leadforge-lead-scoring-v1-preview using the adanim Cloudflare account credentials from ~/.config/adanim/cloudflare_api_token.env (parsed in Python since subprocess can't source a shell file).

CLI

# Build only (inspect locally)
python scripts/build_shmuggingface_site.py --release-dir release

# Build + deploy
python scripts/build_shmuggingface_site.py --release-dir release --deploy

Options: --release-dir, --out-dir, --smf-core, --deploy, --cf-env, --project-name.

Other changes

  • .gitignore: add release/_shmuggingface/ (config + dist) and .wrangler/
  • pyproject.toml: per-file-ignores for S603/S607/S108/E501 on the new script (subprocess calls with repo-controlled inputs; long data strings are page content, not source)
  • .agent-plan.md: mark PR 7.2.2 ✓; update PR 7.3 description to cite the preview site as a pre-flight step

Cloudflare Pages setup

Project leadforge-lead-scoring-v1-preview was created on the adanim account and the first deployment pushed 48 files. Subsequent --deploy runs will push only changed files.

🤖 Generated with Claude Code

…deploy

- scripts/build_shmuggingface_site.py (new): reads the three public release
  tiers, renders release/README.md → HTML via markdown-it-py (linkify
  disabled), loads per-tier manifest/metrics/feature-dict/sample rows, emits
  a shmuggingface.config.mjs and drives ShmuggingFaceCore to produce a
  HuggingFace+Kaggle mock static site, then deploys via wrangler pages deploy.

- ShmuggingFaceCore is auto-cloned to /tmp/shmuggingface-core on first run
  and git-pulled on subsequent runs; no npm dep installation required.

- Config includes descriptionHtml (full README as HTML), coverImage,
  splits/subsets arrays, files[].about descriptions, and 8 sample rows.
  All file references use relative sourcePath so ShmuggingFaceCore copies
  real release files into the dist.

- Cloudflare Pages project 'leadforge-lead-scoring-v1-preview' created on
  the adanim account; live at:
  https://leadforge-lead-scoring-v1-preview.pages.dev

- pyproject.toml: per-file-ignores for S603/S607/S108/E501 on the script
  (subprocess calls with controlled inputs; long data strings).
- .gitignore: add release/_shmuggingface/ and .wrangler/
- .agent-plan.md: mark PR 7.2.2 complete; update PR 7.3 to cite preview site

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 24, 2026 20:00
@shaypal5 shaypal5 added type: feature New capability layer: cli cli/ command-line interface labels May 24, 2026
@github-actions

This comment has been minimized.

…re Pages

Without --branch main wrangler derives the branch name from the git checkout
(feat/shmuggingface-preview-site) and deploys to a preview slot, leaving the
root pages.dev URL serving Cloudflare's 'Nothing is here yet' placeholder.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new one-command script that builds and (optionally) deploys a live ShmuggingFace/Cloudflare Pages preview minisite from the leadforge release artifacts, to support pre-publish review of how the dataset will look on Kaggle and Hugging Face. This complements the local PR 7.2 preview tooling with an external, browser-shareable preview ahead of the PR 7.3 publish work.

Changes:

  • New scripts/build_shmuggingface_site.py: renders README, loads per-tier manifest/metrics/feature dictionary/sample rows, emits shmuggingface.config.mjs, drives ShmuggingFaceCore (auto-cloned to /tmp/shmuggingface-core), and optionally deploys to Cloudflare Pages via wrangler.
  • .gitignore adds release/_shmuggingface/ and .wrangler/; pyproject.toml adds Ruff per-file ignores (S603/S607/S108/E501) for the new script.
  • .agent-plan.md marks PR 7.2.2 done and updates PR 7.3 to reference the preview site.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 8 comments.

File Description
scripts/build_shmuggingface_site.py New end-to-end build/deploy script for the ShmuggingFace preview site.
pyproject.toml Ruff per-file ignores for the new script (subprocess + long data strings).
.gitignore Ignore generated release/_shmuggingface/ output and .wrangler/.
.agent-plan.md Marks PR 7.2.2 complete; updates PR 7.3 to cite the preview site.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +88
_PARENT_LINK_RE = re.compile(r"\]\(\.\./([^)]+)\)")
_VALIDATION_LINK_RE = re.compile(r"\]\(validation/validation_report\.md\)")


def _rewrite_links(text: str) -> str:
"""Rewrite relative markdown links to GitHub blob URLs."""
text = _PARENT_LINK_RE.sub(rf"]({GITHUB_BLOB_BASE}/\1)", text)
text = _VALIDATION_LINK_RE.sub(
f"]({GITHUB_BLOB_BASE}/release/validation/validation_report.md)", text
)
return text
Comment on lines +104 to +113
def load_tier(release_dir: Path, tier: str) -> dict:
"""Load manifest, metrics, feature dictionary, and sample rows for one tier."""
tier_dir = release_dir / tier
manifest = json.loads((tier_dir / "manifest.json").read_text())
metrics = json.loads((tier_dir / "metrics.json").read_text())

fd = pd.read_csv(tier_dir / "feature_dictionary.csv")
columns = list(fd["name"])

df = pd.read_csv(tier_dir / "lead_scoring.csv")
"task": "tabular-classification",
"language": "English",
"rowCount": n_leads,
"splits": ["train", "valid", "test"],
Comment on lines +162 to +163
def kb(path: Path) -> str:
return f"{max(1, path.stat().st_size // 1024)} KB"
Comment thread scripts/build_shmuggingface_site.py Outdated
Comment on lines +279 to +300
def ensure_smf_core(smf_core: Path | None) -> Path:
"""Return path to a working ShmuggingFaceCore checkout, cloning if needed."""
if smf_core is not None:
entry = smf_core / "bin/shmuggingface.mjs"
if not entry.exists():
sys.exit(f"ShmuggingFaceCore entry point not found at {entry}")
return smf_core

entry = SMF_CORE_CACHE / "bin/shmuggingface.mjs"
if SMF_CORE_CACHE.exists() and entry.exists():
print(f" Updating ShmuggingFaceCore cache at {SMF_CORE_CACHE}", file=sys.stderr)
subprocess.run(
["git", "-C", str(SMF_CORE_CACHE), "pull", "--quiet"],
check=False,
)
else:
print(f" Cloning ShmuggingFaceCore → {SMF_CORE_CACHE}", file=sys.stderr)
subprocess.run(
["git", "clone", "--depth=1", SMF_CORE_REPO, str(SMF_CORE_CACHE)],
check=True,
)
return SMF_CORE_CACHE
Comment thread scripts/build_shmuggingface_site.py Outdated
print(f" Updating ShmuggingFaceCore cache at {SMF_CORE_CACHE}", file=sys.stderr)
subprocess.run(
["git", "-C", str(SMF_CORE_CACHE), "pull", "--quiet"],
check=False,
Comment on lines +326 to +338
def _load_cf_env(cf_env_path: Path) -> dict:
"""Parse a shell env file and return a dict of variable overrides."""
env = os.environ.copy()
for raw_line in cf_env_path.read_text().splitlines():
line = raw_line.strip()
if line.startswith("#") or not line:
continue
if line.startswith("export "):
line = line[len("export ") :]
if "=" in line:
key, _, val = line.partition("=")
env[key.strip()] = val.strip().strip("'\"")
return env
Comment on lines +378 to +423
f"\n Live at: https://{project_name}.pages.dev",
file=sys.stderr,
)


# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------


def main() -> None:
import argparse

parser = argparse.ArgumentParser(
description="Build (and optionally deploy) the ShmuggingFace review minisite.",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"--release-dir",
default="release",
type=Path,
metavar="PATH",
help="Root of the release directory (default: release/)",
)
parser.add_argument(
"--out-dir",
type=Path,
metavar="PATH",
help="Output directory for the static site (default: release/_shmuggingface/dist)",
)
parser.add_argument(
"--smf-core",
type=Path,
default=None,
metavar="PATH",
help="Path to a local ShmuggingFaceCore checkout (auto-cloned if absent)",
)
parser.add_argument(
"--deploy",
action="store_true",
help="Deploy to Cloudflare Pages after building",
)
parser.add_argument(
"--cf-env",
type=Path,
default=DEFAULT_CF_ENV,
@github-actions

This comment has been minimized.

…back

- package.json (new): declares @shmuggingface/core via the GitHub release tag
    "github:ShmuggingFace/ShmuggingFaceCore#v1.0.0"
- package-lock.json (new): lockfile pinning the resolved SHA
- scripts/build_shmuggingface_site.py: ensure_smf_core() now resolves via
  node_modules/@shmuggingface/core (npm install path) as the canonical source;
  --smf-core PATH override kept for local dev; git-clone-to-/tmp fallback
  removed in favour of a clear error pointing at npm install.
  Logs 'Using npm-installed @shmuggingface/core vX.Y.Z' for traceability.
- .gitignore: add node_modules/ (package.json + lock are committed, not the tree)

Site rebuilt from v1.0.0 and redeployed to:
  https://leadforge-lead-scoring-v1-preview.pages.dev
Both HF-style (12 pages) and Kaggle-style (12 pages) mocks confirmed present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

subsets was hardcoded to 'leadforge-lead-scoring-v1' for all three tiers,
making the HF Dataset Viewer's Subset dropdown show the same name regardless
of which tier page you were on.  Now each tier gets its own suffixed name:
  leadforge-lead-scoring-v1-intro
  leadforge-lead-scoring-v1-intermediate
  leadforge-lead-scoring-v1-advanced

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #80 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/build_shmuggingface_site.py:89
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231107
Root author: copilot-pull-request-reviewer

Comment:
    The README link rewriting here duplicates the canonical implementation in `scripts/_release_common.py::rewrite_release_links` (which exposes `GITHUB_BLOB_BASE`, `PARENT_RELATIVE_LINK_RE`, and the validation-report rewrite). Importing and reusing `rewrite_release_links` would single-source the rule so the preview site cannot drift from what the Kaggle/HF packagers already rewrite at packaging time.

## COPILOT-2
Location: scripts/build_shmuggingface_site.py:114
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231118
Root author: copilot-pull-request-reviewer

Comment:
    `load_tier` will raise a bare `FileNotFoundError` if a tier directory hasn't been materialized (the `release/{intro,intermediate,advanced}/` trees are gitignored per `.gitignore:212-217` and only exist after running `scripts/build_public_release.py`). Other release tooling — e.g. `scripts/verify_claims_register.py` — soft-skips missing bundle dirs and surfaces an actionable message. Consider a pre-flight check that emits a clear "run scripts/build_public_release.py first" error instead of a stack trace on a fresh checkout.

## COPILOT-3
Location: scripts/build_shmuggingface_site.py:237
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231123
Root author: copilot-pull-request-reviewer

Comment:
    The HuggingFace packager uses the HF-canonical split label `validation` (see `scripts/package_hf_release.py:124-132` and the committed `release/huggingface/README.md` frontmatter), but the preview config advertises `"valid"` here. Since this site is explicitly meant to mirror how the dataset will look on Hugging Face, exposing `valid` in `splits` makes the preview disagree with what HF reviewers will actually see. Consider using `["train", "validation", "test"]` for the user-facing labels.

## COPILOT-4
Location: scripts/build_shmuggingface_site.py:164
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231124
Root author: copilot-pull-request-reviewer

Comment:
    `kb()` reports `path.stat().st_size // 1024` KB, which truncates anything <1024 bytes to 0 (then bumped to 1 by `max(1, ...)`) and rounds down for everything else. For sub-MB CSVs/parquets the rounding bias can be material in a "human-readable size" column. Consider rounding to nearest and/or switching to MB above some threshold, similar to a `humanize`-style formatter.

## COPILOT-5
Location: scripts/build_shmuggingface_site.py
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231131
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `ensure_smf_core` updates a cache under `/tmp/shmuggingface-core` without pinning a commit or tag: a `git pull` (or fresh `--depth=1` clone) will silently pick up whatever is on `main` upstream. Because the generated site's structure depends on `bin/shmuggingface.mjs`, an unrelated change in ShmuggingFaceCore can quietly alter the preview output or break this script with no local code change. Consider pinning a specific commit/tag (e.g. via a `SMF_CORE_REF` constant and `git -C ... checkout <ref>` after fetch) for reproducibility.

## COPILOT-6
Location: scripts/build_shmuggingface_site.py
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231138
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The `git pull` invocation uses `check=False`, so a failed update (network glitch, dirty cache, force-push upstream) is silently ignored and the script continues with a stale checkout — there's no warning logged either. Either set `check=True` (consistent with the clone branch on the next lines) or inspect `returncode` and emit a warning so users know the cache wasn't refreshed.

## COPILOT-7
Location: scripts/build_shmuggingface_site.py:347
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231144
Root author: copilot-pull-request-reviewer

Comment:
    `_load_cf_env`'s quote stripping uses `val.strip().strip("'\"")`, which strips any combination of leading/trailing single or double quotes rather than matching pairs. A value like `"it's"` would become `it's` correctly, but `'mixed"` would have both ends stripped, and a value containing an `=` sign is preserved (good — `partition` handles that). The bigger issue is that values containing `#` after the value (a trailing inline comment in some `.env` styles) are not handled, but more importantly any inline whitespace inside the value is preserved while surrounding whitespace is stripped — which differs from how a shell would `source` the file. Consider using `shlex.split` or `python-dotenv` if available to get shell-compatible parsing.

## COPILOT-8
Location: scripts/build_shmuggingface_site.py:435
URL: https://github.com/leadforge-dev/leadforge/pull/80#discussion_r3295231147
Root author: copilot-pull-request-reviewer

Comment:
    Nearby release/preview scripts (`scripts/preview_kaggle_page.py`, `scripts/preview_hf_page.py`, `scripts/validate_release_candidate.py`) all use a testable driver pattern: free-function `parse_args(argv)`, a frozen config dataclass, a `run_*(config) -> Outcome`, and `main(argv) -> int` returning an exit code (with `sys.exit(main())` at the bottom). This script uses `main() -> None` with `argparse` instantiated inline and scatters `sys.exit(...)` calls inside helpers (lines 44, 284, 344, 365, 427). That makes it harder to unit-test (no `argv` injection, exits short-circuit assertions) and breaks consistency with the existing PR 7.2 preview scripts that this PR explicitly cites as siblings. Consider refactoring to match the established pattern so the script becomes test-friendly.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 26372736212 attempt 1
Comment timestamp: 2026-05-24T21:05:09.251422+00:00
PR head commit: 0a19970cfa940040cc464d3dc7c02ef80352455d

@shaypal5

Copy link
Copy Markdown
Contributor Author

Superseded by PR #86 (PR 8.4), which brings this script to main with all review issues fixed:

  • TIER_USABILITY / TIER_MEDAL constants removed (fabricated values)
  • Raises on missing manifest fields instead of silently defaulting
  • Per-tier dataset_card.md as description body (not global README × 3)
  • --branch preview default; --production flag required for live deploys
  • Bare relative link rewriting (LICENSE etc.)
  • Bumped to ShmuggingFaceCore v1.0.2 (upstream implemented all requested fixes)
  • Package-lock.json regenerated via HTTPS tarball (no SSH keys needed)
  • 22 smoke tests added

This branch also had merge conflicts with main that would have needed resolving. Closing in favour of the clean PR 86.

@shaypal5 shaypal5 closed this May 27, 2026
@shaypal5 shaypal5 deleted the feat/shmuggingface-preview-site branch June 11, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: cli cli/ command-line interface type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants