Skip to content

SMF-PR5: Add canonical platform metadata lint gate#85

Merged
shaypal5 merged 1 commit into
mainfrom
codex/pr5-canonical-metadata
May 26, 2026
Merged

SMF-PR5: Add canonical platform metadata lint gate#85
shaypal5 merged 1 commit into
mainfrom
codex/pr5-canonical-metadata

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 25, 2026

Copy link
Copy Markdown
Contributor

Planning notation

SMF-PR5 / PR 8.4a: canonical platform metadata lint gate

Parent milestone: dataset: leadforge-lead-scoring-v1

Plan source:

  • /Users/shaypalachy/agents/environments/opensource/projects/shmuggingface/view_only_clones/ShmuggingFaceCore/docs/next_10_review_prs.md
  • /Users/shaypalachy/agents/handoffs/leadforge-v1-review/leadforge_shmuggingface_integration_issues.md
  • docs/external_review/summaries/v1_release_review_synthesis.md

What changed

Adds scripts/lint_platform_metadata.py, a canonical metadata diff/lint gate over the actual publication artifacts:

  • release/kaggle/dataset-metadata.json
  • release/huggingface/README.md

The lint fails on:

  • Kaggle isPrivate drifting away from false
  • Kaggle/HF license mismatches
  • missing HF tabular-classification task category
  • exact platform tag vocabulary drift for Kaggle keywords and HF tags
  • HF config/split declarations that do not exactly match the canonical train/validation/test layout
  • HF data files absent from Kaggle resources
  • task-split schema drift against the flat CSV schema minus split
  • metadata schema drift against actual CSV/parquet files when bundle files are materialized
  • missing root and per-tier agent-reviewable resources in the canonical Kaggle file list

The existing preview renderers already consume the canonical artifacts directly; this PR makes that contract explicit and CI-enforced. It also exposes --strict-files for release-readiness runs where missing tier CSV/parquet files should fail instead of being soft-skipped on fresh checkouts.

Why

HIGH-LF1 / HIGH-I1 said the preview path could miss platform metadata bugs such as privacy, tags, task, license, split, and schema mismatches. This adds a focused failing gate before preview/publish so those bugs are caught without relying on visual review.

Tests

  • python scripts/lint_platform_metadata.py
  • python scripts/sync_release_docs.py --check && python scripts/build_release_metrics.py --check && python scripts/build_claims_register.py --check && python scripts/verify_claims_register.py && python scripts/lint_platform_metadata.py
  • python -m pytest tests/scripts/test_lint_platform_metadata.py -q
  • python -m pytest tests/scripts/test_lint_platform_metadata.py tests/scripts/test_preview_kaggle_page.py tests/scripts/test_preview_hf_page.py -q
  • ruff check scripts/lint_platform_metadata.py tests/scripts/test_lint_platform_metadata.py
  • ruff format --check scripts/lint_platform_metadata.py tests/scripts/test_lint_platform_metadata.py
  • python -m mypy scripts/lint_platform_metadata.py

Note: plain pytest on this machine resolves to /Library/Frameworks/Python.framework/... and lacks repo deps; validation used python -m pytest from the pyenv Python where pandas/pyarrow are installed.

Follow-up

Remaining preview hardening items in .agent-plan.md stay scoped to later work: dependency pin cleanup, any reintroduced ShmuggingFaceCore builder path, deploy defaults, and broader link-rewrite cleanup.

Copilot AI review requested due to automatic review settings May 25, 2026 21:14
@shaypal5 shaypal5 added type: test Test additions or fixes type: ci CI/CD pipeline changes layer: validation validation/ invariants and checks layer: render render/ bundle and artifact output dataset: leadforge-lead-scoring-v1 Issue/PR scoped to the leadforge-lead-scoring-v1 dataset release labels May 25, 2026
@shaypal5 shaypal5 added the status: needs review Ready for review label May 25, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CI-enforced lint gate that diffs/validates the canonical publication metadata artifacts for Kaggle and Hugging Face, aiming to prevent preview/publish drift (privacy, license, tags/tasks, split paths, and required resource coverage) from landing unnoticed.

Changes:

  • Introduce scripts/lint_platform_metadata.py to lint Kaggle dataset-metadata.json vs HF README.md frontmatter and enforce a canonical contract.
  • Add focused unit tests covering expected pass/fail cases and asserting committed release artifacts pass the lint.
  • Wire the lint script into the release-artifacts-sync GitHub Actions job.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
scripts/lint_platform_metadata.py Implements the canonical Kaggle/HF metadata lint checks and CLI entrypoint.
tests/scripts/test_lint_platform_metadata.py Adds unit tests for the lint gate plus a “committed artifacts pass” test.
.github/workflows/ci.yml Runs the new metadata lint as part of the release artifact sync job.
.agent-plan.md Marks SMF-PR5 planned work item as completed and documents the gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +469 to +475
parser.add_argument(
"--tier",
action="append",
dest="tiers",
default=None,
help="tier/config to validate (repeatable; default: intro/intermediate/advanced)",
)
@shaypal5 shaypal5 force-pushed the codex/pr5-canonical-metadata branch from 2035975 to e5e907f Compare May 25, 2026 21:26
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes an unresolved review comment on PR #85 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/lint_platform_metadata.py:648
URL: https://github.com/leadforge-dev/leadforge/pull/85#discussion_r3299980392
Root author: copilot-pull-request-reviewer

Comment:
    The `--tier` CLI flag is described as selecting which tier/configs to validate, but `_lint_hf_configs()` currently requires the HF frontmatter `configs` list to match `tiers` exactly (same order, no extras). Passing `--tier intro` will fail against the canonical README (which includes 3 configs), so the flag can’t be used as advertised. Either (a) implement subset validation by filtering `configs` to the requested tiers and only checking those, or (b) change/remove the flag/help text to reflect that the README is expected to contain exactly the specified configs.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 26420400486 attempt 1
Comment timestamp: 2026-05-25T21:26:20.445392+00:00
PR head commit: e5e907fea17609c884da8a9d031034050c8ac2b9

@shaypal5 shaypal5 merged commit 3c44d13 into main May 26, 2026
9 of 10 checks passed
@shaypal5 shaypal5 deleted the codex/pr5-canonical-metadata branch May 26, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset: leadforge-lead-scoring-v1 Issue/PR scoped to the leadforge-lead-scoring-v1 dataset release layer: render render/ bundle and artifact output layer: validation validation/ invariants and checks status: needs review Ready for review type: ci CI/CD pipeline changes type: test Test additions or fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants