Skip to content

feat: public Kaggle/HuggingFace dataset release infrastructure#51

Merged
shaypal5 merged 6 commits into
mainfrom
public-release-dataset-card
May 3, 2026
Merged

feat: public Kaggle/HuggingFace dataset release infrastructure#51
shaypal5 merged 6 commits into
mainfrom
public-release-dataset-card

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Dataset card improvement: render_dataset_card() now renders actual table inventory (row counts) and feature categories (from LEAD_SNAPSHOT_FEATURES) instead of placeholder stubs. write_bundle() passes table counts to the renderer.
  • Build script: scripts/build_public_release.py generates four bundles (3 difficulty tiers × student_public + 1 intermediate × research_instructor), validates each, and creates flat CSV convenience exports with a split column.
  • Platform documentation: release/README.md (landing page with quick-start snippets, directory structure, dataset summary) and release/HF_DATASET_CARD.md (YAML frontmatter with HuggingFace dataset configs for each difficulty tier).
  • Baseline notebook: release/notebooks/01_baseline_lead_scoring.ipynb — LR + GBM baselines with AUC, PR-AUC, Precision@K, value-aware ranking, and feature importance. Works from pre-generated Parquet files without requiring leadforge.
  • Leakage fix: current_stage at the 90-day horizon contains terminal stages (closed_won/closed_lost) that perfectly encode the label. Flat CSV exports drop it; notebook excludes it; README documents it.

Test plan

  • All 851 existing tests pass
  • 4 new dataset card tests (table inventory with/without counts, feature categories, leakage flags)
  • Build script generates all 4 bundles and all pass leadforge validate
  • Notebook modeling logic verified against generated data (LR AUC=0.895, GBM AUC=0.877 on intermediate)
  • Flat CSV confirmed to exclude current_stage
  • Run build script on clean checkout to verify determinism

🤖 Generated with Claude Code

shaypal5 and others added 5 commits May 3, 2026 23:39
Replace stub sections in render_dataset_card() with actual content:
- Table inventory renders row counts when table_counts dict is passed
- Feature categories always rendered from LEAD_SNAPSHOT_FEATURES with
  category counts, example columns, and leakage-flagged column callout
- write_bundle() now passes table_row_counts to the card renderer

Prepares dataset cards for public Kaggle/HuggingFace release.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scripts/build_public_release.py generates four bundles:
- intro, intermediate, advanced (student_public)
- intermediate_instructor (research_instructor)

Each student_public bundle gets a flat CSV convenience export
(lead_scoring.csv) merging train/valid/test with a split column.
All bundles are validated after generation. Prints a summary table
with row counts and conversion rates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ease

- release/README.md: landing page with directory structure, quick-start
  snippets (flat CSV, Parquet, relational, reproduce from source),
  dataset summary, feature dictionary overview, and provenance info
- release/HF_DATASET_CARD.md: HuggingFace-format card with YAML
  frontmatter defining three dataset configs (intro/intermediate/advanced)
  mapped to their respective Parquet task splits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
current_stage at the 90-day horizon contains terminal stages
(closed_won/closed_lost) that perfectly encode the label. The build
script now drops it from the flat CSV convenience export, the baseline
notebook excludes it from modeling features, and the README documents
the issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add public Kaggle/HuggingFace release section covering phases 1-5.
Document the current_stage leakage issue at 90-day horizon.
Restructure to show teaching datasets as previous focus.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 added type: feature New capability type: docs Documentation or narrative changes layer: render render/ bundle and artifact output labels May 3, 2026
Copilot AI review requested due to automatic review settings May 3, 2026 20:59
@shaypal5 shaypal5 added type: docs Documentation or narrative changes layer: render render/ bundle and artifact output labels May 3, 2026
@github-actions

This comment has been minimized.

Issues found and fixed:

1. current_stage not flagged as leakage: marked leakage_risk=True in
   LEAD_SNAPSHOT_FEATURES with warning about terminal stages at full
   horizon. Now auto-detected by notebook and feature dictionary.

2. Function-body import in dataset_card.py: moved LEAD_SNAPSHOT_FEATURES
   import to module level.

3. table_counts truthiness bug: changed `if table_counts:` to
   `if table_counts is not None:` so empty dicts render the table header
   instead of the placeholder. Added test.

4. README corrupted Unicode: rewrote directory tree with clean ASCII.

5. README wrong feature count: corrected "35 features" to
   "34 features + 1 target" throughout.

6. README false advertising on difficulty: added Known Limitations section
   disclosing that difficulty tiers share the same conversion rate until
   the engine implements difficulty modulation.

7. Build script CSV re-read: compute conversion rate from train Parquet
   column read instead of re-reading the entire CSV.

8. Missing gitignore for release/ generated artifacts: added entries for
   intro/, intermediate/, advanced/, intermediate_instructor/, LICENSE.

9. Notebook hardcoded current_stage exclusion: removed manual exclusion
   now that current_stage is leakage_risk=True in the feature spec and
   auto-detected via feature dictionary.

10. HF card validation split: added note that split name is "validation"
    while the file on disk is valid.parquet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shaypal5

shaypal5 commented May 3, 2026

Copy link
Copy Markdown
Contributor Author

Self-review: 10 issues found and fixed

Pushed commit 9dab09c addressing all issues from a critical self-review:

# Issue Fix
1 current_stage ships as leaky column with only a prose warning Marked leakage_risk=True in LEAD_SNAPSHOT_FEATURES; now auto-flagged in feature dictionary and auto-excluded by notebook
2 Function-body import in dataset_card.py Moved LEAD_SNAPSHOT_FEATURES import to module level
3 if table_counts: falsely handles empty dicts Changed to if table_counts is not None:; added test
4 README directory tree has corrupted Unicode Rewrote with clean ASCII
5 README claims "35 features" (actually 34 + 1 target) Corrected to "34 features + 1 target" throughout
6 README claims different difficulty tiers without disclosing engine doesn't modulate yet Added "Known limitations" section
7 Build script re-reads entire CSV just for conversion rate Reads single column from train Parquet instead
8 No .gitignore for generated release artifacts Added entries for release/{intro,intermediate,advanced,intermediate_instructor}/
9 Notebook hardcodes current_stage exclusion instead of using feature dictionary Removed hardcode; now auto-detected via leakage_risk flag
10 HF card doesn't show validation split or explain naming Added valid to quick-start snippet with naming note

@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #51 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25290854962 attempt 1
Comment timestamp: 2026-05-03T21:08:44.124420+00:00
PR head commit: 9dab09c93ba8da99c2021479db28f3741c0d8c17

@shaypal5 shaypal5 merged commit dfd5fdf into main May 3, 2026
8 checks passed
@shaypal5 shaypal5 deleted the public-release-dataset-card branch May 3, 2026 21:11
@shaypal5 shaypal5 removed the request for review from Copilot May 3, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: render render/ bundle and artifact output type: docs Documentation or narrative changes type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant