feat: public Kaggle/HuggingFace dataset release infrastructure#51
Merged
Conversation
Replace stub sections in render_dataset_card() with actual content: - Table inventory renders row counts when table_counts dict is passed - Feature categories always rendered from LEAD_SNAPSHOT_FEATURES with category counts, example columns, and leakage-flagged column callout - write_bundle() now passes table_row_counts to the card renderer Prepares dataset cards for public Kaggle/HuggingFace release. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scripts/build_public_release.py generates four bundles: - intro, intermediate, advanced (student_public) - intermediate_instructor (research_instructor) Each student_public bundle gets a flat CSV convenience export (lead_scoring.csv) merging train/valid/test with a split column. All bundles are validated after generation. Prints a summary table with row counts and conversion rates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ease - release/README.md: landing page with directory structure, quick-start snippets (flat CSV, Parquet, relational, reproduce from source), dataset summary, feature dictionary overview, and provenance info - release/HF_DATASET_CARD.md: HuggingFace-format card with YAML frontmatter defining three dataset configs (intro/intermediate/advanced) mapped to their respective Parquet task splits Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
current_stage at the 90-day horizon contains terminal stages (closed_won/closed_lost) that perfectly encode the label. The build script now drops it from the flat CSV convenience export, the baseline notebook excludes it from modeling features, and the README documents the issue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add public Kaggle/HuggingFace release section covering phases 1-5. Document the current_stage leakage issue at 90-day horizon. Restructure to show teaching datasets as previous focus. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Issues found and fixed:
1. current_stage not flagged as leakage: marked leakage_risk=True in
LEAD_SNAPSHOT_FEATURES with warning about terminal stages at full
horizon. Now auto-detected by notebook and feature dictionary.
2. Function-body import in dataset_card.py: moved LEAD_SNAPSHOT_FEATURES
import to module level.
3. table_counts truthiness bug: changed `if table_counts:` to
`if table_counts is not None:` so empty dicts render the table header
instead of the placeholder. Added test.
4. README corrupted Unicode: rewrote directory tree with clean ASCII.
5. README wrong feature count: corrected "35 features" to
"34 features + 1 target" throughout.
6. README false advertising on difficulty: added Known Limitations section
disclosing that difficulty tiers share the same conversion rate until
the engine implements difficulty modulation.
7. Build script CSV re-read: compute conversion rate from train Parquet
column read instead of re-reading the entire CSV.
8. Missing gitignore for release/ generated artifacts: added entries for
intro/, intermediate/, advanced/, intermediate_instructor/, LICENSE.
9. Notebook hardcoded current_stage exclusion: removed manual exclusion
now that current_stage is leakage_risk=True in the feature spec and
auto-detected via feature dictionary.
10. HF card validation split: added note that split name is "validation"
while the file on disk is valid.parquet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
Self-review: 10 issues found and fixedPushed commit
|
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #51 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
render_dataset_card()now renders actual table inventory (row counts) and feature categories (fromLEAD_SNAPSHOT_FEATURES) instead of placeholder stubs.write_bundle()passes table counts to the renderer.scripts/build_public_release.pygenerates four bundles (3 difficulty tiers × student_public + 1 intermediate × research_instructor), validates each, and creates flat CSV convenience exports with asplitcolumn.release/README.md(landing page with quick-start snippets, directory structure, dataset summary) andrelease/HF_DATASET_CARD.md(YAML frontmatter with HuggingFace dataset configs for each difficulty tier).release/notebooks/01_baseline_lead_scoring.ipynb— LR + GBM baselines with AUC, PR-AUC, Precision@K, value-aware ranking, and feature importance. Works from pre-generated Parquet files without requiring leadforge.current_stageat the 90-day horizon contains terminal stages (closed_won/closed_lost) that perfectly encode the label. Flat CSV exports drop it; notebook excludes it; README documents it.Test plan
leadforge validatecurrent_stage🤖 Generated with Claude Code