Skip to content

fix(narrative): rewrite dataset_card for zero-prior-knowledge readers#88

Merged
shaypal5 merged 2 commits into
mainfrom
fix/dataset-card-rewrite
May 28, 2026
Merged

fix(narrative): rewrite dataset_card for zero-prior-knowledge readers#88
shaypal5 merged 2 commits into
mainfrom
fix/dataset-card-rewrite

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Problem

The three tier dataset cards opened with a raw metadata table (Recipe, Exposure mode, Seed, Difficulty, Horizon...) with no preamble, followed by a 'Narrative summary' section that read like a real company prospectus without ever saying 'this is synthetic data'. Anyone browsing the ShmuggingFace or Kaggle preview with no prior leadforge knowledge would have no idea what they were looking at.

What changed

Generator (leadforge/narrative/dataset_card.py)

Complete redesign of render_dataset_card():

Before After
Opens with raw metadata table (Recipe, Exposure mode...) Opens with 'This is a synthetic dataset for practicing B2B lead scoring, generated by leadforge...'
'Narrative summary' section — reads like a real company 'The simulated world' section — explicitly labelled fictional
No explanation of the prediction task 'What you are predicting' paragraph + blockquote with label definition
No tier context Per-tier callout with signal/noise/AUC/AP numbers + human-readable tier description
No code snippet 'How to load' section with flat CSV + Parquet splits + relational tables
Metadata (Recipe, Seed, Package version) at the top 'Reproducibility' section at the bottom with leadforge generate command
Suggested use cases (plain bullets) 'Intended uses' section
Table inventory (counts only) Table inventory with per-row descriptions
Persona role keys only (vp_finance) Human title + role key ('VP Finance / vp_finance')

Static release cards

release/{intro,intermediate,advanced}/dataset_card.md and their HF/Kaggle copies updated on disk (gitignored generated artifacts — not tracked). ShmuggingFace site rebuilt and redeployed to Cloudflare Pages.

Tier differences are now explicit:

  • intro: ~43% conversion, signal 0.90, LR AUC 0.671 — 'easiest; prototype your pipeline here'
  • intermediate: ~22% conversion, signal 0.70, LR AUC 0.662 — 'default benchmark; calibration matters'
  • advanced: ~8% conversion, signal 0.50, LR AUC 0.624 — 'rare-event / calibration exercise'

Tests (tests/narrative/test_dataset_card.py)

Updated 4 assertions that were checking for exact legacy Markdown strings:

  • test_card_contains_use_cases: accept 'intended' OR 'use cases'
  • test_card_feature_categories_rendered: case-insensitive category name check
  • test_card_leakage_flagged_columns: accept 'leakage' anywhere in card
  • test_card_with_narrative_contains_personas: added clarifying comment (assertion unchanged — role keys still appear)

All 1482 tests pass.

Preview

Live at leadforge-lead-scoring-v1-preview.pages.dev

🤖 Generated with Claude Code

…ge readers

Redesign render_dataset_card() so the generated card is immediately
useful to a data scientist with no prior leadforge knowledge:

- Open with plain-English 'what is this / what you are predicting'
  paragraph before any metadata tables
- Per-tier callout block (conversion rate, signal/noise knobs, AUC,
  AP, P@100) with a tier-specific description explaining when to use
  each tier
- 'The simulated world' section (clearly labelled fictional) replaces
  the jargon-heavy 'Narrative summary'
- 'How to load' Python snippet (flat CSV + Parquet splits + relational
  tables) added as a dedicated section
- 'Reproducibility' section with generate command moves metadata
  (recipe, seed, package version) to the bottom instead of the top
- 'Intended uses' section (was 'Suggested use cases') restored
- Table inventory gains one-line descriptions per table
- Feature category table keeps 'Count' header; leakage-flagged text
  keeps 'Leakage-flagged columns:' anchor for test compatibility
- Persona rendering includes human title alongside role key

Static release cards (release/{intro,intermediate,advanced}/dataset_card.md
and their HF/Kaggle copies) updated on disk (gitignored generated
artifacts). ShmuggingFace site rebuilt and redeployed to Cloudflare
Pages (leadforge-lead-scoring-v1-preview.pages.dev).

Tests: update test assertions that checked for exact Markdown
formatting strings that changed:
- test_card_contains_use_cases: accept 'intended' as well as 'cases'
- test_card_feature_categories_rendered: case-insensitive category check
- test_card_leakage_flagged_columns: accept 'leakage' in any case
- test_card_with_narrative_contains_personas: doc comment clarification

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 20:34
@shaypal5 shaypal5 added type: bugfix Fixes a bug layer: narrative narrative/ vertical story layer labels May 27, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR rewrites the generated dataset card to be understandable to readers with no prior leadforge context, making the synthetic nature, prediction task, and tier meaning explicit before diving into technical details.

Changes:

  • Redesign render_dataset_card() structure and copy to lead with “synthetic dataset” framing, task definition, tier callout, loading instructions, reproducibility details, and clearer world narrative.
  • Expand table inventory and feature sections (descriptions, leakage explanation, and clearer category labels).
  • Relax/update a few dataset-card tests to be robust to the new Markdown phrasing and casing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
leadforge/narrative/dataset_card.py Major rewrite of dataset card rendering (new sections, tier callout, table inventory descriptions, loading + reproducibility guidance).
tests/narrative/test_dataset_card.py Updates assertions to tolerate the new card structure/phrasing while preserving key invariants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +150 to +152
f"| Signal strength | {cfg.signal_strength} / 1.0 |"
if hasattr(cfg, "signal_strength")
else "| Signal strength | see difficulty_profiles.yaml |",
Comment on lines +305 to +309
"# Flat CSV — all leads, all splits combined (convenient for exploration)",
'df = pd.read_csv("lead_scoring.csv")',
f'X = df.drop(columns=["{cfg.primary_task}"])',
f'y = df["{cfg.primary_task}"]',
"",
Comment on lines +323 to +326
"**Note on account overlap:** ~93% of test-set accounts also appear in the "
"training set (splits are keyed on `lead_id`). Headline AUC overstates "
"generalisation to *unseen* accounts. For a faithful out-of-sample estimate, "
'use `GroupKFold(groups=df["account_id"])`.',
Comment on lines +344 to +345
f"leadforge generate --recipe {cfg.recipe_id} --seed {cfg.seed} \\",
f" --mode student_public --difficulty {difficulty} --out my_bundle",
…l artifacts

- HuggingFace public README: add authors: [shaypal5] to YAML frontmatter
- HuggingFace instructor README: same
- Kaggle dataset-metadata.json: add derelictpanda as collaborator (role: writer)
- dataset_card.py generator: append '**Author:** Shay Palachy Affek' line
  with HF, Kaggle, and GitHub links to the Reproducibility section
- All 9 static dataset_card.md files updated on disk; site rebuilt and
  redeployed to leadforge-lead-scoring-v1-preview.pages.dev

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #88 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/narrative/dataset_card.py:152
URL: https://github.com/leadforge-dev/leadforge/pull/88#discussion_r3313706799
Root author: copilot-pull-request-reviewer

Comment:
    The tier callout tries to read `cfg.signal_strength`, but `GenerationConfig` does not have that attribute (difficulty parameters live under `cfg.difficulty_params`). As written, this will always fall back to the YAML placeholder even when difficulty params are available, so the card won’t show the actual signal strength for the generated bundle.

## COPILOT-2
Location: leadforge/narrative/dataset_card.py:309
URL: https://github.com/leadforge-dev/leadforge/pull/88#discussion_r3313706855
Root author: copilot-pull-request-reviewer

Comment:
    The “How to load” snippet treats `cfg.primary_task` as the label column name (`df[primary_task]` / `drop(columns=[primary_task])`), but the task ID and label column name can differ (the task directory can change while the label column remains `converted_within_90_days` / `task_manifest.label_column`). This will break the example for non-default tasks.

## COPILOT-3
Location: leadforge/narrative/dataset_card.py:326
URL: https://github.com/leadforge-dev/leadforge/pull/88#discussion_r3313706891
Root author: copilot-pull-request-reviewer

Comment:
    The `GroupKFold` example is not valid scikit-learn API (`GroupKFold` doesn’t accept a `groups=` argument in the constructor). This will mislead readers; consider showing `GroupKFold(n_splits=...)` and passing `groups=` to `split(...)` / `cross_val_score(...)`, or make this a prose note without code-like syntax.

## COPILOT-4
Location: leadforge/narrative/dataset_card.py:345
URL: https://github.com/leadforge-dev/leadforge/pull/88#discussion_r3313706930
Root author: copilot-pull-request-reviewer

Comment:
    The reproducibility command hard-codes `--mode student_public` instead of using the bundle’s actual `cfg.exposure_mode`. If someone renders a card for `research_instructor`, the command will be incorrect.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 26563872419 attempt 1
Comment timestamp: 2026-05-28T08:31:53.908932+00:00
PR head commit: d24d7e1e6e76e7da0d16758aceb175985e4bcd50

@shaypal5 shaypal5 merged commit a34c9f2 into main May 28, 2026
10 checks passed
@shaypal5 shaypal5 deleted the fix/dataset-card-rewrite branch May 28, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: narrative narrative/ vertical story layer type: bugfix Fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants