fix(release): change Kaggle owner slug to derelictpanda#98
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates the Kaggle release packaging defaults and Kaggle dataset metadata to publish under a different Kaggle account and refresh the dataset description/credits.
Changes:
- Switch Kaggle username default used by the packaging script.
- Update
dataset-metadata.jsonownership (id) and rewrite the long-form description (including credits). - Remove the explicit collaborator entry from Kaggle metadata.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| scripts/package_kaggle_release.py | Changes the default Kaggle user slug used during release packaging. |
| release/kaggle/dataset-metadata.json | Updates Kaggle dataset owner/id and revises the dataset description/credits metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # --------------------------------------------------------------------------- | ||
|
|
||
| DEFAULT_USER_SLUG: Final[str] = "leadforge" | ||
| DEFAULT_USER_SLUG: Final[str] = "derelictpanda" |
| "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Created by\n[Shay Palachy Affek](https://www.shaypalachy.com/) and generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising (median public-SaaS growth 30%→25%\nfrom 2023 to 2025; New CAC Ratio rose materially in 2024), so\npredicting *which* leads convert within a fixed window has moved from\na marketing nicety to a survival skill. This dataset teaches that\nskill on a relational substrate, with the realistic confusions\n(snapshot-window discipline, leakage traps, channel signal weaker than\nvendor blogs imply) that students will hit when they finally get hands\non real CRM data.\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier\n│ ├── manifest.json # provenance + file hashes\n│ ├── metrics.json # per-tier headline metrics (medians + spreads)\n│ ├── dataset_card.md # auto-rendered per-bundle card\n│ ├── feature_dictionary.csv # authoritative column spec\n│ ├── lead_scoring.csv # flat convenience CSV (all splits)\n│ ├── tables/*.parquet # 7 snapshot-safe relational tables\n│ └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── docs/ # vendored DGP / leakage / break-me docs (agent-readable)\n├── metrics.json # top-level cross-tier metrics summary\n├── claims_register.{md,json} # claims → backing-artifact map (agent-readable)\n├── dataset-metadata.json # Kaggle dataset metadata\n├── dataset-cover-image.png # Kaggle cover image\n├── README.md # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n### Agent-reviewable artifacts\n\nThe published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:\n\n- **`metrics.json` (root) + `<tier>/metrics.json`** — deterministic\n JSON view of the headline LR AUC / AP / P@100 / Brier / conversion\n rate / cohort-shift / cross-tier-ordering medians, with JSON-path\n back-references to `validation/validation_report.json` (the\n source of truth).\n- **`claims_register.{md,json}`** — every numerical or structural\n claim on this page paired with the artifact and path that backs it.\n Rendered from `claims_register_source.yaml` by\n `scripts/build_claims_register.py`.\n- **`docs/`** — vendored copies of `generation_method.md`,\n `channel_signal_audit.md`, `break_me_guide.md`,\n `feature_dictionary.md`, `v1_acceptance_gates_bands.yaml`,\n `v2_decision_log.md`, plus a hand-authored\n `relational_table_schemas.csv` documenting every column of every\n relational table. These match the GitHub-blob links cited below but\n ship inside the bundle so a reviewer never needs network access.\n- **`<tier>/manifest.json`** — SHA-256 hash for every file plus the\n full redaction contract (`structural_redactions.columns`,\n `omitted_tables`, `relational_snapshot_safe`, `snapshot_day`).\n- Kaggle / HuggingFace preview pages additionally inject a\n `schema.org/Dataset` JSON-LD block in their `<head>` for agent\n ingestion without HTML parsing.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n# --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Evaluation note — account and contact overlap\n\n**518 of 557 test accounts (≈93 %) appear in train** on the intermediate\nbundle; the other tiers are similar. Contact-level overlap is comparable\nin magnitude: most test contacts also have activity in the training set.\nThe random-split headline metrics therefore ride both account-level and\ncontact-level signal across the split boundary and over-estimate\ngeneralisation to unseen accounts and contacts. For a faithful\nout-of-sample number, retrain with `GroupKFold(account_id)` and report\nboth metrics. Notebook 02 demonstrates the detection recipe;\n[`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives\nthe worked example.\n\n## Dataset summary\n\n**Tiers are prevalence and noise axes, not modelling-complexity axes.**\nLR AUC is ~0.88 in every tier by design. The tiers differ in conversion\nrate, missingness, and noise — not rank discrimination. Choose a tier\nbased on the teaching exercise, not on expected AUC:\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (acceptance band, gate G7.\\*) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (`v1_acceptance_gates_bands.yaml` G7.\\*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |\n|---|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |\n\n**Reading this table:** LR AUC is flat across tiers by design — the\ntiers are a prevalence / noise axis, not a rank-discrimination axis.\nBrier score *improves* as prevalence falls (a prevalence effect, not\nbetter calibration); use `calibration_max_bin_error` to assess\ncalibration quality. Advanced's 0.52 max-bin error means the model's\npredicted probabilities are materially mis-scaled against actual\nconversion rates — a realistic miscalibration exercise.\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended prevalence axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n The instructor companion exposes the hidden graph for teaching, not\n designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n attributes.\n\n## Known limitations\n\n- **Tiers are a prevalence / noise axis, not a modelling-complexity\n axis.** LR AUC is ~0.88 in every tier; the three tiers differ in\n conversion rate (43% / 22% / 8%), noise scale, and missingness —\n not in rank discrimination. Use AP, P@K, and calibration metrics\n to see the difficulty gradient; AUC alone will not show it.\n- **93% account and contact overlap across train / test splits.** Random\n splits are keyed on lead ID; most test accounts and contacts also\n appear in train. Headline metrics over-state generalisation to unseen\n accounts and contacts. Use `GroupKFold(account_id)` for a faithful\n estimate.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n is slightly negative in every tier (intro −0.0045, intermediate\n −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n all tiers and the per-channel rate spread is ≤0.05. The simulator\n does not encode channel-conditional probabilities; channel-conditional\n encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n baked in; the cohort-shift gate (G6.4) is informational and will\n bite in v2.\n- **Advanced-tier noise can produce artifact zeros in count and duration\n columns.** Gaussian noise is applied before MCAR missingness; the\n snapshot builder clamps results below zero to zero. What users observe\n is therefore not negative values but zeros that may be noise artifacts\n rather than true zero values — e.g. `days_since_last_touch = 0` might\n mean \"noised below zero, clamped\" rather than \"touched today\". Treat\n suspicious zero clusters in the Advanced tier as intentional\n data-cleaning exercise material.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n sales_activities, opportunities (public); plus customers and\n subscriptions (instructor only). Per-row counts per bundle live in\n `manifest.json`.\n- **Features.** 31 public columns grouped by analytical role in\n [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n the per-bundle `feature_dictionary.csv` is the authoritative\n machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n recorded in `tasks/converted_within_90_days/task_manifest.json`.\n Splits are keyed on `lead_id`; see the *Evaluation note* above for\n the account-overlap caveat.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate <bundle_dir>`; every file\nis hashed in `manifest.json`.\n\n## Credits\n\nCreated by [Shay Palachy Affek](https://www.shaypalachy.com/).\nDataset generated with [leadforge](https://github.com/leadforge-dev/leadforge) (MIT).\nProfiles: [HuggingFace](https://huggingface.co/shaypal5) · [Kaggle](https://www.kaggle.com/derelictpanda) · [GitHub](https://github.com/shaypalachy)\n", | ||
| "expectedUpdateFrequency": "never", | ||
| "id": "leadforge/leadforge-lead-scoring-v1", | ||
| "id": "derelictpanda/leadforge-lead-scoring-v1", |
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #98 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Kaggle org
leadforgedoesn't exist; publish underderelictpanda(personal account).DEFAULT_USER_SLUG:leadforge→derelictpandainpackage_kaggle_release.pyrelease/kaggle/dataset-metadata.json— also picks up author attribution fromrelease/README.md(chore(release): add author attribution to README #94–chore: add author attribution to repo README #96) that was missing from the stale metadataDry-run passes cleanly.
🤖 Generated with Claude Code