Plan v4 lead scoring dataset + leadforge engine roadmap by shaypal5 · Pull Request #19 · leadforge-dev/leadforge

shaypal5 · 2026-04-29T12:41:37Z

Summary

Comprehensive planning PR for the v4 lead scoring dataset — a pedagogically improved single-CSV dataset for an intro ML course. This PR contains only documentation and planning artifacts (no code changes, all 590 tests pass).

Why v4?

v1–v3 each fixed critical issues but revealed new ones:

v1: label leakage via funnel_stage containing closed_won/closed_lost
v2: 90-day snapshot = post-mortem classification (not prediction); deterministic binary proxies (reached_sql=0 → 0% conversion)
v3: day-21 snapshot + clean proxies, but AUC only 0.62 due to flat category effects at intro difficulty

v4 addresses all of these and adds:

Value-aware ranking (expected_acv feature)
Temporal momentum (touches_week_1)
Structured missingness (conditional on lead source, not just MCAR)
Deliberate leakage trap (total_touches_all — full 90-day window, for classroom discussion)
Stronger category signal via category_effect_scale in difficulty profiles

Milestones

Milestone	Scope	Status
v4-M0	Requirements, contract, agent instructions	✅ This PR
v4-M1	Engine: category signal tuning + windowed snapshots	⬜ Next
v4-M2	Build pipeline: snapshot builder + validation	⬜
v4-M3	Documentation + dataset release	⬜

Existing roadmap items — triage

Item	Decision	Rationale
M12: CLI `--json`/`--strict` flags	Deferred	No consumer needs it; low priority vs dataset
M14: Sample bundle generation	Absorbed into v4-M3	v4 dataset IS the sample
M14: Lead-scoring baseline notebook	Deferred	v4 validation script covers this
M14: Notebooks 3–4 (public/instructor, recipe customization)	Discarded	No current audience
M15: Docs polish + v1.0 RC	Deferred	Do after v4 ships

v4 dataset acceptance criteria

1,000 rows × 18 columns (17 features + 1 target)
30% conversion rate
No deterministic groups (n≥50 at 0% or 100%)
Baseline LR AUC 0.65–0.90 (without leakage trap)
Leakage trap boosts AUC by ≥0.03
Structured missingness (source-conditional, not just MCAR)
Reproducible with seed 42

Files in this PR

docs/v4/lead_scoring_v4_requirements.md — 7 requirements with rationale
docs/v4/dataset_contract.md — schema contract, temporal gates, missingness patterns
docs/v4/engine_changes_spec.md — what changes in the engine and where
docs/v4/validation_spec.md — 8 mandatory + 3 warning checks
docs/v4/implementation_plan.md — milestone breakdown with acceptance criteria
Updated CLAUDE.md — repo map, generation workflow, student_public invariants
Updated AGENTS.md — v4 implementation guide, coding conventions, testing
Updated .agent-plan.md — v4 as next work, M12–M15 triage

What this PR does NOT do

No code changes to the engine or any Python module
No implementation of v4-M1 through v4-M3 (deferred to follow-up PRs)
No changes to the simulation loop
No new dependencies

🤖 Generated with Claude Code

Add comprehensive v4 planning docs, updated agent instructions, and revised project roadmap driven by dataset needs from v1–v3 iterations. Docs added: - docs/v4/lead_scoring_v4_requirements.md — 7 requirements (value features, temporal momentum, structured missingness, leakage trap, redundancy fix, stronger category signal, robust validation) - docs/v4/dataset_contract.md — schema contract, temporal gates, missingness patterns, subsampling rules - docs/v4/engine_changes_spec.md — category effect scaling, windowed snapshot builder, structured missingness, leakage trap feature - docs/v4/validation_spec.md — 8 mandatory checks + 3 warning checks - docs/v4/implementation_plan.md — 4 milestones (M0–M3) with acceptance criteria and explicit mapping from existing roadmap items Updated: - CLAUDE.md — added repo map, generation workflow, student_public invariants, feature addition guide, v4 plan pointers - AGENTS.md — added v4 implementation guide, coding conventions, validation checklist, local testing commands - .agent-plan.md — v4 milestones as next work; M12–M15 items explicitly triaged (deferred/absorbed/discarded) No code changes. All 590 existing tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-29T12:42:44Z

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR
#19. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.20
Trigger: pull request opened
Workflow run: 25109426149 attempt 1
Comment timestamp: 2026-04-29T12:41:56.785853+00:00
PR head commit: 1f45df09976b1cf10e5cca6a0a366a3270ff4364

Copilot

Pull request overview

Planning and documentation for the v4 lead scoring “intro” dataset, including requirements, dataset contract, validation spec, and an engine/build roadmap to address leakage, temporal framing, and pedagogical signal strength.

Changes:

Added v4 planning docs: requirements, dataset contract, engine changes spec, validation spec, and milestone implementation plan.
Updated contributor-facing guidance (CLAUDE.md, AGENTS.md) with repo map, workflow/commands, and v4 implementation checklist.
Updated .agent-plan.md to prioritize v4 and triage older roadmap items.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
docs/v4/validation_spec.md	Defines mandatory + warning validation checks for the v4 single-CSV dataset.
docs/v4/lead_scoring_v4_requirements.md	Captures v4 goals, rationale from v1–v3, and the proposed v4 column set + acceptance criteria.
docs/v4/implementation_plan.md	Milestone breakdown (M0–M3) and deliverables/acceptance criteria for v4 delivery.
docs/v4/engine_changes_spec.md	Specifies intended engine changes (difficulty/category signal tuning, snapshot windowing) and build-script responsibilities.
docs/v4/dataset_contract.md	Defines snapshot/target temporal gates, missingness expectations, subsampling, and reproducibility contract.
CLAUDE.md	Adds repo map + generation/validation workflow documentation and links to v4 docs.
AGENTS.md	Adds v4 implementation guidance, conventions, and a validation checklist.
.agent-plan.md	Reorients “Next Up” to v4 milestones and defers/triages older roadmap items.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T12:48:09Z

+| Conversion rate | In [15%, 40%] |
+| Baseline LR AUC | In [0.65, 0.90] (all features except leakage trap) |
+| Leakage trap AUC boost | AUC with trap > AUC without trap by ≥0.03 |
+| Missingness per column | Each column with nulls: 1–15% missing |


R7 specifies “Each column with nulls: 1–15% missing”, but validation_spec.md only enforces “no column > 20% missing” and dataset_contract.md includes structural missingness for days_since_last_touch (which may exceed 15% depending on touch sparsity). Consider relaxing/aligning this requirement (e.g., per-column max only, or explicit bounds per affected column).

Suggested change

| Missingness per column | Each column with nulls: 1–15% missing |

| Missingness per column | For nullable columns, no column may exceed 20% missing unless explicitly exempted below |

| Missingness exceptions | `days_since_last_touch`: structural missingness allowed in [5%, 35%] because leads with no touches have natural NaN; `web_sessions`: overall missingness ≤20% |

Copilot · 2026-04-29T12:48:10Z

+| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal |
+| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |
+| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |
+
+### Files affected
+
+- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations
+- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns


The “New features to compute” list includes days_since_first_touch, but docs/v4/lead_scoring_v4_requirements.md’s target column set does not include this feature (and fixes the dataset at 18 columns). Please reconcile whether days_since_first_touch is part of v4 (and update the schema/validation) or remove it here to keep the specs aligned.

Suggested change

| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal |

| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |

| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |

### Files affected

- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations

- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns

| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |

| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |

### Files affected

- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations

- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the added v4 columns

Copilot · 2026-04-29T12:48:10Z

+- `is_sql` — engine invariant creates deterministic groups
+- `is_mql` — zero variance
+- `lead_created_at` — timestamp that could be used to reverse-engineer temporal info
+- Any column containing `_id` suffix (opaque identifiers, not features)


“Any column containing _id suffix” is ambiguous (contains vs endswith). If the intent is to ban opaque identifiers like lead_id/account_id, consider clarifying to “any column name ending with _id” to avoid mis-implementing the check.

Suggested change

- Any column containing `_id` suffix (opaque identifiers, not features)

- Any column name ending with `_id` (opaque identifiers, not features)

Copilot · 2026-04-29T12:48:11Z

+**Problem:** v1–v3 inject missingness randomly (MCAR). Real CRM data has structured gaps.
+
+**Requirement:** Implement three missingness patterns:
+1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved.


Structured missingness refers to total_touches == 0, but R5 requires dropping total_touches from v4. To avoid confusion, define the condition in terms of retained columns (e.g., inbound_touches + outbound_touches == 0) or explicitly note that total_touches is computed internally and not included in the final CSV.

Suggested change

1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved.

1. **Natural (structural):** `days_since_last_touch` is NaN when `inbound_touches + outbound_touches == 0` (no touches recorded). Already exists but must be preserved.

Copilot · 2026-04-29T12:48:11Z

+
+| Column | Pattern | Rate | Condition |
+|---|---|---|---|
+| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot |


The missingness contract uses total touches == 0 as the condition for structural NaNs in days_since_last_touch, but v4 drops total_touches from the final CSV. It would be clearer to express the condition using columns that remain in v4 (e.g., inbound_touches + outbound_touches == 0) or explicitly state that the condition is evaluated pre-drop during build.

Suggested change

| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot |

| `days_since_last_touch` | Structural | Natural | NaN when `inbound_touches + outbound_touches == 0` by snapshot |

Copilot · 2026-04-29T12:48:13Z

+
+## Overview
+
+This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.


This plan says v4 is delivered across “4–6 PRs”, but later the timeline section states “The work is 4 PRs”. Please make the expected PR count consistent so the milestone planning is unambiguous.

Suggested change

This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.

This plan implements the v4 lead scoring dataset in 4 milestones across 4 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.

Copilot · 2026-04-29T12:48:14Z

+- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
+- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
+- These live in the leadforge repo (not datasets-private)
+
+**v4-M3 (release):**
+- Work in `leadforge-datasets-private` repo


This section states the v4 build/validation scripts live in the leadforge repo (not datasets-private), but CLAUDE.md’s “Related repos” + workflow example place these scripts in leadforge-datasets-private. Please align the docs on a single canonical home (or document how scripts are shared) to avoid contributors building in the wrong repo.

Suggested change

- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap

- `scripts/validate_v4_dataset.py` (new) — dataset-level validation

- These live in the leadforge repo (not datasets-private)

**v4-M3 (release):**

- Work in `leadforge-datasets-private` repo

- Work in `leadforge-datasets-private` repo

- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap

- `scripts/validate_v4_dataset.py` (new) — dataset-level validation

**v4-M3 (release):**

- Continue in `leadforge-datasets-private` repo for dataset generation and release artifacts

Copilot · 2026-04-29T12:48:14Z

+### Implementation order
+
+```
+v4-M0 (planning PR — already done)


The implementation order diagram says “v4-M0 (planning PR — already done)”, but in this PR v4-M0 is the work being proposed/merged. Consider rephrasing to something time-invariant (e.g., “v4-M0: planning (this PR)”) to avoid the docs disagreeing with .agent-plan.md.

Suggested change

v4-M0 (planning PR — already done)

v4-M0: planning (this PR)

Copilot · 2026-04-29T12:48:14Z

+- Stratified subsampling to 1,000 rows at ~30% conversion rate.
+- All negatives retained (up to 700); positives downsampled.
+- Subsampling preserves within-class feature distributions.


“All negatives retained (up to 700); positives downsampled.” is a bit self-contradictory for a 1,000-row stratified sample (you can’t retain all negatives from a 5,000-lead bundle if there are more than 700). Consider rewording to specify the intended sampling policy (e.g., fixed 700 negatives sampled uniformly/stratified, and 300 positives downsampled as needed) and what happens if the source bundle’s base rate is below 30%.

Suggested change

- Stratified subsampling to 1,000 rows at ~30% conversion rate.

- All negatives retained (up to 700); positives downsampled.

- Subsampling preserves within-class feature distributions.

- Stratified subsampling produces 1,000 rows with a target mix of 700 negatives and 300 positives when the source bundle contains at least that many examples of each class.

- Sampling policy: draw negatives uniformly at random without replacement up to 700 rows, and draw positives uniformly at random without replacement up to 300 rows.

- If the source bundle has fewer than 300 positives, retain all available positives and fill the remaining rows with additional uniformly sampled negatives; if it has fewer than 700 negatives, retain all available negatives and fill the remaining rows with additional uniformly sampled positives.

- Because sampling is uniform within each class, the subsample is intended to preserve within-class feature distributions up to normal sampling variation.

Copilot · 2026-04-29T12:48:15Z

+2. **Conditional on source:** `web_sessions` is missing for ~15% of `sdr_outbound` leads (CRM tracking often not set up for outbound-sourced leads) but only ~2% of `inbound_marketing` leads.
+3. **Role data gap:** `seniority` is missing for ~8% of `partner_referral` leads (referral partners don't always provide full contact details).
+
+**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values.


R3 says missingness injection belongs in the snapshot builder (engine), but docs/v4/engine_changes_spec.md specifies missingness should be injected in the v4 build script (post-snapshot) to keep engine output clean. Please align R3 with the chosen design so implementation guidance is consistent.

Suggested change

**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values.

**Implementation note:** Preserve natural structural missingness in the engine output, but inject the conditional missingness patterns in the v4 build script **after** snapshot generation, conditioned on feature values, to keep the snapshot builder output clean and reusable.

* refactor: consolidate v4 planning docs + spike experiment Self-review of PR #19 identified 8 issues. This commit addresses all of them: 1. Ran spike experiment validating category signal approach — the spec's CategoricalInfluence scaling was wrong (not wired into simulation). Correct approach: correlate observables with latent traits in population.py. Scale 1.8 gives AUC 0.694, within [0.65, 0.90] target. 2. Consolidated 5 overlapping spec docs into 2: design.md (single source of truth for requirements, contract, engine changes, plan) + validation_spec.md. 3. Added "Known limitations" section (is_sql invariant, role_function gap). 4. Added missingness rationale (detectability at n=1000, not arbitrary). 5. Added tuning protocol decision table for when validation checks fail. 6. Trimmed AGENTS.md to durable conventions + pointer to docs/v4/. 7. Added explicit ACV band→midpoint mapping table and null-band behavior. 8. Merged v4-M1 and v4-M2 into single milestone (engine + build pipeline can't be validated independently). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix ruff formatting in spike script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump ruff pre-commit hook v0.4.5 → v0.11.13 Aligns the local pre-commit hook with CI's ruff version (unpinned, currently 0.11.x). The old v0.4.5 hook accepted formatting that the CI ruff rejects, allowing format violations to slip through. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…re dictionary (PR 4.1 deliverable 3-5) * docs/release/generation_method.md (new) — standalone DGP summary for external readers. Reads alone, references the architecture spec. Covers the five generation layers (motif families → mechanism layer → population → simulation engine → snapshot rendering), the public- vs-instructor split, calibration / validation, and the explicit "what this is not" boundary. * docs/release/feature_dictionary.md (new) — narrative companion to the per-bundle feature_dictionary.csv. Groups the 32 public columns by analytical role (lead identity / firmographics / personographics / engagement / funnel / value) plus the deliberate trap and the target. Documents difficulty modulation parameters, modelling defaults, and pedagogical caveats. Satisfies G10.3. * release/README.md (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1): - macro framing paragraph (2024–2026 SaaS context, recommendation #19) - simulation simplifications section (chatgpt v2 §2.6 — modelled / approximate / not modelled) - calibration documentation linking to validation_report.md - public-vs-instructor redaction policy with concrete column lists citing BANNED_LEAD_COLUMNS / BANNED_OPP_COLUMNS / BANNED_TABLES / SNAPSHOT_FILTERED_TABLES from leakage_probes.py - intended use vs out-of-scope use - known limitations including the G7.4.4 GBM-vs-LR finding and the weak channel signal from the Phase 4 audit - composition section (entities / features / label / splits / provenance) per Datasheets format - adversarial-framing pointer (placeholder link to break-me guide that lands in PR 6.3) - maintenance plan All claims about realism, calibration, or difficulty are anchored to release/validation/validation_report.md per G10.6. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(scripts,docs): channel-signal audit (PR 4.1 deliverable 1+2) scripts/audit_channel_signal.py audits how strongly source channel signals conversion across the release tier family. For each tier we compute per-channel conversion rates and the univariate AUC of channel against converted_within_90_days (scored as the empirical positive rate per channel — a 1-D Bayes classifier equivalent to a saturated logistic regression on one-hot channel features). Outputs JSON + Markdown to docs/release/channel_signal_audit.{json,md}. Tests guard determinism against the committed release/ bundles (a double-run produces byte-identical output) plus per-channel rollup, univariate AUC closed-form, single-class fallback, error paths, and the CLI wiring. The audit confirms what the v1 DGP predicts: channel signal in v1 is weak — across all three tiers the largest per-channel rate spread is 0.043 and the largest univariate AUC is 0.521, well below the G2 / Gemini v2 industry MQL→SQL band (SEO ~51% vs Email <1%). v1 drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities; channel-conditional encoding is tracked as post-v1 work in docs/release/post_v1_roadmap.md. Roadmap: docs/release/v1_release_roadmap.md §"Phase 4 — PR 4.1". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(release): release-grade dataset card + generation method + feature dictionary (PR 4.1 deliverable 3-5) * docs/release/generation_method.md (new) — standalone DGP summary for external readers. Reads alone, references the architecture spec. Covers the five generation layers (motif families → mechanism layer → population → simulation engine → snapshot rendering), the public- vs-instructor split, calibration / validation, and the explicit "what this is not" boundary. * docs/release/feature_dictionary.md (new) — narrative companion to the per-bundle feature_dictionary.csv. Groups the 32 public columns by analytical role (lead identity / firmographics / personographics / engagement / funnel / value) plus the deliberate trap and the target. Documents difficulty modulation parameters, modelling defaults, and pedagogical caveats. Satisfies G10.3. * release/README.md (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1): - macro framing paragraph (2024–2026 SaaS context, recommendation #19) - simulation simplifications section (chatgpt v2 §2.6 — modelled / approximate / not modelled) - calibration documentation linking to validation_report.md - public-vs-instructor redaction policy with concrete column lists citing BANNED_LEAD_COLUMNS / BANNED_OPP_COLUMNS / BANNED_TABLES / SNAPSHOT_FILTERED_TABLES from leakage_probes.py - intended use vs out-of-scope use - known limitations including the G7.4.4 GBM-vs-LR finding and the weak channel signal from the Phase 4 audit - composition section (entities / features / label / splits / provenance) per Datasheets format - adversarial-framing pointer (placeholder link to break-me guide that lands in PR 6.3) - maintenance plan All claims about realism, calibration, or difficulty are anchored to release/validation/validation_report.md per G10.6. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(plan): mark Phase 4 PR 4.1 complete in .agent-plan.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(release): cover lead_source / first_touch_channel in feature dictionary Self-review caught a gap: the prior commit grouped 30 of 32 public columns; lead_source and first_touch_channel were referenced in the "recommended modelling defaults" checklist but did not appear in any category table. Adds a "Lead source & channel" subsection that describes both columns, calls out that they're identical in v1, and cross-references the channel-signal audit so readers don't expect top-tier feature importance from these columns. Updates the summary table to reflect 32 documented columns. Also corrects two minor wording issues (firmographics "Six" → "Five", personographics "all four" → "all three", and a typo "bandage" → "discretisation"). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(scripts): channel audit — out-of-sample AUC, no verdict bands, group identical columns Self-review of the previous PR-4.1 commit surfaced four problems with audit_channel_signal.py: * The univariate AUC was computed in-sample (train rates → train labels), guaranteed >= 0.5 by construction and not directly comparable to the source_only baselines in release/validation/validation_report.json. * The "weak / moderate / strong" verdict made a hard comparison between v1's 90-day closed-won label and the G2 / Gemini v2 industry MQL→SQL benchmark band. The two metrics measure different funnel transitions; the comparison was a category error. * The verdict prose hard-coded a "50 percentage points" claim and a specific architectural narrative ("v1 drives conversion through motif-family hazards") inside the script — both would silently drift from the data and the codebase over time. * lead_source and first_touch_channel produce byte-identical audits in v1 yet were rendered as two parallel tables per tier. Fixes: * audit_channel now takes both train and test DataFrames and returns univariate_auc_in_sample (the historical 1-D Bayes interpretation, retained for transparency) plus univariate_auc_out_of_sample (train rates scored against held-out test labels). The OOS numbers reproduce the source_only HistGBM baselines in validation_report.json for seed 42 cell-for-cell (intro 0.5014, intermediate 0.5139, advanced 0.5226). * Verdict bands and the _classify_signal / _verdict_paragraph helpers are gone. The markdown report now ends with a Discussion section written by hand around the actual numbers, with an explicit caveat that the industry benchmarks measure MQL→SQL (not 90-day closed-won) and are reproduced for context only. * INDUSTRY_MQL_TO_SQL_BENCHMARKS is now a tuple of pairs (genuinely immutable; matches dataclass(frozen=True) semantics). report_to_dict converts it back to a {name: rate} dict for the JSON output. * render_markdown groups channel columns whose audits are byte-identical into one section with a header listing all columns ("Columns: lead_source, first_touch_channel (audit values identical)"). The JSON keeps per-column entries. New tests in tests/scripts/test_audit_channel_signal.py: * OOS AUC == in-sample AUC when test=train (sanity check) * OOS AUC stays well-defined when the test split contains channels unseen on train (train-base-rate fallback) * render_markdown collapses two identical columns into one section AND keeps two distinct columns in two sections * test_lead_source_equals_first_touch_channel_in_v1 (parametrized over intro/intermediate/advanced) — locks the feature-dictionary claim that the two channel columns are identical in v1. If the simulator ever diverges them, the doc must be updated. * test_committed_audit_artifacts_match_fresh_regeneration — re-runs the audit against the committed bundles and asserts byte-equality with the committed docs/release/channel_signal_audit.{md,json}. CI gate against bundles regenerated without re-running the audit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(release): self-review fixes — README trim, citations, feature-dict consistency * release/README.md (~434 → ~228 lines): trimmed to a release-grade landing card. The full DGP, motif families, simulation simplifications, and module map move to docs/release/generation_method.md (linked). Macro-framing claim now cites docs/external_review/summaries/gemini_v2_summary.md as the source of the 30%→25% growth and CAC-ratio numbers (previously presented as if primary research). Composition + maintenance sections compressed into the table at the bottom. * docs/release/generation_method.md: dropped the "Where the code lives" module table. This doc is for external readers; module paths belong in the developer-facing design doc and architecture spec. Ends with a single short pointer to those. * docs/release/feature_dictionary.md: fixed a factually wrong claim about the leakage trap (the per-bundle CSV has columns ``name,dtype,description,category,is_target,leakage_risk`` — there is no ``is_leakage_trap`` column). Reworded the modelling-default checklist to acknowledge that the flat ``lead_scoring.csv`` and the Parquet task splits ship every column listed in the dictionary including the IDs — the recommendation says what to use as features, not what's in the file. Also notes that ``lead_source`` and ``first_touch_channel`` carry identical values in v1 (locked by the new test), so picking one is fine. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(scripts,docs): address Copilot review threads on PR 4.1 Six fixes from the Copilot reviews on PR #69: * scripts/audit_channel_signal.py — _label_to_int now uses pd.api.types.is_bool_dtype() so it explicitly handles pandas nullable BooleanDtype (the actual parquet dtype on the v1 bundles) alongside numpy bool. Previously it worked via a coincidental pd.to_numeric fallback, with a comment that misled future readers. * scripts/audit_channel_signal.py — render_markdown now takes both md_path and json_path and emits the JSON link as a relative path to the markdown's directory, so a `--out-md`/`--out-json` override produces a markdown report whose link target is correct. Defaults to the canonical "channel_signal_audit.json" basename when called without paths (the unit-test path). * scripts/audit_channel_signal.py — main() pins encoding="utf-8" on both write_text() calls so the audit output is byte-identical across operating systems and locale configurations. * scripts/audit_channel_signal.py — Discussion section is no longer bundle-specific. The previous prose claimed "for seed 42 the OOS numbers below match the report cell-for-cell" — true for the committed bundle but wrong for any other --release-dir. The new prose talks about which AUC is comparable and what conclusion the numbers in the per-tier sections support, both bundle-agnostic. * release/README.md — fixed the relational-feature-engineering Quick start example. The previous snippet did `leads.merge(touch_counts, on="lead_id")` where touch_counts was a Series with lead_id in its index, not as a column — would error in modern pandas. The new snippet uses .reset_index() and merges the resulting DataFrame. * docs/release/feature_dictionary.md — touches_week_1 documented as "days 0–7 inclusive" (8 day values) and touches_last_7_days qualified with "for snapshot_day=30, days 24–30 inclusive". Previously claimed "days 0–6" for week_1, which mismatched the snapshot builder's _day <= 7 window. Test changes: * test_release_audit_is_deterministic now writes both runs to the same path (back-to-back overwrite) instead of distinct tmp paths, so the relative-link rendering doesn't make the two outputs differ. * test_committed_audit_artifacts_match_fresh_regeneration uses the canonical "channel_signal_audit.{md,json}" basenames in tmp_path, so the relative link in the regenerated markdown matches the committed file's link. Two stale Copilot threads (firmographics "Six columns" and "bandage" typo) were already addressed in commit f6b274e during the first self-review pass. 1175/1175 tests pass; ruff + mypy clean; the regenerated audit artifacts are byte-identical via the canonical-path mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 29, 2026 12:41

shaypal5 added type: docs Documentation or narrative changes layer: render render/ bundle and artifact output layer: mechanisms mechanisms/ generators and transitions layer: validation validation/ invariants and checks labels Apr 29, 2026

Copilot started reviewing on behalf of shaypal5 April 29, 2026 12:42 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

shaypal5 merged commit be68208 into main Apr 29, 2026
9 checks passed

shaypal5 deleted the plan/v4-lead-scoring-dataset branch April 29, 2026 13:02

shaypal5 mentioned this pull request Apr 29, 2026

refactor: consolidate v4 planning docs + spike experiment #20

Merged

4 tasks

shaypal5 mentioned this pull request May 6, 2026

PR 4.1: channel-signal audit + release-grade dataset card #69

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan v4 lead scoring dataset + leadforge engine roadmap#19

Plan v4 lead scoring dataset + leadforge engine roadmap#19
shaypal5 merged 1 commit into
mainfrom
plan/v4-lead-scoring-dataset

shaypal5 commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	\| Missingness per column \| Each column with nulls: 1–15% missing \|
	\| Missingness per column \| For nullable columns, no column may exceed 20% missing unless explicitly exempted below \|
	\| Missingness exceptions \| `days_since_last_touch`: structural missingness allowed in [5%, 35%] because leads with no touches have natural NaN; `web_sessions`: overall missingness ≤20% \|

	- Any column containing `_id` suffix (opaque identifiers, not features)
	- Any column name ending with `_id` (opaque identifiers, not features)

	1. Natural (structural): `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved.
	1. Natural (structural): `days_since_last_touch` is NaN when `inbound_touches + outbound_touches == 0` (no touches recorded). Already exists but must be preserved.

	\| `days_since_last_touch` \| Structural \| Natural \| NaN when `total touches == 0` by snapshot \|
	\| `days_since_last_touch` \| Structural \| Natural \| NaN when `inbound_touches + outbound_touches == 0` by snapshot \|


		## Overview

		This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.

	v4-M0 (planning PR — already done)
	v4-M0: planning (this PR)

-- Stratified subsampling to 1,000 rows at ~30% conversion rate.
-- All negatives retained (up to 700); positives downsampled.
-- Subsampling preserves within-class feature distributions.
+- Stratified subsampling produces 1,000 rows with a target mix of 700 negatives and 300 positives when the source bundle contains at least that many examples of each class.
+- Sampling policy: draw negatives uniformly at random without replacement up to 700 rows, and draw positives uniformly at random without replacement up to 300 rows.
+- If the source bundle has fewer than 300 positives, retain all available positives and fill the remaining rows with additional uniformly sampled negatives; if it has fewer than 700 negatives, retain all available negatives and fill the remaining rows with additional uniformly sampled positives.
+- Because sampling is uniform within each class, the subsample is intended to preserve within-class feature distributions up to normal sampling variation.

	Engine change needed: Missingness injection in the snapshot builder, conditioned on feature values.
	Implementation note: Preserve natural structural missingness in the engine output, but inject the conditional missingness patterns in the v4 build script after snapshot generation, conditioned on feature values, to keep the snapshot builder output clean and reusable.

Conversation

shaypal5 commented Apr 29, 2026

Summary

Why v4?

Milestones

Existing roadmap items — triage

v4 dataset acceptance criteria

Files in this PR

What this PR does NOT do

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants