Skip to content

Plan v4 lead scoring dataset + leadforge engine roadmap#19

Merged
shaypal5 merged 1 commit into
mainfrom
plan/v4-lead-scoring-dataset
Apr 29, 2026
Merged

Plan v4 lead scoring dataset + leadforge engine roadmap#19
shaypal5 merged 1 commit into
mainfrom
plan/v4-lead-scoring-dataset

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

Comprehensive planning PR for the v4 lead scoring dataset — a pedagogically improved single-CSV dataset for an intro ML course. This PR contains only documentation and planning artifacts (no code changes, all 590 tests pass).

Why v4?

v1–v3 each fixed critical issues but revealed new ones:

  • v1: label leakage via funnel_stage containing closed_won/closed_lost
  • v2: 90-day snapshot = post-mortem classification (not prediction); deterministic binary proxies (reached_sql=0 → 0% conversion)
  • v3: day-21 snapshot + clean proxies, but AUC only 0.62 due to flat category effects at intro difficulty

v4 addresses all of these and adds:

  • Value-aware ranking (expected_acv feature)
  • Temporal momentum (touches_week_1)
  • Structured missingness (conditional on lead source, not just MCAR)
  • Deliberate leakage trap (total_touches_all — full 90-day window, for classroom discussion)
  • Stronger category signal via category_effect_scale in difficulty profiles

Milestones

Milestone Scope Status
v4-M0 Requirements, contract, agent instructions ✅ This PR
v4-M1 Engine: category signal tuning + windowed snapshots ⬜ Next
v4-M2 Build pipeline: snapshot builder + validation
v4-M3 Documentation + dataset release

Existing roadmap items — triage

Item Decision Rationale
M12: CLI --json/--strict flags Deferred No consumer needs it; low priority vs dataset
M14: Sample bundle generation Absorbed into v4-M3 v4 dataset IS the sample
M14: Lead-scoring baseline notebook Deferred v4 validation script covers this
M14: Notebooks 3–4 (public/instructor, recipe customization) Discarded No current audience
M15: Docs polish + v1.0 RC Deferred Do after v4 ships

v4 dataset acceptance criteria

  • 1,000 rows × 18 columns (17 features + 1 target)
  • 30% conversion rate
  • No deterministic groups (n≥50 at 0% or 100%)
  • Baseline LR AUC 0.65–0.90 (without leakage trap)
  • Leakage trap boosts AUC by ≥0.03
  • Structured missingness (source-conditional, not just MCAR)
  • Reproducible with seed 42

Files in this PR

  • docs/v4/lead_scoring_v4_requirements.md — 7 requirements with rationale
  • docs/v4/dataset_contract.md — schema contract, temporal gates, missingness patterns
  • docs/v4/engine_changes_spec.md — what changes in the engine and where
  • docs/v4/validation_spec.md — 8 mandatory + 3 warning checks
  • docs/v4/implementation_plan.md — milestone breakdown with acceptance criteria
  • Updated CLAUDE.md — repo map, generation workflow, student_public invariants
  • Updated AGENTS.md — v4 implementation guide, coding conventions, testing
  • Updated .agent-plan.md — v4 as next work, M12–M15 triage

What this PR does NOT do

  • No code changes to the engine or any Python module
  • No implementation of v4-M1 through v4-M3 (deferred to follow-up PRs)
  • No changes to the simulation loop
  • No new dependencies

🤖 Generated with Claude Code

Add comprehensive v4 planning docs, updated agent instructions, and
revised project roadmap driven by dataset needs from v1–v3 iterations.

Docs added:
- docs/v4/lead_scoring_v4_requirements.md — 7 requirements (value features,
  temporal momentum, structured missingness, leakage trap, redundancy fix,
  stronger category signal, robust validation)
- docs/v4/dataset_contract.md — schema contract, temporal gates, missingness
  patterns, subsampling rules
- docs/v4/engine_changes_spec.md — category effect scaling, windowed snapshot
  builder, structured missingness, leakage trap feature
- docs/v4/validation_spec.md — 8 mandatory checks + 3 warning checks
- docs/v4/implementation_plan.md — 4 milestones (M0–M3) with acceptance
  criteria and explicit mapping from existing roadmap items

Updated:
- CLAUDE.md — added repo map, generation workflow, student_public invariants,
  feature addition guide, v4 plan pointers
- AGENTS.md — added v4 implementation guide, coding conventions, validation
  checklist, local testing commands
- .agent-plan.md — v4 milestones as next work; M12–M15 items explicitly
  triaged (deferred/absorbed/discarded)

No code changes. All 590 existing tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 29, 2026 12:41
@shaypal5 shaypal5 added type: docs Documentation or narrative changes layer: render render/ bundle and artifact output layer: mechanisms mechanisms/ generators and transitions layer: validation validation/ invariants and checks labels Apr 29, 2026
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR
#19. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.20
Trigger: pull request opened
Workflow run: 25109426149 attempt 1
Comment timestamp: 2026-04-29T12:41:56.785853+00:00
PR head commit: 1f45df09976b1cf10e5cca6a0a366a3270ff4364

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Planning and documentation for the v4 lead scoring “intro” dataset, including requirements, dataset contract, validation spec, and an engine/build roadmap to address leakage, temporal framing, and pedagogical signal strength.

Changes:

  • Added v4 planning docs: requirements, dataset contract, engine changes spec, validation spec, and milestone implementation plan.
  • Updated contributor-facing guidance (CLAUDE.md, AGENTS.md) with repo map, workflow/commands, and v4 implementation checklist.
  • Updated .agent-plan.md to prioritize v4 and triage older roadmap items.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
docs/v4/validation_spec.md Defines mandatory + warning validation checks for the v4 single-CSV dataset.
docs/v4/lead_scoring_v4_requirements.md Captures v4 goals, rationale from v1–v3, and the proposed v4 column set + acceptance criteria.
docs/v4/implementation_plan.md Milestone breakdown (M0–M3) and deliverables/acceptance criteria for v4 delivery.
docs/v4/engine_changes_spec.md Specifies intended engine changes (difficulty/category signal tuning, snapshot windowing) and build-script responsibilities.
docs/v4/dataset_contract.md Defines snapshot/target temporal gates, missingness expectations, subsampling, and reproducibility contract.
CLAUDE.md Adds repo map + generation/validation workflow documentation and links to v4 docs.
AGENTS.md Adds v4 implementation guidance, conventions, and a validation checklist.
.agent-plan.md Reorients “Next Up” to v4 milestones and defers/triages older roadmap items.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| Conversion rate | In [15%, 40%] |
| Baseline LR AUC | In [0.65, 0.90] (all features except leakage trap) |
| Leakage trap AUC boost | AUC with trap > AUC without trap by ≥0.03 |
| Missingness per column | Each column with nulls: 1–15% missing |

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R7 specifies “Each column with nulls: 1–15% missing”, but validation_spec.md only enforces “no column > 20% missing” and dataset_contract.md includes structural missingness for days_since_last_touch (which may exceed 15% depending on touch sparsity). Consider relaxing/aligning this requirement (e.g., per-column max only, or explicit bounds per affected column).

Suggested change
| Missingness per column | Each column with nulls: 1–15% missing |
| Missingness per column | For nullable columns, no column may exceed 20% missing unless explicitly exempted below |
| Missingness exceptions | `days_since_last_touch`: structural missingness allowed in [5%, 35%] because leads with no touches have natural NaN; `web_sessions`: overall missingness ≤20% |

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +88
| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal |
| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |
| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |

### Files affected

- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations
- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “New features to compute” list includes days_since_first_touch, but docs/v4/lead_scoring_v4_requirements.md’s target column set does not include this feature (and fixes the dataset at 18 columns). Please reconcile whether days_since_first_touch is part of v4 (and update the schema/validation) or remove it here to keep the specs aligned.

Suggested change
| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal |
| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |
| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |
### Files affected
- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations
- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns
| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |
| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |
### Files affected
- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations
- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the added v4 columns

Copilot uses AI. Check for mistakes.
- `is_sql` — engine invariant creates deterministic groups
- `is_mql` — zero variance
- `lead_created_at` — timestamp that could be used to reverse-engineer temporal info
- Any column containing `_id` suffix (opaque identifiers, not features)

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Any column containing _id suffix” is ambiguous (contains vs endswith). If the intent is to ban opaque identifiers like lead_id/account_id, consider clarifying to “any column name ending with _id” to avoid mis-implementing the check.

Suggested change
- Any column containing `_id` suffix (opaque identifiers, not features)
- Any column name ending with `_id` (opaque identifiers, not features)

Copilot uses AI. Check for mistakes.
**Problem:** v1–v3 inject missingness randomly (MCAR). Real CRM data has structured gaps.

**Requirement:** Implement three missingness patterns:
1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structured missingness refers to total_touches == 0, but R5 requires dropping total_touches from v4. To avoid confusion, define the condition in terms of retained columns (e.g., inbound_touches + outbound_touches == 0) or explicitly note that total_touches is computed internally and not included in the final CSV.

Suggested change
1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved.
1. **Natural (structural):** `days_since_last_touch` is NaN when `inbound_touches + outbound_touches == 0` (no touches recorded). Already exists but must be preserved.

Copilot uses AI. Check for mistakes.

| Column | Pattern | Rate | Condition |
|---|---|---|---|
| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot |

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missingness contract uses total touches == 0 as the condition for structural NaNs in days_since_last_touch, but v4 drops total_touches from the final CSV. It would be clearer to express the condition using columns that remain in v4 (e.g., inbound_touches + outbound_touches == 0) or explicitly state that the condition is evaluated pre-drop during build.

Suggested change
| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot |
| `days_since_last_touch` | Structural | Natural | NaN when `inbound_touches + outbound_touches == 0` by snapshot |

Copilot uses AI. Check for mistakes.

## Overview

This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plan says v4 is delivered across “4–6 PRs”, but later the timeline section states “The work is 4 PRs”. Please make the expected PR count consistent so the milestone planning is unambiguous.

Suggested change
This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.
This plan implements the v4 lead scoring dataset in 4 milestones across 4 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.

Copilot uses AI. Check for mistakes.
Comment thread AGENTS.md
Comment on lines +56 to +61
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
- These live in the leadforge repo (not datasets-private)

**v4-M3 (release):**
- Work in `leadforge-datasets-private` repo

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section states the v4 build/validation scripts live in the leadforge repo (not datasets-private), but CLAUDE.md’s “Related repos” + workflow example place these scripts in leadforge-datasets-private. Please align the docs on a single canonical home (or document how scripts are shared) to avoid contributors building in the wrong repo.

Suggested change
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
- These live in the leadforge repo (not datasets-private)
**v4-M3 (release):**
- Work in `leadforge-datasets-private` repo
- Work in `leadforge-datasets-private` repo
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
**v4-M3 (release):**
- Continue in `leadforge-datasets-private` repo for dataset generation and release artifacts

Copilot uses AI. Check for mistakes.
Comment thread AGENTS.md
### Implementation order

```
v4-M0 (planning PR — already done)

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation order diagram says “v4-M0 (planning PR — already done)”, but in this PR v4-M0 is the work being proposed/merged. Consider rephrasing to something time-invariant (e.g., “v4-M0: planning (this PR)”) to avoid the docs disagreeing with .agent-plan.md.

Suggested change
v4-M0 (planning PR — already done)
v4-M0: planning (this PR)

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +66
- Stratified subsampling to 1,000 rows at ~30% conversion rate.
- All negatives retained (up to 700); positives downsampled.
- Subsampling preserves within-class feature distributions.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“All negatives retained (up to 700); positives downsampled.” is a bit self-contradictory for a 1,000-row stratified sample (you can’t retain all negatives from a 5,000-lead bundle if there are more than 700). Consider rewording to specify the intended sampling policy (e.g., fixed 700 negatives sampled uniformly/stratified, and 300 positives downsampled as needed) and what happens if the source bundle’s base rate is below 30%.

Suggested change
- Stratified subsampling to 1,000 rows at ~30% conversion rate.
- All negatives retained (up to 700); positives downsampled.
- Subsampling preserves within-class feature distributions.
- Stratified subsampling produces 1,000 rows with a target mix of 700 negatives and 300 positives when the source bundle contains at least that many examples of each class.
- Sampling policy: draw negatives uniformly at random without replacement up to 700 rows, and draw positives uniformly at random without replacement up to 300 rows.
- If the source bundle has fewer than 300 positives, retain all available positives and fill the remaining rows with additional uniformly sampled negatives; if it has fewer than 700 negatives, retain all available negatives and fill the remaining rows with additional uniformly sampled positives.
- Because sampling is uniform within each class, the subsample is intended to preserve within-class feature distributions up to normal sampling variation.

Copilot uses AI. Check for mistakes.
2. **Conditional on source:** `web_sessions` is missing for ~15% of `sdr_outbound` leads (CRM tracking often not set up for outbound-sourced leads) but only ~2% of `inbound_marketing` leads.
3. **Role data gap:** `seniority` is missing for ~8% of `partner_referral` leads (referral partners don't always provide full contact details).

**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R3 says missingness injection belongs in the snapshot builder (engine), but docs/v4/engine_changes_spec.md specifies missingness should be injected in the v4 build script (post-snapshot) to keep engine output clean. Please align R3 with the chosen design so implementation guidance is consistent.

Suggested change
**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values.
**Implementation note:** Preserve natural structural missingness in the engine output, but inject the conditional missingness patterns in the v4 build script **after** snapshot generation, conditioned on feature values, to keep the snapshot builder output clean and reusable.

Copilot uses AI. Check for mistakes.
@shaypal5 shaypal5 merged commit be68208 into main Apr 29, 2026
9 checks passed
@shaypal5 shaypal5 deleted the plan/v4-lead-scoring-dataset branch April 29, 2026 13:02
shaypal5 added a commit that referenced this pull request Apr 29, 2026
* refactor: consolidate v4 planning docs + spike experiment

Self-review of PR #19 identified 8 issues. This commit addresses all of them:

1. Ran spike experiment validating category signal approach — the spec's
   CategoricalInfluence scaling was wrong (not wired into simulation).
   Correct approach: correlate observables with latent traits in population.py.
   Scale 1.8 gives AUC 0.694, within [0.65, 0.90] target.

2. Consolidated 5 overlapping spec docs into 2: design.md (single source of
   truth for requirements, contract, engine changes, plan) + validation_spec.md.

3. Added "Known limitations" section (is_sql invariant, role_function gap).

4. Added missingness rationale (detectability at n=1000, not arbitrary).

5. Added tuning protocol decision table for when validation checks fail.

6. Trimmed AGENTS.md to durable conventions + pointer to docs/v4/.

7. Added explicit ACV band→midpoint mapping table and null-band behavior.

8. Merged v4-M1 and v4-M2 into single milestone (engine + build pipeline
   can't be validated independently).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix ruff formatting in spike script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump ruff pre-commit hook v0.4.5 → v0.11.13

Aligns the local pre-commit hook with CI's ruff version (unpinned,
currently 0.11.x). The old v0.4.5 hook accepted formatting that
the CI ruff rejects, allowing format violations to slip through.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 6, 2026
…re dictionary (PR 4.1 deliverable 3-5)

* docs/release/generation_method.md (new) — standalone DGP summary for
  external readers. Reads alone, references the architecture spec.
  Covers the five generation layers (motif families → mechanism layer
  → population → simulation engine → snapshot rendering), the public-
  vs-instructor split, calibration / validation, and the explicit
  "what this is not" boundary.

* docs/release/feature_dictionary.md (new) — narrative companion to
  the per-bundle feature_dictionary.csv. Groups the 32 public columns
  by analytical role (lead identity / firmographics / personographics /
  engagement / funnel / value) plus the deliberate trap and the target.
  Documents difficulty modulation parameters, modelling defaults, and
  pedagogical caveats. Satisfies G10.3.

* release/README.md (substantial rewrite) — release-grade dataset card
  per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1):
  - macro framing paragraph (2024–2026 SaaS context, recommendation #19)
  - simulation simplifications section (chatgpt v2 §2.6 — modelled /
    approximate / not modelled)
  - calibration documentation linking to validation_report.md
  - public-vs-instructor redaction policy with concrete column lists
    citing BANNED_LEAD_COLUMNS / BANNED_OPP_COLUMNS / BANNED_TABLES /
    SNAPSHOT_FILTERED_TABLES from leakage_probes.py
  - intended use vs out-of-scope use
  - known limitations including the G7.4.4 GBM-vs-LR finding and the
    weak channel signal from the Phase 4 audit
  - composition section (entities / features / label / splits /
    provenance) per Datasheets format
  - adversarial-framing pointer (placeholder link to break-me guide
    that lands in PR 6.3)
  - maintenance plan

All claims about realism, calibration, or difficulty are anchored to
release/validation/validation_report.md per G10.6.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 6, 2026
* feat(scripts,docs): channel-signal audit (PR 4.1 deliverable 1+2)

scripts/audit_channel_signal.py audits how strongly source channel
signals conversion across the release tier family. For each tier we
compute per-channel conversion rates and the univariate AUC of channel
against converted_within_90_days (scored as the empirical positive rate
per channel — a 1-D Bayes classifier equivalent to a saturated logistic
regression on one-hot channel features). Outputs JSON + Markdown to
docs/release/channel_signal_audit.{json,md}.

Tests guard determinism against the committed release/ bundles (a
double-run produces byte-identical output) plus per-channel rollup,
univariate AUC closed-form, single-class fallback, error paths, and the
CLI wiring.

The audit confirms what the v1 DGP predicts: channel signal in v1 is
weak — across all three tiers the largest per-channel rate spread is
0.043 and the largest univariate AUC is 0.521, well below the G2 /
Gemini v2 industry MQL→SQL band (SEO ~51% vs Email <1%). v1 drives
conversion through motif-family hazards keyed off latent traits, not
channel-conditional probabilities; channel-conditional encoding is
tracked as post-v1 work in docs/release/post_v1_roadmap.md.

Roadmap: docs/release/v1_release_roadmap.md §"Phase 4 — PR 4.1".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(release): release-grade dataset card + generation method + feature dictionary (PR 4.1 deliverable 3-5)

* docs/release/generation_method.md (new) — standalone DGP summary for
  external readers. Reads alone, references the architecture spec.
  Covers the five generation layers (motif families → mechanism layer
  → population → simulation engine → snapshot rendering), the public-
  vs-instructor split, calibration / validation, and the explicit
  "what this is not" boundary.

* docs/release/feature_dictionary.md (new) — narrative companion to
  the per-bundle feature_dictionary.csv. Groups the 32 public columns
  by analytical role (lead identity / firmographics / personographics /
  engagement / funnel / value) plus the deliberate trap and the target.
  Documents difficulty modulation parameters, modelling defaults, and
  pedagogical caveats. Satisfies G10.3.

* release/README.md (substantial rewrite) — release-grade dataset card
  per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1):
  - macro framing paragraph (2024–2026 SaaS context, recommendation #19)
  - simulation simplifications section (chatgpt v2 §2.6 — modelled /
    approximate / not modelled)
  - calibration documentation linking to validation_report.md
  - public-vs-instructor redaction policy with concrete column lists
    citing BANNED_LEAD_COLUMNS / BANNED_OPP_COLUMNS / BANNED_TABLES /
    SNAPSHOT_FILTERED_TABLES from leakage_probes.py
  - intended use vs out-of-scope use
  - known limitations including the G7.4.4 GBM-vs-LR finding and the
    weak channel signal from the Phase 4 audit
  - composition section (entities / features / label / splits /
    provenance) per Datasheets format
  - adversarial-framing pointer (placeholder link to break-me guide
    that lands in PR 6.3)
  - maintenance plan

All claims about realism, calibration, or difficulty are anchored to
release/validation/validation_report.md per G10.6.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(plan): mark Phase 4 PR 4.1 complete in .agent-plan.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(release): cover lead_source / first_touch_channel in feature dictionary

Self-review caught a gap: the prior commit grouped 30 of 32 public
columns; lead_source and first_touch_channel were referenced in the
"recommended modelling defaults" checklist but did not appear in any
category table. Adds a "Lead source & channel" subsection that
describes both columns, calls out that they're identical in v1, and
cross-references the channel-signal audit so readers don't expect
top-tier feature importance from these columns. Updates the summary
table to reflect 32 documented columns. Also corrects two minor wording
issues (firmographics "Six" → "Five", personographics "all four" →
"all three", and a typo "bandage" → "discretisation").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(scripts): channel audit — out-of-sample AUC, no verdict bands, group identical columns

Self-review of the previous PR-4.1 commit surfaced four problems with
audit_channel_signal.py:

* The univariate AUC was computed in-sample (train rates → train labels),
  guaranteed >= 0.5 by construction and not directly comparable to the
  source_only baselines in release/validation/validation_report.json.
* The "weak / moderate / strong" verdict made a hard comparison between
  v1's 90-day closed-won label and the G2 / Gemini v2 industry MQL→SQL
  benchmark band.  The two metrics measure different funnel transitions;
  the comparison was a category error.
* The verdict prose hard-coded a "50 percentage points" claim and a
  specific architectural narrative ("v1 drives conversion through
  motif-family hazards") inside the script — both would silently drift
  from the data and the codebase over time.
* lead_source and first_touch_channel produce byte-identical audits in
  v1 yet were rendered as two parallel tables per tier.

Fixes:

* audit_channel now takes both train and test DataFrames and returns
  univariate_auc_in_sample (the historical 1-D Bayes interpretation,
  retained for transparency) plus univariate_auc_out_of_sample (train
  rates scored against held-out test labels).  The OOS numbers
  reproduce the source_only HistGBM baselines in validation_report.json
  for seed 42 cell-for-cell (intro 0.5014, intermediate 0.5139,
  advanced 0.5226).
* Verdict bands and the _classify_signal / _verdict_paragraph helpers
  are gone.  The markdown report now ends with a Discussion section
  written by hand around the actual numbers, with an explicit caveat
  that the industry benchmarks measure MQL→SQL (not 90-day closed-won)
  and are reproduced for context only.
* INDUSTRY_MQL_TO_SQL_BENCHMARKS is now a tuple of pairs (genuinely
  immutable; matches dataclass(frozen=True) semantics).  report_to_dict
  converts it back to a {name: rate} dict for the JSON output.
* render_markdown groups channel columns whose audits are
  byte-identical into one section with a header listing all columns
  ("Columns: lead_source, first_touch_channel (audit values
  identical)").  The JSON keeps per-column entries.

New tests in tests/scripts/test_audit_channel_signal.py:

* OOS AUC == in-sample AUC when test=train (sanity check)
* OOS AUC stays well-defined when the test split contains channels
  unseen on train (train-base-rate fallback)
* render_markdown collapses two identical columns into one section
  AND keeps two distinct columns in two sections
* test_lead_source_equals_first_touch_channel_in_v1 (parametrized
  over intro/intermediate/advanced) — locks the feature-dictionary
  claim that the two channel columns are identical in v1.  If the
  simulator ever diverges them, the doc must be updated.
* test_committed_audit_artifacts_match_fresh_regeneration — re-runs
  the audit against the committed bundles and asserts byte-equality
  with the committed docs/release/channel_signal_audit.{md,json}.
  CI gate against bundles regenerated without re-running the audit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(release): self-review fixes — README trim, citations, feature-dict consistency

* release/README.md (~434 → ~228 lines): trimmed to a release-grade
  landing card.  The full DGP, motif families, simulation
  simplifications, and module map move to docs/release/generation_method.md
  (linked).  Macro-framing claim now cites
  docs/external_review/summaries/gemini_v2_summary.md as the source of
  the 30%→25% growth and CAC-ratio numbers (previously presented as if
  primary research).  Composition + maintenance sections compressed
  into the table at the bottom.
* docs/release/generation_method.md: dropped the "Where the code
  lives" module table.  This doc is for external readers; module
  paths belong in the developer-facing design doc and architecture
  spec.  Ends with a single short pointer to those.
* docs/release/feature_dictionary.md: fixed a factually wrong claim
  about the leakage trap (the per-bundle CSV has columns
  ``name,dtype,description,category,is_target,leakage_risk`` — there
  is no ``is_leakage_trap`` column).  Reworded the modelling-default
  checklist to acknowledge that the flat ``lead_scoring.csv`` and the
  Parquet task splits ship every column listed in the dictionary
  including the IDs — the recommendation says what to use as features,
  not what's in the file.  Also notes that ``lead_source`` and
  ``first_touch_channel`` carry identical values in v1 (locked by the
  new test), so picking one is fine.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(scripts,docs): address Copilot review threads on PR 4.1

Six fixes from the Copilot reviews on PR #69:

* scripts/audit_channel_signal.py — _label_to_int now uses
  pd.api.types.is_bool_dtype() so it explicitly handles pandas
  nullable BooleanDtype (the actual parquet dtype on the v1 bundles)
  alongside numpy bool.  Previously it worked via a coincidental
  pd.to_numeric fallback, with a comment that misled future readers.
* scripts/audit_channel_signal.py — render_markdown now takes both
  md_path and json_path and emits the JSON link as a relative path
  to the markdown's directory, so a `--out-md`/`--out-json` override
  produces a markdown report whose link target is correct.  Defaults
  to the canonical "channel_signal_audit.json" basename when called
  without paths (the unit-test path).
* scripts/audit_channel_signal.py — main() pins encoding="utf-8" on
  both write_text() calls so the audit output is byte-identical
  across operating systems and locale configurations.
* scripts/audit_channel_signal.py — Discussion section is no longer
  bundle-specific.  The previous prose claimed "for seed 42 the OOS
  numbers below match the report cell-for-cell" — true for the
  committed bundle but wrong for any other --release-dir.  The new
  prose talks about which AUC is comparable and what conclusion the
  numbers in the per-tier sections support, both bundle-agnostic.
* release/README.md — fixed the relational-feature-engineering
  Quick start example.  The previous snippet did
  `leads.merge(touch_counts, on="lead_id")` where touch_counts was
  a Series with lead_id in its index, not as a column — would error
  in modern pandas.  The new snippet uses .reset_index() and merges
  the resulting DataFrame.
* docs/release/feature_dictionary.md — touches_week_1 documented as
  "days 0–7 inclusive" (8 day values) and touches_last_7_days
  qualified with "for snapshot_day=30, days 24–30 inclusive".
  Previously claimed "days 0–6" for week_1, which mismatched the
  snapshot builder's _day <= 7 window.

Test changes:

* test_release_audit_is_deterministic now writes both runs to the
  same path (back-to-back overwrite) instead of distinct tmp paths,
  so the relative-link rendering doesn't make the two outputs differ.
* test_committed_audit_artifacts_match_fresh_regeneration uses the
  canonical "channel_signal_audit.{md,json}" basenames in tmp_path,
  so the relative link in the regenerated markdown matches the
  committed file's link.

Two stale Copilot threads (firmographics "Six columns" and "bandage"
typo) were already addressed in commit f6b274e during the first
self-review pass.

1175/1175 tests pass; ruff + mypy clean; the regenerated audit
artifacts are byte-identical via the canonical-path mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: mechanisms mechanisms/ generators and transitions layer: render render/ bundle and artifact output layer: validation validation/ invariants and checks type: docs Documentation or narrative changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants