Plan v4 lead scoring dataset + leadforge engine roadmap#19
Conversation
Add comprehensive v4 planning docs, updated agent instructions, and revised project roadmap driven by dataset needs from v1–v3 iterations. Docs added: - docs/v4/lead_scoring_v4_requirements.md — 7 requirements (value features, temporal momentum, structured missingness, leakage trap, redundancy fix, stronger category signal, robust validation) - docs/v4/dataset_contract.md — schema contract, temporal gates, missingness patterns, subsampling rules - docs/v4/engine_changes_spec.md — category effect scaling, windowed snapshot builder, structured missingness, leakage trap feature - docs/v4/validation_spec.md — 8 mandatory checks + 3 warning checks - docs/v4/implementation_plan.md — 4 milestones (M0–M3) with acceptance criteria and explicit mapping from existing roadmap items Updated: - CLAUDE.md — added repo map, generation workflow, student_public invariants, feature addition guide, v4 plan pointers - AGENTS.md — added v4 implementation guide, coding conventions, validation checklist, local testing commands - .agent-plan.md — v4 milestones as next work; M12–M15 items explicitly triaged (deferred/absorbed/discarded) No code changes. All 590 existing tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR
#19. Treat this PR as all clear unless new signals appear.Run metadata: |
There was a problem hiding this comment.
Pull request overview
Planning and documentation for the v4 lead scoring “intro” dataset, including requirements, dataset contract, validation spec, and an engine/build roadmap to address leakage, temporal framing, and pedagogical signal strength.
Changes:
- Added v4 planning docs: requirements, dataset contract, engine changes spec, validation spec, and milestone implementation plan.
- Updated contributor-facing guidance (
CLAUDE.md,AGENTS.md) with repo map, workflow/commands, and v4 implementation checklist. - Updated
.agent-plan.mdto prioritize v4 and triage older roadmap items.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/v4/validation_spec.md | Defines mandatory + warning validation checks for the v4 single-CSV dataset. |
| docs/v4/lead_scoring_v4_requirements.md | Captures v4 goals, rationale from v1–v3, and the proposed v4 column set + acceptance criteria. |
| docs/v4/implementation_plan.md | Milestone breakdown (M0–M3) and deliverables/acceptance criteria for v4 delivery. |
| docs/v4/engine_changes_spec.md | Specifies intended engine changes (difficulty/category signal tuning, snapshot windowing) and build-script responsibilities. |
| docs/v4/dataset_contract.md | Defines snapshot/target temporal gates, missingness expectations, subsampling, and reproducibility contract. |
| CLAUDE.md | Adds repo map + generation/validation workflow documentation and links to v4 docs. |
| AGENTS.md | Adds v4 implementation guidance, conventions, and a validation checklist. |
| .agent-plan.md | Reorients “Next Up” to v4 milestones and defers/triages older roadmap items. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Conversion rate | In [15%, 40%] | | ||
| | Baseline LR AUC | In [0.65, 0.90] (all features except leakage trap) | | ||
| | Leakage trap AUC boost | AUC with trap > AUC without trap by ≥0.03 | | ||
| | Missingness per column | Each column with nulls: 1–15% missing | |
There was a problem hiding this comment.
R7 specifies “Each column with nulls: 1–15% missing”, but validation_spec.md only enforces “no column > 20% missing” and dataset_contract.md includes structural missingness for days_since_last_touch (which may exceed 15% depending on touch sparsity). Consider relaxing/aligning this requirement (e.g., per-column max only, or explicit bounds per affected column).
| | Missingness per column | Each column with nulls: 1–15% missing | | |
| | Missingness per column | For nullable columns, no column may exceed 20% missing unless explicitly exempted below | | |
| | Missingness exceptions | `days_since_last_touch`: structural missingness allowed in [5%, 35%] because leads with no touches have natural NaN; `web_sessions`: overall missingness ≤20% | |
| | `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal | | ||
| | `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature | | ||
| | `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap | | ||
|
|
||
| ### Files affected | ||
|
|
||
| - `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations | ||
| - `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns |
There was a problem hiding this comment.
The “New features to compute” list includes days_since_first_touch, but docs/v4/lead_scoring_v4_requirements.md’s target column set does not include this feature (and fixes the dataset at 18 columns). Please reconcile whether days_since_first_touch is part of v4 (and update the schema/validation) or remove it here to keep the specs aligned.
| | `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal | | |
| | `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature | | |
| | `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap | | |
| ### Files affected | |
| - `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations | |
| - `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns | |
| | `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature | | |
| | `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap | | |
| ### Files affected | |
| - `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations | |
| - `leadforge/schema/features.py` — add new `FeatureSpec` entries for the added v4 columns |
| - `is_sql` — engine invariant creates deterministic groups | ||
| - `is_mql` — zero variance | ||
| - `lead_created_at` — timestamp that could be used to reverse-engineer temporal info | ||
| - Any column containing `_id` suffix (opaque identifiers, not features) |
There was a problem hiding this comment.
“Any column containing _id suffix” is ambiguous (contains vs endswith). If the intent is to ban opaque identifiers like lead_id/account_id, consider clarifying to “any column name ending with _id” to avoid mis-implementing the check.
| - Any column containing `_id` suffix (opaque identifiers, not features) | |
| - Any column name ending with `_id` (opaque identifiers, not features) |
| **Problem:** v1–v3 inject missingness randomly (MCAR). Real CRM data has structured gaps. | ||
|
|
||
| **Requirement:** Implement three missingness patterns: | ||
| 1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved. |
There was a problem hiding this comment.
Structured missingness refers to total_touches == 0, but R5 requires dropping total_touches from v4. To avoid confusion, define the condition in terms of retained columns (e.g., inbound_touches + outbound_touches == 0) or explicitly note that total_touches is computed internally and not included in the final CSV.
| 1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved. | |
| 1. **Natural (structural):** `days_since_last_touch` is NaN when `inbound_touches + outbound_touches == 0` (no touches recorded). Already exists but must be preserved. |
|
|
||
| | Column | Pattern | Rate | Condition | | ||
| |---|---|---|---| | ||
| | `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot | |
There was a problem hiding this comment.
The missingness contract uses total touches == 0 as the condition for structural NaNs in days_since_last_touch, but v4 drops total_touches from the final CSV. It would be clearer to express the condition using columns that remain in v4 (e.g., inbound_touches + outbound_touches == 0) or explicitly state that the condition is evaluated pre-drop during build.
| | `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot | | |
| | `days_since_last_touch` | Structural | Natural | NaN when `inbound_touches + outbound_touches == 0` by snapshot | |
|
|
||
| ## Overview | ||
|
|
||
| This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria. |
There was a problem hiding this comment.
This plan says v4 is delivered across “4–6 PRs”, but later the timeline section states “The work is 4 PRs”. Please make the expected PR count consistent so the milestone planning is unambiguous.
| This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria. | |
| This plan implements the v4 lead scoring dataset in 4 milestones across 4 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria. |
| - `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap | ||
| - `scripts/validate_v4_dataset.py` (new) — dataset-level validation | ||
| - These live in the leadforge repo (not datasets-private) | ||
|
|
||
| **v4-M3 (release):** | ||
| - Work in `leadforge-datasets-private` repo |
There was a problem hiding this comment.
This section states the v4 build/validation scripts live in the leadforge repo (not datasets-private), but CLAUDE.md’s “Related repos” + workflow example place these scripts in leadforge-datasets-private. Please align the docs on a single canonical home (or document how scripts are shared) to avoid contributors building in the wrong repo.
| - `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap | |
| - `scripts/validate_v4_dataset.py` (new) — dataset-level validation | |
| - These live in the leadforge repo (not datasets-private) | |
| **v4-M3 (release):** | |
| - Work in `leadforge-datasets-private` repo | |
| - Work in `leadforge-datasets-private` repo | |
| - `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap | |
| - `scripts/validate_v4_dataset.py` (new) — dataset-level validation | |
| **v4-M3 (release):** | |
| - Continue in `leadforge-datasets-private` repo for dataset generation and release artifacts |
| ### Implementation order | ||
|
|
||
| ``` | ||
| v4-M0 (planning PR — already done) |
There was a problem hiding this comment.
The implementation order diagram says “v4-M0 (planning PR — already done)”, but in this PR v4-M0 is the work being proposed/merged. Consider rephrasing to something time-invariant (e.g., “v4-M0: planning (this PR)”) to avoid the docs disagreeing with .agent-plan.md.
| v4-M0 (planning PR — already done) | |
| v4-M0: planning (this PR) |
| - Stratified subsampling to 1,000 rows at ~30% conversion rate. | ||
| - All negatives retained (up to 700); positives downsampled. | ||
| - Subsampling preserves within-class feature distributions. |
There was a problem hiding this comment.
“All negatives retained (up to 700); positives downsampled.” is a bit self-contradictory for a 1,000-row stratified sample (you can’t retain all negatives from a 5,000-lead bundle if there are more than 700). Consider rewording to specify the intended sampling policy (e.g., fixed 700 negatives sampled uniformly/stratified, and 300 positives downsampled as needed) and what happens if the source bundle’s base rate is below 30%.
| - Stratified subsampling to 1,000 rows at ~30% conversion rate. | |
| - All negatives retained (up to 700); positives downsampled. | |
| - Subsampling preserves within-class feature distributions. | |
| - Stratified subsampling produces 1,000 rows with a target mix of 700 negatives and 300 positives when the source bundle contains at least that many examples of each class. | |
| - Sampling policy: draw negatives uniformly at random without replacement up to 700 rows, and draw positives uniformly at random without replacement up to 300 rows. | |
| - If the source bundle has fewer than 300 positives, retain all available positives and fill the remaining rows with additional uniformly sampled negatives; if it has fewer than 700 negatives, retain all available negatives and fill the remaining rows with additional uniformly sampled positives. | |
| - Because sampling is uniform within each class, the subsample is intended to preserve within-class feature distributions up to normal sampling variation. |
| 2. **Conditional on source:** `web_sessions` is missing for ~15% of `sdr_outbound` leads (CRM tracking often not set up for outbound-sourced leads) but only ~2% of `inbound_marketing` leads. | ||
| 3. **Role data gap:** `seniority` is missing for ~8% of `partner_referral` leads (referral partners don't always provide full contact details). | ||
|
|
||
| **Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values. |
There was a problem hiding this comment.
R3 says missingness injection belongs in the snapshot builder (engine), but docs/v4/engine_changes_spec.md specifies missingness should be injected in the v4 build script (post-snapshot) to keep engine output clean. Please align R3 with the chosen design so implementation guidance is consistent.
| **Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values. | |
| **Implementation note:** Preserve natural structural missingness in the engine output, but inject the conditional missingness patterns in the v4 build script **after** snapshot generation, conditioned on feature values, to keep the snapshot builder output clean and reusable. |
* refactor: consolidate v4 planning docs + spike experiment Self-review of PR #19 identified 8 issues. This commit addresses all of them: 1. Ran spike experiment validating category signal approach — the spec's CategoricalInfluence scaling was wrong (not wired into simulation). Correct approach: correlate observables with latent traits in population.py. Scale 1.8 gives AUC 0.694, within [0.65, 0.90] target. 2. Consolidated 5 overlapping spec docs into 2: design.md (single source of truth for requirements, contract, engine changes, plan) + validation_spec.md. 3. Added "Known limitations" section (is_sql invariant, role_function gap). 4. Added missingness rationale (detectability at n=1000, not arbitrary). 5. Added tuning protocol decision table for when validation checks fail. 6. Trimmed AGENTS.md to durable conventions + pointer to docs/v4/. 7. Added explicit ACV band→midpoint mapping table and null-band behavior. 8. Merged v4-M1 and v4-M2 into single milestone (engine + build pipeline can't be validated independently). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix ruff formatting in spike script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump ruff pre-commit hook v0.4.5 → v0.11.13 Aligns the local pre-commit hook with CI's ruff version (unpinned, currently 0.11.x). The old v0.4.5 hook accepted formatting that the CI ruff rejects, allowing format violations to slip through. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…re dictionary (PR 4.1 deliverable 3-5) * docs/release/generation_method.md (new) — standalone DGP summary for external readers. Reads alone, references the architecture spec. Covers the five generation layers (motif families → mechanism layer → population → simulation engine → snapshot rendering), the public- vs-instructor split, calibration / validation, and the explicit "what this is not" boundary. * docs/release/feature_dictionary.md (new) — narrative companion to the per-bundle feature_dictionary.csv. Groups the 32 public columns by analytical role (lead identity / firmographics / personographics / engagement / funnel / value) plus the deliberate trap and the target. Documents difficulty modulation parameters, modelling defaults, and pedagogical caveats. Satisfies G10.3. * release/README.md (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1): - macro framing paragraph (2024–2026 SaaS context, recommendation #19) - simulation simplifications section (chatgpt v2 §2.6 — modelled / approximate / not modelled) - calibration documentation linking to validation_report.md - public-vs-instructor redaction policy with concrete column lists citing BANNED_LEAD_COLUMNS / BANNED_OPP_COLUMNS / BANNED_TABLES / SNAPSHOT_FILTERED_TABLES from leakage_probes.py - intended use vs out-of-scope use - known limitations including the G7.4.4 GBM-vs-LR finding and the weak channel signal from the Phase 4 audit - composition section (entities / features / label / splits / provenance) per Datasheets format - adversarial-framing pointer (placeholder link to break-me guide that lands in PR 6.3) - maintenance plan All claims about realism, calibration, or difficulty are anchored to release/validation/validation_report.md per G10.6. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(scripts,docs): channel-signal audit (PR 4.1 deliverable 1+2)
scripts/audit_channel_signal.py audits how strongly source channel
signals conversion across the release tier family. For each tier we
compute per-channel conversion rates and the univariate AUC of channel
against converted_within_90_days (scored as the empirical positive rate
per channel — a 1-D Bayes classifier equivalent to a saturated logistic
regression on one-hot channel features). Outputs JSON + Markdown to
docs/release/channel_signal_audit.{json,md}.
Tests guard determinism against the committed release/ bundles (a
double-run produces byte-identical output) plus per-channel rollup,
univariate AUC closed-form, single-class fallback, error paths, and the
CLI wiring.
The audit confirms what the v1 DGP predicts: channel signal in v1 is
weak — across all three tiers the largest per-channel rate spread is
0.043 and the largest univariate AUC is 0.521, well below the G2 /
Gemini v2 industry MQL→SQL band (SEO ~51% vs Email <1%). v1 drives
conversion through motif-family hazards keyed off latent traits, not
channel-conditional probabilities; channel-conditional encoding is
tracked as post-v1 work in docs/release/post_v1_roadmap.md.
Roadmap: docs/release/v1_release_roadmap.md §"Phase 4 — PR 4.1".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): release-grade dataset card + generation method + feature dictionary (PR 4.1 deliverable 3-5)
* docs/release/generation_method.md (new) — standalone DGP summary for
external readers. Reads alone, references the architecture spec.
Covers the five generation layers (motif families → mechanism layer
→ population → simulation engine → snapshot rendering), the public-
vs-instructor split, calibration / validation, and the explicit
"what this is not" boundary.
* docs/release/feature_dictionary.md (new) — narrative companion to
the per-bundle feature_dictionary.csv. Groups the 32 public columns
by analytical role (lead identity / firmographics / personographics /
engagement / funnel / value) plus the deliberate trap and the target.
Documents difficulty modulation parameters, modelling defaults, and
pedagogical caveats. Satisfies G10.3.
* release/README.md (substantial rewrite) — release-grade dataset card
per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1):
- macro framing paragraph (2024–2026 SaaS context, recommendation #19)
- simulation simplifications section (chatgpt v2 §2.6 — modelled /
approximate / not modelled)
- calibration documentation linking to validation_report.md
- public-vs-instructor redaction policy with concrete column lists
citing BANNED_LEAD_COLUMNS / BANNED_OPP_COLUMNS / BANNED_TABLES /
SNAPSHOT_FILTERED_TABLES from leakage_probes.py
- intended use vs out-of-scope use
- known limitations including the G7.4.4 GBM-vs-LR finding and the
weak channel signal from the Phase 4 audit
- composition section (entities / features / label / splits /
provenance) per Datasheets format
- adversarial-framing pointer (placeholder link to break-me guide
that lands in PR 6.3)
- maintenance plan
All claims about realism, calibration, or difficulty are anchored to
release/validation/validation_report.md per G10.6.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(plan): mark Phase 4 PR 4.1 complete in .agent-plan.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): cover lead_source / first_touch_channel in feature dictionary
Self-review caught a gap: the prior commit grouped 30 of 32 public
columns; lead_source and first_touch_channel were referenced in the
"recommended modelling defaults" checklist but did not appear in any
category table. Adds a "Lead source & channel" subsection that
describes both columns, calls out that they're identical in v1, and
cross-references the channel-signal audit so readers don't expect
top-tier feature importance from these columns. Updates the summary
table to reflect 32 documented columns. Also corrects two minor wording
issues (firmographics "Six" → "Five", personographics "all four" →
"all three", and a typo "bandage" → "discretisation").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(scripts): channel audit — out-of-sample AUC, no verdict bands, group identical columns
Self-review of the previous PR-4.1 commit surfaced four problems with
audit_channel_signal.py:
* The univariate AUC was computed in-sample (train rates → train labels),
guaranteed >= 0.5 by construction and not directly comparable to the
source_only baselines in release/validation/validation_report.json.
* The "weak / moderate / strong" verdict made a hard comparison between
v1's 90-day closed-won label and the G2 / Gemini v2 industry MQL→SQL
benchmark band. The two metrics measure different funnel transitions;
the comparison was a category error.
* The verdict prose hard-coded a "50 percentage points" claim and a
specific architectural narrative ("v1 drives conversion through
motif-family hazards") inside the script — both would silently drift
from the data and the codebase over time.
* lead_source and first_touch_channel produce byte-identical audits in
v1 yet were rendered as two parallel tables per tier.
Fixes:
* audit_channel now takes both train and test DataFrames and returns
univariate_auc_in_sample (the historical 1-D Bayes interpretation,
retained for transparency) plus univariate_auc_out_of_sample (train
rates scored against held-out test labels). The OOS numbers
reproduce the source_only HistGBM baselines in validation_report.json
for seed 42 cell-for-cell (intro 0.5014, intermediate 0.5139,
advanced 0.5226).
* Verdict bands and the _classify_signal / _verdict_paragraph helpers
are gone. The markdown report now ends with a Discussion section
written by hand around the actual numbers, with an explicit caveat
that the industry benchmarks measure MQL→SQL (not 90-day closed-won)
and are reproduced for context only.
* INDUSTRY_MQL_TO_SQL_BENCHMARKS is now a tuple of pairs (genuinely
immutable; matches dataclass(frozen=True) semantics). report_to_dict
converts it back to a {name: rate} dict for the JSON output.
* render_markdown groups channel columns whose audits are
byte-identical into one section with a header listing all columns
("Columns: lead_source, first_touch_channel (audit values
identical)"). The JSON keeps per-column entries.
New tests in tests/scripts/test_audit_channel_signal.py:
* OOS AUC == in-sample AUC when test=train (sanity check)
* OOS AUC stays well-defined when the test split contains channels
unseen on train (train-base-rate fallback)
* render_markdown collapses two identical columns into one section
AND keeps two distinct columns in two sections
* test_lead_source_equals_first_touch_channel_in_v1 (parametrized
over intro/intermediate/advanced) — locks the feature-dictionary
claim that the two channel columns are identical in v1. If the
simulator ever diverges them, the doc must be updated.
* test_committed_audit_artifacts_match_fresh_regeneration — re-runs
the audit against the committed bundles and asserts byte-equality
with the committed docs/release/channel_signal_audit.{md,json}.
CI gate against bundles regenerated without re-running the audit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): self-review fixes — README trim, citations, feature-dict consistency
* release/README.md (~434 → ~228 lines): trimmed to a release-grade
landing card. The full DGP, motif families, simulation
simplifications, and module map move to docs/release/generation_method.md
(linked). Macro-framing claim now cites
docs/external_review/summaries/gemini_v2_summary.md as the source of
the 30%→25% growth and CAC-ratio numbers (previously presented as if
primary research). Composition + maintenance sections compressed
into the table at the bottom.
* docs/release/generation_method.md: dropped the "Where the code
lives" module table. This doc is for external readers; module
paths belong in the developer-facing design doc and architecture
spec. Ends with a single short pointer to those.
* docs/release/feature_dictionary.md: fixed a factually wrong claim
about the leakage trap (the per-bundle CSV has columns
``name,dtype,description,category,is_target,leakage_risk`` — there
is no ``is_leakage_trap`` column). Reworded the modelling-default
checklist to acknowledge that the flat ``lead_scoring.csv`` and the
Parquet task splits ship every column listed in the dictionary
including the IDs — the recommendation says what to use as features,
not what's in the file. Also notes that ``lead_source`` and
``first_touch_channel`` carry identical values in v1 (locked by the
new test), so picking one is fine.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(scripts,docs): address Copilot review threads on PR 4.1
Six fixes from the Copilot reviews on PR #69:
* scripts/audit_channel_signal.py — _label_to_int now uses
pd.api.types.is_bool_dtype() so it explicitly handles pandas
nullable BooleanDtype (the actual parquet dtype on the v1 bundles)
alongside numpy bool. Previously it worked via a coincidental
pd.to_numeric fallback, with a comment that misled future readers.
* scripts/audit_channel_signal.py — render_markdown now takes both
md_path and json_path and emits the JSON link as a relative path
to the markdown's directory, so a `--out-md`/`--out-json` override
produces a markdown report whose link target is correct. Defaults
to the canonical "channel_signal_audit.json" basename when called
without paths (the unit-test path).
* scripts/audit_channel_signal.py — main() pins encoding="utf-8" on
both write_text() calls so the audit output is byte-identical
across operating systems and locale configurations.
* scripts/audit_channel_signal.py — Discussion section is no longer
bundle-specific. The previous prose claimed "for seed 42 the OOS
numbers below match the report cell-for-cell" — true for the
committed bundle but wrong for any other --release-dir. The new
prose talks about which AUC is comparable and what conclusion the
numbers in the per-tier sections support, both bundle-agnostic.
* release/README.md — fixed the relational-feature-engineering
Quick start example. The previous snippet did
`leads.merge(touch_counts, on="lead_id")` where touch_counts was
a Series with lead_id in its index, not as a column — would error
in modern pandas. The new snippet uses .reset_index() and merges
the resulting DataFrame.
* docs/release/feature_dictionary.md — touches_week_1 documented as
"days 0–7 inclusive" (8 day values) and touches_last_7_days
qualified with "for snapshot_day=30, days 24–30 inclusive".
Previously claimed "days 0–6" for week_1, which mismatched the
snapshot builder's _day <= 7 window.
Test changes:
* test_release_audit_is_deterministic now writes both runs to the
same path (back-to-back overwrite) instead of distinct tmp paths,
so the relative-link rendering doesn't make the two outputs differ.
* test_committed_audit_artifacts_match_fresh_regeneration uses the
canonical "channel_signal_audit.{md,json}" basenames in tmp_path,
so the relative link in the regenerated markdown matches the
committed file's link.
Two stale Copilot threads (firmographics "Six columns" and "bandage"
typo) were already addressed in commit f6b274e during the first
self-review pass.
1175/1175 tests pass; ruff + mypy clean; the regenerated audit
artifacts are byte-identical via the canonical-path mode.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Comprehensive planning PR for the v4 lead scoring dataset — a pedagogically improved single-CSV dataset for an intro ML course. This PR contains only documentation and planning artifacts (no code changes, all 590 tests pass).
Why v4?
v1–v3 each fixed critical issues but revealed new ones:
funnel_stagecontainingclosed_won/closed_lostreached_sql=0→ 0% conversion)v4 addresses all of these and adds:
expected_acvfeature)touches_week_1)total_touches_all— full 90-day window, for classroom discussion)category_effect_scalein difficulty profilesMilestones
Existing roadmap items — triage
--json/--strictflagsv4 dataset acceptance criteria
Files in this PR
docs/v4/lead_scoring_v4_requirements.md— 7 requirements with rationaledocs/v4/dataset_contract.md— schema contract, temporal gates, missingness patternsdocs/v4/engine_changes_spec.md— what changes in the engine and wheredocs/v4/validation_spec.md— 8 mandatory + 3 warning checksdocs/v4/implementation_plan.md— milestone breakdown with acceptance criteriaCLAUDE.md— repo map, generation workflow, student_public invariantsAGENTS.md— v4 implementation guide, coding conventions, testing.agent-plan.md— v4 as next work, M12–M15 triageWhat this PR does NOT do
🤖 Generated with Claude Code