refactor: consolidate v4 planning docs + spike experiment#20
Conversation
Self-review of PR #19 identified 8 issues. This commit addresses all of them: 1. Ran spike experiment validating category signal approach — the spec's CategoricalInfluence scaling was wrong (not wired into simulation). Correct approach: correlate observables with latent traits in population.py. Scale 1.8 gives AUC 0.694, within [0.65, 0.90] target. 2. Consolidated 5 overlapping spec docs into 2: design.md (single source of truth for requirements, contract, engine changes, plan) + validation_spec.md. 3. Added "Known limitations" section (is_sql invariant, role_function gap). 4. Added missingness rationale (detectability at n=1000, not arbitrary). 5. Added tuning protocol decision table for when validation checks fail. 6. Trimmed AGENTS.md to durable conventions + pointer to docs/v4/. 7. Added explicit ACV band→midpoint mapping table and null-band behavior. 8. Merged v4-M1 and v4-M2 into single milestone (engine + build pipeline can't be validated independently). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Aligns the local pre-commit hook with CI's ruff version (unpinned, currently 0.11.x). The old v0.4.5 hook accepted formatting that the CI ruff rejects, allowing format violations to slip through. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR
#20. Treat this PR as all clear unless new signals appear.Run metadata: |
There was a problem hiding this comment.
Pull request overview
This PR follows up on the v4 planning work by consolidating multiple overlapping v4 spec documents into a single design doc + a companion validation spec, and adds a spike script to empirically validate the proposed “category signal” approach for v4.
Changes:
- Added
docs/v4/design.mdas the consolidated single source of truth (requirements + contract + engine changes + implementation plan) and added a companion note tovalidation_spec.md. - Added
scripts/spike_category_signal.pyto measure category spread + AUC under different population-level correlation scales. - Removed superseded v4 planning docs and updated repo pointers (
CLAUDE.md,AGENTS.md,.agent-plan.md) to reference the consolidated docs.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/spike_category_signal.py |
New spike experiment script for measuring category→conversion signal and LR AUC under different correlation scales. |
docs/v4/design.md |
New consolidated v4 design doc (requirements/contract/engine plan) intended as the single source of truth. |
docs/v4/validation_spec.md |
Adds a pointer clarifying this doc is the automated-check companion to design.md. |
docs/v4/planning_pr_review.md |
Adds the self-review writeup and treatment plan for PR #19 issues. |
CLAUDE.md |
Updates v4 doc references to point to docs/v4/design.md + related consolidated artifacts. |
AGENTS.md |
Removes v4-specific step-by-step guide and replaces with pointers to docs/v4/. |
.agent-plan.md |
Updates v4 milestone/checklist references to match the consolidated doc structure and spike script. |
docs/v4/lead_scoring_v4_requirements.md |
Deleted (content consolidated into docs/v4/design.md). |
docs/v4/dataset_contract.md |
Deleted (content consolidated into docs/v4/design.md). |
docs/v4/engine_changes_spec.md |
Deleted (content consolidated into docs/v4/design.md). |
docs/v4/implementation_plan.md |
Deleted (content consolidated into docs/v4/design.md). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Include an `expected_acv` numeric feature so students can compute `P(conversion) × expected_acv` and practice value-aware ranking. | ||
|
|
||
| **ACV derivation (single source of truth):** | ||
|
|
||
| | Condition | Value | | ||
| |---|---| | ||
| | Opportunity created by snapshot day | Opportunity's `estimated_acv` | | ||
| | No opportunity, `estimated_revenue_band` known | Band midpoint (see table below) | | ||
| | No opportunity, band unknown | NaN | | ||
|
|
||
| **Revenue band → ACV midpoint mapping:** | ||
|
|
||
| | Band | Midpoint ($k) | | ||
| |---|---| | ||
| | $1M–$10M | 25 | | ||
| | $10M–$50M | 55 | | ||
| | $50M–$200M | 85 | | ||
| | $200M+ | 140 | | ||
|
|
||
| These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling. |
There was a problem hiding this comment.
The ACV midpoint mapping table labels values as "Midpoint ($k)", but the engine’s OpportunityRow.estimated_acv values are in raw USD (see _EMPLOYEE_ACV_RANGES in leadforge/simulation/engine.py). Please clarify the units for expected_acv (USD vs thousands) and ensure the midpoint values/implementation match that unit consistently.
| Include an `expected_acv` numeric feature so students can compute `P(conversion) × expected_acv` and practice value-aware ranking. | |
| **ACV derivation (single source of truth):** | |
| | Condition | Value | | |
| |---|---| | |
| | Opportunity created by snapshot day | Opportunity's `estimated_acv` | | |
| | No opportunity, `estimated_revenue_band` known | Band midpoint (see table below) | | |
| | No opportunity, band unknown | NaN | | |
| **Revenue band → ACV midpoint mapping:** | |
| | Band | Midpoint ($k) | | |
| |---|---| | |
| | $1M–$10M | 25 | | |
| | $10M–$50M | 55 | | |
| | $50M–$200M | 85 | | |
| | $200M+ | 140 | | |
| These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling. | |
| Include an `expected_acv` numeric feature in raw USD so students can compute `P(conversion) × expected_acv` and practice value-aware ranking. | |
| **ACV derivation (single source of truth):** | |
| | Condition | Value | | |
| |---|---| | |
| | Opportunity created by snapshot day | Opportunity's `estimated_acv` | | |
| | No opportunity, `estimated_revenue_band` known | Band midpoint in raw USD (see table below) | | |
| | No opportunity, band unknown | NaN | | |
| **Revenue band → ACV midpoint mapping:** | |
| | Band | Midpoint (USD) | | |
| |---|---| | |
| | $1M–$10M | 25000 | | |
| | $10M–$50M | 55000 | | |
| | $50M–$200M | 85000 | | |
| | $200M+ | 140000 | | |
| These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling. The fallback midpoint values are expressed in raw USD so they use the same unit as `OpportunityRow.estimated_acv` and keep `expected_acv` consistent regardless of derivation path. |
| ## Target column set | ||
|
|
||
| | # | Column | Type | Source | Notes | | ||
| |---|---|---|---|---| | ||
| | 1 | `industry` | categorical | account | 4 values | | ||
| | 2 | `region` | categorical | account | US, UK | | ||
| | 3 | `company_size` | categorical | account | 4 bands | | ||
| | 4 | `company_revenue` | categorical | account | 4 bands | | ||
| | 5 | `contact_role` | categorical | contact | 4 roles | | ||
| | 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) | | ||
| | 7 | `lead_source` | categorical | lead | 3 channels | | ||
| | 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day | | ||
| | 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day | | ||
| | 10 | `expected_acv` | numeric | derived | See R1 ACV derivation table | | ||
| | 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints | | ||
| | 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints | | ||
| | 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity | | ||
| | 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound) | | ||
| | 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count | | ||
| | 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches | | ||
| | 17 | `total_touches_all` | integer | **ALL events** | Leakage trap — full 90-day window | | ||
| | 18 | `converted` | binary 0/1 | target | Converted within 90 days | |
There was a problem hiding this comment.
Within this doc, the target column set uses names like company_size, company_revenue, contact_role, and converted, but the engine-facing sections and spike refer to current bundle/snapshot names like employee_band, estimated_revenue_band, role_function, and converted_within_90_days. Since this file is intended as the single source of truth, add an explicit mapping (raw bundle column → final CSV column) or standardize on one naming scheme throughout to avoid implementation/validation mismatches.
| | $1M–$10M | 25 | | ||
| | $10M–$50M | 55 | | ||
| | $50M–$200M | 85 | |
There was a problem hiding this comment.
Several band labels in the mapping/correlation tables use an en-dash ($1M–$10M, $10M–$50M, etc.), but the engine’s estimated_revenue_band values use a plain hyphen ($1M-$10M, …). If these strings are meant to be used as literal keys (e.g., in YAML config / lookup tables), please normalize them to the exact engine labels to avoid hard-to-spot mismatches.
| | $1M–$10M | 25 | | |
| | $10M–$50M | 55 | | |
| | $50M–$200M | 85 | | |
| | $1M-$10M | 25 | | |
| | $10M-$50M | 55 | | |
| | $50M-$200M | 85 | |
| ) | ||
|
|
||
|
|
||
| def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> None: |
There was a problem hiding this comment.
run_pipeline is annotated as returning None, but it returns auc (a float). Update the return annotation to float (or stop returning a value) so type checkers and readers match the actual behavior.
| def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> None: | |
| def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> float: |
| 3. Correlated observables (1.8x boost) — stronger correlation | ||
|
|
||
| Reports category spread (max - min conversion rate) per categorical feature | ||
| and logistic regression AUC at day-21 snapshot. |
There was a problem hiding this comment.
The script header/docstring says results are measured on a day-21 snapshot, but run_pipeline() calls build_snapshot(sim, pop) without any day-21 gating (the current build_snapshot aggregates over the full simulation output). Either update the description/output text to reflect the actual window used, or implement explicit day-21 filtering in this spike so the reported AUC/spreads correspond to the intended snapshot day.
| and logistic regression AUC at day-21 snapshot. | |
| and logistic regression AUC from the snapshot produced by the current | |
| simulation output aggregation. |
| config = gen._world_spec.config | ||
| narrative = gen._world_spec.narrative |
There was a problem hiding this comment.
This spike reaches into Generator private state (gen._world_spec.*). Since Generator exposes public config and world_spec properties, prefer those to avoid coupling the script to internal implementation details.
| config = gen._world_spec.config | |
| narrative = gen._world_spec.narrative | |
| world_spec = gen.world_spec | |
| config = gen.config | |
| narrative = world_spec.narrative |
| feature_cols = [c for c in df.columns if c != "converted_within_90_days"] | ||
| x_df = df[feature_cols].copy() | ||
| y = df["converted_within_90_days"].astype(int) |
There was a problem hiding this comment.
measure_auc() trains on feature_cols = [c for c in df.columns if c != 'converted_within_90_days'], which will include opaque identifiers and meta columns (e.g., lead_id, account_id, contact_id, lead_created_at, current_stage, is_sql). Including these can make the spike AUC hard to interpret/compare against the planned v4 column set (and can spuriously inflate train AUC). Consider restricting to the intended v4 feature subset (or explicitly dropping IDs/timestamps/outcome-adjacent columns) before fitting.
Summary
Follow-up to #19. Self-review identified 8 issues with the v4 planning PR — this addresses all of them:
scripts/spike_category_signal.py): validated thatCategoricalInfluencescaling was wrong (not wired into simulation). Correct approach: correlate observables with latent traits inpopulation.py. Scale 1.8 → AUC 0.694.design.md+validation_spec.md). Single source of truth for shared constants.is_sqlinvariant workaround +role_functiongap.docs/v4/.Test plan
🤖 Generated with Claude Code