Skip to content

refactor: consolidate v4 planning docs + spike experiment#20

Merged
shaypal5 merged 3 commits into
mainfrom
fix/v4-planning-review-fixes
Apr 29, 2026
Merged

refactor: consolidate v4 planning docs + spike experiment#20
shaypal5 merged 3 commits into
mainfrom
fix/v4-planning-review-fixes

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #19. Self-review identified 8 issues with the v4 planning PR — this addresses all of them:

  • Spike experiment (scripts/spike_category_signal.py): validated that CategoricalInfluence scaling was wrong (not wired into simulation). Correct approach: correlate observables with latent traits in population.py. Scale 1.8 → AUC 0.694.
  • Doc consolidation: 5 overlapping docs → 2 (design.md + validation_spec.md). Single source of truth for shared constants.
  • Known limitations section: explicit is_sql invariant workaround + role_function gap.
  • Tuning protocol: decision table for when validation checks fail.
  • Trimmed AGENTS.md: v4-specific content replaced with pointer to docs/v4/.
  • ACV mapping: explicit band→midpoint table + null-band behavior.
  • Merged M1+M2: engine + build pipeline as single milestone (can't validate independently).

Test plan

  • 590 tests passing
  • Spike experiment runs and produces expected results
  • All file references in CLAUDE.md, AGENTS.md, .agent-plan.md updated
  • No stale references to deleted files

🤖 Generated with Claude Code

Self-review of PR #19 identified 8 issues. This commit addresses all of them:

1. Ran spike experiment validating category signal approach — the spec's
   CategoricalInfluence scaling was wrong (not wired into simulation).
   Correct approach: correlate observables with latent traits in population.py.
   Scale 1.8 gives AUC 0.694, within [0.65, 0.90] target.

2. Consolidated 5 overlapping spec docs into 2: design.md (single source of
   truth for requirements, contract, engine changes, plan) + validation_spec.md.

3. Added "Known limitations" section (is_sql invariant, role_function gap).

4. Added missingness rationale (detectability at n=1000, not arbitrary).

5. Added tuning protocol decision table for when validation checks fail.

6. Trimmed AGENTS.md to durable conventions + pointer to docs/v4/.

7. Added explicit ACV band→midpoint mapping table and null-band behavior.

8. Merged v4-M1 and v4-M2 into single milestone (engine + build pipeline
   can't be validated independently).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 29, 2026 13:06
@github-actions

This comment has been minimized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Aligns the local pre-commit hook with CI's ruff version (unpinned,
currently 0.11.x). The old v0.4.5 hook accepted formatting that
the CI ruff rejects, allowing format violations to slip through.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR
#20. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.20
Trigger: commit pushed
Workflow run: 25110782532 attempt 1
Comment timestamp: 2026-04-29T13:10:00.453021+00:00
PR head commit: 506ea24c83245374160bdcafd3f46e5dc1afb381

@shaypal5 shaypal5 merged commit eee5a86 into main Apr 29, 2026
5 checks passed
@shaypal5 shaypal5 deleted the fix/v4-planning-review-fixes branch April 29, 2026 13:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR follows up on the v4 planning work by consolidating multiple overlapping v4 spec documents into a single design doc + a companion validation spec, and adds a spike script to empirically validate the proposed “category signal” approach for v4.

Changes:

  • Added docs/v4/design.md as the consolidated single source of truth (requirements + contract + engine changes + implementation plan) and added a companion note to validation_spec.md.
  • Added scripts/spike_category_signal.py to measure category spread + AUC under different population-level correlation scales.
  • Removed superseded v4 planning docs and updated repo pointers (CLAUDE.md, AGENTS.md, .agent-plan.md) to reference the consolidated docs.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
scripts/spike_category_signal.py New spike experiment script for measuring category→conversion signal and LR AUC under different correlation scales.
docs/v4/design.md New consolidated v4 design doc (requirements/contract/engine plan) intended as the single source of truth.
docs/v4/validation_spec.md Adds a pointer clarifying this doc is the automated-check companion to design.md.
docs/v4/planning_pr_review.md Adds the self-review writeup and treatment plan for PR #19 issues.
CLAUDE.md Updates v4 doc references to point to docs/v4/design.md + related consolidated artifacts.
AGENTS.md Removes v4-specific step-by-step guide and replaces with pointers to docs/v4/.
.agent-plan.md Updates v4 milestone/checklist references to match the consolidated doc structure and spike script.
docs/v4/lead_scoring_v4_requirements.md Deleted (content consolidated into docs/v4/design.md).
docs/v4/dataset_contract.md Deleted (content consolidated into docs/v4/design.md).
docs/v4/engine_changes_spec.md Deleted (content consolidated into docs/v4/design.md).
docs/v4/implementation_plan.md Deleted (content consolidated into docs/v4/design.md).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/v4/design.md
Comment on lines +23 to +42
Include an `expected_acv` numeric feature so students can compute `P(conversion) × expected_acv` and practice value-aware ranking.

**ACV derivation (single source of truth):**

| Condition | Value |
|---|---|
| Opportunity created by snapshot day | Opportunity's `estimated_acv` |
| No opportunity, `estimated_revenue_band` known | Band midpoint (see table below) |
| No opportunity, band unknown | NaN |

**Revenue band → ACV midpoint mapping:**

| Band | Midpoint ($k) |
|---|---|
| $1M–$10M | 25 |
| $10M–$50M | 55 |
| $50M–$200M | 85 |
| $200M+ | 140 |

These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ACV midpoint mapping table labels values as "Midpoint ($k)", but the engine’s OpportunityRow.estimated_acv values are in raw USD (see _EMPLOYEE_ACV_RANGES in leadforge/simulation/engine.py). Please clarify the units for expected_acv (USD vs thousands) and ensure the midpoint values/implementation match that unit consistently.

Suggested change
Include an `expected_acv` numeric feature so students can compute `P(conversion) × expected_acv` and practice value-aware ranking.
**ACV derivation (single source of truth):**
| Condition | Value |
|---|---|
| Opportunity created by snapshot day | Opportunity's `estimated_acv` |
| No opportunity, `estimated_revenue_band` known | Band midpoint (see table below) |
| No opportunity, band unknown | NaN |
**Revenue band → ACV midpoint mapping:**
| Band | Midpoint ($k) |
|---|---|
| $1M–$10M | 25 |
| $10M–$50M | 55 |
| $50M–$200M | 85 |
| $200M+ | 140 |
These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling.
Include an `expected_acv` numeric feature in raw USD so students can compute `P(conversion) × expected_acv` and practice value-aware ranking.
**ACV derivation (single source of truth):**
| Condition | Value |
|---|---|
| Opportunity created by snapshot day | Opportunity's `estimated_acv` |
| No opportunity, `estimated_revenue_band` known | Band midpoint in raw USD (see table below) |
| No opportunity, band unknown | NaN |
**Revenue band → ACV midpoint mapping:**
| Band | Midpoint (USD) |
|---|---|
| $1M–$10M | 25000 |
| $10M–$50M | 55000 |
| $50M–$200M | 85000 |
| $200M+ | 140000 |
These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling. The fallback midpoint values are expressed in raw USD so they use the same unit as `OpportunityRow.estimated_acv` and keep `expected_acv` consistent regardless of derivation path.

Copilot uses AI. Check for mistakes.
Comment thread docs/v4/design.md
Comment on lines +80 to +101
## Target column set

| # | Column | Type | Source | Notes |
|---|---|---|---|---|
| 1 | `industry` | categorical | account | 4 values |
| 2 | `region` | categorical | account | US, UK |
| 3 | `company_size` | categorical | account | 4 bands |
| 4 | `company_revenue` | categorical | account | 4 bands |
| 5 | `contact_role` | categorical | contact | 4 roles |
| 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) |
| 7 | `lead_source` | categorical | lead | 3 channels |
| 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day |
| 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day |
| 10 | `expected_acv` | numeric | derived | See R1 ACV derivation table |
| 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints |
| 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints |
| 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity |
| 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound) |
| 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count |
| 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches |
| 17 | `total_touches_all` | integer | **ALL events** | Leakage trap — full 90-day window |
| 18 | `converted` | binary 0/1 | target | Converted within 90 days |

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within this doc, the target column set uses names like company_size, company_revenue, contact_role, and converted, but the engine-facing sections and spike refer to current bundle/snapshot names like employee_band, estimated_revenue_band, role_function, and converted_within_90_days. Since this file is intended as the single source of truth, add an explicit mapping (raw bundle column → final CSV column) or standardize on one naming scheme throughout to avoid implementation/validation mismatches.

Copilot uses AI. Check for mistakes.
Comment thread docs/v4/design.md
Comment on lines +37 to +39
| $1M–$10M | 25 |
| $10M–$50M | 55 |
| $50M–$200M | 85 |

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several band labels in the mapping/correlation tables use an en-dash ($1M–$10M, $10M–$50M, etc.), but the engine’s estimated_revenue_band values use a plain hyphen ($1M-$10M, …). If these strings are meant to be used as literal keys (e.g., in YAML config / lookup tables), please normalize them to the exact engine labels to avoid hard-to-spot mismatches.

Suggested change
| $1M$10M | 25 |
| $10M$50M | 55 |
| $50M$200M | 85 |
| $1M-$10M | 25 |
| $10M-$50M | 55 |
| $50M-$200M | 85 |

Copilot uses AI. Check for mistakes.
)


def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> None:

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_pipeline is annotated as returning None, but it returns auc (a float). Update the return annotation to float (or stop returning a value) so type checkers and readers match the actual behavior.

Suggested change
def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> None:
def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> float:

Copilot uses AI. Check for mistakes.
3. Correlated observables (1.8x boost) — stronger correlation

Reports category spread (max - min conversion rate) per categorical feature
and logistic regression AUC at day-21 snapshot.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script header/docstring says results are measured on a day-21 snapshot, but run_pipeline() calls build_snapshot(sim, pop) without any day-21 gating (the current build_snapshot aggregates over the full simulation output). Either update the description/output text to reflect the actual window used, or implement explicit day-21 filtering in this spike so the reported AUC/spreads correspond to the intended snapshot day.

Suggested change
and logistic regression AUC at day-21 snapshot.
and logistic regression AUC from the snapshot produced by the current
simulation output aggregation.

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +161
config = gen._world_spec.config
narrative = gen._world_spec.narrative

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spike reaches into Generator private state (gen._world_spec.*). Since Generator exposes public config and world_spec properties, prefer those to avoid coupling the script to internal implementation details.

Suggested change
config = gen._world_spec.config
narrative = gen._world_spec.narrative
world_spec = gen.world_spec
config = gen.config
narrative = world_spec.narrative

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +108
feature_cols = [c for c in df.columns if c != "converted_within_90_days"]
x_df = df[feature_cols].copy()
y = df["converted_within_90_days"].astype(int)

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

measure_auc() trains on feature_cols = [c for c in df.columns if c != 'converted_within_90_days'], which will include opaque identifiers and meta columns (e.g., lead_id, account_id, contact_id, lead_created_at, current_stage, is_sql). Including these can make the spike AUC hard to interpret/compare against the planned v4 column set (and can spuriously inflate train AUC). Consider restricting to the intended v4 feature subset (or explicitly dropping IDs/timestamps/outcome-adjacent columns) before fitting.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants