Skip to content

feat: Milestone 7 — simulation engine (state, daily loop, conversion, events)#12

Merged
shaypal5 merged 2 commits into
mainfrom
feat/milestone-7-simulation-engine
Apr 28, 2026
Merged

feat: Milestone 7 — simulation engine (state, daily loop, conversion, events)#12
shaypal5 merged 2 commits into
mainfrom
feat/milestone-7-simulation-engine

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

  • simulation/state.py: LeadSimState — per-lead mutable state tracking current stage, dwell days, conversion/churn flags, and the first day reaching sql (used to anchor opportunity timestamps)
  • simulation/engine.py: simulate_world(config, population, world_graph) → SimulationResult — 90-day discrete-time loop. Each day per non-terminal lead: churn check (0.4%/day → closed_lost), stage advance via HazardTransition (mql → … → negotiation), final close via ConversionHazard (negotiation → closed_won), then touch/session/sales-activity emission. Post-loop creates opportunities (all sql+ leads), customers, and subscriptions (converted leads).
  • converted_within_90_days is event-derived — it is True only when the simulated trajectory reaches closed_won within the 90-day horizon (never directly sampled).
  • 45 new tests covering state transitions, FK integrity, determinism, motif-family variation, and conversion-rate ordering (fit-dominant > buying-committee-friction)
  • Total: 490 tests passing, ruff clean, mypy clean

Test plan

  • LeadSimState unit tests: advance, mark_converted, mark_churned, dwell reset, sql_day recording
  • SimulationResult type correctness: all lists contain correct row types
  • Determinism: same seed → identical outcome lists
  • Lead outcomes: all leads are MQL, converted leads have timestamps and are at closed_won, some leads convert and some don't
  • Opportunity / customer / subscription FK integrity and counts match conversion counts
  • Touch / session / activity lead FK and ID uniqueness
  • All 5 motif families run without error
  • Conversion-rate ordering (fit-dominant vs buying-committee-friction at n=300)
  • _plan_from_acv tier boundaries

🤖 Generated with Claude Code

…ion)

- simulation/state.py: LeadSimState tracks stage, dwell, converted/churned
  flags, conversion_day, and sql_day (for opportunity anchor)
- simulation/engine.py: simulate_world() runs 90-day daily-step loop;
  HazardTransition advances stages, ConversionHazard fires closed_won from
  negotiation, daily churn produces closed_lost; emits touches (Recency-
  DecayIntensity), sessions (30% of touch-days), and sales activities (20%
  of active-stage days); post-loop creates opportunity, customer, and
  subscription rows for qualifying leads
- 45 new tests covering state transitions, FK integrity, determinism,
  motif-family variation, and conversion-rate ordering
- Total: 490 tests passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 28, 2026 07:20
@shaypal5 shaypal5 added type: feature New capability layer: simulation simulation/ discrete-time engine labels Apr 28, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the Milestone 7 discrete-time (daily) simulation engine that evolves leads through funnel stages, emits CRM-like events, and materializes downstream entities (opportunities/customers/subscriptions), with a new state model and an accompanying test suite.

Changes:

  • Introduces LeadSimState for per-lead mutable simulation state (stage, dwell, conversion/churn flags, sql-day anchoring).
  • Implements simulate_world(...) 90-day loop that applies churn + stage/conversion hazards and emits touches/sessions/sales activities, then creates opportunities/customers/subscriptions.
  • Adds a comprehensive tests/simulation/test_engine.py covering determinism, FK integrity, motif-family behavior, and tier boundaries.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
leadforge/simulation/engine.py New simulation engine + post-sim entity creation + helper utilities
leadforge/simulation/state.py New LeadSimState dataclass used by the engine loop
tests/simulation/test_engine.py New unit/integration-style tests validating engine/state behavior
.agent-plan.md Updates milestone status and next-task plan

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/simulation/engine.py Outdated
Comment thread tests/simulation/test_engine.py Outdated
Comment thread leadforge/simulation/engine.py Outdated
Comment thread leadforge/simulation/engine.py
Comment thread leadforge/simulation/engine.py Outdated
Comment thread leadforge/simulation/state.py
@github-actions

This comment has been minimized.

- Precompute lead_dates dict before main loop (avoid ISO parse on every day×lead)
- Split sim_rng into simulation_transitions / simulation_events / simulation_post_sim
  substreams so unrelated outputs are stable when logic changes (COPILOT-3)
- Narrow conv_ts to str (not str|None) inside state.converted block; remove
  type: ignore[arg-type] comments (COPILOT-4, COPILOT-5)
- state.py: broaden advance_stage() to set sql_day for any SQL-or-deeper stage
  (not just "sql"); fix docstring to match (COPILOT-6)
- test: replace flaky count comparison with lead-signature assertion for
  different-seeds-differ test (COPILOT-2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #12.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/simulation/engine.py
URL: https://github.com/leadforge-dev/leadforge/pull/12#discussion_r3152275834
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `date.fromisoformat(lead.lead_created_at)` is executed inside the inner (day × lead) loop, even though `lead_created_at` is constant per lead. For the default scale (e.g., 5k leads × 90 days) this adds avoidable parsing overhead. Consider precomputing `lead_created_date` once per lead before the main loop (or caching it alongside `states`) and then only doing the `timedelta(days=t)` addition per day.

## COPILOT-2
Location: tests/simulation/test_engine.py
URL: https://github.com/leadforge-dev/leadforge/pull/12#discussion_r3152275878
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `test_different_seeds_differ` asserts different seeds produce different *counts* of touches/sessions, but it’s still possible (by chance) for both counts to match, making the test flaky. Consider asserting a stronger signal (e.g., compare a small slice of serialized rows, or check that at least one of: touch types, timestamps, conversion flags, etc. differs), or increase `n_leads` enough that identical counts become effectively impossible.
    ~~~suggestion
            r1 = _run_sim(seed=1, n_leads=200)
            r2 = _run_sim(seed=2, n_leads=200)
            lead_signature_1 = [
                (
                    row.lead_id,
                    row.converted_within_90_days,
                    row.current_stage,
                    row.conversion_timestamp,
                )
                for row in r1.leads
            ]
            lead_signature_2 = [
                (
                    row.lead_id,
                    row.converted_within_90_days,
                    row.current_stage,
                    row.conversion_timestamp,
                )
                for row in r2.leads
            ]
            assert (
                lead_signature_1 != lead_signature_2
                or len(r1.touches) != len(r2.touches)
                or len(r1.sessions) != len(r2.sessions)
                or len(r1.opportunities) != len(r2.opportunities)
            )
    ~~~

## COPILOT-3
Location: leadforge/simulation/engine.py
URL: https://github.com/leadforge-dev/leadforge/pull/12#discussion_r3152275907
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `sim_rng` is used for *all* simulation randomness (daily events, stage hazards, and post-loop ACV sampling). This couples unrelated outputs: e.g., changing touch/session logic will shift the RNG state and therefore change ACV/opportunity values, producing large diffs and making components less stream-independent. Consider splitting into multiple `RNGRoot(config.seed).child(...)` substreams (e.g., transitions/churn, events, and post_sim/acv) to keep outputs stable when unrelated parts change (similar to `build_population()` using separate `population_*` streams).
    ~~~suggestion
        transition_rng = root.child("simulation/transitions")
        event_rng = root.child("simulation/events")
        post_sim_rng = root.child("simulation/post_sim")
        acv_rng = post_sim_rng.child("acv")
        sim_rng = event_rng
    ~~~

## COPILOT-4
Location: leadforge/simulation/engine.py:334
URL: https://github.com/leadforge-dev/leadforge/pull/12#discussion_r3152275932
Root author: copilot-pull-request-reviewer

Comment:
    `conv_ts` is computed from `state.conversion_day` but remains typed as `str | None`, and later gets passed into `CustomerRow.customer_start_at` / `SubscriptionRow.subscription_start_at` via `# type: ignore[arg-type]`. Since this block only runs when `state.converted` is true, consider narrowing `conv_ts` to a non-optional `str` (e.g., `assert conv_ts is not None` or a separate `conv_ts_str`) so the ignores can be removed and a future refactor can’t silently introduce `None` values.

## COPILOT-5
Location: leadforge/simulation/engine.py
URL: https://github.com/leadforge-dev/leadforge/pull/12#discussion_r3152275955
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    In the `SubscriptionRow(...)` call, `subscription_status` is mis-indented relative to the other keyword args. While it won’t change semantics inside parentheses, it hurts readability and is easy to miss in reviews; consider aligning it with the other fields (and, if you apply the `conv_ts` narrowing, drop the `# type: ignore[arg-type]` here as well).
    ~~~suggestion
                            subscription_status="active",
    ~~~

## COPILOT-6
Location: leadforge/simulation/state.py:80
URL: https://github.com/leadforge-dev/leadforge/pull/12#discussion_r3152275979
Root author: copilot-pull-request-reviewer

Comment:
    `LeadSimState.sql_day` docstring says it records the first day the lead reached the `sql` stage “(or equivalent deeper stage)”, but `advance_stage()` only sets `sql_day` when `new_stage == "sql"`. Either update the docstring to match the implementation, or broaden the condition to record the first transition into any SQL-or-deeper stage so opportunity timestamps remain correct even if the stage sequence changes later.

Run metadata:

Tool ref: v4
Tool version: 4.0.20
Trigger: commit pushed
Workflow run: 25040688823 attempt 1
Comment timestamp: 2026-04-28T07:50:13.541028+00:00
PR head commit: a1723babd2f66c8863f692a9357aebafce17cf8d

@shaypal5 shaypal5 merged commit 94fb07a into main Apr 28, 2026
5 checks passed
@shaypal5 shaypal5 deleted the feat/milestone-7-simulation-engine branch April 28, 2026 07:55
shaypal5 added a commit that referenced this pull request May 7, 2026
Drafts the rubric document the driver feeds to Claude. Structured as
a parseable file with <system_prompt> and <user_cue> section markers
the driver splits on; the input bundle is concatenated between them.

Fourteen rubric dimensions (D1-D14) covering documentation
truthfulness, leakage discipline, realism vs disclosure, difficulty
signal, calibration / value-aware ranking, cohort and time-window
discipline, notebook integrity, platform packaging hygiene,
adversarial-framing completeness, pedagogy of the documented trap,
effective semantic diversity (recommendation #12 v1 scope),
Datasheets-for-Datasets composition, manifest and provenance
integrity, and an out-of-scope guard. Every finding cites which
dimension surfaced it via rubric_dimension so reviewers can audit
clustering. Category vocabulary is locked to the nine break_me_guide
triage labels so findings route into existing labels without
translation. Severity calibration and style guide explicitly written
to discourage re-deriving the existing nine adversarial patterns and
to push for concrete, quotable evidence on every finding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 9, 2026
* PR 7.1: design decisions for LLM critique module

Records the load-bearing design calls before any code lands so the
implementation, the rubric prompt, and the driver have one source of
truth. Covers the nine design questions from the PR brief: provider
abstraction (single, Anthropic only), skip-cleanly behavior on missing
ANTHROPIC_API_KEY, model + thinking + caching posture for Opus 4.7,
JSON output schema (frozen dataclass with 9-value category vocabulary
matching break_me_guide.md triage labels), input-bundle composition
(intermediate tier only, BANNED_* constants live-referenced for the
diff summary, no latent truth), determinism via provenance instead of
fake temperature=0, CLI flags mirroring validate_release_candidate.py,
test posture (no live API), and first-run adjudication workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: LLM critique rubric prompt

Drafts the rubric document the driver feeds to Claude. Structured as
a parseable file with <system_prompt> and <user_cue> section markers
the driver splits on; the input bundle is concatenated between them.

Fourteen rubric dimensions (D1-D14) covering documentation
truthfulness, leakage discipline, realism vs disclosure, difficulty
signal, calibration / value-aware ranking, cohort and time-window
discipline, notebook integrity, platform packaging hygiene,
adversarial-framing completeness, pedagogy of the documented trap,
effective semantic diversity (recommendation #12 v1 scope),
Datasheets-for-Datasets composition, manifest and provenance
integrity, and an out-of-scope guard. Every finding cites which
dimension surfaced it via rubric_dimension so reviewers can audit
clustering. Category vocabulary is locked to the nine break_me_guide
triage labels so findings route into existing labels without
translation. Severity calibration and style guide explicitly written
to discourage re-deriving the existing nine adversarial patterns and
to push for concrete, quotable evidence on every finding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: leadforge/validation/llm_critique.py — module core

The LLM-critique core. Sits one layer below the driver
(scripts/run_llm_critique.py, lands in the next commit) and provides:

- LLMCritiqueClient protocol + default Anthropic implementation. The
  Anthropic client lazy-imports the SDK so the module imports cleanly
  on machines without anthropic installed (skip-cleanly path needs to
  work even without the SDK).
- has_anthropic_credentials / api_key_or_skip — the env-var gate.
  Treats unset and empty-after-strip identically as "absent" since
  shells routinely set ANTHROPIC_API_KEY="" via env -i or stale
  .envrc files.
- parse_rubric_prompt — splits the rubric file on
  <system_prompt>/<user_cue> markers. Surrounding prose is ignored.
- build_input_bundle — assembles the eleven blocks the design doc
  pins (README, dataset_card, generation_method, manifest,
  feature_dictionary, validation_report.{md,json}, test-split sample
  rendered as CSV, public/instructor diff summary, public-safe
  mechanism summary, break-me guide). Public/instructor diff is
  live-derived from the BANNED_LEAD_COLUMNS / BANNED_OPP_COLUMNS /
  BANNED_TABLES / SNAPSHOT_FILTERED_TABLES constants in
  leakage_probes.py — single source of truth, auto-stays-in-sync. The
  mechanism summary names motif families and difficulty knobs *names*
  only, never values, matching the student_public redaction posture.
- Pure builder: same release_dir → byte-identical bytes → identical
  sha256. Per-source-file hashes carried for audit-artifact-sync.
- parse_critique_response — schema validator. Rejects malformed JSON,
  wrong types, severities outside {high,medium,low}, categories
  outside the nine break_me_guide labels, rubric dimensions outside
  D1-D14, finding-id collisions, missing required fields. Returns
  every problem in one error rather than the first one only.
- render_markdown_summary — the "latest run, at a glance" file
  (single canonical filename so dataset-card links don't rot).
- raw_output_path / summary_output_path — timestamped raw JSON
  accumulates per run, summary overwrites in place.

Default Anthropic call uses adaptive thinking with display=summarized
(only thinking mode supported on Opus 4.7), effort=high (recommended
minimum for intelligence-sensitive work per claude-api skill), two
prompt-cache breakpoints (rubric + input bundle, per the design
doc's caching strategy), and stream + get_final_message to dodge
the 10-min idle-connection timeout on long adaptive-thinking
responses.

ruff + mypy clean; full leadforge/ mypy pass: 83 files clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: scripts/run_llm_critique.py — driver

CLI + filesystem glue around leadforge.validation.llm_critique. Mirrors
scripts/validate_release_candidate.py's posture: free-function
parse_args, frozen DriverConfig, run_critique(config) → DriverResult,
main(argv) returning an exit code.

Skip-cleanly path triggers BEFORE any I/O — no rubric read, no bundle
build, no out-dir creation. Tests can pin this via has_anthropic_credentials
returning False on a controlled env dict.

Three modes alongside the live path:
- --dry-run writes the rendered input bundle to
  <out-dir>/llm_critique_input_<ts>.md so a maintainer can inspect
  what gets sent to Claude. Different output filename from the real
  raw JSON, can't be confused.
- --no-execute calls api_key_or_skip + build_anthropic_client to prove
  the SDK is installed and creds are present, then exits without
  writing or calling the API. CI smoke gate.
- --out-tag suffixes the raw JSON filename so adjudication re-runs
  (re-run after fixing a finding) don't shadow the canonical run.

Exit codes:
- 0 — pass (skip-cleanly counts as pass; no high-severity findings)
- 1 — high-severity finding(s) present and unresolved
- 2 — pre-flight error or schema-validation failure on the LLM
      response (every problem rendered, not just the first)

Adjudication is the maintainer's responsibility *after* the driver
exits with code 1: resolve the finding in code OR log to
v2_decision_log.md, then re-run. The next critique's exit code is
the gate.

ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: tests for llm_critique module + driver (54 cases, no live API)

Two new files: tests/validation/test_llm_critique.py for the module,
tests/scripts/test_run_llm_critique.py for the driver. No live API
calls; the LLMCritiqueClient protocol is exercised via a small in-
process canned-response fake.

Module coverage (43 cases):
- has_anthropic_credentials / api_key_or_skip — unset, empty,
  whitespace-only, real value, strip semantics. Covers the
  shells-set-empty-string-via-env-i case explicitly.
- parse_rubric_prompt — both sections extracted, missing system
  prompt raises, missing user cue raises, plus a smoke test against
  the actual checked-in rubric file (skipped if absent).
- build_input_bundle — same release_dir → byte-identical bytes
  (sha256, bundle_hashes, rendered text); block order pinned to
  README first / break-me last with eleven blocks total; missing
  input raises FileNotFoundError; CSV-rendered test split has the
  expected row count; per-file hashes carry every input.
- Sync test: the live-derived public/instructor diff summary names
  every banned-column / banned-table constant from leakage_probes.py
  — guarantees the diff stays in sync if the leakage contract
  changes.
- parse_critique_response — eleven malformations pinned: missing
  required field, wrong severity, wrong category, wrong rubric
  dimension, finding-id collision, findings non-list, top-level
  non-object, non-JSON, code-fence stripping (defensive), score out
  of range, empty findings list valid.
- has_unresolved_high_severity — high flags, medium doesn't, no
  findings doesn't.
- Vocabulary alignment: every VALID_CATEGORIES entry appears in
  break_me_guide.md; rubric dimensions are exactly D1-D14;
  severities are exactly the three values.
- Round-tripping result_to_dict / result_to_json (stable, sorted).
- render_markdown_summary groups findings by severity, hashes table
  renders, no-findings placeholder shows.
- Output filenames: timestamped raw, --out-tag suffix, canonical
  summary.

Driver coverage (11 cases):
- Skip-cleanly path: env unset / empty → skipped, no I/O, no
  out-dir created.
- Live happy path with canned client: both files written, raw JSON
  parses back to the same overall_score, summary lands at canonical
  filename.
- High-severity finding still writes both files (so the maintainer
  can adjudicate); has_unresolved_high_severity flips True.
- --out-tag suffixes the raw filename.
- --dry-run writes only the input bundle, not the raw / summary.
- Schema-validation failure → main() returns 2, stderr says
  "schema-validation error".
- main() exit-code policy: pass returns 0, high-severity returns 1,
  skip-cleanly returns 0 with SKIPPED on stdout, missing
  --release-dir / --prompt returns 2 with "pre-flight" on stderr.

54/54 pass; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: .agent-plan.md close-out narrative

Phase 7 PR 7.1 entry follows the Phase 6 PR 6.1/6.2/6.3 entry format:
dense paragraph with all the load-bearing decisions and validation
results inline. Calls out the live-first-run deferral (no
ANTHROPIC_API_KEY available to the agent; dry-run path exercised
end-to-end against the real release dir).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: hostile-self-review fixes

Fold-back from a brutal-review pass against the diff.  Caught 12
findings; fixes follow.

BLOCKER 1 — `--no-execute` was performing pre-flight I/O before
the credentials check, contradicting the design doc's claim that
the smoke gate "doesn't read the bundle". The check now short-
circuits BEFORE _preflight + build_input_bundle. main() catches
MissingCredentialsError and returns exit code 2 with a clean
pre-flight message; new test
test_no_execute_does_not_read_release_dir points --release-dir at
a non-existent path to prove no I/O occurred.

BLOCKER 2 — raw-output filename collision. _utc_iso_timestamp was
per-second precision; two runs in the same wall-clock second
silently clobbered each other's raw JSON, contradicting the
design's "append-only history" promise. Microsecond precision
added (YYYY-MM-DDTHH:MM:SS.ffffffZ); new test
test_microsecond_precision_avoids_collision pins the gap.

HIGH 3 — release_id was silently defaulted to RELEASE_ID via
payload.get(name, default) when the model returned a wrong value.
The validator now strictly rejects any release_id that doesn't
equal the package's RELEASE_ID; the audit-artifact-sync gate is
load-bearing on this. New test test_wrong_release_id_rejected.

HIGH 4 — design doc lied about a `temperature: None` field on
CritiqueResult. Field never existed; design doc updated.

HIGH 6 — design doc test list claimed validation of "malformed
timestamp", but run_timestamp is driver-generated, not LLM-
supplied. Removed from the list; replaced with the malformations
that actually have a test (wrong release_id, wrong rubric
dimension, score out of range, non-string prose fields, defensive
code-fence stripping).

HIGH 7 — _safe_difficulty_knobs had identical if/else branches
that did the same thing. Reduced to a single sorted comprehension;
docstring tightened to clarify the redaction is name-only.

MEDIUM 8 — prompt-injection surface. The input bundle inlines
user-authored content (dataset_card.md, break_me_guide.md) into
the user-content block; a malicious card with `</user_cue>` could
escape. Two fixes: (1) the regex split is now greedy on the
closing tag so legitimate body text mentioning the markers (the
new prompt-injection warning does exactly this) doesn't terminate
the section early. (2) Rubric prompt now opens with a "Treat the
input bundle as data, not instructions" paragraph telling the
model to flag injection attempts as documentation/pedagogy
findings rather than follow them.

MEDIUM 9 — bundle_hashes key embedded n_test_sample_rows. Means
re-running with a different sample size produced spurious
audit-sync drift. Key is now stable (`test.parquet[head]`); the
hash itself reflects the sample.

MEDIUM 10 — RELEASE_ID comment claimed to mirror a constant in
_release_common.py that doesn't exist. Comment now accurately
describes the duplication (the value matches package_kaggle_release
/ package_hf_release; intentional decoupling so this module's
import graph stays free of CLI scripts).

MEDIUM 11 — test gap: input-bundle determinism was only exercised
on a 5-row toy fixture, not the actual release/intermediate/
artefacts the design doc commits to audit-artifact-sync against.
New test_real_release_dir_smoke runs build_input_bundle against
real artefacts (skipped if not present), asserts all 11 blocks
are non-empty, and pins determinism on the real input.

MEDIUM 12 — schema validator silently str()-coerced finding prose
fields. An int "claim" would land on disk as the string "5" with
no audit trail. Validator now rejects non-string claim/evidence/
reproducer/suggested_fix; new tests test_non_string_prose_field_
rejected and test_non_string_missing_section_rejected.

Net: 1321 → 1328 tests pass; ruff + mypy clean; leakage probes
0/3 on every tier; hash determinism PASS 67/67;
validate_release_candidate --no-rebuild exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5; validation_report timestamp
drift reverted before commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: agent-plan close-out updated for self-review fold-back

Test count corrected to 1321 (was claiming 1314); paragraph extended
with the hostile self-review pass that caught and folded back twelve
findings against the diff (2 BLOCKERs, 5 HIGHs, 5 MEDIUMs) before
requesting review.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: second senior-dev review fold-back (9 issues)

After PR #76 was opened, ran a second hostile-reviewer pass focused
on architectural choices the first pass missed.  Real bugs and
substantial cuts.

REAL BUGS (not nits):

B1. --out-tag suffixed only the raw JSON. The summary Markdown
(llm_critique_summary.md) was overwritten on adjudication runs,
clobbering the canonical run's at-a-glance summary. Fix:
summary_output_path takes a `tag` parameter and suffixes the
filename when set.  New test test_out_tag_suffixes_both_raw_and_summary
pins both files getting the suffix and the canonical summary
NOT being written.

B2. skip-cleanly silently passed the release-readiness gate.
v1_release_roadmap.md line 35 makes "no unresolved high-severity
findings" a hard acceptance criterion — but if ANTHROPIC_API_KEY
was unset, CI passed with no critique having run.  Added
--require-execute flag (default off; release-readiness CI sets
it) that converts the skip path into MissingCredentialsError →
exit 2.  Also added a loud stderr WARNING on the regular skip
path so a maintainer reading CI logs notices.  Two new tests:
test_require_execute_fails_loud_on_missing_key and
test_main_warns_loudly_when_skipping.

ARCHITECTURAL FIXES:

A1. The "audit-artifact-sync" framing in code and docs was wrong.
A real audit-artifact-sync (PR 4.1 / 5.1 / 5.2 pattern) commits a
frozen artefact and asserts byte-identity on rebuild. What I had
was just "build twice, assert hashes equal" — that's determinism,
not audit-sync. Renamed throughout to "smoke test against the
real release dir" / "staleness check vs committed result". The
test name (test_real_release_dir_smoke) was already correct from
the first pass; the docstrings and module comments were the
remaining surface.

A2. Two prompt-cache breakpoints cut to one. System content
sits inside the cached prefix on messages.create (render order:
system → messages). A second breakpoint at end-of-system bought
nothing and burned a cache_control slot. One breakpoint at end
of input bundle is correct; rubric edits and bundle edits both
invalidate the same slot, which is what we want.

CUTS:

M1. Design doc cut from 394 lines to 73. The 9-decision table
replaces the multi-paragraph rationale-per-call shape that read
as documentation theater. Working notes, not a maintained
document.

M2. Rubric cut from 420 lines to ~210. Each of the 13 dimensions
now one paragraph (3-5 sentences) instead of 3-6 paragraphs. D14
("out-of-scope guard") was meta-instruction not a real
dimension; converted to a "What is NOT yours to audit" appendix
at the end of the rubric. VALID_RUBRIC_DIMENSIONS updated to
D1-D13; sync test updated.

M3. Test-split sample: 100 raw rows of CSV replaced with
df.describe(include="all") per-column statistics + a 20-row
head. The model can't draw distributional conclusions from raw
rows; statistics carry the signal. Rendered input bundle dropped
from 148KB to 128KB.

M5. messages.stream(...).get_final_message() replaced with
messages.create(timeout=600.0). The streaming was defensive
theater — no stream events were processed. The actual contract
("don't time out on long adaptive-thinking responses") is
spelled correctly with an explicit timeout.

M6. render_input_bundle_text free function moved to
InputBundle.render() method. Leaky abstraction; the function
was just iterating over the bundle's blocks. Tests and driver
updated to call .render(); free function removed from __all__.

Won't-fix (recorded for completeness):
A3. MissingCredentialsError as a custom class — kept. Lets the
driver catch precisely "env-var missing" without filtering
RuntimeError by message string.
A4. result_to_dict / result_to_json moved to methods — kept as
free functions to match the existing release_quality.py pattern
(report_to_dict / report_to_json are also free functions there).
M4. Remove --out-tag — kept since adjudication re-runs benefit
from a stable suffix the maintainer chooses, not just a
microsecond timestamp.

Net: 1323/1323 tests pass + 5 publish-extra-gated skips; ruff
+ mypy clean; leakage probes 0/3 on every tier; hash determinism
PASS 67/67; validate_release_candidate --no-rebuild exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5; validation_report timestamp
drift reverted before commit. Dry-run smoke against real release/
produced 128KB byte-stable input bundle (down from 148KB).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: address Copilot review (COPILOT-1, COPILOT-2 + TAGALONG-1)

Two real issues from copilot-pull-request-reviewer's inline review,
plus a tag-along inconsistency I found while verifying.

COPILOT-1 — _render_public_safe_mechanism_summary hardcoded
"(intermediate tier)" and called _safe_difficulty_knobs(..., "intermediate")
even though build_input_bundle accepts a tier parameter and the
driver exposes --tier. Running with --tier advanced produced an
"intermediate tier" header with the intermediate knob list, which
is wrong on two axes for the requested tier.

  Fix: thread tier through. _render_public_safe_mechanism_summary
  now takes tier as an argument, the header renders f"({tier} tier)",
  and _safe_difficulty_knobs gets the actual tier. New test
  test_mechanism_summary_tracks_requested_tier pins the behavior on
  both intermediate and advanced.

COPILOT-2 — env override was partially ineffective. run_critique
honored env for has_anthropic_credentials / api_key_or_skip but
build_anthropic_client() called anthropic.Anthropic() with no args,
which falls back to process-global os.environ. So a test that
passed env={"ANTHROPIC_API_KEY": "fake"} would have its env
silently ignored on the SDK side.

  Fix: build_anthropic_client(api_key=None) now accepts an explicit
  key. The driver resolves the key from env via api_key_or_skip and
  passes it through. Both the live path and --no-execute use the
  resolved key. New test test_env_override_is_passed_to_anthropic_client
  stubs build_anthropic_client and asserts the api_key argument
  matches the injected env, with the process env explicitly cleared
  so a leak would fail loud.

TAGALONG-1 — Decision-4 row in docs/release/llm_critique_design.md
still said "rubric_dimension (D1–D14)" but the second-pass cut
reduced the rubric to D1-D13 and updated VALID_RUBRIC_DIMENSIONS
in lockstep. One-character fix.

Resolved as outdated (no fix needed):

COPILOT-3, 4, 5 — three Copilot comments against the OLD 394-line
design doc that claimed: skip-cleanly prints to stderr; driver
prints a progress dot per streamed chunk; output schema sketch
shows bundle_hashes as dict[tier→sha]. The 73-line replacement
design doc (M1 in the second-pass cut) doesn't make any of those
claims. The current driver streams nothing (M5 fix:
messages.create with explicit timeout) and the schema sketch was
removed entirely. pr-agent-context already flagged all three as
Status: outdated; resolving the threads via GraphQL after this
push.

Net: 1325/1325 tests pass + 5 publish-extra-gated skips; ruff +
mypy clean; leakage probes 0/3 on every tier; hash determinism
PASS 67/67; validate_release_candidate --no-rebuild exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5; validation_report timestamp
drift reverted before commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: mypy override for the lazy anthropic import (fixes CI type-check)

CI's Type-check job runs without the anthropic SDK installed (it's
not in the dev extras), so the lazy ``import anthropic`` inside
``build_anthropic_client`` was failing with ``import-not-found``.
Added a mypy override matching the existing pattern for pandas /
networkx / sklearn / matplotlib.  Local mypy still clean (83 source
files); CI Type-check job will now pass.

The runtime contract is enforced by tests via the LLMCritiqueClient
protocol — type-check coverage of the SDK methods isn't load-bearing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* PR 7.1: first live critique run + adjudication

Live critique executed against release/intermediate/ with a dedicated
Anthropic project key (`leadforge-llm-critique-v1-prod`).

Result: score 7/10, six findings (1 high, 4 medium, 1 low). Exit
code 1 as the design doc specifies for unresolved high-severity
findings. Outputs committed at:
- release/validation/llm_critique_raw_20260508T204359.124834Z.json
- release/validation/llm_critique_summary.md

ADJUDICATION

Resolved in code in this PR:

- F001 (HIGH, documentation, D6) — 93% account_id overlap between
  train and test was documented only in break_me_guide §5, missing
  from release/README.md and the per-tier dataset_card.md, so a
  baseline-notebook student would silently train an account-leaky
  model. Added a "Group-leakage warning" paragraph to the README's
  "Splits" subsection citing the 518/557 figure and a
  GroupKFold(account_id) recipe. The parallel disclosure on the
  auto-rendered dataset_card.md is logged as accepted-for-v2 because
  the renderer change is out of scope for PR 7.1's no-bundle-regen
  rule.

- F004 (MEDIUM, pedagogy, D9) — break_me_guide pattern 5 covered
  account_id but ignored the parallel hazard on contact_id, despite
  contacts being shared across the lead-keyed split at the same
  magnitude. Extended pattern 5 to enumerate account_id, contact_id,
  and any reusable foreign-key column as group-leakage axes; reused
  the same overlap-snippet template per key.

- F006 (LOW, documentation, D1) — README "Conversion rate (recipe
  band)" column header didn't make clear it was a recipe-acceptance
  window not the achievable range. Renamed to "(acceptance band,
  gate G7.*)" and added a one-sentence note that observed five-seed
  spreads sit comfortably inside the band.

Logged to docs/release/v2_decision_log.md (with verdict
+ next-step + audit link to the raw JSON):

- F002 (MEDIUM) — accepted-for-v2 — Gaussian noise produces
  non-physical values (negative ACV, negative day-deltas, day-deltas
  > snapshot_day=30); needs a "Noise artefacts" Caveats bullet on
  the auto-rendered dataset_card.md.

- F003 (MEDIUM) — wont-fix — already treated by
  scripts/_release_common.py::rewrite_release_links() which both
  platform packagers (PR 5.1, 5.2) call at packaging time. The LLM
  didn't have visibility into the platform packagers (intentional —
  they're not in the input bundle) and made a wrong inference.

- F005 (MEDIUM) — accepted-for-v2 — calibration_max_bin_error=0.5234
  on advanced tier is driven by an n=2 high-prob bin; needs a
  minimum-bin-count footnote or a metric redefinition. Touches
  release_quality.py and would force a validation_report regen,
  which the brief explicitly forbids in PR 7.1.

- Three missing-section callouts (Datasheets §Biases, §Privacy,
  per-bundle group-split warning) — all accepted-for-v2.

- Three maintainer questions (noise vs windowing, top_decile_rate
  naming, Kaggle/HF docs subtree inclusion) — answered inline with
  appropriate verdicts.

KNOCK-ON

The README edits cascaded into the platform packager artefacts
(both inline release/README.md). Regenerated cleanly via the
existing packagers:
- release/kaggle/dataset-metadata.json
- release/huggingface/README.md

The audit-sync tests
(test_committed_kaggle_metadata_matches_fresh_regeneration,
test_committed_hf_readme_matches_fresh_regeneration) flagged the
drift before the regen, exactly as designed.

Net: 1325/1325 tests pass + 5 publish-extra-gated skips; ruff +
mypy clean; leakage probes 0/3 on every tier; hash determinism
PASS 67/67; validate_release_candidate --no-rebuild exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5; validation_report timestamp
drift reverted before commit. Agent-plan close-out updated to
reflect the live-run + adjudication.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: simulation simulation/ discrete-time engine type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants