feat: Milestone 11 — validation harness (invariants, realism, difficulty, drift)#18
Conversation
…lty, drift) Add four specialised validation modules and wire them into the existing bundle_checks orchestrator: - invariants.py: determinism check (same seed → identical SHA-256 hashes) and exposure monotonicity (student_public ⊂ research_instructor) - realism.py: conversion-rate bounds, non-empty core tables, non-negative counts, valid booleans, multi-stage distribution diversity - difficulty.py: known-difficulty validation + ordering check (no-op until engine modulates by difficulty) - drift.py: cross-seed stability — conversion rate spread, degenerate seed detection, stage diversity bundle_checks.validate_bundle() gains an include_realism flag so the CLI validate command can optionally run distributional checks alongside the structural ones. 18 new validation tests; total 581 passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds a new “validation harness” to the LeadForge codebase to assert key invariants (determinism, exposure monotonicity), distributional realism heuristics, difficulty manifest sanity checks, and cross-seed drift/stability checks; and wires these into the existing bundle validation entrypoint with accompanying tests.
Changes:
- Introduces new validation modules:
invariants,realism,difficulty, anddrift(plus corresponding tests). - Extends
validate_bundle()to optionally include realism + difficulty checks via aninclude_realismflag (defaulting to enabled). - Updates planning doc to mark Milestone 11 complete and outline Milestone 12.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
leadforge/validation/realism.py |
Adds conversion-rate bounds, feature-range checks, and non-degeneracy heuristics. |
leadforge/validation/invariants.py |
Adds determinism (hash parity) and exposure monotonicity checks. |
leadforge/validation/drift.py |
Adds cross-seed stability checks (rate spread, degenerate seeds, stage diversity). |
leadforge/validation/difficulty.py |
Adds manifest difficulty validation and a placeholder ordering check. |
leadforge/validation/bundle_checks.py |
Wires realism/difficulty into the main validation orchestrator behind a flag. |
tests/validation/test_realism.py |
Test coverage for realism checks (happy path + manifest corruption case). |
tests/validation/test_invariants.py |
Test coverage for determinism + exposure monotonicity. |
tests/validation/test_drift.py |
Test coverage for cross-seed stability behavior. |
tests/validation/test_difficulty.py |
Test coverage for known/unknown difficulty and ordering no-op behavior. |
.agent-plan.md |
Updates milestone status and next steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment has been minimized.
This comment has been minimized.
1. Delete dead I/O in check_difficulty_ordering (was computing rates then
discarding them)
2. Pass manifest dict down to check_realism/check_difficulty instead of
re-reading manifest.json 3 times per validation pass
3. Remove unused manifest param from _check_conversion_rate
4. _check_table_nonempty now reads actual Parquet files via pyarrow
metadata instead of trusting manifest row counts
5. check_exposure_monotonicity now verifies content equality (SHA-256)
for shared tables, task splits, and feature_dictionary.csv
6. check_determinism now compares core non-Parquet files (manifest.json,
dataset_card.md, feature_dictionary.csv) in addition to Parquet files
7. (test perf — accepted as-is; module-scoped fixtures already amortize)
8. _check_feature_ranges now reads only needed columns via pyarrow schema
introspection instead of loading the entire Parquet file
9. Derive count/boolean column lists from LEAD_SNAPSHOT_FEATURES schema
instead of hardcoding — prevents silent drift
10. Add 4 negative-path tests for drift.py (0% rate, 100% rate, wide
spread, single-stage degeneracy)
11. Add 5 negative-path tests for realism.py (low rate, high rate,
negative counts, non-boolean values, single stage)
12. Use epsilon comparisons for float rate checks in drift.py
13. Move check_realism/check_difficulty imports to top level
Total: 590 tests passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
- COPILOT-3/7: exposure monotonicity now checks both directions for core files, tables, and task splits (flags extras in instructor that are missing from student, and vice versa) - COPILOT-4: boolean validation uses pd.api.types.is_bool_dtype() instead of value-set comparison, correctly detecting integer-coded booleans (0/1 == True/False in Python) - COPILOT-8: drift check now surfaces an explicit error when a seed's train.parquet is missing, instead of silently skipping Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #18.
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: leadforge/validation/difficulty.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313768
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`rates` is populated by reading Parquet files, but the collected values are never used (the function always returns an empty list). This adds unnecessary I/O and can confuse readers; either implement the intended ordering check / report the observed rates via the returned errors, or remove the unused computation until the feature is ready.
## COPILOT-2
Location: leadforge/validation/difficulty.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313788
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The TODO/comment claims “return the observed rates for informational purposes”, but this function only returns `errors` and currently always returns `[]`. Either include the observed rates in the returned messages (or change the API to return a structured result) or update the comment/docstring so it matches the actual behavior.
## COPILOT-3
Location: leadforge/validation/invariants.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313807
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
This “same tables” check only enforces `student_tables ⊆ instructor_tables` (it doesn’t flag tables present only in the instructor bundle). If the intent is that *only* `metadata/` differs between exposure modes, consider also checking `instructor_tables - student_tables` (and similarly for task files) to enforce equality outside of metadata.
## COPILOT-4
Location: leadforge/validation/realism.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313836
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
Boolean validation here can miss integer-coded booleans (0/1) because `0 == False` and `1 == True`, so `unique - {True, False}` will be empty even when the column is not actually boolean. Consider checking the dtype (e.g., `is_bool_dtype` / pandas BooleanDtype) and/or using identity-based checks (`v is True/False`) to reliably detect non-boolean encodings.
## COPILOT-5
Location: leadforge/validation/realism.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313850
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`pd.read_parquet(train_path)` loads the full task table even though only a handful of columns are checked below. To keep validation fast on large bundles, read only the needed columns (the union of the count + bool feature columns that actually exist).
## COPILOT-6
Location: leadforge/validation/realism.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313874
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
This non-empty-table check relies only on `manifest["tables"][...]["row_count"]`. That can produce false positives/negatives if the manifest is wrong (e.g., row_count stale or corrupted) and won't catch an actually-empty parquet when row_count is nonzero/missing. Since this module already reads Parquet files for other checks, consider validating non-emptiness against the actual table files (accounts/contacts/leads) rather than the manifest metadata.
## COPILOT-7
Location: leadforge/validation/invariants.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313886
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
Docstring says “Both must have the same core files”, but the logic only checks one direction (student has file && instructor missing). If the instructor has a core file that the student is missing, this will currently pass even though the bundles no longer match. Either enforce equality for these core files (or at least require they exist in both) or update the wording to reflect the intended one-way subset behavior.
## COPILOT-8
Location: leadforge/validation/drift.py:34
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313906
Root author: copilot-pull-request-reviewer
Comment:
When `train.parquet` is missing for a seed, the code silently `continue`s, which can cause this check to return `[]` even though some inputs were invalid/incomplete. Consider appending an explicit error for missing required artifacts (or otherwise surfacing that a seed was skipped) so instability checks don’t silently degrade into a no-op.Run metadata: |
There was a problem hiding this comment.
Pull request overview
Adds a validation harness to the Leadforge bundle pipeline, introducing invariant checks (determinism + exposure monotonicity) and distributional “realism” / drift sanity checks, and wiring optional realism+difficulty validation into the existing bundle validator.
Changes:
- Introduces new validation modules: invariants (determinism, exposure monotonicity), realism (sanity bounds), drift (cross-seed stability), difficulty (manifest difficulty validation + placeholder ordering check).
- Adds new test coverage for the new validation modules under
tests/validation/. - Extends
validate_bundle()with aninclude_realismflag to optionally run realism+difficulty checks in addition to structural validation.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
leadforge/validation/realism.py |
Implements distributional sanity checks (conversion rate bounds, non-empty core tables, feature range/type checks, stage diversity). |
leadforge/validation/invariants.py |
Adds determinism comparison and exposure-mode monotonicity checks. |
leadforge/validation/drift.py |
Adds cross-seed stability checks to detect degenerate seeds / wide conversion-rate spread. |
leadforge/validation/difficulty.py |
Validates manifest difficulty field and provides a (currently no-op) ordering check. |
leadforge/validation/bundle_checks.py |
Wires realism+difficulty into the main bundle validator behind include_realism. |
tests/validation/test_realism.py |
Tests realism checks by mutating a generated bundle to trigger specific errors. |
tests/validation/test_invariants.py |
Tests determinism and exposure monotonicity validation behavior. |
tests/validation/test_drift.py |
Tests drift detection for degenerate conversion/stage distributions across seeds. |
tests/validation/test_difficulty.py |
Tests difficulty validation and asserts ordering check is a no-op for v1. |
.agent-plan.md |
Updates milestone tracking notes to mark validation harness as completed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| rate = df["converted_within_90_days"].mean() | ||
|
|
| def _check_table_nonempty(root: Path, manifest: dict[str, Any]) -> list[str]: | ||
| """Core tables should have at least 1 row (verified from actual files).""" | ||
| errors: list[str] = [] | ||
| required_nonempty = {"accounts", "contacts", "leads"} | ||
|
|
||
| for table_name in required_nonempty: | ||
| parquet_path = root / f"tables/{table_name}.parquet" | ||
| if not parquet_path.exists(): | ||
| errors.append(f"Table '{table_name}' file missing") | ||
| else: | ||
| meta = pq.read_metadata(parquet_path) | ||
| if meta.num_rows == 0: | ||
| errors.append(f"Table '{table_name}' has 0 rows") | ||
|
|
||
| return errors |
| @@ -37,6 +44,11 @@ def validate_bundle(bundle_root: Path) -> list[str]: | |||
| errors.extend(_check_task_splits(bundle_root, manifest)) | |||
| errors.extend(_check_fk_integrity(tables)) | |||
| errors.extend(_check_leakage(bundle_root, manifest)) | |||
|
|
|||
| if include_realism: | |||
| errors.extend(check_realism(bundle_root, manifest)) | |||
| errors.extend(check_difficulty(manifest)) | |||
|
|
|||
| # Compare core non-Parquet files that must also be deterministic. | ||
| for fname in ("manifest.json", "dataset_card.md", "feature_dictionary.csv"): | ||
| fa = bundle_a / fname | ||
| fb = bundle_b / fname | ||
| if fa.exists() and fb.exists(): | ||
| if file_sha256(fa) != file_sha256(fb): | ||
| errors.append(f"Hash mismatch: {fname}") | ||
| elif fa.exists() != fb.exists(): | ||
| errors.append(f"File '{fname}' exists in one bundle but not the other") |
Summary
validation/invariants.py: determinism check (same seed → identical SHA-256 hashes across all Parquet files) and exposure monotonicity (student_public ⊂ research_instructor — metadata/, tables, tasks, core files)validation/realism.py: conversion-rate bounds (1%–95%), non-empty core tables, non-negative counts, valid booleans, multi-stage distribution diversityvalidation/difficulty.py: known-difficulty validation from manifest + difficulty ordering check (currently no-op until engine modulates by difficulty)validation/drift.py: cross-seed stability — conversion rate spread (max/min < 5×), degenerate seed detection (0% or 100%), stage diversity across seedsvalidation/bundle_checks.py:validate_bundle()gainsinclude_realismflag to optionally run realism + difficulty checks alongside structural validation18 new validation tests; total 581 passing. Ruff + mypy clean.
Test plan
pytest tests/validation/— 18 tests passpytest(full suite) — 581 tests passruff check .cleanmypy leadforge/clean🤖 Generated with Claude Code