Skip to content

feat: Milestone 11 — validation harness (invariants, realism, difficulty, drift)#18

Merged
shaypal5 merged 3 commits into
mainfrom
feat/milestone-11-validation-harness
Apr 29, 2026
Merged

feat: Milestone 11 — validation harness (invariants, realism, difficulty, drift)#18
shaypal5 merged 3 commits into
mainfrom
feat/milestone-11-validation-harness

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

  • validation/invariants.py: determinism check (same seed → identical SHA-256 hashes across all Parquet files) and exposure monotonicity (student_public ⊂ research_instructor — metadata/, tables, tasks, core files)
  • validation/realism.py: conversion-rate bounds (1%–95%), non-empty core tables, non-negative counts, valid booleans, multi-stage distribution diversity
  • validation/difficulty.py: known-difficulty validation from manifest + difficulty ordering check (currently no-op until engine modulates by difficulty)
  • validation/drift.py: cross-seed stability — conversion rate spread (max/min < 5×), degenerate seed detection (0% or 100%), stage diversity across seeds
  • validation/bundle_checks.py: validate_bundle() gains include_realism flag to optionally run realism + difficulty checks alongside structural validation

18 new validation tests; total 581 passing. Ruff + mypy clean.

Test plan

  • pytest tests/validation/ — 18 tests pass
  • pytest (full suite) — 581 tests pass
  • ruff check . clean
  • mypy leadforge/ clean
  • CI green

🤖 Generated with Claude Code

…lty, drift)

Add four specialised validation modules and wire them into the existing
bundle_checks orchestrator:

- invariants.py: determinism check (same seed → identical SHA-256 hashes)
  and exposure monotonicity (student_public ⊂ research_instructor)
- realism.py: conversion-rate bounds, non-empty core tables, non-negative
  counts, valid booleans, multi-stage distribution diversity
- difficulty.py: known-difficulty validation + ordering check (no-op until
  engine modulates by difficulty)
- drift.py: cross-seed stability — conversion rate spread, degenerate seed
  detection, stage diversity

bundle_checks.validate_bundle() gains an include_realism flag so the CLI
validate command can optionally run distributional checks alongside the
structural ones.

18 new validation tests; total 581 passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 29, 2026 07:34
@shaypal5 shaypal5 added type: feature New capability layer: validation validation/ invariants and checks labels Apr 29, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “validation harness” to the LeadForge codebase to assert key invariants (determinism, exposure monotonicity), distributional realism heuristics, difficulty manifest sanity checks, and cross-seed drift/stability checks; and wires these into the existing bundle validation entrypoint with accompanying tests.

Changes:

  • Introduces new validation modules: invariants, realism, difficulty, and drift (plus corresponding tests).
  • Extends validate_bundle() to optionally include realism + difficulty checks via an include_realism flag (defaulting to enabled).
  • Updates planning doc to mark Milestone 11 complete and outline Milestone 12.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
leadforge/validation/realism.py Adds conversion-rate bounds, feature-range checks, and non-degeneracy heuristics.
leadforge/validation/invariants.py Adds determinism (hash parity) and exposure monotonicity checks.
leadforge/validation/drift.py Adds cross-seed stability checks (rate spread, degenerate seeds, stage diversity).
leadforge/validation/difficulty.py Adds manifest difficulty validation and a placeholder ordering check.
leadforge/validation/bundle_checks.py Wires realism/difficulty into the main validation orchestrator behind a flag.
tests/validation/test_realism.py Test coverage for realism checks (happy path + manifest corruption case).
tests/validation/test_invariants.py Test coverage for determinism + exposure monotonicity.
tests/validation/test_drift.py Test coverage for cross-seed stability behavior.
tests/validation/test_difficulty.py Test coverage for known/unknown difficulty and ordering no-op behavior.
.agent-plan.md Updates milestone status and next steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/validation/difficulty.py Outdated
Comment thread leadforge/validation/difficulty.py Outdated
Comment thread leadforge/validation/invariants.py Outdated
Comment thread leadforge/validation/realism.py Outdated
Comment thread leadforge/validation/realism.py Outdated
Comment thread leadforge/validation/realism.py Outdated
Comment thread leadforge/validation/invariants.py Outdated
Comment thread leadforge/validation/drift.py
@github-actions

This comment has been minimized.

1. Delete dead I/O in check_difficulty_ordering (was computing rates then
   discarding them)
2. Pass manifest dict down to check_realism/check_difficulty instead of
   re-reading manifest.json 3 times per validation pass
3. Remove unused manifest param from _check_conversion_rate
4. _check_table_nonempty now reads actual Parquet files via pyarrow
   metadata instead of trusting manifest row counts
5. check_exposure_monotonicity now verifies content equality (SHA-256)
   for shared tables, task splits, and feature_dictionary.csv
6. check_determinism now compares core non-Parquet files (manifest.json,
   dataset_card.md, feature_dictionary.csv) in addition to Parquet files
7. (test perf — accepted as-is; module-scoped fixtures already amortize)
8. _check_feature_ranges now reads only needed columns via pyarrow schema
   introspection instead of loading the entire Parquet file
9. Derive count/boolean column lists from LEAD_SNAPSHOT_FEATURES schema
   instead of hardcoding — prevents silent drift
10. Add 4 negative-path tests for drift.py (0% rate, 100% rate, wide
    spread, single-stage degeneracy)
11. Add 5 negative-path tests for realism.py (low rate, high rate,
    negative counts, non-boolean values, single stage)
12. Use epsilon comparisons for float rate checks in drift.py
13. Move check_realism/check_difficulty imports to top level

Total: 590 tests passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

- COPILOT-3/7: exposure monotonicity now checks both directions for
  core files, tables, and task splits (flags extras in instructor that
  are missing from student, and vice versa)
- COPILOT-4: boolean validation uses pd.api.types.is_bool_dtype()
  instead of value-set comparison, correctly detecting integer-coded
  booleans (0/1 == True/False in Python)
- COPILOT-8: drift check now surfaces an explicit error when a seed's
  train.parquet is missing, instead of silently skipping

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 29, 2026 07:54
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #18.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/validation/difficulty.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313768
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `rates` is populated by reading Parquet files, but the collected values are never used (the function always returns an empty list). This adds unnecessary I/O and can confuse readers; either implement the intended ordering check / report the observed rates via the returned errors, or remove the unused computation until the feature is ready.

## COPILOT-2
Location: leadforge/validation/difficulty.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313788
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The TODO/comment claims “return the observed rates for informational purposes”, but this function only returns `errors` and currently always returns `[]`. Either include the observed rates in the returned messages (or change the API to return a structured result) or update the comment/docstring so it matches the actual behavior.

## COPILOT-3
Location: leadforge/validation/invariants.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313807
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    This “same tables” check only enforces `student_tables ⊆ instructor_tables` (it doesn’t flag tables present only in the instructor bundle). If the intent is that *only* `metadata/` differs between exposure modes, consider also checking `instructor_tables - student_tables` (and similarly for task files) to enforce equality outside of metadata.

## COPILOT-4
Location: leadforge/validation/realism.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313836
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Boolean validation here can miss integer-coded booleans (0/1) because `0 == False` and `1 == True`, so `unique - {True, False}` will be empty even when the column is not actually boolean. Consider checking the dtype (e.g., `is_bool_dtype` / pandas BooleanDtype) and/or using identity-based checks (`v is True/False`) to reliably detect non-boolean encodings.

## COPILOT-5
Location: leadforge/validation/realism.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313850
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `pd.read_parquet(train_path)` loads the full task table even though only a handful of columns are checked below. To keep validation fast on large bundles, read only the needed columns (the union of the count + bool feature columns that actually exist).

## COPILOT-6
Location: leadforge/validation/realism.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313874
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    This non-empty-table check relies only on `manifest["tables"][...]["row_count"]`. That can produce false positives/negatives if the manifest is wrong (e.g., row_count stale or corrupted) and won't catch an actually-empty parquet when row_count is nonzero/missing. Since this module already reads Parquet files for other checks, consider validating non-emptiness against the actual table files (accounts/contacts/leads) rather than the manifest metadata.

## COPILOT-7
Location: leadforge/validation/invariants.py
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313886
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Docstring says “Both must have the same core files”, but the logic only checks one direction (student has file && instructor missing). If the instructor has a core file that the student is missing, this will currently pass even though the bundles no longer match. Either enforce equality for these core files (or at least require they exist in both) or update the wording to reflect the intended one-way subset behavior.

## COPILOT-8
Location: leadforge/validation/drift.py:34
URL: https://github.com/leadforge-dev/leadforge/pull/18#discussion_r3159313906
Root author: copilot-pull-request-reviewer

Comment:
    When `train.parquet` is missing for a seed, the code silently `continue`s, which can cause this check to return `[]` even though some inputs were invalid/incomplete. Consider appending an explicit error for missing required artifacts (or otherwise surfacing that a seed was skipped) so instability checks don’t silently degrade into a no-op.

Run metadata:

Tool ref: v4
Tool version: 4.0.20
Trigger: commit pushed
Workflow run: 25097334293 attempt 1
Comment timestamp: 2026-04-29T07:55:04.881452+00:00
PR head commit: 8cc1378076b16e6ee2e1cfe074f124fc1ff18d02

@shaypal5 shaypal5 self-assigned this Apr 29, 2026
@shaypal5 shaypal5 merged commit 258e610 into main Apr 29, 2026
7 checks passed
@shaypal5 shaypal5 deleted the feat/milestone-11-validation-harness branch April 29, 2026 07:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a validation harness to the Leadforge bundle pipeline, introducing invariant checks (determinism + exposure monotonicity) and distributional “realism” / drift sanity checks, and wiring optional realism+difficulty validation into the existing bundle validator.

Changes:

  • Introduces new validation modules: invariants (determinism, exposure monotonicity), realism (sanity bounds), drift (cross-seed stability), difficulty (manifest difficulty validation + placeholder ordering check).
  • Adds new test coverage for the new validation modules under tests/validation/.
  • Extends validate_bundle() with an include_realism flag to optionally run realism+difficulty checks in addition to structural validation.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
leadforge/validation/realism.py Implements distributional sanity checks (conversion rate bounds, non-empty core tables, feature range/type checks, stage diversity).
leadforge/validation/invariants.py Adds determinism comparison and exposure-mode monotonicity checks.
leadforge/validation/drift.py Adds cross-seed stability checks to detect degenerate seeds / wide conversion-rate spread.
leadforge/validation/difficulty.py Validates manifest difficulty field and provides a (currently no-op) ordering check.
leadforge/validation/bundle_checks.py Wires realism+difficulty into the main bundle validator behind include_realism.
tests/validation/test_realism.py Tests realism checks by mutating a generated bundle to trigger specific errors.
tests/validation/test_invariants.py Tests determinism and exposure monotonicity validation behavior.
tests/validation/test_drift.py Tests drift detection for degenerate conversion/stage distributions across seeds.
tests/validation/test_difficulty.py Tests difficulty validation and asserts ordering check is a no-op for v1.
.agent-plan.md Updates milestone tracking notes to mark validation harness as completed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +52 to +53
rate = df["converted_within_90_days"].mean()

Comment on lines +65 to +79
def _check_table_nonempty(root: Path, manifest: dict[str, Any]) -> list[str]:
"""Core tables should have at least 1 row (verified from actual files)."""
errors: list[str] = []
required_nonempty = {"accounts", "contacts", "leads"}

for table_name in required_nonempty:
parquet_path = root / f"tables/{table_name}.parquet"
if not parquet_path.exists():
errors.append(f"Table '{table_name}' file missing")
else:
meta = pq.read_metadata(parquet_path)
if meta.num_rows == 0:
errors.append(f"Table '{table_name}' has 0 rows")

return errors
Comment on lines 24 to +51
@@ -37,6 +44,11 @@ def validate_bundle(bundle_root: Path) -> list[str]:
errors.extend(_check_task_splits(bundle_root, manifest))
errors.extend(_check_fk_integrity(tables))
errors.extend(_check_leakage(bundle_root, manifest))

if include_realism:
errors.extend(check_realism(bundle_root, manifest))
errors.extend(check_difficulty(manifest))

Comment on lines +25 to +33
# Compare core non-Parquet files that must also be deterministic.
for fname in ("manifest.json", "dataset_card.md", "feature_dictionary.csv"):
fa = bundle_a / fname
fb = bundle_b / fname
if fa.exists() and fb.exists():
if file_sha256(fa) != file_sha256(fb):
errors.append(f"Hash mismatch: {fname}")
elif fa.exists() != fb.exists():
errors.append(f"File '{fname}' exists in one bundle but not the other")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: validation validation/ invariants and checks type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants