Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,15 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
- [x] Reproduce relational-leakage finding on alpha bundles → `docs/release/v1_current_state_audit.md` — all three tiers reconstruct `converted_within_90_days` at 100% via paths A–E; LR/HistGBM AUC = 1.000 on join-derived features. Probe script: `scripts/probe_relational_leakage.py` (function `deterministic_relational_reconstruction` designed to lift into PR 3.1's `leadforge/validation/leakage_probes.py`).
- [x] Lock dataset release name `leadforge-lead-scoring-v1` (already locked via PR #61's milestone rename + roadmap edits; G1.1 reaffirmed)

### Phase 2 — Snapshot-safe relational export
### Phase 2 — Snapshot-safe relational export
- [x] `leadforge/render/relational_snapshot_safe.py` (new) — PR 2.1: `to_dataframes_snapshot_safe(dfs, *, snapshot_day)` projects the full-horizon dict from `to_dataframes` onto the public-bundle shape (drops `BANNED_LEAD_COLUMNS` from leads, `BANNED_OPP_COLUMNS` from opportunities, filters event tables per-lead by `lead_created_at + snapshot_day`, omits `customers`/`subscriptions`, passes accounts/contacts unchanged).
- [x] `leadforge/validation/relational_leakage.py` (new) — PR 2.1: owns the snapshot-safe contract constants (`BANNED_LEAD_COLUMNS`, `BANNED_OPP_COLUMNS`, `BANNED_TABLES`, `SNAPSHOT_FILTERED_TABLES`); ships `LeakageFinding`/`LeakageReport`/`RelationalLeakageError(LeadforgeError)` plus five probes (`probe_banned_columns`, `probe_banned_tables`, `probe_deterministic_reconstruction`, `probe_snapshot_window`, opt-in `probe_bonus_model_auc`) and two orchestrators (`run_all_probes`, `run_all_probes_on_dataframes`). The bonus-model probe is opt-in: orchestrators skip it unless the caller passes `bonus_model_max_auc=...` (PR 3.3 will calibrate per-tier bands). `deterministic_relational_reconstruction` lifted from `scripts/probe_relational_leakage.py`; the script now re-exports it from the package.
- [ ] `BUNDLE_SCHEMA_VERSION` 4 → 5; manifest gains `relational_snapshot_safe` — PR 2.2.
- [ ] Wire `relational_snapshot_safe` through `leadforge/exposure/filters.py` and `leadforge/api/bundle.py`; plumb the leakage validator into `leadforge/validation/bundle_checks.py` — PR 2.2.
- [ ] Drop `converted_within_90_days` / `conversion_timestamp` from public `leads`; drop `close_outcome` / `closed_at` from public `opportunities`; omit `customers` / `subscriptions` from public bundles — PR 2.2 (the structural rules are already enforced module-side; PR 2.2 turns them on for actual writes).
- [ ] Hash-determinism preserved on regenerated bundles — PR 2.2.
- [x] PR 2.2: `BUNDLE_SCHEMA_VERSION` 4 → 5; manifest gains `relational_snapshot_safe: bool` (self-describes whether `tables/` is the snapshot-safe public shape or full-horizon instructor shape).
- [x] PR 2.2: `BundleFilter.relational_snapshot_safe` flag (student_public True; research_instructor False); `leadforge/api/bundle.py` calls `to_dataframes_snapshot_safe(dfs, snapshot_day=...)` for student_public after `to_dataframes(...)`. `leadforge/validation/bundle_checks.py` runs `run_all_probes` on public bundles (bonus probe stays off — PR 3.3 calibrates); `_check_fk_integrity` silently skips `BANNED_TABLES` for snapshot-safe bundles.
- [x] PR 2.2: public `leads.parquet` drops `converted_within_90_days` / `conversion_timestamp`; public `opportunities.parquet` drops `close_outcome` / `closed_at`; public bundles omit `customers` / `subscriptions`. Confirmed by regenerated `release/{intro,intermediate,advanced}/manifest.json` (`tables` key: 7 entries, no `customers` / `subscriptions`); `release/intermediate_instructor/` retains full-horizon (9 tables).
- [x] PR 2.2: `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on all three public tiers (was exiting 2 on alpha bundles); all path prediction rates A-E = 0.000.
- [x] PR 2.2: hash-determinism preserved on regenerated bundles (`scripts/verify_hash_determinism.py`: 67/67 files identical across pinned-timestamp runs; was 73/73 before — drop reflects the 2 omitted public tables × 3 public tiers).
- [x] PR 2.2: `check_exposure_monotonicity` updated for v5 — student is allowed to omit `BANNED_TABLES`, drop `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS`, and be a row-subset of instructor on snapshot-filtered event tables.

### Phase 3 — Release validation hardening
- [ ] `leadforge/validation/{release_quality,leakage_probes,reporting}.py` (new)
Expand Down
28 changes: 27 additions & 1 deletion leadforge/api/bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
:func:`write_bundle` is called by :meth:`WorldBundle.save` and orchestrates
all rendering steps:

1. Write relational Parquet tables (``tables/``).
1. Project the relational dict (snapshot-safe for ``student_public``,
full-horizon for ``research_instructor``) and write ``tables/``.
2. Build the lead snapshot and write task splits (``tasks/``).
3. Write ``dataset_card.md`` and ``feature_dictionary.csv``.
4. Apply exposure filtering — write ``metadata/`` for ``research_instructor``
Expand All @@ -16,10 +17,12 @@
from pathlib import Path
from typing import TYPE_CHECKING

from leadforge.exposure.filters import get_filter
from leadforge.exposure.modes import apply_exposure
from leadforge.narrative.dataset_card import render_dataset_card
from leadforge.render.manifests import build_manifest, write_manifest
from leadforge.render.relational import to_dataframes
from leadforge.render.relational_snapshot_safe import to_dataframes_snapshot_safe
from leadforge.render.snapshots import build_snapshot
from leadforge.render.tasks import write_task_splits
from leadforge.schema.dictionaries import write_feature_dictionary
Expand Down Expand Up @@ -66,14 +69,36 @@ def write_bundle(
# README's "Option 3") cannot trivially reintroduce a redacted
# column by joining ``tables/leads.parquet`` to their feature set.
redacted = redacted_columns_for(config.exposure_mode)
bundle_filter = get_filter(config.exposure_mode)

# ------------------------------------------------------------------
# 1. Relational tables → tables/
#
# For ``student_public`` (``relational_snapshot_safe = True``) we
# project the full-horizon dict onto the snapshot-safe shape:
# ``BANNED_LEAD_COLUMNS`` / ``BANNED_OPP_COLUMNS`` are dropped, event
# tables are filtered per-lead to ``lead_created_at + snapshot_day``,
# and ``BANNED_TABLES`` (``customers`` / ``subscriptions``) are
# omitted entirely. The feature-level redaction below still applies
# on top — the two policies operate on disjoint columns
# (snapshot-safe owns the structural reconstruction surface;
# ``redacted_columns_for`` owns near-deterministic snapshot
# features), so they neither double-emit nor overlap.
# ------------------------------------------------------------------
tables_dir = root / "tables"
tables_dir.mkdir(exist_ok=True)

dfs = to_dataframes(result, population)
if bundle_filter.relational_snapshot_safe:
if config.snapshot_day is None:
raise ValueError(
f"exposure_mode={config.exposure_mode.value!r} requires "
"config.snapshot_day to be set (the snapshot-safe relational "
"export filters event tables to lead_created_at + snapshot_day); "
"got snapshot_day=None. Pin a snapshot_day on the recipe or "
"pass it explicitly."
)
dfs = to_dataframes_snapshot_safe(dfs, snapshot_day=config.snapshot_day)
table_row_counts: dict[str, int] = {}
for table_name, df in dfs.items():
if redacted:
Expand Down Expand Up @@ -136,5 +161,6 @@ def write_bundle(
bundle_root=root,
generation_timestamp=generation_timestamp,
redacted_columns=sorted(redacted),
relational_snapshot_safe=bundle_filter.relational_snapshot_safe,
)
write_manifest(manifest, root)
22 changes: 20 additions & 2 deletions leadforge/exposure/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,33 @@ class BundleFilter:
write_metadata: Whether to create ``metadata/`` with hidden-truth
files (``graph.json``, ``graph.graphml``, ``world_spec.json``,
``latent_registry.json``, ``mechanism_summary.json``).
relational_snapshot_safe: Whether the relational ``tables/`` dict
must be projected onto the snapshot-safe shape before being
written. When ``True``, the bundle writer routes through
:func:`leadforge.render.relational_snapshot_safe.to_dataframes_snapshot_safe`,
which strips :data:`leadforge.validation.relational_leakage.BANNED_LEAD_COLUMNS`
from ``leads``, :data:`~leadforge.validation.relational_leakage.BANNED_OPP_COLUMNS`
from ``opportunities``, filters event tables per-lead by
``lead_created_at + snapshot_day``, and omits
:data:`~leadforge.validation.relational_leakage.BANNED_TABLES`
(``customers`` / ``subscriptions``) entirely. When ``False``,
the writer emits the full-horizon export.
"""

write_metadata: bool
relational_snapshot_safe: bool


#: Canonical filter rules for every supported exposure mode.
FILTERS: dict[ExposureMode, BundleFilter] = {
ExposureMode.student_public: BundleFilter(write_metadata=False),
ExposureMode.research_instructor: BundleFilter(write_metadata=True),
ExposureMode.student_public: BundleFilter(
write_metadata=False,
relational_snapshot_safe=True,
),
ExposureMode.research_instructor: BundleFilter(
write_metadata=True,
relational_snapshot_safe=False,
),
}


Expand Down
51 changes: 50 additions & 1 deletion leadforge/render/manifests.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@
from typing import TYPE_CHECKING, Any

from leadforge.core.hashing import file_sha256
from leadforge.validation.relational_leakage import (
BANNED_LEAD_COLUMNS,
BANNED_OPP_COLUMNS,
BANNED_TABLES,
)

if TYPE_CHECKING:
from leadforge.core.models import GenerationConfig
Expand All @@ -36,7 +41,19 @@
# assuming "features computed over full horizon" must update.
# ``manifest.snapshot_day`` recorded so the contract is
# self-describing (``null`` means full-horizon, legacy behaviour).
BUNDLE_SCHEMA_VERSION = "4"
# "5" — PR 2.2: ``student_public`` bundles route through the
# snapshot-safe relational export (
# :mod:`leadforge.render.relational_snapshot_safe`). Public
# ``leads`` drops ``converted_within_90_days`` /
# ``conversion_timestamp``; public ``opportunities`` drops
# ``close_outcome`` / ``closed_at``; public bundles omit
# ``customers`` / ``subscriptions``; event tables filtered
# per-lead to ``lead_created_at + snapshot_day``.
# ``manifest.relational_snapshot_safe`` records the contract so
# consumers / validators can tell from the bundle alone whether
# the tables are snapshot-safe. ``research_instructor`` bundles
# keep the full-horizon export (``relational_snapshot_safe = false``).
BUNDLE_SCHEMA_VERSION = "5"

# Manifest fields whose value is non-deterministic by design (wall-clock,
# host metadata, etc.). Determinism checks must ignore these fields when
Expand All @@ -52,6 +69,7 @@ def build_manifest(
bundle_root: Path,
generation_timestamp: str | None = None,
redacted_columns: list[str] | None = None,
relational_snapshot_safe: bool = False,
) -> dict[str, Any]:
"""Build the bundle manifest dict.

Expand All @@ -71,6 +89,13 @@ def build_manifest(
this exposure mode. Recorded in the manifest so consumers
(and the validator) can audit redaction without inspecting
package internals. Defaults to ``[]`` (nothing redacted).
relational_snapshot_safe: ``True`` if the relational ``tables/``
were projected through
:func:`leadforge.render.relational_snapshot_safe.to_dataframes_snapshot_safe`
before being written. Recorded in the manifest so a tool
reading a v5+ bundle can tell from the manifest alone whether
``tables/`` is the snapshot-safe (public) shape or the
full-horizon (instructor) shape. Defaults to ``False``.

Returns:
A JSON-serialisable dict ready to be written as ``manifest.json``.
Expand Down Expand Up @@ -117,11 +142,35 @@ def build_manifest(
"snapshot_day": config.snapshot_day,
"motif_family": world_graph.motif_family,
"redacted_columns": redacted_columns_list,
"relational_snapshot_safe": bool(relational_snapshot_safe),
"structural_redactions": _build_structural_redactions(bool(relational_snapshot_safe)),
Comment on lines +145 to +146
"tables": tables,
"tasks": tasks,
}


def _build_structural_redactions(relational_snapshot_safe: bool) -> dict[str, Any]:
"""Self-describing record of the table-level redactions applied at write.

For snapshot-safe (public) bundles this enumerates every column the
snapshot-safe export drops from ``leads`` / ``opportunities`` and the
tables it omits entirely. For full-horizon (instructor) bundles
every list is empty. Together with ``manifest.redacted_columns``
(the snapshot-feature redactions) and ``manifest.relational_snapshot_safe``
(the contract flag) the manifest fully describes what the writer
dropped without the consumer needing to consult package internals.
"""
if not relational_snapshot_safe:
return {"columns": {}, "omitted_tables": []}
return {
"columns": {
"leads": sorted(BANNED_LEAD_COLUMNS),
"opportunities": sorted(BANNED_OPP_COLUMNS),
},
"omitted_tables": sorted(BANNED_TABLES),
}


def write_manifest(manifest: dict[str, Any], bundle_root: Path) -> Path:
"""Serialise *manifest* to ``bundle_root/manifest.json`` and return the path."""
path = bundle_root / "manifest.json"
Expand Down
Loading
Loading