Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tie
- [x] Update release/HF_DATASET_CARD.md — add conversion rates to summary table
- [x] Verify SHA-256 hash determinism (re-run build, compare hashes) — `scripts/verify_hash_determinism.py`; 73/73 files identical across two `build_public_release.py` runs (modulo `manifest.json`'s wall-clock `generation_timestamp`)
- [x] Fix `current_stage` leakage in student_public bundles via exposure-layer redaction — `is_leakage_trap` flag distinguishes the pedagogical trap (`total_touches_all`) from true label leaks; `BundleFilter.redacted_columns` strips the latter; `validate_bundle()` enforces the invariant. 73/73 hash-determinism preserved.
- [x] Windowed snapshot for student_public bundles — `snapshot_day=30` pinned in recipe; event-aggregate features no longer share the 90-day label window; `BUNDLE_SCHEMA_VERSION` bumped to 4; 73/73 hash-determinism preserved; conversion rates unchanged (41.5% / 20.1% / 7.9%).
- [ ] Upload to Kaggle and HuggingFace
- [ ] Announce

Expand All @@ -66,11 +67,11 @@ First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tie

Deterministic leak fixed via exposure-layer redaction. `FeatureSpec` now carries an explicit `redact_in_modes: frozenset[ExposureMode]` field — *prescriptive* — alongside the descriptive `leakage_risk` flag. `current_stage` is marked `redact_in_modes={ExposureMode.student_public}`; the writer queries `redacted_columns_for(mode)` and strips matching columns from the snapshot, task splits, and feature dictionary before they hit disk. The pedagogical trap `total_touches_all` is preserved in all modes (no entry in `redact_in_modes`). The manifest records `redacted_columns: [...]` so the bundle is self-describing. `validate_bundle()` cross-checks parquet schemas, feature dictionary, and the manifest's declared redaction set against `redacted_columns_for(mode)` derived independently from the feature spec. Hash-determinism preserved (73/73 identical across builds).

### Follow-up: structural leakage in `student_public` bundles (issue #57)
### Follow-up: structural leakage in `student_public` bundles (issue #57) — fully resolved

Tracked in [GitHub issue #57](https://github.com/leadforge-dev/leadforge/issues/57).

1. **Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves. The structural fix is a windowed snapshot (`snapshot_day=N` with `N < label_window_days`), as v6/v7 datasets already do at day 14/20. **Open** — its own PR with documentation recalibration; will likely bump `BUNDLE_SCHEMA_VERSION` again.
1. ~~**Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves.~~ **Resolved** — windowed snapshot at `snapshot_day=30` (recipe default). `BUNDLE_SCHEMA_VERSION` bumped 3 → 4; `manifest.snapshot_day` records the contract. Conversion rates invariant (label is event-derived from `label_window_days`); trap gap (`total_touches_all − touch_count`) ~3 touches with 54–77% of leads showing divergence. Guarded by `test_bundle_schema_v4_contract.py` and `test_windowed_bundle_trap.py`.
2. ~~**`is_sql=False` is near-deterministic for non-conversion.** Measured on the regenerated bundle: P(converted | is_sql=False) = 0.038 (intro), 0.015 (intermediate), 0.006 (advanced).~~ **Resolved** — `is_sql` redacted in `student_public` mode by post-#57 PR (bundle schema v3).
3. ~~**`is_mql` is a constant `True`.** Zero variance feature in all three tiers.~~ **Resolved** — `is_mql` removed from the canonical feature list by post-#57 PR (bundle schema v3). Guarded by a new `test_no_zero_variance_features` check.

Expand Down
47 changes: 39 additions & 8 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,45 @@ Format inspired by [Keep a Changelog](https://keepachangelog.com/).

## Unreleased

### Bundle schema v4

`bundle_schema_version` bumped from `"3"` to `"4"`. Closes the final
sub-item of issue #57: event-aggregate features are no longer computed
over the same 90-day window the label resolves in.

- **Windowed snapshot.** `GenerationConfig.snapshot_day` (also exposed
as a recipe-level field and an explicit kwarg on
`Generator.from_recipe()`) now controls the feature aggregation
window. When set, `build_snapshot()` filters touches, sessions,
sales activities, and opportunities to events with timestamp
≤ `lead_created_at + snapshot_day`. The
`b2b_saas_procurement_v1` recipe pins `snapshot_day: 30` —
measurements at seed 42, n_leads=5000 across all three difficulty
tiers showed day 30 keeps LR AUC in [0.85, 0.86] (challenging but
modelable) while preserving a meaningful trap gap of ~3 touches
with 54–77% of leads showing any divergence between
`total_touches_all` (full-horizon) and `touch_count` (windowed).
- **Conversion rates unchanged.** The label is event-derived from
`label_window_days` in the simulator and is independent of
`snapshot_day`, so the published rates stay at 41.5% / 20.1% / 7.9%
(intro / intermediate / advanced) — well inside the declared
`difficulty_profiles.yaml` ranges.
- **`manifest.snapshot_day` recorded.** The published bundle
declares its windowing contract; consumers can distinguish
full-horizon (legacy v2/v3) bundles from windowed (v4) bundles
without inspecting package internals. Column SET is unchanged
from v3, but column VALUES are no longer full-horizon — a contract
shift that v3 consumers would not detect from schema alone.
- **Schema contract test.** `tests/render/test_bundle_schema_v3_contract.py`
renamed to `test_bundle_schema_v4_contract.py` and gains a
`snapshot_day == 30` assertion alongside the existing column-set
pinning.
- **Trap invariant guard.** New `tests/render/test_windowed_bundle_trap.py`
asserts `total_touches_all >= touch_count` for every lead and
`>` for at least some — guarding against a future refactor that
silently widens `touch_count` back to the full horizon and
collapses the pedagogical gap.

### Bundle schema v3

`bundle_schema_version` bumped from `"2"` to `"3"`. Three structural
Expand Down Expand Up @@ -53,14 +92,6 @@ changes follow up on PR #56 (issue #57):
columns (down from 35); **11** columns in `tables/leads.parquet`
(down from 12 — `is_mql` removed).

### Open follow-up

Issue #57 sub-item 1 remains open: event-aggregate features
(`touch_count`, `session_count`, `pricing_page_views`, ...) are still
computed over the same 90-day window the label resolves in. The
structural fix is a windowed snapshot rebuild and is deferred to its
own PR.

---

## v1.0.0 — 2026-05-02
Expand Down
1 change: 1 addition & 0 deletions leadforge/api/bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ def write_bundle(
result,
population,
horizon_days=config.horizon_days,
snapshot_day=config.snapshot_day,
difficulty_params=config.difficulty_params,
seed=config.seed,
)
Expand Down
6 changes: 6 additions & 0 deletions leadforge/api/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ def from_recipe(
horizon_days: int | None = None,
primary_task: str | None = None,
label_window_days: int | None = None,
snapshot_day: int | None = None,
output_path: str = _MISSING, # type: ignore[assignment]
override: dict[str, Any] | None = None,
) -> Generator:
Expand All @@ -76,6 +77,10 @@ def from_recipe(
directory name and manifest key.
label_window_days: Override recipe default label observation
window in days.
snapshot_day: Override recipe default snapshot day for windowed
feature aggregation. ``None`` means full-horizon (legacy)
aggregation; an integer ``N`` means features aggregate only
events with ``timestamp <= lead_created_at + N days``.
output_path: Directory where the bundle will be saved.
override: Optional dict of overrides (mirrors a ``--override`` file).
Applied after recipe defaults but before explicit kwargs.
Expand Down Expand Up @@ -105,6 +110,7 @@ def from_recipe(
horizon_days=horizon_days,
primary_task=primary_task,
label_window_days=label_window_days,
snapshot_day=snapshot_day,
output_path=output_path,
override=override,
)
Expand Down
23 changes: 23 additions & 0 deletions leadforge/api/recipes.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ class Recipe:
default_population: dict[str, int]
horizon_days: int
label_window_days: int | None = None
snapshot_day: int | None = None

# ------------------------------------------------------------------ #
# Construction
Expand Down Expand Up @@ -105,6 +106,19 @@ def from_dict(cls, data: dict[str, Any]) -> Recipe:
raise InvalidRecipeError(f"'label_window_days' must be positive, got {raw_lwd}")
label_window_days = raw_lwd

snapshot_day: int | None = None
raw_sd = data.get("snapshot_day")
if raw_sd is not None:
if isinstance(raw_sd, bool) or not isinstance(raw_sd, int):
raise InvalidRecipeError(
f"'snapshot_day' must be a positive int or null, got {type(raw_sd).__name__!r}"
)
if raw_sd <= 0:
raise InvalidRecipeError(
f"'snapshot_day' must be a positive int or null, got {raw_sd}"
)
snapshot_day = raw_sd

return cls(
id=data["id"],
title=data["title"],
Expand All @@ -116,6 +130,7 @@ def from_dict(cls, data: dict[str, Any]) -> Recipe:
default_population=dict(pop),
horizon_days=horizon_days,
label_window_days=label_window_days,
snapshot_day=snapshot_day,
)

# ------------------------------------------------------------------ #
Expand All @@ -134,6 +149,7 @@ def resolve_config(
horizon_days: int | None = None,
primary_task: str | None = None,
label_window_days: int | None = None,
snapshot_day: int | None = None,
output_path: str = _MISSING, # type: ignore[assignment]
override: dict[str, Any] | None = None,
) -> GenerationConfig:
Expand Down Expand Up @@ -165,6 +181,7 @@ def resolve_config(
"horizon_days": pkg["horizon_days"],
"primary_task": pkg["primary_task"],
"label_window_days": pkg["label_window_days"],
"snapshot_day": pkg["snapshot_day"],
}

# Layer 3 — recipe defaults
Expand All @@ -176,6 +193,8 @@ def resolve_config(
resolved["primary_task"] = self.primary_task
if self.label_window_days is not None:
resolved["label_window_days"] = self.label_window_days
if self.snapshot_day is not None:
resolved["snapshot_day"] = self.snapshot_day

# Layer 2 — override dict (beats recipe/package defaults)
if override:
Expand All @@ -186,6 +205,7 @@ def resolve_config(
"horizon_days",
"primary_task",
"label_window_days",
"snapshot_day",
"seed",
"output_path",
"exposure_mode",
Expand Down Expand Up @@ -216,6 +236,8 @@ def resolve_config(
resolved["primary_task"] = primary_task
if label_window_days is not None:
resolved["label_window_days"] = label_window_days
if snapshot_day is not None:
resolved["snapshot_day"] = snapshot_day

try:
mode = ExposureMode(resolved["exposure_mode"])
Expand Down Expand Up @@ -254,6 +276,7 @@ def resolve_config(
horizon_days=resolved["horizon_days"],
primary_task=resolved["primary_task"],
label_window_days=resolved["label_window_days"],
snapshot_day=resolved["snapshot_day"],
output_path=resolved["output_path"],
)

Expand Down
27 changes: 27 additions & 0 deletions leadforge/core/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ class GenerationConfig:
horizon_days: int = 90
primary_task: str = "converted_within_90_days"
label_window_days: int = 90
snapshot_day: int | None = None
output_path: str = "./out"
package_version: str = field(default_factory=lambda: __version__)
difficulty_params: DifficultyParams | None = None
Expand All @@ -87,6 +88,32 @@ def __post_init__(self) -> None:
f"label_window_days ({self.label_window_days}) must not exceed "
f"horizon_days ({self.horizon_days})"
)
if self.snapshot_day is not None:
if isinstance(self.snapshot_day, bool) or not isinstance(self.snapshot_day, int):
raise InvalidConfigError(
f"snapshot_day must be a positive int or None, "
f"got {type(self.snapshot_day).__name__!r}"
)
if self.snapshot_day <= 0:
raise InvalidConfigError(
f"snapshot_day must be a positive int or None, got {self.snapshot_day}"
)
if self.snapshot_day > self.horizon_days:
raise InvalidConfigError(
f"snapshot_day ({self.snapshot_day}) must not exceed "
f"horizon_days ({self.horizon_days})"
Comment thread
shaypal5 marked this conversation as resolved.
)
# A snapshot anchored after the label closes would let features
# observe events that occur beyond the label-scoring window —
# exactly the structural leakage the windowed snapshot is here
# to prevent. Reject at config time.
if self.snapshot_day > self.label_window_days:
raise InvalidConfigError(
f"snapshot_day ({self.snapshot_day}) must not exceed "
f"label_window_days ({self.label_window_days}); a snapshot "
f"anchored after the label closes would re-introduce "
f"structural leakage."
)
# Coerce string enums supplied as plain strings
if not isinstance(self.exposure_mode, ExposureMode):
try:
Expand Down
24 changes: 22 additions & 2 deletions leadforge/narrative/dataset_card.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,11 @@ def render_dataset_card(
# ------------------------------------------------------------------
# Header
# ------------------------------------------------------------------
snapshot_label = (
f"{cfg.snapshot_day} days (windowed)"
if cfg.snapshot_day is not None and cfg.snapshot_day < cfg.horizon_days
else f"{cfg.horizon_days} days (full horizon)"
)
lines += [
"# leadforge dataset card",
"",
Expand All @@ -65,6 +70,8 @@ def render_dataset_card(
f"| Exposure mode | `{cfg.exposure_mode}` |",
f"| Difficulty | `{cfg.difficulty}` |",
f"| Horizon | {cfg.horizon_days} days |",
f"| Label window | {cfg.label_window_days} days |",
f"| Feature snapshot window | {snapshot_label} |",
"",
]

Expand Down Expand Up @@ -188,14 +195,27 @@ def render_dataset_card(
# ------------------------------------------------------------------
# Caveats
# ------------------------------------------------------------------
if cfg.snapshot_day is not None and cfg.snapshot_day < cfg.horizon_days:
feature_window_caveat = (
f"- The label is evaluated over the full {cfg.label_window_days}-day "
f"window from lead creation; event-aggregate features (e.g. "
f"`touch_count`, `session_count`, `expected_acv`) observe only the "
f"first {cfg.snapshot_day} days of that window. The deliberate "
f"exception is `total_touches_all`, which counts touches over the "
f"full {cfg.horizon_days}-day horizon as a pedagogical leakage trap."
)
else:
feature_window_caveat = (
"- Features are anchored at the snapshot date. No post-anchor data is "
"included (leakage-free by construction)."
Comment on lines +209 to +210
)
lines += [
"## Caveats",
"",
"- This is **synthetic** data. It does not represent any real company, product, or market.",
"- The hidden world structure varies by motif family and stochastic rewiring; "
"no two seeds produce the same DGP.",
"- Features are anchored at the snapshot date. No post-anchor data is "
"included (leakage-free by construction).",
feature_window_caveat,
"- In `student_public` mode, the latent world graph, mechanism summary, "
"and full world spec are withheld.",
"",
Expand Down
3 changes: 3 additions & 0 deletions leadforge/recipes/b2b_saas_procurement_v1/recipe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,6 @@ default_population:
n_contacts: 4200
n_leads: 5000
horizon_days: 90
# Feature aggregation window in days; see CHANGELOG (bundle schema v4) and
# release/README.md for the rationale. Must satisfy snapshot_day <= label_window_days.
snapshot_day: 30
11 changes: 10 additions & 1 deletion leadforge/render/manifests.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,15 @@
# feature list (zero-variance); ``is_sql`` redacted in
# ``student_public`` mode (near-deterministic for non-conversion).
# ``manifest.redacted_columns`` was already added in PR #56.
BUNDLE_SCHEMA_VERSION = "3"
# "4" — issue #57 sub-item 1: windowed snapshot. Event-aggregate
# features (touch_count, session_count, expected_acv, ...) now
# aggregate only events within ``[lead_created_at, lead_created_at
# + snapshot_day]``. Column SET unchanged from v3, but column
# VALUES are no longer full-horizon — consumers pinning v3 and
# assuming "features computed over full horizon" must update.
# ``manifest.snapshot_day`` recorded so the contract is
# self-describing (``null`` means full-horizon, legacy behaviour).
BUNDLE_SCHEMA_VERSION = "4"

# Manifest fields whose value is non-deterministic by design (wall-clock,
# host metadata, etc.). Determinism checks must ignore these fields when
Expand Down Expand Up @@ -106,6 +114,7 @@ def build_manifest(
"horizon_days": config.horizon_days,
"primary_task": config.primary_task,
"label_window_days": config.label_window_days,
"snapshot_day": config.snapshot_day,
"motif_family": world_graph.motif_family,
"redacted_columns": redacted_columns_list,
"tables": tables,
Expand Down
Loading
Loading