Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,13 @@ motif sampling + population + sim + `LifecycleArtifacts`; lifecycle relational
first on-disk lifecycle bundle: 6 relational tables + 8 task dirs (both
regimes) + lifecycle dataset card + manifest extra_fields + hidden-truth
metadata; difficulty_params threaded; student_public refused until 4c) opened
as **#126**. Next: `Pn.4c` (student_public snapshot-safety + CLAUDE.md +
recipe-driven difficulty resolution), `Pn.4d` (shared bundle orchestrator),
`LTV-Po` (recipe). Note: `validate_bundle` is lead-scoring-coupled — scheme-
aware validation is `LTV-Pp`.
as **#126** (merged). `LTV-Pn.4c` (student_public snapshot-safety — public
relational projection: event tables ≤ observation_date, subscriptions
stateful/terminal columns dropped; manifest flags; CLAUDE.md clause;
lead-scoring byte-identical) opened as **#127**. Next: `Pn.4d` (shared bundle
orchestrator), `LTV-Po` (recipe; also recipe-driven difficulty resolution).
Note: `validate_bundle` is lead-scoring-coupled — scheme-aware validation is
`LTV-Pp`.

---

Expand Down
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp
- Never use a single fixed hidden world (DGP must vary by motif family + rewiring).
- Never leak post-snapshot-anchor data into flat task features.
- **Never publish public relational tables that allow label reconstruction via joins.** Public relational exports must be snapshot-safe: every `*_timestamp` column in event tables (`touches.touch_timestamp`, `sessions.session_timestamp`, `sales_activities.activity_timestamp`) must satisfy `<= lead_created_at + snapshot_day`; `opportunities` must be filtered by `created_at <= lead_created_at + snapshot_day`; no terminal-state fields (`close_outcome`, `closed_at`, `converted_within_90_days`, `conversion_timestamp`) in public `leads`/`opportunities`; no conversion-conditional entities (`customers`, `subscriptions`) in public bundles.
- **(lifecycle / `b2b_saas_ltv_v1` scheme)** The public relational export is snapshot-safe against the absolute `observation_date` cutoff: every timestamp column in the public event tables (`subscription_events.event_timestamp`, `health_signals.period_start`, `invoices.invoice_date`) must satisfy `<= observation_date`; the public `subscriptions` table drops all stateful/terminal columns (`subscription_status`, `current_mrr`, `renewal_count`, `expansion_count`, `subscription_end_at`, `churn_at`, `churn_reason`), keeping only the at-signing identity (`subscription_id`, `customer_id`, `plan_name`, `subscription_start_at`, `contract_term_months`); no pLTV target (`ltv_revenue_*`) or churn label appears in any public relational table. Each task split carries only its own target (no cross-target leakage); the `mrr_change_full_period` trap is deliberately retained in all modes. The early-pLTV (tenure-anchored) task family is **omitted from `student_public` bundles** — its forward window precedes `observation_date`, so its targets would be reconstructible by joining the public event tables; it ships in `research_instructor` only. The calendar-anchored family is published (its targets fall after `observation_date`).
- Never require external APIs for core generation.
- Never publish hidden truth in `student_public` mode.
- Never derive `converted_within_90_days` as a directly sampled label; it must emerge from simulated events.
Expand Down
33 changes: 26 additions & 7 deletions docs/ltv/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ protocol + registry, with the package physically reorganized into
| `LTV-M3` | Customer population + lifecycle world | `LTV-Ph`, `LTV-Pi` | #113 (Ph) |
| `LTV-M4` | Lifecycle simulation engine | `LTV-Pj`, `LTV-Pk` | #117 (Pj), #118 (Pk) |
| `LTV-M5` | Customer snapshots + pLTV targets (both regimes) | `LTV-Pl`, `LTV-Pm` | #119 (Pl), #120 (Pm) |
| `LTV-M6` | Register LifecycleScheme + recipe + manifest/version | `LTV-Pn.1…4`, `LTV-Po` | #121 (Pn.1), #122 (Pn.2), #124 (Pn.3), #125 (Pn.4a), #126 (Pn.4b) |
| `LTV-M6` | Register LifecycleScheme + recipe + manifest/version | `LTV-Pn.1…4`, `LTV-Po` | #121 (Pn.1), #122 (Pn.2), #124 (Pn.3), #125 (Pn.4a), #126 (Pn.4b), #127 (Pn.4c) |
| `LTV-M7` | Validation + regression-metric calibration | `LTV-Pp` | |
| `LTV-M8` | CLI, notebooks, publish | `LTV-Pq`, `LTV-Pr`, `LTV-Ps` | |

Expand Down Expand Up @@ -342,12 +342,31 @@ methods, then public-safety, then the carried orchestrator cleanup:
coupled (applies lead-scoring FK/table/task checks) and errors on a
lifecycle bundle; scheme-aware validation is `LTV-Pp`.
- Labels: `type: feature`, `layer: api`, `layer: render`
- [ ] **`LTV-Pn.4c`** — `feat(lifecycle): student_public snapshot-safety`.
Public relational filtering (event tables ≤ cutoff; drop terminal
`churn_at`/`churn_reason`/`subscription_end_at`; no target columns); the
early-regime degenerate-column + dtype-preserving-missingness flags from
LTV-Pm. Extend `CLAUDE.md` hard constraints with the lifecycle
snapshot-safety clause + the `schemes/` layout.
- [x] **`LTV-Pn.4c`** — `feat(lifecycle): student_public snapshot-safety`
(**PR #127**). New `schemes/lifecycle/render/relational_snapshot_safe.py`
projects the public relational tables: event tables filtered to
`<= observation_date`; `subscriptions` drops its stateful/terminal columns.
**Note:** design §5 named only the three terminal fields, but the four
*stateful* columns (`subscription_status`/`current_mrr`/`renewal_count`/
`expansion_count`) also hold end-of-sim values that leak the targets, so the
banned set (`leakage_probes.LIFECYCLE_BANNED_SUBSCRIPTION_COLUMNS`) extends
the spec. `write_bundle` drops the public guard and wires the projection;
manifest records `relational_snapshot_safe` + `structural_redactions`
(`build_manifest` gained a pass-through `structural_redactions` param — the
last lead-scoring coupling in the manifest builder; lead-scoring byte-
identical). `CLAUDE.md` gains the lifecycle snapshot-safety clause. The
per-task single-target splits + cutoff-bounded features (LTV-Pn.4b) already
satisfy public task safety; the early-regime degenerate-column flags are
documented (LTV-Pm).
- **Design decision (self-review):** the early-pLTV (tenure-anchored) task
family is **omitted from `student_public` bundles** — its forward window
precedes `observation_date`, so its target is exactly reconstructible by
joining the public event tables (verified: 52/60 customers). One
`observation_date`-anchored relational export cannot serve both regimes, so
the early family is instructor-only for now. Revisit if public early-pLTV
is wanted (would need per-regime relational exports or a relational-free
public early task) — flag for `LTV-Po`/design-doc update; tension noted
against D8's "first-class early-pLTV".
- Labels: `type: feature`, `layer: exposure`, `layer: render`, `layer: docs`
- [ ] **`LTV-Pn.4d`** — `refactor: shared bundle orchestrator`. With both
schemes' `write_bundle` in hand, lift the shared orchestrator (mkdir →
Expand Down
12 changes: 11 additions & 1 deletion leadforge/render/manifests.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ def build_manifest(
relational_snapshot_safe: bool = False,
motif_family: str | None = None,
extra_fields: dict[str, Any] | None = None,
structural_redactions: dict[str, Any] | None = None,
) -> dict[str, Any]:
"""Build the bundle manifest dict.

Expand Down Expand Up @@ -116,6 +117,11 @@ def build_manifest(
extra_fields: Optional scheme-specific top-level manifest keys merged
into the result (e.g. the lifecycle scheme's ``observation_date``
and forward windows). Must not collide with a core manifest key.
structural_redactions: Optional scheme-supplied table-level redaction
record (``{"columns": {...}, "omitted_tables": [...]}``). When
``None`` the lead-scoring default is computed from the
snapshot-safe flag (back-compat); schemes with a different public
relational shape (e.g. lifecycle) pass their own.

Returns:
A JSON-serialisable dict ready to be written as ``manifest.json``.
Expand Down Expand Up @@ -164,7 +170,11 @@ def build_manifest(
"motif_family": motif_family,
"redacted_columns": redacted_columns_list,
"relational_snapshot_safe": bool(relational_snapshot_safe),
"structural_redactions": _build_structural_redactions(bool(relational_snapshot_safe)),
"structural_redactions": (
structural_redactions
if structural_redactions is not None
else _build_structural_redactions(bool(relational_snapshot_safe))
),
"tables": tables,
"tasks": tasks,
}
Expand Down
62 changes: 43 additions & 19 deletions leadforge/schemes/lifecycle/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,28 +122,30 @@ def write_bundle(
path: str,
generation_timestamp: str | None = None,
) -> None:
"""Serialise a lifecycle *bundle* to *path* (instructor mode).
"""Serialise a lifecycle *bundle* to *path*.

Writes the six relational tables, both observation regimes' snapshots
split into 8 task directories (3 pLTV regression + 1 churn
classification per regime, the early regime prefixed ``early_``), a
dataset card, the feature dictionary, the hidden-truth ``metadata/``
(via :meth:`write_metadata`), and the manifest (recording
``generation_scheme`` + ``observation_date`` + the forward windows).
(instructor only, via :meth:`write_metadata`), and the manifest
(``generation_scheme`` + ``observation_date`` + forward windows).

``config.difficulty_params`` is threaded into both snapshot builders —
when set (LTV-Po resolves it from the recipe profile), it drives the
snapshot distortions.

Only ``research_instructor`` mode is supported here. The
``student_public`` snapshot-safety projection (event-table cutoff
filtering, terminal-column drops, per-task target projection) lands in
LTV-Pn.4c; until then this refuses to write a public bundle rather than
emit one that is not snapshot-safe.
``student_public`` bundles are projected snapshot-safe: the relational
event tables are filtered to ``<= observation_date`` and the
``subscriptions`` table's stateful/terminal columns are dropped (see
:mod:`leadforge.schemes.lifecycle.render.relational_snapshot_safe`); no
``metadata/`` is written; and the manifest records
``relational_snapshot_safe`` + ``structural_redactions``. The per-task
splits are single-target and cutoff-bounded by construction.
"""
from pathlib import Path

from leadforge.core.enums import ExposureMode
from leadforge.exposure.filters import get_filter
from leadforge.exposure.modes import apply_exposure
from leadforge.render.manifests import build_manifest, write_manifest
from leadforge.render.relational_io import write_relational_tables
Expand All @@ -153,6 +155,10 @@ def write_bundle(
from leadforge.schemes.lifecycle.features import CUSTOMER_SNAPSHOT_FEATURES
from leadforge.schemes.lifecycle.render.dataset_card import render_lifecycle_dataset_card
from leadforge.schemes.lifecycle.render.relational import to_dataframes
from leadforge.schemes.lifecycle.render.relational_snapshot_safe import (
LIFECYCLE_BANNED_SUBSCRIPTION_COLUMNS,
to_dataframes_snapshot_safe,
)
from leadforge.schemes.lifecycle.snapshots import (
FORWARD_WINDOWS_DAYS,
build_customer_snapshot,
Expand All @@ -171,36 +177,52 @@ def write_bundle(
"Call Generator.generate() / build_world() first."
)
config = bundle.spec.config
if config.exposure_mode is not ExposureMode.research_instructor:
raise NotImplementedError(
f"lifecycle write_bundle currently supports only "
f"research_instructor; {config.exposure_mode.value!r} (snapshot-safe "
"public export) lands in LTV-Pn.4c"
)
bundle_filter = get_filter(config.exposure_mode)

population = artifacts.population
sim = artifacts.simulation_result
root = Path(path)
root.mkdir(parents=True, exist_ok=True)

# 1. Relational tables → tables/
# student_public is projected snapshot-safe (event tables filtered to
# <= observation_date; subscriptions' stateful/terminal columns
# dropped). research_instructor keeps the full-horizon shape.
dfs = to_dataframes(sim, population)
structural_redactions: dict[str, object] | None = None
if bundle_filter.relational_snapshot_safe:
dfs = to_dataframes_snapshot_safe(dfs, cutoff=population.observation_date)
structural_redactions = {
"columns": {"subscriptions": sorted(LIFECYCLE_BANNED_SUBSCRIPTION_COLUMNS)},
"omitted_tables": [],
}
table_row_counts = write_relational_tables(dfs, root / "tables")

# 2. Both regime snapshots → 8 task directories.
# 2. Regime snapshots → task directories.
# difficulty_params (None until LTV-Po resolves it) drives distortions.
#
# The early-pLTV (tenure-anchored) family is OMITTED from snapshot-safe
# public bundles: its forward window (start + early_tenure_weeks + Nd)
# precedes the relational cutoff (observation_date), so its targets are
# reconstructible by joining the public event tables (invoices between
# the early cutoff and observation_date *are* the early target window).
# One observation_date-anchored relational export cannot serve both
# regimes; the early family stays instructor-only. The calendar family
# is safe (its targets fall after observation_date, absent from the
# public relational tables).
snapshots = {
CALENDAR_REGIME: build_customer_snapshot(
population, sim, difficulty_params=config.difficulty_params, seed=config.seed
),
EARLY_REGIME: build_early_pltv_snapshot(
}
if not bundle_filter.relational_snapshot_safe:
snapshots[EARLY_REGIME] = build_early_pltv_snapshot(
population,
sim,
early_tenure_weeks=config.early_tenure_weeks,
difficulty_params=config.difficulty_params,
seed=config.seed,
),
}
)
# Each task is a standalone single-target split: drop every OTHER
# target column so a task's parquet cannot leak the answer's siblings
# (e.g. ltv_revenue_730d ⊇ ltv_revenue_90d). The deliberate
Expand Down Expand Up @@ -250,6 +272,8 @@ def write_bundle(
"forward_windows_days": list(FORWARD_WINDOWS_DAYS),
"early_tenure_weeks": config.early_tenure_weeks,
},
relational_snapshot_safe=bundle_filter.relational_snapshot_safe,
structural_redactions=structural_redactions,
)
write_manifest(manifest, root)

Expand Down
106 changes: 106 additions & 0 deletions leadforge/schemes/lifecycle/render/relational_snapshot_safe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"""Snapshot-safe relational export for ``student_public`` lifecycle bundles.

:func:`to_dataframes_snapshot_safe` projects the full-horizon dict from
:func:`leadforge.schemes.lifecycle.render.relational.to_dataframes` onto the
shape published in public bundles, enforcing the design.md §5 contract against
the absolute calendar ``cutoff`` (the world ``observation_date``):

* event tables (``subscription_events`` / ``health_signals`` / ``invoices``)
are row-filtered to ``timestamp <= cutoff`` — no post-cutoff events;
* ``subscriptions`` drops its stateful/terminal columns
(:data:`LIFECYCLE_BANNED_SUBSCRIPTION_COLUMNS`), keeping only the at-signing
identity (plan, term, start) — current MRR / status / counts / churn fields
all hold end-of-simulation values that leak the pLTV / churn targets;
* ``accounts`` / ``customers`` pass through (firmographic / at-signing, no
post-cutoff state).

The public **task** parquets are already snapshot-safe by construction (their
features are computed at/before the cutoff and each carries only its own
target); this module only governs the relational ``tables/``.

The cutoff is the calendar regime's ``observation_date``. The early-pLTV
(tenure-anchored) task family is therefore **omitted from public bundles**
(``LifecycleScheme.write_bundle``): its forward window precedes
``observation_date``, so its targets would be reconstructible by joining the
public event tables (the invoices between the early cutoff and
``observation_date`` *are* the early target window). A single
``observation_date``-anchored relational export cannot serve both regimes; the
early family stays instructor-only.

``research_instructor`` keeps the full-horizon
:func:`~leadforge.schemes.lifecycle.render.relational.to_dataframes`.
"""

from __future__ import annotations

from typing import TYPE_CHECKING

from leadforge.validation.leakage_probes import (
LIFECYCLE_BANNED_SUBSCRIPTION_COLUMNS,
LIFECYCLE_SNAPSHOT_FILTERED_TABLES,
)

if TYPE_CHECKING:
from collections.abc import Mapping

import pandas as pd

__all__ = [
"LIFECYCLE_BANNED_SUBSCRIPTION_COLUMNS",
"LIFECYCLE_SNAPSHOT_FILTERED_TABLES",
"to_dataframes_snapshot_safe",
]

# Canonical output order (parity with the full-horizon to_dataframes).
_OUTPUT_ORDER = (
"accounts",
"customers",
"subscriptions",
"subscription_events",
"health_signals",
"invoices",
)


def to_dataframes_snapshot_safe(
dfs: Mapping[str, pd.DataFrame],
*,
cutoff: str,
) -> dict[str, pd.DataFrame]:
"""Project the full-horizon lifecycle relational dict to its public shape.

Args:
dfs: Output of
:func:`leadforge.schemes.lifecycle.render.relational.to_dataframes`.
Input frames are never mutated.
cutoff: Absolute ISO date (the world ``observation_date``); event rows
with a timestamp strictly after it are dropped.

Returns:
A new dict in canonical order. ``subscriptions`` has its
stateful/terminal columns removed; the event tables are row-filtered to
``<= cutoff``; ``accounts`` / ``customers`` pass through.

Raises:
ValueError: if *cutoff* is empty.
"""
if not cutoff:
raise ValueError("cutoff (observation_date) must be a non-empty ISO date string")

filtered_tables = dict(LIFECYCLE_SNAPSHOT_FILTERED_TABLES)
banned = set(LIFECYCLE_BANNED_SUBSCRIPTION_COLUMNS)

out: dict[str, pd.DataFrame] = {}
for name in _OUTPUT_ORDER:
if name not in dfs:
continue
df = dfs[name]
if name == "subscriptions":
out[name] = df.drop(columns=[c for c in banned if c in df.columns])
elif name in filtered_tables:
ts_col = filtered_tables[name]
# ISO date strings compare correctly lexicographically.
out[name] = df[df[ts_col] <= cutoff].reset_index(drop=True)
else:
out[name] = df
return out
Loading
Loading