Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,14 @@ split writer + `schemes/lifecycle/tasks.py` task families; discharges the
`LTV-Pc` regression-task-spec leftover) opened as **#124** (merged). `LTV-Pn.4` split into four (build → write → public-safety
→ orchestrator): `LTV-Pn.4a` (`LifecycleScheme.build_world` — deterministic
motif sampling + population + sim + `LifecycleArtifacts`; lifecycle relational
`to_dataframes`; consumes the Pn.3 config fields) opened as **#125**. Next:
`Pn.4b` (instructor `write_bundle` + tasks), `Pn.4c` (student_public
snapshot-safety + CLAUDE.md), `Pn.4d` (shared bundle orchestrator), `LTV-Po`
(recipe).
`to_dataframes`; consumes the Pn.3 config fields) opened as **#125** (merged). `LTV-Pn.4b` (instructor-mode `write_bundle` —
first on-disk lifecycle bundle: 6 relational tables + 8 task dirs (both
regimes) + lifecycle dataset card + manifest extra_fields + hidden-truth
metadata; difficulty_params threaded; student_public refused until 4c) opened
as **#126**. Next: `Pn.4c` (student_public snapshot-safety + CLAUDE.md +
recipe-driven difficulty resolution), `Pn.4d` (shared bundle orchestrator),
`LTV-Po` (recipe). Note: `validate_bundle` is lead-scoring-coupled — scheme-
aware validation is `LTV-Pp`.

---

Expand Down
27 changes: 17 additions & 10 deletions docs/ltv/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ protocol + registry, with the package physically reorganized into
| `LTV-M3` | Customer population + lifecycle world | `LTV-Ph`, `LTV-Pi` | #113 (Ph) |
| `LTV-M4` | Lifecycle simulation engine | `LTV-Pj`, `LTV-Pk` | #117 (Pj), #118 (Pk) |
| `LTV-M5` | Customer snapshots + pLTV targets (both regimes) | `LTV-Pl`, `LTV-Pm` | #119 (Pl), #120 (Pm) |
| `LTV-M6` | Register LifecycleScheme + recipe + manifest/version | `LTV-Pn.1…4`, `LTV-Po` | #121 (Pn.1), #122 (Pn.2), #124 (Pn.3), #125 (Pn.4a) |
| `LTV-M6` | Register LifecycleScheme + recipe + manifest/version | `LTV-Pn.1…4`, `LTV-Po` | #121 (Pn.1), #122 (Pn.2), #124 (Pn.3), #125 (Pn.4a), #126 (Pn.4b) |
| `LTV-M7` | Validation + regression-metric calibration | `LTV-Pp` | |
| `LTV-M8` | CLI, notebooks, publish | `LTV-Pq`, `LTV-Pr`, `LTV-Ps` | |

Expand Down Expand Up @@ -325,15 +325,22 @@ methods, then public-safety, then the carried orchestrator cleanup:
`write_bundle` still stubbed.
- Tests: determinism, cross-seed motif variability, FK integrity, table shapes.
- Labels: `type: feature`, `layer: api`, `layer: render`
- [ ] **`LTV-Pn.4b`** — `feat(lifecycle): write_bundle (instructor) + tasks`.
Instructor-mode `write_bundle`: relational tables; both regime snapshots →
8 task dirs (3 pLTV regression + churn, × 2 regimes) via the shared writer;
dataset card; feature dictionary; manifest with `generation_scheme` +
`observation_date` + windows (`extra_fields`); lifecycle `write_metadata`
hidden-truth hook (latent registry + mechanism summary). First on-disk
lifecycle bundle. **Must resolve `difficulty_params` from the active profile
and thread it into `build_customer_snapshot` (Pn.4a's `build_world` does not —
without this the snapshot distortions never fire and every tier is identical).**
- [x] **`LTV-Pn.4b`** — `feat(lifecycle): write_bundle (instructor) + tasks`
(**PR #126**). Instructor-mode `write_bundle` produces the first on-disk
lifecycle bundle: six relational tables; both regime snapshots → 8 task dirs
(3 pLTV regression + churn, × 2 regimes) via the shared writer; a lifecycle
dataset card (`render/dataset_card.py` — the lead-scoring card is too
coupled to reuse); feature dictionary; manifest with `generation_scheme` +
`observation_date` + `forward_windows_days` (`extra_fields`); lifecycle
`write_metadata` hidden-truth hook (latent registry + mechanism summary;
no graph). `config.difficulty_params` is **threaded** into both snapshot
builders (tested), so recipe-resolved difficulty will drive distortions;
recipe-driven *resolution* of `difficulty_params` lands in `LTV-Po`.
`student_public` is **refused** (raises) until `LTV-Pn.4c` adds the
snapshot-safe export — never emit an unsafe public bundle.
- **Flagged:** `validation.bundle_checks.validate_bundle` is lead-scoring-
coupled (applies lead-scoring FK/table/task checks) and errors on a
lifecycle bundle; scheme-aware validation is `LTV-Pp`.
- Labels: `type: feature`, `layer: api`, `layer: render`
- [ ] **`LTV-Pn.4c`** — `feat(lifecycle): student_public snapshot-safety`.
Public relational filtering (event tables ≤ cutoff; drop terminal
Expand Down
183 changes: 173 additions & 10 deletions leadforge/schemes/lifecycle/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

The second peer scheme alongside ``lead_scoring``. Its entity rows and FK
constraints live here (``entities`` / ``relationships``); the snapshot, feature,
and task definitions live in sibling modules. :meth:`LifecycleScheme.build_world`
is implemented (LTV-Pn.4a); :meth:`write_bundle` / :meth:`write_metadata` are
built out in LTV-Pn.4b–c and currently raise :class:`NotImplementedError`.
and task definitions live in sibling modules. ``build_world`` (LTV-Pn.4a) and
the instructor-mode ``write_bundle`` / ``write_metadata`` (LTV-Pn.4b) are
implemented; the ``student_public`` snapshot-safe export lands in LTV-Pn.4c.
"""

from __future__ import annotations
Expand All @@ -20,11 +20,6 @@
from leadforge.core.models import GenerationConfig, WorldBundle
from leadforge.narrative.spec import NarrativeSpec

_NOT_IMPLEMENTED = (
"the lifecycle (b2b_saas_ltv_v1) write path is not implemented yet; "
"it is built across LTV-Pn.4b–c"
)


def _sample_motif_family(rng: random.Random) -> str:
"""Deterministically pick a retention motif family for this world.
Expand Down Expand Up @@ -74,11 +69,26 @@ def build_world(
``narrative.yaml`` will not drive them until ``LTV-Po`` decides
whether the lifecycle scheme should consume the narrative spec.
"""
from leadforge.core.exceptions import InvalidConfigError
from leadforge.core.models import WorldBundle, WorldSpec
from leadforge.core.rng import RNGRoot
from leadforge.schemes.lifecycle.artifacts import LifecycleArtifacts
from leadforge.schemes.lifecycle.engine import simulate_lifecycle
from leadforge.schemes.lifecycle.population import build_customer_population
from leadforge.schemes.lifecycle.snapshots import FORWARD_WINDOWS_DAYS

# config.forward_windows_days is not yet threaded into the snapshot
# builder, which exports the fixed FORWARD_WINDOWS_DAYS targets. Reject
# an override now (clear, early) rather than emit a bundle whose manifest
# disagrees with its task dirs, or under-simulate and fail opaquely later.
# Threading config-driven windows through is tracked for a later step.
if tuple(config.forward_windows_days) != tuple(FORWARD_WINDOWS_DAYS):
raise InvalidConfigError(
f"config.forward_windows_days={tuple(config.forward_windows_days)} differs "
f"from the lifecycle scheme's exported windows {tuple(FORWARD_WINDOWS_DAYS)}; "
"config-driven forward windows are not yet supported (the snapshot builder "
"exports the fixed set). Use the default until that wiring lands."
)

motif_rng = RNGRoot(config.seed).child("lifecycle_motif")
motif_family = _sample_motif_family(motif_rng)
Expand Down Expand Up @@ -112,10 +122,163 @@ def write_bundle(
path: str,
generation_timestamp: str | None = None,
) -> None:
raise NotImplementedError(_NOT_IMPLEMENTED)
"""Serialise a lifecycle *bundle* to *path* (instructor mode).

Writes the six relational tables, both observation regimes' snapshots
split into 8 task directories (3 pLTV regression + 1 churn
classification per regime, the early regime prefixed ``early_``), a
dataset card, the feature dictionary, the hidden-truth ``metadata/``
(via :meth:`write_metadata`), and the manifest (recording
``generation_scheme`` + ``observation_date`` + the forward windows).

``config.difficulty_params`` is threaded into both snapshot builders —
when set (LTV-Po resolves it from the recipe profile), it drives the
snapshot distortions.

Only ``research_instructor`` mode is supported here. The
``student_public`` snapshot-safety projection (event-table cutoff
filtering, terminal-column drops, per-task target projection) lands in
LTV-Pn.4c; until then this refuses to write a public bundle rather than
emit one that is not snapshot-safe.
"""
from pathlib import Path

from leadforge.core.enums import ExposureMode
from leadforge.exposure.modes import apply_exposure
from leadforge.render.manifests import build_manifest, write_manifest
from leadforge.render.relational_io import write_relational_tables
from leadforge.render.tasks import write_task_splits
from leadforge.schema.dictionaries import write_feature_dictionary
from leadforge.schemes.lifecycle.artifacts import LifecycleArtifacts
from leadforge.schemes.lifecycle.features import CUSTOMER_SNAPSHOT_FEATURES
from leadforge.schemes.lifecycle.render.dataset_card import render_lifecycle_dataset_card
from leadforge.schemes.lifecycle.render.relational import to_dataframes
from leadforge.schemes.lifecycle.snapshots import (
FORWARD_WINDOWS_DAYS,
build_customer_snapshot,
build_early_pltv_snapshot,
)
from leadforge.schemes.lifecycle.tasks import (
CALENDAR_REGIME,
EARLY_REGIME,
lifecycle_task_manifests,
)

artifacts = bundle.artifacts
if not isinstance(artifacts, LifecycleArtifacts):
raise RuntimeError(
"WorldBundle is not populated with lifecycle artifacts. "
"Call Generator.generate() / build_world() first."
)
config = bundle.spec.config
if config.exposure_mode is not ExposureMode.research_instructor:
raise NotImplementedError(
f"lifecycle write_bundle currently supports only "
f"research_instructor; {config.exposure_mode.value!r} (snapshot-safe "
"public export) lands in LTV-Pn.4c"
)

population = artifacts.population
sim = artifacts.simulation_result
root = Path(path)
root.mkdir(parents=True, exist_ok=True)

# 1. Relational tables → tables/
dfs = to_dataframes(sim, population)
table_row_counts = write_relational_tables(dfs, root / "tables")

# 2. Both regime snapshots → 8 task directories.
# difficulty_params (None until LTV-Po resolves it) drives distortions.
snapshots = {
CALENDAR_REGIME: build_customer_snapshot(
population, sim, difficulty_params=config.difficulty_params, seed=config.seed
),
EARLY_REGIME: build_early_pltv_snapshot(
population,
sim,
early_tenure_weeks=config.early_tenure_weeks,
difficulty_params=config.difficulty_params,
seed=config.seed,
),
}
# Each task is a standalone single-target split: drop every OTHER
# target column so a task's parquet cannot leak the answer's siblings
# (e.g. ltv_revenue_730d ⊇ ltv_revenue_90d). The deliberate
# mrr_change_full_period trap (leakage_risk but not a target) is kept.
all_target_cols = {f.name for f in CUSTOMER_SNAPSHOT_FEATURES if f.is_target}
task_row_counts: dict[str, dict[str, int]] = {}
all_tasks = []
for regime, snapshot in snapshots.items():
for task in lifecycle_task_manifests(regime):
other_targets = [
c for c in all_target_cols - {task.label_column} if c in snapshot.columns
]
task_df = snapshot.drop(columns=other_targets)
counts = write_task_splits(task_df, root / "tasks", seed=config.seed, task=task)
task_row_counts[task.task_id] = counts
all_tasks.append(task)

# 3. Dataset card + feature dictionary
(root / "dataset_card.md").write_text(
render_lifecycle_dataset_card(
bundle.spec,
table_counts=table_row_counts,
tasks=tuple(all_tasks),
observation_date=population.observation_date,
)
)
write_feature_dictionary(
root / "feature_dictionary.csv", features=tuple(CUSTOMER_SNAPSHOT_FEATURES)
)

# 4. Exposure metadata (delegates hidden truth to write_metadata)
apply_exposure(bundle, root, config.exposure_mode)

# 5. Manifest
manifest = build_manifest(
config=config,
generation_scheme=self.name,
motif_family=artifacts.motif_family,
table_row_counts=table_row_counts,
task_row_counts=task_row_counts,
bundle_root=root,
generation_timestamp=generation_timestamp,
extra_fields={
"observation_date": population.observation_date,
# The actual exported target windows (source of truth), not
# config.forward_windows_days — build_world rejects any mismatch.
"forward_windows_days": list(FORWARD_WINDOWS_DAYS),
"early_tenure_weeks": config.early_tenure_weeks,
},
)
write_manifest(manifest, root)

def write_metadata(self, bundle: WorldBundle, meta_dir: Path) -> None:
raise NotImplementedError(_NOT_IMPLEMENTED)
"""Write the lifecycle hidden-truth files into *meta_dir*.

Called by :func:`leadforge.exposure.modes.apply_exposure` after the
shared ``world_spec.json``. The lifecycle scheme has no hidden graph;
its latent truth is the per-entity latent registry and the
motif-derived mechanism parameters.
"""
import json

from leadforge.schemes.lifecycle.artifacts import LifecycleArtifacts
from leadforge.schemes.lifecycle.render.metadata import (
latent_registry_dict,
mechanism_summary_dict,
)

artifacts = bundle.artifacts
if not isinstance(artifacts, LifecycleArtifacts):
raise RuntimeError("WorldBundle is not populated with lifecycle artifacts.")

(meta_dir / "latent_registry.json").write_text(
json.dumps(latent_registry_dict(artifacts.population.latent_state), indent=2)
)
(meta_dir / "mechanism_summary.json").write_text(
json.dumps(mechanism_summary_dict(artifacts.motif_family), indent=2)
)


LIFECYCLE_SCHEME = LifecycleScheme()
Expand Down
89 changes: 89 additions & 0 deletions leadforge/schemes/lifecycle/render/dataset_card.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
"""Dataset-card renderer for the lifecycle (pLTV) scheme.

The lead-scoring card (:func:`leadforge.narrative.dataset_card.render_dataset_card`)
is hard-coupled to the lead-scoring framing (binary conversion label, single
task, narrative-driven firmographics), so the lifecycle scheme renders its own.
Kept deliberately concise for LTV-Pn.4b; richer prose can follow.
"""

from __future__ import annotations

from typing import TYPE_CHECKING

if TYPE_CHECKING:
from leadforge.core.models import WorldSpec
from leadforge.schema.tasks import TaskManifest

__all__ = ["render_lifecycle_dataset_card"]


def render_lifecycle_dataset_card(
world_spec: WorldSpec,
*,
table_counts: dict[str, int],
tasks: tuple[TaskManifest, ...],
observation_date: str,
) -> str:
"""Return a Markdown dataset card for a lifecycle (pLTV) bundle."""
cfg = world_spec.config
tier = (str(cfg.difficulty) if cfg.difficulty else "unknown").capitalize()

lines: list[str] = [
f"# B2B SaaS pLTV Dataset — {tier} Tier",
"",
"## What this is",
"",
"A synthetic B2B SaaS customer base simulated week by week from "
"acquisition through retention, expansion, and churn. The prediction "
"task is **predicted lifetime value (pLTV)**: a continuous, "
"zero-inflated, right-skewed regression target — forecast each "
"customer's future gross revenue over a fixed forward window. Customer "
"churn is provided as a secondary classification label.",
"",
"## Two observation regimes",
"",
"- **Calendar-anchored (standard)** — every customer observed at the "
f"fixed observation date (`{observation_date}`); tenure varies from "
"cold to mature. Task ids: `pltv_revenue_*`, `churned_within_180d`.",
"- **Tenure-anchored (early-pLTV)** — every customer observed at a "
f"fixed short tenure (`customer_start + {cfg.early_tenure_weeks}w`); the "
"genuine cold-start case. Task ids prefixed `early_`.",
"",
"## Tasks",
"",
"| task_id | type | target | window (days) |",
"|---|---|---|---|",
]
for t in tasks:
lines.append(
f"| `{t.task_id}` | {t.task_type} | `{t.label_column}` | {t.label_window_days} |"
)

lines += [
"",
"## Relational tables",
"",
"| table | rows |",
"|---|---|",
]
for name, count in table_counts.items():
lines.append(f"| `{name}` | {count} |")

lines += [
"",
"## Leakage trap",
"",
"`mrr_change_full_period` is a deliberate trap: it is computed through "
"the end of simulation, so post-cutoff expansions inflate it. Use "
"`mrr_change_at_snapshot` (computed strictly at the cutoff) instead.",
"",
"## Reproducibility",
"",
f"- Recipe: `{cfg.recipe_id}`",
f"- Seed: `{cfg.seed}`",
f"- Scheme: `{world_spec.scheme}`",
"",
"Deterministic given (recipe, config, seed, package version).",
"",
]
return "\n".join(lines)
Loading
Loading