diff --git a/.agent-plan.md b/.agent-plan.md index 24d737c..da85796 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -217,7 +217,7 @@ Documentation + CI: | M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag | | M12: CLI help text polish | Deferred | Low priority vs dataset | | M14: Sample bundle commit | Absorbed into v4-M2 | v4 dataset IS the sample | -| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships | +| M14: Notebook 1 (inspecting world) | **Done** | `leadforge/examples/notebooks/01_inspect_world.ipynb` | | M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this | | M14: Notebook 3 (public vs instructor) | Discarded | No current audience | | M14: Notebook 4 (recipe customization) | Discarded | Premature | diff --git a/leadforge/examples/notebooks/01_inspect_world.ipynb b/leadforge/examples/notebooks/01_inspect_world.ipynb new file mode 100644 index 0000000..52a037c --- /dev/null +++ b/leadforge/examples/notebooks/01_inspect_world.ipynb @@ -0,0 +1,273 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": "# Inspecting a Generated World\n\nThis notebook walks you through generating a synthetic CRM dataset with **leadforge** and exploring what's inside the output bundle.\n\n**Prerequisites:** `pip install -e \".[dev]\"` from the repo root, plus a Jupyter environment (`pip install notebook` or `pip install jupyterlab`).\n\nWe'll cover:\n1. Generating a bundle via the Python API\n2. Exploring `manifest.json` — provenance, row counts, file hashes\n3. Loading the relational tables and examining FK relationships\n4. Inspecting the task splits (train/valid/test)\n5. Reading the dataset card and feature dictionary" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Generate a bundle\n", + "\n", + "We use `Generator.from_recipe()` to create a small world (500 leads) in `student_public` mode with `intro` difficulty. The bundle is written to a temporary directory so nothing lingers after the notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "import atexit\nimport shutil\nimport tempfile\nfrom pathlib import Path\n\nfrom leadforge.api import Generator\n\ntmpdir = tempfile.mkdtemp(prefix=\"leadforge_demo_\")\natexit.register(shutil.rmtree, tmpdir, True) # cleanup even on kernel restart\nbundle_path = Path(tmpdir) / \"demo_bundle\"\n\ngen = Generator.from_recipe(\n \"b2b_saas_procurement_v1\",\n seed=42,\n exposure_mode=\"student_public\",\n difficulty=\"intro\",\n)\nbundle = gen.generate(n_leads=500)\nbundle.save(str(bundle_path))\n\nprint(f\"Bundle written to: {bundle_path}\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see what files were created:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for p in sorted(bundle_path.rglob(\"*\")):\n", + " if p.is_file():\n", + " size_kb = p.stat().st_size / 1024\n", + " print(f\" {p.relative_to(bundle_path)} ({size_kb:.1f} KB)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Explore the manifest\n", + "\n", + "`manifest.json` is the bundle's provenance record. It captures the recipe, seed, package version, exposure mode, row counts, and SHA-256 hashes for every data file — everything you need to reproduce or verify the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "with open(bundle_path / \"manifest.json\") as f:\n", + " manifest = json.load(f)\n", + "\n", + "# Top-level provenance fields\n", + "for key in [\n", + " \"package_version\",\n", + " \"recipe_id\",\n", + " \"seed\",\n", + " \"exposure_mode\",\n", + " \"difficulty\",\n", + " \"generation_timestamp\",\n", + " \"bundle_schema_version\",\n", + "]:\n", + " print(f\"{key}: {manifest.get(key)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Table inventory: row counts and file hashes\nprint(\"Relational tables:\")\nfor name, info in manifest[\"tables\"].items():\n print(f\" {name:20s} {info['row_count']:>6,} rows sha256={info['sha256'][:12]}...\")\n\nprint(\"\\nTask splits:\")\nfor task_id, task_info in manifest[\"tasks\"].items():\n print(f\" {task_id}:\")\n for key in (\"train\", \"valid\", \"test\"):\n rows_key = f\"{key}_rows\"\n if rows_key in task_info:\n print(f\" {key:6s} {task_info[rows_key]:>5,} rows\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Relational tables\n", + "\n", + "The bundle contains 9 relational tables stored as Parquet files under `tables/`. These represent the full CRM world: accounts, contacts, leads, their interactions (touches, sessions, sales activities), and outcomes (opportunities, customers, subscriptions)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "tables = {}\n", + "for parquet_file in sorted((bundle_path / \"tables\").glob(\"*.parquet\")):\n", + " name = parquet_file.stem\n", + " tables[name] = pd.read_parquet(parquet_file)\n", + "\n", + "# Summary of all tables\n", + "summary = pd.DataFrame(\n", + " [{\"table\": name, \"rows\": len(df), \"columns\": len(df.columns)} for name, df in tables.items()]\n", + ")\n", + "print(summary.to_string(index=False))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Sample rows from the leads table\n", + "tables[\"leads\"].head(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Sample rows from the touches table\n", + "tables[\"touches\"].head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### FK relationships\n", + "\n", + "The tables are linked by foreign keys (e.g., every lead references an account and a contact). Let's verify one relationship and see how the tables connect." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Every lead's account_id should exist in the accounts table\nlead_account_ids = set(tables[\"leads\"][\"account_id\"])\naccount_ids = set(tables[\"accounts\"][\"account_id\"])\norphans = lead_account_ids - account_ids\nprint(f\"FK check: {len(orphans)} orphan account_ids (expect 0)\")\n\nprint(f\"Accounts: {len(account_ids)}\")\nprint(f\"Contacts: {len(tables['contacts'])}\")\nprint(f\"Leads: {len(tables['leads'])}\")\nprint(f\"Leads per account (mean): {len(tables['leads']) / len(account_ids):.1f}\")\nprint(f\"Touches per lead (mean): {len(tables['touches']) / len(tables['leads']):.1f}\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Task splits\n", + "\n", + "The primary task (`converted_within_90_days`) is exported as train/valid/test Parquet splits under `tasks/`. Each row is a lead snapshot — a flat, ML-ready feature vector anchored at the snapshot date. No post-snapshot data leaks into these features." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Read task ID from the manifest rather than hardcoding\ntask_id = next(iter(manifest[\"tasks\"]))\ntask_dir = bundle_path / \"tasks\" / task_id\n\nsplits = {}\nfor split_file in sorted(task_dir.glob(\"*.parquet\")):\n splits[split_file.stem] = pd.read_parquet(split_file)\n\nfor name, df in splits.items():\n n_pos = df[task_id].sum()\n rate = n_pos / len(df) * 100\n print(f\"{name:6s}: {len(df):>4} rows, {n_pos:>3} converted ({rate:.1f}%)\")" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Feature overview from the train split\ntrain = splits[\"train\"]\nprint(f\"Task: {task_id}\")\nprint(f\"Features: {len(train.columns)} columns\\n\")\ntrain.dtypes" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Quick summary statistics for numeric features\n", + "train.describe().T" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Task manifest\n", + "\n", + "`task_manifest.json` records the split ratios and label column for reproducibility." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with open(task_dir / \"task_manifest.json\") as f:\n", + " task_manifest = json.load(f)\n", + "\n", + "task_manifest" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Dataset card and feature dictionary\n", + "\n", + "Every bundle includes a human-readable dataset card (Markdown) and a machine-readable feature dictionary (CSV) describing each column in the task table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Dataset card (first 40 lines)\n", + "card_text = (bundle_path / \"dataset_card.md\").read_text()\n", + "print(\"\\n\".join(card_text.splitlines()[:40]))\n", + "print(f\"\\n... ({len(card_text.splitlines())} lines total)\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Feature dictionary\n", + "feat_dict = pd.read_csv(bundle_path / \"feature_dictionary.csv\")\n", + "feat_dict" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's next?\n", + "\n", + "This bundle was generated in **`student_public`** mode, which excludes the hidden causal structure behind the data. leadforge also supports a **`research_instructor`** mode that includes the full world graph, latent variable registry, and mechanism summaries — useful for teaching causal inference or evaluating model interpretability. That's a topic for a future notebook.\n", + "\n", + "For now, you have everything you need to start building models on the task splits!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleanup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Explicit cleanup (atexit also handles this if the kernel dies)\nshutil.rmtree(tmpdir, ignore_errors=True)\nprint(f\"Cleaned up {tmpdir}\")" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}