commandgraph
diff --git a/‎AGENTS.md‎
Lines changed: 2 additions & 2 deletions b/‎AGENTS.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions b/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎COOKBOOK.md‎
Lines changed: 73 additions & 0 deletions b/‎COOKBOOK.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 75 additions & 33 deletions b/‎README.md‎
Lines changed: 75 additions & 33 deletions
diff --git a/‎benchmarks/README.md‎
Lines changed: 11 additions & 6 deletions b/‎benchmarks/README.md‎
Lines changed: 11 additions & 6 deletions
@@ -68,7 +68,7 @@ python3 -c "import py_compile; py_compile.compile('cgr.py', doraise=True)"
 # Pytest suite
 python3 -m pytest test_commandgraph.py -x -q
 
-# Validate all example files
+# Validate root feature-exercise example files
 cgr validate nginx_setup.cg && cgr validate nginx_setup.cgr
 cgr validate webserver.cg --repo ./repo && cgr validate webserver.cgr --repo ./repo
 cgr validate parallel_test.cgr && cgr validate multinode_test.cgr && cgr validate multinode_test.cg
@@ -90,6 +90,6 @@ testing-ssh/run-ssh-demos.sh  # 5 SSH demos
 - If you touch apply output, keep `--output json` and `--report FILE.json` aligned by deriving both from the same execution/result data when possible.
 - If you touch state handling, inspect both resource-entry behavior and `_wave` / `_run` metric behavior.
 - If you touch sub-graph inclusion, preserve the contract that the child graph runs as a unit rather than as flattened parent resources.
-- Test against the example graphs after changes, not just unit tests.
+- Test against the root feature-exercise graphs after changes, not just unit tests.
 - Templates may exist in either `.cgr` or `.cg`, but the stdlib in `repo/` is `.cgr`, and the repo loader prefers `.cgr`.
 - Be cautious with state semantics, dependency resolution, and desugaring changes. Small regressions in those areas can silently break resume behavior or execution ordering.
@@ -74,7 +74,7 @@ python3 cgr.py apply build.cgr
 python3 -m pytest test_commandgraph.py -x -q
 python3 -m pytest test_modularization.py -x -q
 
-# Validate all example files
+# Validate root feature-exercise example files
 cgr validate nginx_setup.cg && cgr validate nginx_setup.cgr
 cgr validate webserver.cg --repo ./repo && cgr validate webserver.cgr --repo ./repo
 cgr validate parallel_test.cgr && cgr validate multinode_test.cgr && cgr validate multinode_test.cg
@@ -94,6 +94,6 @@ testing-ssh/run-ssh-demos.sh  # 5 SSH demos
 - For composition, prefer sub-graph steps when the user wants another graph executed as one unit rather than flattened child resources.
 - When touching state code, account for both resource entries and `_wave` / `_run` metric records.
 - When touching apply output, keep `--report FILE.json` and `--output json` consistent by deriving both from the same result data when possible.
-- Test with all example files after any change.
+- Test with the root feature-exercise graphs after any change.
 - Templates can be `.cgr` or `.cg` — all 44 stdlib are `.cgr`. Repo loader prefers `.cgr`.
 - Reference `MANUAL.md` for syntax details, `.claude/docs/design_doc_*.md` for design rationale.
@@ -534,6 +534,79 @@ cgr apply infra.cgr --tags security --skip-tags audit  # Security, but not the a
 
 ---
 
+## Recipe 11: Full Production Rollout
+
+**When to use:** End-to-end release workflow — provision, configure, register, roll out with a canary gate, then wait for human approval before verifying the fleet.
+**Concepts:** `stage`/`phase`, `wait for webhook`, HTTP steps, auth tokens, `each`, `verify`.
+
+This is the kind of workflow most teams express today as a fragile combination of pipeline stages, shell scripts, approval toggles, and manual operator steps. Encoding it in one graph gives you a consistent execution model with crash recovery at every point.
+
+```
+--- Production rollout with gates ---
+
+set env = "prod"
+set deploy_id = "release-2026-04-11"
+
+target "control" local:
+
+  [provision infra]:
+    run    $ terraform apply -auto-approve -var="env=${env}"
+    timeout 15m
+
+  [run base playbook]:
+    first [provision infra]
+    run    $ ansible-playbook -i inventory/${env} playbooks/base.yml
+    timeout 20m
+
+  [run app playbook]:
+    first [run base playbook]
+    run    $ ansible-playbook -i inventory/${env} playbooks/app.yml --tags deploy
+    timeout 15m
+
+  [register deploy]:
+    first [run app playbook]
+    post "https://deploy-api.example.net/releases"
+    auth bearer "${DEPLOY_API_TOKEN}"
+    body json '{"environment":"${env}","deploy_id":"${deploy_id}"}'
+    expect 200..299
+
+  [roll out]:
+    first [register deploy]
+    stage "production":
+      phase "canary" 1 from "web-1,web-2,web-3,web-4":
+        [deploy ${server}]:
+          run $ ssh ${server} '/opt/myapp/activate.sh'
+
+        verify "canary healthy":
+          run $ curl -sf http://${server}:8080/health
+          retry 10x wait 3s
+
+      phase "rest" remaining from "web-1,web-2,web-3,web-4":
+        each server, 2 at a time:
+          [deploy ${server}]:
+            run $ ssh ${server} '/opt/myapp/activate.sh'
+
+  [wait for approval]:
+    first [roll out]
+    wait for webhook "/approve/${deploy_id}"
+    timeout 2h
+
+  verify "fleet healthy":
+    first [wait for approval]
+    run $ ansible -i inventory/${env} all -m shell -a 'systemctl is-active myapp'
+```
+
+**How it works:**
+- Terraform, Ansible, and the API registration run sequentially with explicit ordering
+- `[roll out]` deploys to one canary server first — if the verify fails, the rest of the fleet is never touched
+- `wait for webhook` pauses execution until `POST /approve/release-2026-04-11` is received (or times out after 2h)
+- If anything fails mid-run, `cgr apply` resumes from the failed step — no rerunning Terraform if it already completed
+- `auth bearer` tokens are automatically redacted from all output
+
+**Customization:** Override the environment with `--set env=staging`. Change the concurrency of the fleet rollout by editing `2 at a time`. Add `collect "activation"` to any step to capture its stdout in `cgr report`.
+
+---
+
 ## Feature Reference
 
 Quick lookup: which feature solves your problem?
 
@@ -7,7 +7,7 @@ Write a plain-text file that reads like English (or use an agent on your behalf!
 One Python file with zero dependencies. No agents on your servers. No daemon. No database.
 
 ```python
---- Deploy my app ---
+--- Install nginx and basic app ---
 
 target "web" ssh deploy@10.0.1.5:
 
@@ -72,6 +72,12 @@ target "local" local:
 
 ---
 
+> **Ansible configures it. CommandGraph runs the operational story around it.**
+
+Hardware gets racked. Ansible configures it. But the workflow in between -- waiting for hosts to come up, running playbooks in the right order, verifying the result, gating the next step on a health check -- usually lives in runbooks, chat threads, and operator memory. That is the gap CommandGraph fills.
+
+---
+
 ## Web IDE
 
 `cgr serve FILE` launches a browser-based IDE with a live DAG visualization and execution panel. The left pane is an editor; the right pane shows the dependency graph updating in real time as you edit. Run `apply`, stream step output, inspect state and history, and view collected report data -- all from the browser.
@@ -137,28 +143,37 @@ Inside that shell, `cgr` is already on `PATH`, examples live in `/opt/cgr/exampl
 
 ---
 
-## Why not Ansible / Puppet / Chef / Salt?
+## The case for CommandGraph
+
+Most teams don't struggle because they lack tools. They struggle because the workflow *across* those tools is brittle.
+
+Some of these symptoms may sound familiar:
 
-CommandGraph is not trying to replace those tools. It solves a narrower problem: declare dependencies between operational steps, then let the engine plan order, maximize parallelism, and resume exactly from failure. Traditional config-management tools are still good at configuration management; CommandGraph is good at orchestration across shell commands, remote steps, API calls, wait gates, and sub-graphs.
+- Hardware is racked, but the handoff to Ansible lives in a runbook nobody reads until something breaks
+- One failed health check means rerunning broad chunks of work — no trustworthy resume point
+- Canary logic is implied by convention, not encoded in the workflow
+- Operators must remember which steps are safe to retry and which are not
+- Incident reviews have logs, but not a clean execution graph or machine-readable run record
+
+CommandGraph turns that glue layer into something explicit, resumable, and inspectable. It is not a replacement for Ansible — it is the orchestration layer around it.
 
 ### Running Ansible playbooks from CommandGraph
 
-Already have Ansible playbooks? Run them as steps inside a CommandGraph. This lets you sequence playbooks alongside shell commands, API calls, and other tools -- with crash recovery, dependency ordering, and parallel execution that Ansible alone can't express:
+Already have Ansible playbooks? Run them as steps inside a CommandGraph. The graph picks up right after hardware is racked -- waiting for hosts to respond, running playbooks in order, verifying the result -- with crash recovery, dependency ordering, and parallel execution that Ansible alone can't express:
 
 ```python
---- Provision and configure with Ansible ---
+--- Configure freshly racked servers with Ansible ---
 
 set env = "staging"
 
 target "control" local:
 
-  [provision infra]:
-    run    $ terraform apply -auto-approve -var="env=${env}"
-    timeout 10m
+  [wait for hosts reachable]:
+    run    $ ansible -i inventory/${env} all -m ping
+    retry 10x wait 30s
 
   [run base playbook]:
-    first [provision infra]
-    skip if $ ansible -i inventory/${env} all -m ping | grep -q SUCCESS
+    first [wait for hosts reachable]
     run    $ ansible-playbook -i inventory/${env} playbooks/base.yml
     timeout 15m, retry 1x wait 30s
 
@@ -178,7 +193,9 @@ target "control" local:
     run $ ansible -i inventory/${env} all -m shell -a 'systemctl is-active myapp'
 ```
 
-Terraform provisions, Ansible configures, CommandGraph orchestrates -- with resume from any failure point. You can also use Ansible inventory files directly with `inventory "hosts.ini"` (see [Ansible inventory compatibility](#ansible-inventory-compatibility)).
+The hardware team racks the servers. CommandGraph picks up from there -- waiting, configuring, verifying -- so the handoff is encoded in the graph rather than a chat message. You can also use Ansible inventory files directly with `inventory "hosts.ini"` (see [Ansible inventory compatibility](#ansible-inventory-compatibility)).
+
+For a complete end-to-end production example -- canary rollout, API registration, and human approval gate in a single graph -- see [Cookbook Recipe 11: Full Production Rollout](COOKBOOK.md#recipe-11-full-production-rollout).
 
 ---
 
@@ -198,6 +215,8 @@ Steps with no dependency between them run in the same wave simultaneously. Steps
 
 Every completed step is written to a `.state` file atomically. Crash mid-run, fix the problem, run again. Completed steps skip from state without even SSHing to the server.
 
+Unlike rerunning an entire pipeline and hoping earlier steps are harmless, CommandGraph knows exactly what succeeded. Fix the problem and rerun -- the engine continues from the failed point, not from the top.
+
 Need isolated journals for concurrent parameterized runs? Use `cgr apply FILE --run-id canary` to salt the default state path, or `cgr apply FILE --state /path/to/run.state` to pin an explicit journal.
 
 <p align="center">
@@ -288,6 +307,8 @@ Point a target at an SSH host and every command runs remotely. State stays on yo
   <img src="docs/img/ssh-execution.svg" alt="SSH execution: state local, commands remote" width="700"/>
 </p>
 
+There is no agent to distribute, no daemon to babysit, and no extra control plane to keep alive. For organizations cautious about adding resident components to hosts, this matters.
+
 Multiple targets in one file run in parallel. Steps with `as root` are automatically wrapped in `sudo` on the remote side.
 
 ---
@@ -527,20 +548,29 @@ each name, addr in ${webservers}:
 
 ---
 
-### Maintainer note
+## Where CommandGraph fits best
 
-The release artifact is a single `cgr.py`. Development now happens in `cgr_src/`, with `cgr_dev.py` as the thin dev entrypoint. For maintainers, rebuild after changing source modules, `ide.html`, or `visualize_template.py`:
+CommandGraph is especially well-suited for:
 
-```bash
-python3 cgr_dev.py version
-python3 cgr_dev.py plan build.cgr
-python3 cgr_dev.py apply build.cgr
-python3 cgr_dev.py apply build.cgr --no-resume
-```
+- staged application deploys across fleets with canary verification
+- maintenance workflows that sequence SSH, Ansible, API calls, and approval gates
+- operational runbooks that currently live as a mix of CI config and tribal knowledge
+- audits and inventory collection across many hosts
+- recovery-prone processes where restarting from scratch is expensive
+- any workflow that benefits from "run this in parallel, verify, then continue"
+
+## Getting started in your organization
+
+Don't start with a platform migration. Start with one painful workflow:
+
+- a release process that needs canary promotion and human approval gates
+- a node maintenance runbook you're tired of executing step by step
+- a provisioning-plus-configure-plus-verify sequence that spans three tools today
+- a fleet audit that currently requires too many moving parts
 
-`build.cgr` is the canonical rebuild path. `build_helpers.full_build()` uses the same assembly logic, but it is a lower-level helper, not the normal maintainer workflow.
+Use CommandGraph as the orchestration layer around the tools you already have. That is where it becomes persuasive quickly.
 
-Use [MODULE_MAP.md](MODULE_MAP.md) as the quick reference for where parser, resolver, executor, state, IDE, and CLI changes now live.
+For a complete end-to-end reference -- Ansible, canary rollout, API registration, and approval gate in a single graph -- see [Cookbook Recipe 11: Full Production Rollout](COOKBOOK.md#recipe-11-full-production-rollout).
 
 ---
 
@@ -585,6 +615,22 @@ cgr apply deploy.cgr   # only the drifted step re-runs
 
 ---
 
+## Design principles
+
+**Files are the interface.** A `.cgr` file is a complete, portable, version-controllable description of your infrastructure. No web UI required, no database, no daemon.
+
+**Idempotent by default.** Every step has a `skip if` check. Run it 10 times, get the same result.
+
+**Crash-safe.** State is append-only with `fsync` after each write. A power failure loses at most one line.
+
+**Zero dependencies.** One Python file, stdlib only. Copy it to an air-gapped server and it works.
+
+**Human-readable.** The syntax reads like English: "First install nginx. Skip if already installed. Run apt-get install." No YAML indentation wars. No JSON escaping. No Jinja2 templating bugs.
+
+**Graphs, not lists.** You declare dependencies. The engine computes execution order and maximizes parallelism. Reorder your file however you want -- the result is the same.
+
+---
+
 ## Testing
 
 ```bash
@@ -602,25 +648,21 @@ cd testing-ssh/ && ./run-ssh-demos.sh          # 5 SSH demos
 |----------|----------|-------------|
 | [QUICKSTART.md](QUICKSTART.md) | New users | Zero to running in 5 minutes |
 | [TUTORIAL.md](TUTORIAL.md) | Beginners | 9 guided lessons, ~1 hour |
-| [COOKBOOK.md](COOKBOOK.md) | Operators | 10 real-world recipes |
+| [COOKBOOK.md](COOKBOOK.md) | Operators | 11 real-world recipes, including a full production rollout |
 | [MANUAL.md](MANUAL.md) | Reference | Complete syntax for `.cgr` and `.cg` |
 | [COMMANDGRAPH_SPEC.md](COMMANDGRAPH_SPEC.md) | Code generators | Formal PEG grammar |
-| [AGENTS.md](AGENTS.md) | Contributors | Architecture and internals |
+| [AGENTS.md](AGENTS.md) | Contributors | Architecture, internals, and build workflow |
 
 ---
 
-## Design principles
-
-**Files are the interface.** A `.cgr` file is a complete, portable, version-controllable description of your infrastructure. No web UI required, no database, no daemon.
-
-**Idempotent by default.** Every step has a `skip if` check. Run it 10 times, get the same result.
-
-**Crash-safe.** State is append-only with `fsync` after each write. A power failure loses at most one line.
+### Maintainer note
 
-**Zero dependencies.** One Python file, stdlib only. Copy it to an air-gapped server and it works.
+The release artifact is a single `cgr.py`. Development happens in `cgr_src/`, with `cgr_dev.py` as the thin dev entrypoint. Rebuild after changing source modules, `ide.html`, or `visualize_template.py`:
 
-**Human-readable.** The syntax reads like English: "First install nginx. Skip if already installed. Run apt-get install." No YAML indentation wars. No JSON escaping. No Jinja2 templating bugs.
+```bash
+python3 cgr_dev.py apply build.cgr
+```
 
-**Graphs, not lists.** You declare dependencies. The engine computes execution order and maximizes parallelism. Reorder your file however you want -- the result is the same.
+See [AGENTS.md](AGENTS.md) for the full build workflow and [MODULE_MAP.md](MODULE_MAP.md) for the quick reference on where parser, resolver, executor, state, IDE, and CLI changes live.
 
 ---
@@ -5,7 +5,7 @@ Spawns N containers as deployment targets, then runs the same workload with diff
 
 ## Requirements
 
-- Podman (or Docker as fallback)
+- Podman recommended
 - ~500MB disk for container images
 - Approximately 1 minute for the default 8-target run
 
@@ -14,10 +14,13 @@ Spawns N containers as deployment targets, then runs the same workload with diff
 ```bash
 ./run-benchmark.sh          # 8 targets (default)
 ./run-benchmark.sh 12       # 12 targets
+./run-benchmark.sh auto     # podman only; auto-size target count from host capacity
 ./run-benchmark.sh shell    # interactive shell with targets running
 ./run-benchmark.sh teardown # clean up everything
 ```
 
+`auto` / `max` mode prefers the highest target count the current host can sustain for this benchmark. It calculates an initial ceiling from live CPU and memory availability, then probes podman target startup until it finds the highest count that comes up cleanly over SSH.
+
 ## What it measures
 
 Five scenarios deploy the **same workload** to the **same N hosts**, varying only the concurrency:
@@ -84,9 +87,11 @@ The speedup is sub-linear because:
 
 The control container generates `.cgr` graph files dynamically based on the target count, then runs `cgr apply` for each scenario. State files are cleared between runs so every step executes fresh.
 
-## Team
+Generated graphs are preserved under `benchmarks/generated/latest/` and in a timestamped run directory, so you can inspect them directly:
+
+```bash
+python3 cgr.py serve benchmarks/generated/latest/bench-parN.cgr
+python3 cgr.py visualize benchmarks/generated/latest/bench-parN.cgr -o benchmarks/generated/latest/bench-parN.html
+```
 
-Benchmark designed by:
-- **Marcus Chen (SRE)** — infrastructure and benchmark harness
-- **Priya Kapoor (Systems SRE)** — `each`/`stage` scenario design
-- **Dr. Elena Voss (CS)** — measurement methodology, workload calibration
+The generated directory also includes a small `README.md` with the exact paths for that run.