Skip to content

Commit 2e7bd79

Browse files
committed
618b43
1 parent aaf2440 commit 2e7bd79

14 files changed

Lines changed: 572 additions & 347 deletions

AGENTS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ python3 -c "import py_compile; py_compile.compile('cgr.py', doraise=True)"
6868
# Pytest suite
6969
python3 -m pytest test_commandgraph.py -x -q
7070

71-
# Validate all example files
71+
# Validate root feature-exercise example files
7272
cgr validate nginx_setup.cg && cgr validate nginx_setup.cgr
7373
cgr validate webserver.cg --repo ./repo && cgr validate webserver.cgr --repo ./repo
7474
cgr validate parallel_test.cgr && cgr validate multinode_test.cgr && cgr validate multinode_test.cg
@@ -90,6 +90,6 @@ testing-ssh/run-ssh-demos.sh # 5 SSH demos
9090
- If you touch apply output, keep `--output json` and `--report FILE.json` aligned by deriving both from the same execution/result data when possible.
9191
- If you touch state handling, inspect both resource-entry behavior and `_wave` / `_run` metric behavior.
9292
- If you touch sub-graph inclusion, preserve the contract that the child graph runs as a unit rather than as flattened parent resources.
93-
- Test against the example graphs after changes, not just unit tests.
93+
- Test against the root feature-exercise graphs after changes, not just unit tests.
9494
- Templates may exist in either `.cgr` or `.cg`, but the stdlib in `repo/` is `.cgr`, and the repo loader prefers `.cgr`.
9595
- Be cautious with state semantics, dependency resolution, and desugaring changes. Small regressions in those areas can silently break resume behavior or execution ordering.

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ python3 cgr.py apply build.cgr
7474
python3 -m pytest test_commandgraph.py -x -q
7575
python3 -m pytest test_modularization.py -x -q
7676

77-
# Validate all example files
77+
# Validate root feature-exercise example files
7878
cgr validate nginx_setup.cg && cgr validate nginx_setup.cgr
7979
cgr validate webserver.cg --repo ./repo && cgr validate webserver.cgr --repo ./repo
8080
cgr validate parallel_test.cgr && cgr validate multinode_test.cgr && cgr validate multinode_test.cg
@@ -94,6 +94,6 @@ testing-ssh/run-ssh-demos.sh # 5 SSH demos
9494
- For composition, prefer sub-graph steps when the user wants another graph executed as one unit rather than flattened child resources.
9595
- When touching state code, account for both resource entries and `_wave` / `_run` metric records.
9696
- When touching apply output, keep `--report FILE.json` and `--output json` consistent by deriving both from the same result data when possible.
97-
- Test with all example files after any change.
97+
- Test with the root feature-exercise graphs after any change.
9898
- Templates can be `.cgr` or `.cg` — all 44 stdlib are `.cgr`. Repo loader prefers `.cgr`.
9999
- Reference `MANUAL.md` for syntax details, `.claude/docs/design_doc_*.md` for design rationale.

COOKBOOK.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -534,6 +534,79 @@ cgr apply infra.cgr --tags security --skip-tags audit # Security, but not the a
534534

535535
---
536536

537+
## Recipe 11: Full Production Rollout
538+
539+
**When to use:** End-to-end release workflow — provision, configure, register, roll out with a canary gate, then wait for human approval before verifying the fleet.
540+
**Concepts:** `stage`/`phase`, `wait for webhook`, HTTP steps, auth tokens, `each`, `verify`.
541+
542+
This is the kind of workflow most teams express today as a fragile combination of pipeline stages, shell scripts, approval toggles, and manual operator steps. Encoding it in one graph gives you a consistent execution model with crash recovery at every point.
543+
544+
```
545+
--- Production rollout with gates ---
546+
547+
set env = "prod"
548+
set deploy_id = "release-2026-04-11"
549+
550+
target "control" local:
551+
552+
[provision infra]:
553+
run $ terraform apply -auto-approve -var="env=${env}"
554+
timeout 15m
555+
556+
[run base playbook]:
557+
first [provision infra]
558+
run $ ansible-playbook -i inventory/${env} playbooks/base.yml
559+
timeout 20m
560+
561+
[run app playbook]:
562+
first [run base playbook]
563+
run $ ansible-playbook -i inventory/${env} playbooks/app.yml --tags deploy
564+
timeout 15m
565+
566+
[register deploy]:
567+
first [run app playbook]
568+
post "https://deploy-api.example.net/releases"
569+
auth bearer "${DEPLOY_API_TOKEN}"
570+
body json '{"environment":"${env}","deploy_id":"${deploy_id}"}'
571+
expect 200..299
572+
573+
[roll out]:
574+
first [register deploy]
575+
stage "production":
576+
phase "canary" 1 from "web-1,web-2,web-3,web-4":
577+
[deploy ${server}]:
578+
run $ ssh ${server} '/opt/myapp/activate.sh'
579+
580+
verify "canary healthy":
581+
run $ curl -sf http://${server}:8080/health
582+
retry 10x wait 3s
583+
584+
phase "rest" remaining from "web-1,web-2,web-3,web-4":
585+
each server, 2 at a time:
586+
[deploy ${server}]:
587+
run $ ssh ${server} '/opt/myapp/activate.sh'
588+
589+
[wait for approval]:
590+
first [roll out]
591+
wait for webhook "/approve/${deploy_id}"
592+
timeout 2h
593+
594+
verify "fleet healthy":
595+
first [wait for approval]
596+
run $ ansible -i inventory/${env} all -m shell -a 'systemctl is-active myapp'
597+
```
598+
599+
**How it works:**
600+
- Terraform, Ansible, and the API registration run sequentially with explicit ordering
601+
- `[roll out]` deploys to one canary server first — if the verify fails, the rest of the fleet is never touched
602+
- `wait for webhook` pauses execution until `POST /approve/release-2026-04-11` is received (or times out after 2h)
603+
- If anything fails mid-run, `cgr apply` resumes from the failed step — no rerunning Terraform if it already completed
604+
- `auth bearer` tokens are automatically redacted from all output
605+
606+
**Customization:** Override the environment with `--set env=staging`. Change the concurrency of the fleet rollout by editing `2 at a time`. Add `collect "activation"` to any step to capture its stdout in `cgr report`.
607+
608+
---
609+
537610
## Feature Reference
538611

539612
Quick lookup: which feature solves your problem?

README.md

Lines changed: 75 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Write a plain-text file that reads like English (or use an agent on your behalf!
77
One Python file with zero dependencies. No agents on your servers. No daemon. No database.
88

99
```python
10-
--- Deploy my app ---
10+
--- Install nginx and basic app ---
1111

1212
target "web" ssh deploy@10.0.1.5:
1313

@@ -72,6 +72,12 @@ target "local" local:
7272

7373
---
7474

75+
> **Ansible configures it. CommandGraph runs the operational story around it.**
76+
77+
Hardware gets racked. Ansible configures it. But the workflow in between -- waiting for hosts to come up, running playbooks in the right order, verifying the result, gating the next step on a health check -- usually lives in runbooks, chat threads, and operator memory. That is the gap CommandGraph fills.
78+
79+
---
80+
7581
## Web IDE
7682

7783
`cgr serve FILE` launches a browser-based IDE with a live DAG visualization and execution panel. The left pane is an editor; the right pane shows the dependency graph updating in real time as you edit. Run `apply`, stream step output, inspect state and history, and view collected report data -- all from the browser.
@@ -137,28 +143,37 @@ Inside that shell, `cgr` is already on `PATH`, examples live in `/opt/cgr/exampl
137143

138144
---
139145

140-
## Why not Ansible / Puppet / Chef / Salt?
146+
## The case for CommandGraph
147+
148+
Most teams don't struggle because they lack tools. They struggle because the workflow *across* those tools is brittle.
149+
150+
Some of these symptoms may sound familiar:
141151

142-
CommandGraph is not trying to replace those tools. It solves a narrower problem: declare dependencies between operational steps, then let the engine plan order, maximize parallelism, and resume exactly from failure. Traditional config-management tools are still good at configuration management; CommandGraph is good at orchestration across shell commands, remote steps, API calls, wait gates, and sub-graphs.
152+
- Hardware is racked, but the handoff to Ansible lives in a runbook nobody reads until something breaks
153+
- One failed health check means rerunning broad chunks of work — no trustworthy resume point
154+
- Canary logic is implied by convention, not encoded in the workflow
155+
- Operators must remember which steps are safe to retry and which are not
156+
- Incident reviews have logs, but not a clean execution graph or machine-readable run record
157+
158+
CommandGraph turns that glue layer into something explicit, resumable, and inspectable. It is not a replacement for Ansible — it is the orchestration layer around it.
143159

144160
### Running Ansible playbooks from CommandGraph
145161

146-
Already have Ansible playbooks? Run them as steps inside a CommandGraph. This lets you sequence playbooks alongside shell commands, API calls, and other tools -- with crash recovery, dependency ordering, and parallel execution that Ansible alone can't express:
162+
Already have Ansible playbooks? Run them as steps inside a CommandGraph. The graph picks up right after hardware is racked -- waiting for hosts to respond, running playbooks in order, verifying the result -- with crash recovery, dependency ordering, and parallel execution that Ansible alone can't express:
147163

148164
```python
149-
--- Provision and configure with Ansible ---
165+
--- Configure freshly racked servers with Ansible ---
150166

151167
set env = "staging"
152168

153169
target "control" local:
154170

155-
[provision infra]:
156-
run $ terraform apply -auto-approve -var="env=${env}"
157-
timeout 10m
171+
[wait for hosts reachable]:
172+
run $ ansible -i inventory/${env} all -m ping
173+
retry 10x wait 30s
158174

159175
[run base playbook]:
160-
first [provision infra]
161-
skip if $ ansible -i inventory/${env} all -m ping | grep -q SUCCESS
176+
first [wait for hosts reachable]
162177
run $ ansible-playbook -i inventory/${env} playbooks/base.yml
163178
timeout 15m, retry 1x wait 30s
164179

@@ -178,7 +193,9 @@ target "control" local:
178193
run $ ansible -i inventory/${env} all -m shell -a 'systemctl is-active myapp'
179194
```
180195

181-
Terraform provisions, Ansible configures, CommandGraph orchestrates -- with resume from any failure point. You can also use Ansible inventory files directly with `inventory "hosts.ini"` (see [Ansible inventory compatibility](#ansible-inventory-compatibility)).
196+
The hardware team racks the servers. CommandGraph picks up from there -- waiting, configuring, verifying -- so the handoff is encoded in the graph rather than a chat message. You can also use Ansible inventory files directly with `inventory "hosts.ini"` (see [Ansible inventory compatibility](#ansible-inventory-compatibility)).
197+
198+
For a complete end-to-end production example -- canary rollout, API registration, and human approval gate in a single graph -- see [Cookbook Recipe 11: Full Production Rollout](COOKBOOK.md#recipe-11-full-production-rollout).
182199

183200
---
184201

@@ -198,6 +215,8 @@ Steps with no dependency between them run in the same wave simultaneously. Steps
198215

199216
Every completed step is written to a `.state` file atomically. Crash mid-run, fix the problem, run again. Completed steps skip from state without even SSHing to the server.
200217

218+
Unlike rerunning an entire pipeline and hoping earlier steps are harmless, CommandGraph knows exactly what succeeded. Fix the problem and rerun -- the engine continues from the failed point, not from the top.
219+
201220
Need isolated journals for concurrent parameterized runs? Use `cgr apply FILE --run-id canary` to salt the default state path, or `cgr apply FILE --state /path/to/run.state` to pin an explicit journal.
202221

203222
<p align="center">
@@ -288,6 +307,8 @@ Point a target at an SSH host and every command runs remotely. State stays on yo
288307
<img src="docs/img/ssh-execution.svg" alt="SSH execution: state local, commands remote" width="700"/>
289308
</p>
290309
310+
There is no agent to distribute, no daemon to babysit, and no extra control plane to keep alive. For organizations cautious about adding resident components to hosts, this matters.
311+
291312
Multiple targets in one file run in parallel. Steps with `as root` are automatically wrapped in `sudo` on the remote side.
292313
293314
---
@@ -527,20 +548,29 @@ each name, addr in ${webservers}:
527548

528549
---
529550

530-
### Maintainer note
551+
## Where CommandGraph fits best
531552

532-
The release artifact is a single `cgr.py`. Development now happens in `cgr_src/`, with `cgr_dev.py` as the thin dev entrypoint. For maintainers, rebuild after changing source modules, `ide.html`, or `visualize_template.py`:
553+
CommandGraph is especially well-suited for:
533554

534-
```bash
535-
python3 cgr_dev.py version
536-
python3 cgr_dev.py plan build.cgr
537-
python3 cgr_dev.py apply build.cgr
538-
python3 cgr_dev.py apply build.cgr --no-resume
539-
```
555+
- staged application deploys across fleets with canary verification
556+
- maintenance workflows that sequence SSH, Ansible, API calls, and approval gates
557+
- operational runbooks that currently live as a mix of CI config and tribal knowledge
558+
- audits and inventory collection across many hosts
559+
- recovery-prone processes where restarting from scratch is expensive
560+
- any workflow that benefits from "run this in parallel, verify, then continue"
561+
562+
## Getting started in your organization
563+
564+
Don't start with a platform migration. Start with one painful workflow:
565+
566+
- a release process that needs canary promotion and human approval gates
567+
- a node maintenance runbook you're tired of executing step by step
568+
- a provisioning-plus-configure-plus-verify sequence that spans three tools today
569+
- a fleet audit that currently requires too many moving parts
540570

541-
`build.cgr` is the canonical rebuild path. `build_helpers.full_build()` uses the same assembly logic, but it is a lower-level helper, not the normal maintainer workflow.
571+
Use CommandGraph as the orchestration layer around the tools you already have. That is where it becomes persuasive quickly.
542572

543-
Use [MODULE_MAP.md](MODULE_MAP.md) as the quick reference for where parser, resolver, executor, state, IDE, and CLI changes now live.
573+
For a complete end-to-end reference -- Ansible, canary rollout, API registration, and approval gate in a single graph -- see [Cookbook Recipe 11: Full Production Rollout](COOKBOOK.md#recipe-11-full-production-rollout).
544574

545575
---
546576

@@ -585,6 +615,22 @@ cgr apply deploy.cgr # only the drifted step re-runs
585615

586616
---
587617

618+
## Design principles
619+
620+
**Files are the interface.** A `.cgr` file is a complete, portable, version-controllable description of your infrastructure. No web UI required, no database, no daemon.
621+
622+
**Idempotent by default.** Every step has a `skip if` check. Run it 10 times, get the same result.
623+
624+
**Crash-safe.** State is append-only with `fsync` after each write. A power failure loses at most one line.
625+
626+
**Zero dependencies.** One Python file, stdlib only. Copy it to an air-gapped server and it works.
627+
628+
**Human-readable.** The syntax reads like English: "First install nginx. Skip if already installed. Run apt-get install." No YAML indentation wars. No JSON escaping. No Jinja2 templating bugs.
629+
630+
**Graphs, not lists.** You declare dependencies. The engine computes execution order and maximizes parallelism. Reorder your file however you want -- the result is the same.
631+
632+
---
633+
588634
## Testing
589635

590636
```bash
@@ -602,25 +648,21 @@ cd testing-ssh/ && ./run-ssh-demos.sh # 5 SSH demos
602648
|----------|----------|-------------|
603649
| [QUICKSTART.md](QUICKSTART.md) | New users | Zero to running in 5 minutes |
604650
| [TUTORIAL.md](TUTORIAL.md) | Beginners | 9 guided lessons, ~1 hour |
605-
| [COOKBOOK.md](COOKBOOK.md) | Operators | 10 real-world recipes |
651+
| [COOKBOOK.md](COOKBOOK.md) | Operators | 11 real-world recipes, including a full production rollout |
606652
| [MANUAL.md](MANUAL.md) | Reference | Complete syntax for `.cgr` and `.cg` |
607653
| [COMMANDGRAPH_SPEC.md](COMMANDGRAPH_SPEC.md) | Code generators | Formal PEG grammar |
608-
| [AGENTS.md](AGENTS.md) | Contributors | Architecture and internals |
654+
| [AGENTS.md](AGENTS.md) | Contributors | Architecture, internals, and build workflow |
609655

610656
---
611657

612-
## Design principles
613-
614-
**Files are the interface.** A `.cgr` file is a complete, portable, version-controllable description of your infrastructure. No web UI required, no database, no daemon.
615-
616-
**Idempotent by default.** Every step has a `skip if` check. Run it 10 times, get the same result.
617-
618-
**Crash-safe.** State is append-only with `fsync` after each write. A power failure loses at most one line.
658+
### Maintainer note
619659

620-
**Zero dependencies.** One Python file, stdlib only. Copy it to an air-gapped server and it works.
660+
The release artifact is a single `cgr.py`. Development happens in `cgr_src/`, with `cgr_dev.py` as the thin dev entrypoint. Rebuild after changing source modules, `ide.html`, or `visualize_template.py`:
621661

622-
**Human-readable.** The syntax reads like English: "First install nginx. Skip if already installed. Run apt-get install." No YAML indentation wars. No JSON escaping. No Jinja2 templating bugs.
662+
```bash
663+
python3 cgr_dev.py apply build.cgr
664+
```
623665

624-
**Graphs, not lists.** You declare dependencies. The engine computes execution order and maximizes parallelism. Reorder your file however you want -- the result is the same.
666+
See [AGENTS.md](AGENTS.md) for the full build workflow and [MODULE_MAP.md](MODULE_MAP.md) for the quick reference on where parser, resolver, executor, state, IDE, and CLI changes live.
625667

626668
---

benchmarks/README.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Spawns N containers as deployment targets, then runs the same workload with diff
55

66
## Requirements
77

8-
- Podman (or Docker as fallback)
8+
- Podman recommended
99
- ~500MB disk for container images
1010
- Approximately 1 minute for the default 8-target run
1111

@@ -14,10 +14,13 @@ Spawns N containers as deployment targets, then runs the same workload with diff
1414
```bash
1515
./run-benchmark.sh # 8 targets (default)
1616
./run-benchmark.sh 12 # 12 targets
17+
./run-benchmark.sh auto # podman only; auto-size target count from host capacity
1718
./run-benchmark.sh shell # interactive shell with targets running
1819
./run-benchmark.sh teardown # clean up everything
1920
```
2021

22+
`auto` / `max` mode prefers the highest target count the current host can sustain for this benchmark. It calculates an initial ceiling from live CPU and memory availability, then probes podman target startup until it finds the highest count that comes up cleanly over SSH.
23+
2124
## What it measures
2225

2326
Five scenarios deploy the **same workload** to the **same N hosts**, varying only the concurrency:
@@ -84,9 +87,11 @@ The speedup is sub-linear because:
8487

8588
The control container generates `.cgr` graph files dynamically based on the target count, then runs `cgr apply` for each scenario. State files are cleared between runs so every step executes fresh.
8689

87-
## Team
90+
Generated graphs are preserved under `benchmarks/generated/latest/` and in a timestamped run directory, so you can inspect them directly:
91+
92+
```bash
93+
python3 cgr.py serve benchmarks/generated/latest/bench-parN.cgr
94+
python3 cgr.py visualize benchmarks/generated/latest/bench-parN.cgr -o benchmarks/generated/latest/bench-parN.html
95+
```
8896

89-
Benchmark designed by:
90-
- **Marcus Chen (SRE)** — infrastructure and benchmark harness
91-
- **Priya Kapoor (Systems SRE)**`each`/`stage` scenario design
92-
- **Dr. Elena Voss (CS)** — measurement methodology, workload calibration
97+
The generated directory also includes a small `README.md` with the exact paths for that run.

0 commit comments

Comments
 (0)