You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- If you touch apply output, keep `--output json` and `--report FILE.json` aligned by deriving both from the same execution/result data when possible.
91
91
- If you touch state handling, inspect both resource-entry behavior and `_wave` / `_run` metric behavior.
92
92
- If you touch sub-graph inclusion, preserve the contract that the child graph runs as a unit rather than as flattened parent resources.
93
-
- Test against the example graphs after changes, not just unit tests.
93
+
- Test against the root feature-exercise graphs after changes, not just unit tests.
94
94
- Templates may exist in either `.cgr` or `.cg`, but the stdlib in `repo/` is `.cgr`, and the repo loader prefers `.cgr`.
95
95
- Be cautious with state semantics, dependency resolution, and desugaring changes. Small regressions in those areas can silently break resume behavior or execution ordering.
Copy file name to clipboardExpand all lines: COOKBOOK.md
+73Lines changed: 73 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -534,6 +534,79 @@ cgr apply infra.cgr --tags security --skip-tags audit # Security, but not the a
534
534
535
535
---
536
536
537
+
## Recipe 11: Full Production Rollout
538
+
539
+
**When to use:** End-to-end release workflow — provision, configure, register, roll out with a canary gate, then wait for human approval before verifying the fleet.
540
+
**Concepts:**`stage`/`phase`, `wait for webhook`, HTTP steps, auth tokens, `each`, `verify`.
541
+
542
+
This is the kind of workflow most teams express today as a fragile combination of pipeline stages, shell scripts, approval toggles, and manual operator steps. Encoding it in one graph gives you a consistent execution model with crash recovery at every point.
543
+
544
+
```
545
+
--- Production rollout with gates ---
546
+
547
+
set env = "prod"
548
+
set deploy_id = "release-2026-04-11"
549
+
550
+
target "control" local:
551
+
552
+
[provision infra]:
553
+
run $ terraform apply -auto-approve -var="env=${env}"
554
+
timeout 15m
555
+
556
+
[run base playbook]:
557
+
first [provision infra]
558
+
run $ ansible-playbook -i inventory/${env} playbooks/base.yml
559
+
timeout 20m
560
+
561
+
[run app playbook]:
562
+
first [run base playbook]
563
+
run $ ansible-playbook -i inventory/${env} playbooks/app.yml --tags deploy
564
+
timeout 15m
565
+
566
+
[register deploy]:
567
+
first [run app playbook]
568
+
post "https://deploy-api.example.net/releases"
569
+
auth bearer "${DEPLOY_API_TOKEN}"
570
+
body json '{"environment":"${env}","deploy_id":"${deploy_id}"}'
571
+
expect 200..299
572
+
573
+
[roll out]:
574
+
first [register deploy]
575
+
stage "production":
576
+
phase "canary" 1 from "web-1,web-2,web-3,web-4":
577
+
[deploy ${server}]:
578
+
run $ ssh ${server} '/opt/myapp/activate.sh'
579
+
580
+
verify "canary healthy":
581
+
run $ curl -sf http://${server}:8080/health
582
+
retry 10x wait 3s
583
+
584
+
phase "rest" remaining from "web-1,web-2,web-3,web-4":
585
+
each server, 2 at a time:
586
+
[deploy ${server}]:
587
+
run $ ssh ${server} '/opt/myapp/activate.sh'
588
+
589
+
[wait for approval]:
590
+
first [roll out]
591
+
wait for webhook "/approve/${deploy_id}"
592
+
timeout 2h
593
+
594
+
verify "fleet healthy":
595
+
first [wait for approval]
596
+
run $ ansible -i inventory/${env} all -m shell -a 'systemctl is-active myapp'
597
+
```
598
+
599
+
**How it works:**
600
+
- Terraform, Ansible, and the API registration run sequentially with explicit ordering
601
+
-`[roll out]` deploys to one canary server first — if the verify fails, the rest of the fleet is never touched
602
+
-`wait for webhook` pauses execution until `POST /approve/release-2026-04-11` is received (or times out after 2h)
603
+
- If anything fails mid-run, `cgr apply` resumes from the failed step — no rerunning Terraform if it already completed
604
+
-`auth bearer` tokens are automatically redacted from all output
605
+
606
+
**Customization:** Override the environment with `--set env=staging`. Change the concurrency of the fleet rollout by editing `2 at a time`. Add `collect "activation"` to any step to capture its stdout in `cgr report`.
Copy file name to clipboardExpand all lines: README.md
+75-33Lines changed: 75 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ Write a plain-text file that reads like English (or use an agent on your behalf!
7
7
One Python file with zero dependencies. No agents on your servers. No daemon. No database.
8
8
9
9
```python
10
-
---Deploy my app ---
10
+
---Install nginx and basic app ---
11
11
12
12
target "web" ssh deploy@10.0.1.5:
13
13
@@ -72,6 +72,12 @@ target "local" local:
72
72
73
73
---
74
74
75
+
>**Ansible configures it. CommandGraph runs the operational story around it.**
76
+
77
+
Hardware gets racked. Ansible configures it. But the workflow in between -- waiting for hosts to come up, running playbooks in the right order, verifying the result, gating the next step on a health check -- usually lives in runbooks, chat threads, and operator memory. That is the gap CommandGraph fills.
78
+
79
+
---
80
+
75
81
## Web IDE
76
82
77
83
`cgr serve FILE` launches a browser-based IDEwith a live DAG visualization and execution panel. The left pane is an editor; the right pane shows the dependency graph updating in real time as you edit. Run `apply`, stream step output, inspect state and history, and view collected report data --allfrom the browser.
@@ -137,28 +143,37 @@ Inside that shell, `cgr` is already on `PATH`, examples live in `/opt/cgr/exampl
137
143
138
144
---
139
145
140
-
## Why not Ansible / Puppet / Chef / Salt?
146
+
## The case for CommandGraph
147
+
148
+
Most teams don't struggle because they lack tools. They struggle because the workflow *across* those tools is brittle.
149
+
150
+
Some of these symptoms may sound familiar:
141
151
142
-
CommandGraph isnot trying to replace those tools. It solves a narrower problem: declare dependencies between operational steps, then let the engine plan order, maximize parallelism, and resume exactly from failure. Traditional config-management tools are still good at configuration management; CommandGraph is good at orchestration across shell commands, remote steps, API calls, wait gates, and sub-graphs.
152
+
- Hardware is racked, but the handoff to Ansible lives in a runbook nobody reads until something breaks
153
+
- One failed health check means rerunning broad chunks of work — no trustworthy resume point
154
+
- Canary logic is implied by convention, not encoded in the workflow
155
+
- Operators must remember which steps are safe to retry and which are not
156
+
- Incident reviews have logs, but not a clean execution graph or machine-readable run record
157
+
158
+
CommandGraph turns that glue layer into something explicit, resumable, and inspectable. It isnot a replacement for Ansible — it is the orchestration layer around it.
143
159
144
160
### Running Ansible playbooks from CommandGraph
145
161
146
-
Already have Ansible playbooks? Run them as steps inside a CommandGraph. This lets you sequence playbooks alongside shell commands, API calls, and other tools--with crash recovery, dependency ordering, and parallel execution that Ansible alone can't express:
162
+
Already have Ansible playbooks? Run them as steps inside a CommandGraph. The graph picks up right after hardware is racked -- waiting for hosts to respond, running playbooks in order, verifying the result--with crash recovery, dependency ordering, and parallel execution that Ansible alone can't express:
run $ ansible-playbook -i inventory/${env} playbooks/base.yml
163
178
timeout 15m, retry 1x wait 30s
164
179
@@ -178,7 +193,9 @@ target "control" local:
178
193
run $ ansible -i inventory/${env} all-m shell -a 'systemctl is-active myapp'
179
194
```
180
195
181
-
Terraform provisions, Ansible configures, CommandGraph orchestrates --with resume fromany failure point. You can also use Ansible inventory files directly with`inventory "hosts.ini"` (see [Ansible inventory compatibility](#ansible-inventory-compatibility)).
196
+
The hardware team racks the servers. CommandGraph picks up from there -- waiting, configuring, verifying -- so the handoff is encoded in the graph rather than a chat message. You can also use Ansible inventory files directly with`inventory "hosts.ini"` (see [Ansible inventory compatibility](#ansible-inventory-compatibility)).
197
+
198
+
For a complete end-to-end production example -- canary rollout, API registration, and human approval gate in a single graph -- see [Cookbook Recipe 11: Full Production Rollout](COOKBOOK.md#recipe-11-full-production-rollout).
182
199
183
200
---
184
201
@@ -198,6 +215,8 @@ Steps with no dependency between them run in the same wave simultaneously. Steps
198
215
199
216
Every completed step is written to a `.state`file atomically. Crash mid-run, fix the problem, run again. Completed steps skip from state without even SSHing to the server.
200
217
218
+
Unlike rerunning an entire pipeline and hoping earlier steps are harmless, CommandGraph knows exactly what succeeded. Fix the problem and rerun -- the engine continues from the failed point, notfrom the top.
219
+
201
220
Need isolated journals for concurrent parameterized runs? Use `cgr apply FILE--run-id canary` to salt the default state path, or`cgr apply FILE--state /path/to/run.state` to pin an explicit journal.
202
221
203
222
<p align="center">
@@ -288,6 +307,8 @@ Point a target at an SSH host and every command runs remotely. State stays on yo
288
307
<img src="docs/img/ssh-execution.svg" alt="SSH execution: state local, commands remote" width="700"/>
289
308
</p>
290
309
310
+
There is no agent to distribute, no daemon to babysit, and no extra control plane to keep alive. For organizations cautious about adding resident components to hosts, this matters.
311
+
291
312
Multiple targets in one file run in parallel. Steps with `as root` are automatically wrapped in `sudo` on the remote side.
292
313
293
314
---
@@ -527,20 +548,29 @@ each name, addr in ${webservers}:
527
548
528
549
---
529
550
530
-
### Maintainer note
551
+
## Where CommandGraph fits best
531
552
532
-
The release artifact isa single `cgr.py`. Development now happens in`cgr_src/`, with`cgr_dev.py`as the thin dev entrypoint. For maintainers, rebuild after changing source modules, `ide.html`, or`visualize_template.py`:
553
+
CommandGraph isespecially well-suited for:
533
554
534
-
```bash
535
-
python3 cgr_dev.py version
536
-
python3 cgr_dev.py plan build.cgr
537
-
python3 cgr_dev.py apply build.cgr
538
-
python3 cgr_dev.py apply build.cgr --no-resume
539
-
```
555
+
- staged application deploys across fleets with canary verification
556
+
- maintenance workflows that sequence SSH, Ansible, API calls, and approval gates
557
+
- operational runbooks that currently live as a mix of CI config and tribal knowledge
558
+
- audits and inventory collection across many hosts
559
+
- recovery-prone processes where restarting from scratch is expensive
560
+
-any workflow that benefits from"run this in parallel, verify, then continue"
561
+
562
+
## Getting started in your organization
563
+
564
+
Don't start with a platform migration. Start with one painful workflow:
565
+
566
+
- a release process that needs canary promotion and human approval gates
567
+
- a node maintenance runbook you're tired of executing step by step
568
+
- a provisioning-plus-configure-plus-verify sequence that spans three tools today
569
+
- a fleet audit that currently requires too many moving parts
540
570
541
-
`build.cgr`isthe canonical rebuild path. `build_helpers.full_build()` uses the same assembly logic, but itisa lower-level helper, not the normal maintainer workflow.
571
+
Use CommandGraph asthe orchestration layer around the tools you already have. Thatiswhere it becomes persuasive quickly.
542
572
543
-
Use [MODULE_MAP.md](MODULE_MAP.md) as the quick reference for where parser, resolver, executor, state, IDE, andCLI changes now live.
573
+
For a complete end-to-end reference -- Ansible, canary rollout, API registration, andapproval gate in a single graph -- see [Cookbook Recipe 11: Full Production Rollout](COOKBOOK.md#recipe-11-full-production-rollout).
544
574
545
575
---
546
576
@@ -585,6 +615,22 @@ cgr apply deploy.cgr # only the drifted step re-runs
585
615
586
616
---
587
617
618
+
## Design principles
619
+
620
+
**Files are the interface.** A `.cgr`fileis a complete, portable, version-controllable description of your infrastructure. No web UI required, no database, no daemon.
621
+
622
+
**Idempotent by default.** Every step has a `skip if` check. Run it 10 times, get the same result.
623
+
624
+
**Crash-safe.** State is append-only with`fsync` after each write. A power failure loses at most one line.
625
+
626
+
**Zero dependencies.** One Python file, stdlib only. Copy it to an air-gapped server and it works.
627
+
628
+
**Human-readable.** The syntax reads like English: "First install nginx. Skip if already installed. Run apt-get install." No YAML indentation wars. No JSON escaping. No Jinja2 templating bugs.
629
+
630
+
**Graphs, not lists.** You declare dependencies. The engine computes execution order and maximizes parallelism. Reorder your file however you want -- the result is the same.
**Files are the interface.** A `.cgr`fileis a complete, portable, version-controllable description of your infrastructure. No web UI required, no database, no daemon.
615
-
616
-
**Idempotent by default.** Every step has a `skip if` check. Run it 10 times, get the same result.
617
-
618
-
**Crash-safe.** State is append-only with`fsync` after each write. A power failure loses at most one line.
658
+
### Maintainer note
619
659
620
-
**Zero dependencies.** One Python file, stdlib only. Copy it to an air-gapped server and it works.
660
+
The release artifact is a single `cgr.py`. Development happens in`cgr_src/`, with`cgr_dev.py`as the thin dev entrypoint. Rebuild after changing source modules, `ide.html`, or`visualize_template.py`:
621
661
622
-
**Human-readable.** The syntax reads like English: "First install nginx. Skip if already installed. Run apt-get install." No YAML indentation wars. No JSON escaping. No Jinja2 templating bugs.
662
+
```bash
663
+
python3 cgr_dev.py apply build.cgr
664
+
```
623
665
624
-
**Graphs, not lists.** You declare dependencies. The engine computes execution orderandmaximizes parallelism. Reorder your file however you want -- the result is the same.
666
+
See [AGENTS.md](AGENTS.md) for the full build workflowand[MODULE_MAP.md](MODULE_MAP.md) for the quick reference on where parser, resolver, executor, state, IDE, andCLI changes live.
Copy file name to clipboardExpand all lines: benchmarks/README.md
+11-6Lines changed: 11 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ Spawns N containers as deployment targets, then runs the same workload with diff
5
5
6
6
## Requirements
7
7
8
-
- Podman (or Docker as fallback)
8
+
- Podman recommended
9
9
-~500MB disk for container images
10
10
- Approximately 1 minute for the default 8-target run
11
11
@@ -14,10 +14,13 @@ Spawns N containers as deployment targets, then runs the same workload with diff
14
14
```bash
15
15
./run-benchmark.sh # 8 targets (default)
16
16
./run-benchmark.sh 12 # 12 targets
17
+
./run-benchmark.sh auto # podman only; auto-size target count from host capacity
17
18
./run-benchmark.sh shell # interactive shell with targets running
18
19
./run-benchmark.sh teardown # clean up everything
19
20
```
20
21
22
+
`auto` / `max` mode prefers the highest target count the current host can sustain for this benchmark. It calculates an initial ceiling from live CPU and memory availability, then probes podman target startup until it finds the highest count that comes up cleanly over SSH.
23
+
21
24
## What it measures
22
25
23
26
Five scenarios deploy the **same workload** to the **same N hosts**, varying only the concurrency:
@@ -84,9 +87,11 @@ The speedup is sub-linear because:
84
87
85
88
The control container generates `.cgr` graph files dynamically based on the target count, then runs `cgr apply` for each scenario. State files are cleared between runs so every step executes fresh.
86
89
87
-
## Team
90
+
Generated graphs are preserved under `benchmarks/generated/latest/` and in a timestamped run directory, so you can inspect them directly:
0 commit comments