A structured TDD pipeline for Claude Code with adversarial sub-agent review gates.
This is a reference implementation, not a library. Read it, understand the patterns, then build your own version tuned to your stack and workflow. The code examples are from a real production setup (Go + Next.js monorepo, 16 parallel environments) — adapt them to your situation.
- The Problem
- The Solution
- Getting Started — Choose Your Tier
- Architecture
- Claude Code Hooks Primer
- Tiers
- How it prevents Claude from cheating
- Deep dives
- Porting guide
- FAQ
Claude Code is powerful but takes shortcuts:
- Skips testing ("the code looks correct, let me push")
- Skips review ("all tests pass, ship it")
- Jumps from plan to implementation without defining acceptance criteria
- Approves its own work (writes tests that test what it wrote, not what was specified)
- Anchors on its own decisions (reviews code it wrote and finds no issues)
Why not just add rules to CLAUDE.md? CLAUDE.md rules are suggestions; Claude follows them ~80% of the time. Hooks are enforcement; Claude cannot bypass a non-zero exit code. The ceremony uses hooks for enforcement and CLAUDE.md for guidance — together they're airtight.
LLMs perform dramatically better when they work through a problem in stages — from broad to specific — rather than jumping straight to code. When Claude goes directly from "build feature X" to writing code, it:
- Makes assumptions that were never validated
- Writes tests after the code (testing what it wrote, not what was specified)
- Misses edge cases that would have been obvious during planning
- Produces code that works but doesn't match what the user actually wanted
A structured pipeline forces Claude to think before coding: first understand the problem (plan), then define what "done" means (spec), then map how to verify each claim (test shape), then write failing tests (TDD), and only then write implementation code. Each stage produces an artifact that constrains the next stage. By the time Claude writes code, it has:
- A plan validated with the user's concerns
- A spec with explicit Must Do / Must NOT claims
- A test shape mapping every claim to concrete test cases
- Failing tests that define exactly what success looks like
This is test-driven development in the classical sense — define the acceptance criteria and write failing tests before writing a single line of implementation. The pipeline enforces this ordering because Claude will skip it if you let it.
Two mechanisms enforce the pipeline:
plan → spec → survey → shape-review(GATE) → tests → implement
→ diff-review(GATE) → verify → report-review(GATE) → ship → handoff(GATE)
A hook runs on every user message and injects the current phase as an instruction. Claude can't skip steps because each phase requires artifacts from the previous one — files on disk that prove a step was completed. No artifact = no advancement.
The pipeline progressively narrows scope: plan is broad ("what are we building and why?"), spec is precise ("what must be true when we're done?"), test shape is mechanical ("which test proves each claim?"), and implementation is focused ("make this specific failing test pass"). Each stage constrains the next, so by the time Claude writes code, the degrees of freedom are small and well-defined.
At review gates, 8+ sub-agents review Claude's work in parallel. Each has a narrow mandate (security, performance, regression, contracts, etc.) and — critically — zero conversation context. They weren't "in the room" when decisions were made, so they question everything. This eliminates the anchoring bias that makes self-review useless.
This is a choose-your-own-adventure. Each tier is self-contained and adds more enforcement. Read the tier descriptions below and pick where to start. You can always upgrade later.
┌─────────────────────────────────────────────────────────────────────────┐
│ Which tier do you need? │
│ │
│ Claude sometimes pushes untested code? │
│ ──→ Tier 1: Guard Rails │
│ Hooks block bad pushes. Zero workflow change. │
│ │
│ Claude tests, but builds the wrong thing? │
│ ──→ Tier 2: Phase Awareness │
│ State machine forces plan → spec → test → implement order. │
│ You approve at gates. │
│ │
│ You want hands-off feature development? │
│ ──→ Tier 3: Full Ceremony │
│ Type /develop once. Get a shipped PR. You approve at gates. │
│ 8+ adversarial agents review every diff. │
└─────────────────────────────────────────────────────────────────────────┘
Once you've picked a tier, give this README to Claude and ask it to build your implementation. Here's the prompt:
I want to implement Tier [1/2/3] of the develop-ceremony system
described in this README. My stack is [Go/Python/Rust/TypeScript/etc],
my test runner is [go test/pytest/cargo test/jest/etc], and my base
branch is [main/production/master].
Read the README for the full architecture, then build my hooks and
scripts incrementally:
1. Start with one hook at a time
2. Define a test scenario first (e.g., "I run go test, then git push
— push should be blocked if tests failed")
3. Implement the hook
4. Have me verify it works by actually trying the scenario
5. Iterate to the next hook
A note on "TDD" — there are two layers here. The ceremony uses TDD for building features (plan → spec → failing tests → implementation). But when you're building the ceremony itself (the hooks, scripts, and skills), you should also work incrementally with defined scenarios. Don't try to build all hooks at once — build one, test it manually, then build the next. The ceremony is infrastructure that's hard to debug (hooks run in subprocesses with JSON on stdin), so incremental verification matters.
Claude will need to make decisions along the way — this is intentional. The README describes the patterns and trade-offs, and Claude adapts them to your specific setup:
Decisions Claude will help you make:
- Which test runners to capture — depends on your stack
- What to block vs warn on push — tests and lint should block; semgrep and style may be warnings
- Which convention checks to add — depends on your coding standards
- How to detect your project root — git root? monorepo subdirectory? workspace?
- Which agents to include (Tier 3) — start with 8 core, add stack-specific ones
- Which phases to include (Tier 3) — you may not need all 12
Each person builds their own version. This repo is a reference, not a dependency. Fork it, read the patterns, build what fits your workflow. Different team members may use different tiers — that's fine.
┌─────────────────────────────────────────────────────────────┐
│ USER PROMPT │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ UserPromptSubmit Hook │ │
│ │ → compute-phase.py (reads artifacts, detects phase) │ │
│ │ → injects phase instruction into conversation │ │
│ └──────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Claude follows phase instruction │ │
│ │ (write plan, write tests, spawn review agents, etc.)│ │
│ └──────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ PostToolUse Hooks │ │
│ │ → capture-results.sh (go test → test-results.json) │ │
│ │ → semgrep-check.sh (Edit/Write → guards + checks) │ │
│ └──────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ PreToolUse Hook │ │
│ │ → pre-push-guard.sh (blocks push if checks fail) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
All artifacts live in develop/<branch>/ relative to the project root:
<project-root>/
├── develop/
│ └── feature/my-feature/
│ ├── plan.md # Phase 1 output
│ ├── spec.md # Phase 2 output
│ ├── test-shape.yaml # Phase 3 output
│ ├── gate-shape.approved # Written by /approve-shape (never by Claude)
│ ├── gate-report.approved # Written by /approve-report
│ ├── test-results-*.json # Auto-captured by hooks
│ ├── semgrep-results.json # Auto-captured
│ ├── lint-results.json # Auto-captured
│ ├── phase-override.json # Written by /set-phase (24h expiry, auto-deleted)
│ └── agents/
│ ├── gate1/ # Shape review findings
│ └── gate2/
│ ├── round-1/ # First diff review (all agents)
│ ├── round-2/ # Re-review (only agents that found must-fix)
│ └── round-3/ # Final safety net (Contracts + Regression + Security only)
└── screenshots/
├── claim-do-1-*.png # Evidence for verification (naming matters — see Tier 3)
└── verification-report.pdf # Generated report
If you haven't built Claude Code hooks before, here's what you need to know.
| Hook Type | When it fires | stdin (JSON) | How output works | Exit code |
|---|---|---|---|---|
PreToolUse |
Before Claude runs a tool | {tool_name, tool_input, cwd} |
stdout/stderr shown to Claude as error | exit 0 = allow, exit 2 = block the tool |
PostToolUse |
After Claude runs a tool | {tool_name, tool_input, tool_response, cwd} |
Must use hookSpecificOutput JSON (see below) |
Exit code doesn't block (tool already ran) |
UserPromptSubmit |
Before every user message is processed | {session_id, user_prompt, cwd} |
stdout is injected as a system-reminder | Exit code ignored |
SubagentStop |
When a sub-agent completes | {last_assistant_message, agent_type, cwd} |
Same as PostToolUse | Exit code ignored |
Every hook receives JSON on stdin. Parse it with Python (more reliable than jq for nested fields):
INPUT=$(cat)
COMMAND=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('tool_input',{}).get('command',''))")
CWD=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('cwd',''))")
RESPONSE=$(echo "$INPUT" | python3 -c "
import json, sys
d = json.load(sys.stdin)
resp = d.get('tool_response', '')
if isinstance(resp, list):
resp = ' '.join(str(r.get('text','') if isinstance(r,dict) else r) for r in resp)
print(str(resp)[:10000]) # cap at 10K chars to avoid memory issues
")PostToolUse hooks can't block (the tool already ran), but they can inject warnings into Claude's context via structured JSON on stdout:
# Build the warning message
VIOLATIONS="You wrote a gate-approval file. Only the USER can approve gates."
# Output as hookSpecificOutput JSON
CONTEXT=$(echo "$VIOLATIONS" | python3 -c "import json,sys; print(json.dumps(sys.stdin.read().strip()))")
cat <<EOF
{
"hookSpecificOutput": {
"hookEventName": "PostToolUse",
"additionalContext": ${CONTEXT}
}
}
EOFClaude sees the additionalContext as a system message and course-corrects.
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [{ "type": "command", "command": "/path/to/pre-push-guard.sh" }]
}
],
"PostToolUse": [
{
"matcher": "Bash",
"hooks": [{ "type": "command", "command": "/path/to/capture-results.sh" }]
},
{
"matcher": "Edit|Write",
"hooks": [{ "type": "command", "command": "/path/to/semgrep-check.sh" }]
}
],
"UserPromptSubmit": [
{
"hooks": [{ "type": "command", "command": "/path/to/user-prompt-submit.sh" }]
}
],
"SubagentStop": [
{
"hooks": [{ "type": "command", "command": "/path/to/capture-gate-results.sh" }]
}
]
}
}Hooks are hard to debug because they run in a subprocess. Techniques:
- Log to a file: Add
echo "$(date) fired with: $COMMAND" >> /tmp/hook-debug.logat the top - Test manually: Save sample JSON to a file, then
cat sample.json | bash your-hook.sh - Check Claude's context: If a UserPromptSubmit hook outputs text, Claude will reference it. If it's not appearing, the hook is crashing silently.
Skills are custom commands registered in ~/.claude/commands/. Each is a directory or markdown file with a SKILL.md that defines what /command-name does. When a user types /develop, Claude Code loads ~/.claude/commands/develop/SKILL.md and follows it.
The Agent() tool is a Claude Code built-in that spawns a sub-agent (subprocess) with its own context. run_in_background=True makes it non-blocking so you can launch multiple agents in parallel.
Each tier builds on the previous. Start with Tier 1 — it requires zero workflow changes and immediately prevents bad pushes.
When to upgrade:
- Stay at Tier 1 if Claude is writing good code but occasionally pushes without testing
- Move to Tier 2 when Claude is testing but building the wrong thing (testing what it wrote, not what was specified)
- Move to Tier 3 when you want hands-off feature development with human review at gates
What you get: Claude can't push broken code. Test/lint results are captured automatically on every run. Convention checks fire on every file edit.
What changes for you: Nothing. You code normally. The hooks silently capture results and block bad pushes.
Components:
| Hook | Event | Purpose |
|---|---|---|
capture-results.sh |
PostToolUse[Bash] | Parses test/lint/build output → writes JSON artifacts |
semgrep-check.sh |
PostToolUse[Edit|Write] | Convention checks, gate-file protection, phase recomputation |
pre-push-guard.sh |
PreToolUse[Bash] | Reads artifacts, blocks git push if checks fail |
The hook receives every Bash command + its output as JSON on stdin. It pattern-matches the command to detect what ran:
# After extracting cmd and resp from the JSON stdin:
if 'go test' in cmd:
# Parse: count "ok" lines (passed) and "FAIL" lines (failed)
# Extract individual test names from "--- PASS: TestFoo" / "--- FAIL: TestBar"
passed = len(re.findall(r'^ok\s', resp, re.MULTILINE))
failed = len(re.findall(r'^FAIL\s', resp, re.MULTILINE))
test_names = {}
for m in re.finditer(r'--- (PASS|FAIL|SKIP): (\S+)', resp):
test_names[m.group(2)] = m.group(1).lower()
write_json('test-results.json', {...})
elif 'vitest' in cmd:
# IMPORTANT: strip ANSI color codes before parsing, or counts break
clean = re.sub(r'\x1b\[[0-9;]*m', '', resp)
# Parse: "Tests 5 passed | 2 failed"
pass_match = re.search(r'(\d+)\s+passed', clean)
fail_match = re.search(r'(\d+)\s+failed', clean)
write_json('test-results-widget.json', {...})
elif 'golangci-lint' in cmd and 'run' in cmd:
# Parse: count lines matching "file.go:N:N:"
issues = re.findall(r'^\S+\.go:\d+:\d+:', resp, re.MULTILINE)
write_json('lint-results.json', {...})
elif 'semgrep' in cmd and 'scan' in cmd:
# Count findings
write_json('semgrep-results.json', {...})Key behaviors:
-
Test result merging: If Claude runs
go test ./pkg/foothen latergo test ./pkg/bar, the hook merges results —test_namesfrom both runs accumulate, and pass/fail counts are recalculated from the merged set. This means incremental test runs build up a complete picture. -
Commit pinning: Every artifact includes the current
HEADcommit hash. If Claude makes code changes after running tests, HEAD changes and the artifact becomes stale. The pre-push guard catches this. -
Output truncation:
tool_responseis capped at 10,000 chars on input;output_tailin artifacts is capped at 2,000 chars. This prevents memory issues with massive test output.
Each artifact looks like:
{
"commit": "abc1234",
"timestamp": "2026-04-06T...",
"tests_pass": 42,
"tests_fail": 0,
"all_green": true,
"test_names": {
"TestFoo_HappyPath": "pass",
"TestBar_EdgeCase": "pass"
},
"output_tail": "..."
}Fires before every Bash command. Early-exits for non-git push commands (~0ms overhead).
The guard maintains two separate accumulators — this is an important design choice:
- FAILURES →
exit 2(push blocked): stale tests, failing tests, missing lint, uncommitted changes, migration number conflicts - WARNINGS →
exit 0(push allowed, but issues shown): semgrep findings, Gate 2 staleness (review ran at different commit than HEAD), missing widget build
INPUT=$(cat)
COMMAND=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('tool_input',{}).get('command',''))")
echo "$COMMAND" | grep -q "git push" || exit 0 # not a push? allow immediately
# Find artifact directory
CWD=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('cwd',''))")
PROJECT_ROOT=$(git -C "$CWD" rev-parse --show-toplevel 2>/dev/null)
BRANCH=$(git -C "$PROJECT_ROOT" branch --show-current 2>/dev/null)
DEVELOP_DIR="$PROJECT_ROOT/develop/$BRANCH"
HEAD=$(git -C "$PROJECT_ROOT" rev-parse --short HEAD 2>/dev/null)
FAILURES=""
WARNINGS=""
# Check 1: Tests at HEAD
TEST_FILE="$DEVELOP_DIR/test-results.json"
if [ -f "$TEST_FILE" ]; then
TEST_COMMIT=$(python3 -c "import json; print(json.load(open('$TEST_FILE')).get('commit','')[:7])")
TEST_FAIL=$(python3 -c "import json; print(json.load(open('$TEST_FILE')).get('tests_fail', 0))")
if [ "$TEST_COMMIT" != "$HEAD" ]; then
FAILURES="${FAILURES}\n❌ Tests are stale (ran at $TEST_COMMIT, HEAD is $HEAD). Run: go test ./..."
elif [ "$TEST_FAIL" != "0" ]; then
FAILURES="${FAILURES}\n❌ Tests have $TEST_FAIL failures."
fi
else
FAILURES="${FAILURES}\n❌ No test results found. Run: go test ./..."
fi
# Check 2: Lint — file-type-aware staleness
# Only blocks if .go files changed between lint commit and HEAD.
# If only docs/config changed, lint results are still valid.
LINT_FILE="$DEVELOP_DIR/lint-results.json"
if [ -f "$LINT_FILE" ]; then
LINT_COMMIT=$(python3 -c "import json; print(json.load(open('$LINT_FILE')).get('commit','')[:7])")
if [ "$LINT_COMMIT" != "$HEAD" ]; then
GO_CHANGES=$(git -C "$PROJECT_ROOT" diff --name-only "$LINT_COMMIT".."$HEAD" -- '*.go' | head -1)
if [ -n "$GO_CHANGES" ]; then
FAILURES="${FAILURES}\n❌ Lint stale — Go files changed since $LINT_COMMIT."
fi
# If only non-Go files changed, lint is still valid — no failure
fi
fi
# Check 3: Semgrep (WARNING only, doesn't block)
# Check 4: Uncommitted changes
# Check 5: Migration number conflicts against origin/production
# Block or allow
if [ -n "$FAILURES" ]; then
echo -e "🚨 PRE-PUSH BLOCKED:\n$FAILURES" >&2
[ -n "$WARNINGS" ] && echo -e "\nAlso:\n$WARNINGS"
exit 2
fi
[ -n "$WARNINGS" ] && echo -e "⚠️ Warnings (push allowed):\n$WARNINGS"
exit 0Fires after every Edit/Write. Three responsibilities:
1. Fabrication guard — prevents Claude from cheating:
BASENAME=$(basename "$FILE")
# Block gate-file fabrication
if [[ "$BASENAME" == gate-*.approved ]]; then
VIOLATIONS="STOP: You wrote a gate approval file. Only the USER can approve gates."
fi
# Block artifact fabrication
if [[ "$BASENAME" == *-results.json ]] && [[ "$FILE" == *develop/* ]] && [[ "$TOOL" == "Write" ]]; then
VIOLATIONS="STOP: You wrote an artifact file directly. Run the actual command instead."
fi2. Modular convention checks — sources scripts from a checks/ directory:
CHECKS_DIR="$HOME/.claude/hooks/checks"
# Only scan CHANGED lines (via git diff), not the whole file
CHANGED=$(git diff -U0 -- "$FILE" | grep '^+[^+]' | sed 's/^+//')
# Dispatch by file type
if [[ "$FILE" == *.go ]] && [[ "$FILE" != *_test.go ]]; then
[ -f "$CHECKS_DIR/go-resolvers.sh" ] && source "$CHECKS_DIR/go-resolvers.sh"
[ -f "$CHECKS_DIR/go-layers.sh" ] && source "$CHECKS_DIR/go-layers.sh"
fi
if [[ "$FILE" == *.tsx ]] || [[ "$FILE" == *.ts ]]; then
[ -f "$CHECKS_DIR/ts-components.sh" ] && source "$CHECKS_DIR/ts-components.sh"
fiEach check script uses exported $CHANGED, $FILE, and appends to $VIOLATIONS. This makes it easy to add project-specific checks without modifying the main hook.
Important: convention checks only scan added/changed lines (via git diff -U0), not the whole file. This avoids flagging pre-existing issues in untouched code. Files in generated/, ent/, vendor/, node_modules/, and *_test.go are skipped entirely.
3. Phase recomputation on spec edits (Tier 2+):
# When spec.md or test-shape.yaml is edited, gates reset
if [[ "$BASENAME" == "spec.md" || "$BASENAME" == "test-shape.yaml" ]]; then
# Re-run phase machine to detect gate reset
python3 /path/to/compute-phase.py "$ENV_ROOT" 2>/dev/null
# Auto-regenerate verification skeleton
python3 /path/to/generate-verification-skeleton.py "$ENV_ROOT" 2>/dev/null
VIOLATIONS="${VIOLATIONS}\n⚠️ ${BASENAME} modified — gates reset."
fiThis means editing the spec after approval automatically resets gates — intentionally forcing re-verification of everything.
What you get: Claude works through a structured pipeline that goes from broad to specific: plan (what and why) → spec (what "done" means) → test shape (how to verify each claim) → failing tests → implementation. Every message gets injected with the current phase and instruction. Claude follows the pipeline instead of freelancing.
What changes for you: Features now produce a develop/<branch>/ directory with plan.md, spec.md, and test-shape.yaml. Claude writes failing tests before implementation code. You approve at gates.
New components (in addition to Tier 1):
| Hook/Script | Purpose |
|---|---|
user-prompt-submit.sh |
UserPromptSubmit hook — runs phase machine, injects instruction + behavioral directives |
capture-gate-results.sh |
SubagentStop hook — captures sub-agent review findings |
compute-phase.py |
The state machine — reads artifacts, determines current phase |
phase_utils.py |
Helpers for file checks, JSON parsing, gate validation |
spec_validator.py |
Validates spec.md structure (Must Do / Must NOT claims) |
compute-phase.py takes a project root and checks phases top-down. First incomplete phase wins:
def compute_phase(project_root):
branch = get_current_branch(project_root)
if not branch or branch == "production":
return Phase("no-branch", "Create a feature branch: git checkout -b feature/<TICKET>")
develop_dir = f"{project_root}/develop/{branch}"
# Auto-create develop dir if missing; also walks develop/ tree as fallback
os.makedirs(develop_dir, exist_ok=True)
head = get_head_commit(project_root)
# Check for phase override (/set-phase) — expires after 24h
override = read_json(f"{develop_dir}/phase-override.json")
if override and not expired(override, hours=24):
return Phase(override["phase"], f"Override active. Expires {override['set_at'] + 24h}")
elif override:
os.remove(f"{develop_dir}/phase-override.json") # auto-cleanup
# Phase 1: Plan
if not exists(f"{develop_dir}/plan.md"):
return Phase("plan", "Discuss scope and user concerns. Write plan.md.")
# Hard gate: plan must have User Concerns section with content
if not has_user_concerns(f"{develop_dir}/plan.md"):
return Phase("plan", "plan.md missing User Concerns. Ask: 'What are you worried about?'")
# Phase 2: Spec
if not exists(f"{develop_dir}/spec.md"):
return Phase("spec", "Define Must Do / Must NOT claims. Write spec.md.")
# Phase 3: Survey
if not exists(f"{develop_dir}/test-shape.yaml"):
return Phase("survey", "Map claims to test cases. Write test-shape.yaml.")
# Phase 4: Shape Review (GATE — requires user approval)
if not exists(f"{develop_dir}/gate-shape.approved"):
return Phase("shape-review", "BLOCKED — present spec + test shape for /approve-shape")
# Phase 5: Tests
test_results = read_json(f"{develop_dir}/test-results.json")
if not test_results:
return Phase("tests", "Write failing tests per test-shape.yaml. Run them.")
# Phase 6: Implement — commit-pinned staleness check
if test_results.get("commit", "")[:7] != head[:7]:
return Phase("tests", "Test results are stale (HEAD changed). Re-run tests.")
if not test_results.get("all_green"):
return Phase("implement", f"Make failing tests pass. {test_results['tests_fail']} failures remain.")
# Phase 7: Diff Review
gate2 = read_json(f"{develop_dir}/gate2-results.json")
if not gate2 or gate2.get("total_must_fix", 0) > 0:
return Phase("diff-review", "Run adversarial review agents on the diff.")
# Phase 8-11: verify → report-review → ship → handoff
# ... same pattern: check artifact, check gate file, advanceKey behaviors not obvious from the pseudocode:
-
User concerns as a hard gate: The plan phase blocks if
## User Concernsis empty. This forces the "what are you worried about?" conversation to happen every time. -
Late artifact detection: If early artifacts (plan/spec) are missing but later ones exist (test results, gate files), the machine detects this mismatch and suggests
/set-phaseoptions. This handles migrating existing work into the ceremony. -
Git worktree support: The code handles
.gitbeing a file (worktree pointer) rather than a directory. -
Suggested commands: Each phase includes a
suggested_commandfield with the exact shell command to run next.
The phase machine outputs rich JSON, not just phase + instruction:
{
"phase": "implement",
"stage": 6,
"total_stages": 12,
"progress": "6/12 (feature/ST-1234)",
"instruction": "Make failing tests pass. 2 failures remain.",
"suggested_command": "go test ./pkg/scoring/... -count=1 -v",
"gates": {
"shape": "approved",
"report": "pending",
"handoff": "pending"
},
"user_concerns": [
"Performance with 500+ deals",
"Migration backward compatibility"
],
"custom_probes": [
"Verify page load stays under 2s with 500 deals",
"Verify migration rollback works"
],
"user_visible_expectations": [
"Score badge visible on deal cards",
"Tooltip appears on hover"
],
"stages": {
"tests": {"status": "done", "commit": "abc1234"},
"implement": {"status": "in_progress", "warning": "2 tests still failing"},
"diff-review": {"status": "pending"}
}
}The UserPromptSubmit hook parses this and injects the relevant parts. User concerns and custom probes are carried through the entire pipeline — they originated in the plan phase and surface during diff review as custom agent mandates.
Beyond the phase instruction, the hook injects standing behavioral directives on every turn:
# Phase state
echo "Phase: $PHASE [$PROGRESS] | Gates: shape:$SHAPE report:$REPORT handoff:$HANDOFF"
echo "Instruction: $INSTRUCTION"
# Standing directives (always injected)
echo "PROCESS: Follow the phase instruction. If you skip ANY step, tell the user explicitly."
echo "GATE SAFETY: After shape gate is approved, do NOT edit spec.md or test-shape.yaml unless the user explicitly asks."
echo "BUG FIX PROTOCOL: If fixing a bug on current feature, ask: amend existing PR or new feature?"
# User concerns and probes (from plan.md, carried through pipeline)
echo "User Concerns: $CONCERNS"
echo "Custom Adversarial Probes: $PROBES"
# Checklist enforcement at ship/handoff phases
if [ "$PHASE" = "ship" ] || [ "$PHASE" = "handoff-review" ]; then
if [ ! -f "$DEVELOP_DIR/checklist-ran.json" ] || is_stale "$DEVELOP_DIR/checklist-ran.json"; then
echo "⛔ CHECKLIST NOT RUN. Run /checklist before proceeding."
fi
fiThe gate safety directive prevents Claude from accidentally resetting gates by editing spec.md after approval. The bug fix protocol handles the common case where a user wants to fix something on the current feature — it asks whether to amend (resets gates intentionally) or start a new branch.
Three phases require your approval. Claude cannot approve its own work — the semgrep-check.sh hook blocks it from writing gate-*.approved files.
You approve via slash commands (skills):
/approve-shape— after reviewing spec + test plan + Gate 1 agent findings/approve-report— after reviewing verification report with screenshots/approve-handoff— final sign-off before merge
Each approval skill writes a signed gate file:
approved 2026-04-06T15:30:00Z abc1234 sig=HASH
When a sub-agent completes, this hook classifies its findings by scanning the output text for keywords:
# Classification by keyword matching on agent output
if 'style guide' in msg.lower() or 'file size' in msg.lower():
gate = 2; agent_name = 'style_guide'
elif 'contract' in msg.lower() or 'breaking change' in msg.lower():
gate = 2; agent_name = 'contracts'
elif 'regression' in msg.lower() or 'untested' in msg.lower():
gate = 2; agent_name = 'regression'
# ... etc
# Count severities by regex on the raw message
must_fix = len(re.findall(r'must.?fix|HIGH|CRITICAL', msg, re.IGNORECASE))
should_fix = len(re.findall(r'should.?fix|MEDIUM', msg, re.IGNORECASE))This means agent output format matters: agents must use severity keywords (must-fix, should-fix, HIGH, CRITICAL, MEDIUM) that the regex can parse. The structured table format isn't cosmetic — it's machine-read.
Multi-round tracking: Results are stored in a rounds array. If the same agent name appears twice in the current round, a new round is automatically created. This drives the targeted re-review — only agents with must-fix findings from the previous round are re-run.
Add these behavioral rules to your CLAUDE.md:
## Development Ceremony
- The phase injection in your prompt IS the source of truth for what to do next.
Follow it. Do NOT read STATE.md to determine phase — it may be stale.
- NEVER write gate-*.approved files. Only the user runs /approve-shape,
/approve-report, /approve-handoff.
- NEVER write *-results.json files directly. Run the actual commands (go test,
vitest, semgrep) — the capture hook writes artifacts automatically.
- If the phase says "BLOCKED", present the artifact to the user and STOP.
Do not continue with other work — halt and wait for the user's decision.
- When spawning sub-agents you need results from, run them in FOREGROUND
(the default) and wait for completion before responding. If you reply before
agents finish, the stop hook fires prematurely.What you get: The complete TDD pipeline driven by /develop. Type it once and the machine drives: plan → spec → tests → implement → adversarial review → verify → ship. You approve at gates and that's it. When /develop reaches Done, it automatically invokes /ship — truly end-to-end.
What changes for you: Features are fully managed by the pipeline. You review and approve at gates, handle Claude's questions during planning, and get a shipped PR at the end.
New components:
| Component | Purpose |
|---|---|
/develop skill |
The main workflow — 12 stages with full instructions per phase |
/ship skill |
Autonomous build → test → lint → commit → push → PR → CI poll |
/checklist skill |
Shows receipt for current gate (what's verified, what's missing) |
/approve-* skills |
Write signed gate files after your review |
agents.md |
Sub-agent prompt templates for Gate 1 (spec review) and Gate 2 (diff review) |
agent-gauntlet.md |
22 agent specifications for comprehensive pre-handoff review |
conventions.md |
Your project's architecture rules (referenced by agents) |
| # | Phase | What Claude does | Artifact | Gate? |
|---|---|---|---|---|
| 1 | plan | Discusses scope, asks about concerns, spawns discovery agents | plan.md |
— |
| 2 | spec | Defines Must Do / Must NOT claims, schema change declarations | spec.md |
— |
| 3 | survey | Reads code, traces paths, maps claims → test cases, runs contract risk analysis | test-shape.yaml |
— |
| 4 | shape-review | 2-3 agents review spec + tests (zero context) | agents/gate1/*.json |
You: /approve-shape |
| 5 | tests | Writes failing tests per test-shape.yaml, runs them | Test files + test-results.json |
— |
| 6 | implement | Makes tests pass, one at a time, semgrep after each fix | Code changes | — |
| 7 | diff-review | 8+ agents review diff (zero context, 3 rounds) | agents/gate2/*.json |
Auto-fix loop |
| 8 | verify | E2E feature tests (dev-browser) + regression tests (Playwright) | e2e-results.json + screenshots |
— |
| 9 | report-review | Presents verification report mapping every claim to evidence | — | You: /approve-report |
| 10 | ship | build → lint → test → commit → push → PR → CI poll | PR created | — |
| 11 | handoff-review | Final checklist + approval | — | You: /approve-handoff |
Plan phase is interactive — Claude asks questions, not just writes a doc. It explicitly asks:
- "What are you most worried about with this change?"
- "Are there specific things you want adversarial agents to check?"
- "Any areas of the codebase that are fragile?"
Answers become ## User Concerns and ## Custom Adversarial Probes in plan.md. These are tracked through the entire pipeline and injected into every phase instruction. The phase machine blocks if concerns are empty.
Claude also spawns discovery agents: an "analogical reasoning agent" that finds the most structurally similar existing feature, and a "reuse agent" that searches for existing code to leverage.
Spec phase includes a schema changes declaration in structured YAML:
graphql:
new_types: [ScoreBadge, ScoreBreakdown]
new_mutations: [recalculateScore]
modified_files: [schema.graphqls, scoring.graphqls]
migrations:
new: ["20260406_add_score_column.sql"]
tables_modified: [deals]
codegen:
required: true
widget_codegen: trueThe verification report later checks each declared item exists in the generated code and flags undeclared schema changes.
After spec is written, a verification skeleton PDF is auto-generated showing every claim with PENDING status and evidence slots. The user reviews this before approving — it previews what the final verification report will look like.
Survey phase includes contract risk analysis: for every modified function/interface, callers are grepped across all repos and a risk table is produced. High and medium risks must each map to a test. GraphQL value changes are flagged as "silent breakers" — codegen catches schema shape changes but not value changes.
Verify phase has two distinct steps:
- Feature validation via dev-browser — navigate the actual UI, exercise each test from the test shape, capture screenshots named by claim ID (
claim-do-1-description.png,claim-not-3-no-raw-floats.png). The naming convention matters because the verification report matches byclaim-{do|not}-{N}prefix. - Regression testing via Playwright QA suite — maps changed files to test suites and runs the relevant ones.
Gate 1 agents (3 agents, after spec/test-shape — review the PLAN, not the code):
| Agent | Receives | Checks |
|---|---|---|
| Spec Reviewer | Acceptance criteria, test plan, architecture file list from plan.md, repo root |
Test plan gaps, untested callers, missing edge cases |
| Architecture Reviewer | Architecture file list, acceptance criteria, repo root + widget path | Data duplication, wrong layers, SSOT violations, cross-boundary coupling |
| Reuse Reviewer | Architecture file list, repo root | Missed existing hooks, components, API endpoints |
Gate 1 agents receive plan.md sections, NOT spec.md or a diff. They review the plan before any code is written.
Gate 2 agents (8+ agents, after implementation — review the CODE):
| Agent | What it checks | Why it catches what Claude misses |
|---|---|---|
| Style & Conventions | File size, naming, layer violations | Claude wrote the code — it thinks the structure is fine |
| Contracts & Interfaces | Breaking changes, backward compatibility | Claude focused on making tests pass, not on callers |
| Regression | Untested callers of changed functions | Claude tested what it changed, not what calls it |
| Security | Auth checks, injection, data exposure | Claude builds features, not attacks |
| Performance | N+1 queries, unbounded results | Claude optimizes for correctness, not performance |
| Scope | Unrelated changes, drive-by refactors | Claude "improves" code it wasn't asked to touch |
| Logging | Log format, error context, PII | Claude doesn't think about ops |
| Reuse & Patterns | Observability, pattern deviations | Claude writes new code instead of finding existing patterns |
Critical design detail: Gate 2 agents receive the repo path and base branch, NOT a pasted diff. They have tool access (Bash, Read, Grep, Glob) and actively explore the codebase — grepping for callers, reading style guides, checking file sizes. This is "zero context about the conversation" but "full access to the codebase."
Every agent prompt includes: "Do NOT write any code. Research only." — this keeps agents in adversarial reviewer mode rather than "fixing" things and losing their critical posture.
Custom probes from the plan phase become additional agents. Each probe is spawned as a Tier 2 agent with only the diff, the probe text, and the repo root. Output schema: {verdict: pass|fail, evidence, score: 0-5}.
Agents are organized in three tiers, and the ordering is load-bearing:
| Tier | Agents | Scope | Why this order |
|---|---|---|---|
| Tier 1: Property | Style, Contracts, Regression, Security, Performance, Scope, Logging, Patterns | Individual files in the diff | Catches most issues cheaply |
| Tier 2: Cross-cutting | Widget→API data flow, API→DB data flow, Zero-value/nil safety, Error propagation, Temporal/async safety, State consistency | Data flow across boundaries | Catches bugs that single-file agents structurally cannot see |
| Tier 3: Meta | Fresh-eyes summary, Gap detection, Concern validator, Test adequacy, Cold diff review | The review itself | Reviews the review — finds what all other agents missed |
Tier 2 agents trace data across layers (frontend → API → DB → cache) and catch type mismatches, nil propagation, and consistency gaps. Tier 3's "Gap Detection" compares against analogous features — this only makes sense after Tier 1+2 have catalogued what's present.
The Cold Diff Review agent (Tier 3) is deliberately tool-less — it receives the full diff pre-loaded in its prompt and has NO file-reading tools. This forces pure diff-based reasoning, mimicking what a human reviewer actually sees on a PR.
| Round | Which agents | Purpose |
|---|---|---|
| Round 1 | All agents (8-22 depending on scope) | Comprehensive first pass |
| Round 2 | Only agents that found must-fix items | Verify fixes didn't introduce new issues |
| Round 3 | Always Contracts + Regression + Security | Fixed safety net — these 3 catch real production bugs |
Round 3 always runs even if Round 2 was clean. Exit criteria: Round 3 produces zero must-fix findings.
Weighted scoring: Each agent scores 0-5. Tier weights: Tier 1 = 1x, Tier 2 = 2x, Tier 3 = 1.5x. Cross-cutting agents are weighted double because their findings are harder to detect and more impactful in production.
| Problem | How Claude cheats | Prevention |
|---|---|---|
| Skipping tests | "The code is straightforward, let me push" | Phase machine won't advance without test artifacts at HEAD |
| Fake test results | Writes test-results.json directly | semgrep-check.sh detects Write to *-results.json in develop/ and warns |
| Self-approval | Writes gate-shape.approved | semgrep-check.sh blocks writing gate-*.approved files |
| Pushing broken code | git push without running tests |
pre-push-guard.sh reads artifacts and blocks with exit 2 |
| Stale results | Runs tests, makes more changes, pushes | Artifacts are commit-pinned — HEAD change = stale |
| Anchoring bias | Reviews its own code and finds nothing | Gate 2 agents have zero conversation context |
| Scope creep | "While I'm here, let me refactor this" | Scope agent specifically checks for unrelated changes |
| Skipping edge cases | Tests happy path only | Spec reviewer checks gaps; custom probes test specific risks |
| Skipping user concerns | Ignores the planning conversation | Phase machine blocks if ## User Concerns is empty |
| Editing spec after approval | Changes the contract without re-review | File-edit hook auto-resets gates when spec.md is modified |
| Continuing past a gate | Does more work when it should wait for approval | CLAUDE.md rule + phase injection says "BLOCKED — halt" |
When Claude edits spec.md after the shape gate was approved:
semgrep-check.sh(PostToolUse[Edit]) detects the edit- It re-runs
compute-phase.pywhich sees spec mtime > gate-shape.approved mtime - The gate is invalidated — phase regresses to
shape-review - Verification skeleton is auto-regenerated from the new spec
- Claude gets a warning: "spec.md modified — gates reset"
- On next user message, the
UserPromptSubmithook injects the regressed phase
This is intentional for bug fixes — editing the spec adds new claims, and everything downstream must be re-verified against the new claims.
/set-phase writes phase-override.json:
{"phase": "diff-review", "set_at": "2026-04-06T15:30:00Z"}The phase machine checks this before the normal waterfall. After 24 hours, the override auto-expires and is deleted from disk. This handles cases where the machine is wrong (e.g., you imported existing tests that didn't go through the ceremony).
When Claude runs go test ./pkg/foo then later go test ./pkg/bar:
- First run: writes
test_names: {TestFoo: "pass"} - Second run: reads existing file, merges
test_names: {TestFoo: "pass", TestBar: "pass"} - Recalculates
tests_pass,tests_failfrom the merged set
This means incremental test runs build up a complete picture. The pre-push guard sees the accumulated results, not just the last run.
The reference implementation uses Go + Next.js. Here's how to adapt each piece.
| Your stack | Pattern to match | Output to parse |
|---|---|---|
| Python + pytest | 'pytest' in cmd |
X passed, Y failed, Z error |
| Rust + cargo | 'cargo test' in cmd |
test result: ok. N passed; M failed |
| Java + Maven | 'mvn test' in cmd |
Parse surefire XML reports |
| Ruby + rspec | 'rspec' in cmd |
X examples, Y failures |
| JavaScript + jest | 'jest' in cmd |
Tests: X passed, Y failed (strip ANSI first!) |
Always strip ANSI color codes before parsing: re.sub(r'\x1b\[[0-9;]*m', '', resp). Most test runners emit colored output that breaks regex matching.
| Your check | How to implement |
|---|---|
| Tests pass | Read your test-results JSON, check all_green |
| Lint clean | Read lint-results JSON, check clean |
| Type-check | Add tsc --noEmit capture to capture-results, check in guard |
| Formatting | Add prettier --check / black --check capture |
| Migration conflicts | Compare your migration numbering against base branch |
| Go agent | Python equivalent | Rust equivalent |
|---|---|---|
| Go Idioms | Python Idioms (type hints, exception handling, async patterns) | Rust Safety (unsafe blocks, lifetime issues, error handling) |
| Style & Conventions | PEP 8, import ordering, docstrings | Clippy lints, naming conventions |
| Performance | N+1 ORM queries, missing pagination | Allocation patterns, clone vs borrow |
Can I use this with any language?
Yes. The phase machine is language-agnostic. capture-results.sh pattern-matches test output — add patterns for your test runner. The agent gauntlet specs are customizable per stack.
How long does each phase take? Plan: 5-10 min (interactive). Spec: 2-5 min. Survey: 3-5 min. Tests: 5-15 min. Implement: varies. Diff review: ~2 min (agents run in parallel). Ship: ~3 min. Total for a medium feature: 30-60 min with minimal human intervention.
Is this overkill for small fixes? For a one-line fix, yes. The ceremony is designed for features that take 30+ minutes. For small fixes, the guard rails (Tier 1) still help — you get artifact capture and push protection without the full pipeline.
What if Claude gets stuck in a phase?
Use /set-phase to override. It writes a phase-override.json that expires after 24 hours, then auto-deletes.
Do I need all 22 agents? No. Start with the 8 Tier 1 property agents. Add Tier 2 cross-cutting agents when you want data-flow analysis. Add Tier 3 meta agents when you want review-of-the-review.
What about CI — doesn't this duplicate CI checks? The pre-push guard runs locally (~50ms, reads JSON files). CI runs the full suite remotely. The guard prevents wasting CI time on obviously broken code. They're complementary.
What if I abort mid-pipeline?
Delete develop/<branch>/ or just switch branches. The ceremony state is entirely file-based — no external database or service needed.
How do I debug a hook that's not firing?
Add echo "$(date) fired" >> /tmp/hook-debug.log at the top. If the log doesn't grow, check that settings.json has the right matcher and command path. Test manually: cat sample-input.json | bash your-hook.sh.