Skip to content

attentiontech/develop-ceremony

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

develop-ceremony

A structured TDD pipeline for Claude Code with adversarial sub-agent review gates.

This is a reference implementation, not a library. Read it, understand the patterns, then build your own version tuned to your stack and workflow. The code examples are from a real production setup (Go + Next.js monorepo, 16 parallel environments) — adapt them to your situation.


Table of Contents


The Problem

Claude Code is powerful but takes shortcuts:

  • Skips testing ("the code looks correct, let me push")
  • Skips review ("all tests pass, ship it")
  • Jumps from plan to implementation without defining acceptance criteria
  • Approves its own work (writes tests that test what it wrote, not what was specified)
  • Anchors on its own decisions (reviews code it wrote and finds no issues)

Why not just add rules to CLAUDE.md? CLAUDE.md rules are suggestions; Claude follows them ~80% of the time. Hooks are enforcement; Claude cannot bypass a non-zero exit code. The ceremony uses hooks for enforcement and CLAUDE.md for guidance — together they're airtight.

Why a structured pipeline?

LLMs perform dramatically better when they work through a problem in stages — from broad to specific — rather than jumping straight to code. When Claude goes directly from "build feature X" to writing code, it:

  • Makes assumptions that were never validated
  • Writes tests after the code (testing what it wrote, not what was specified)
  • Misses edge cases that would have been obvious during planning
  • Produces code that works but doesn't match what the user actually wanted

A structured pipeline forces Claude to think before coding: first understand the problem (plan), then define what "done" means (spec), then map how to verify each claim (test shape), then write failing tests (TDD), and only then write implementation code. Each stage produces an artifact that constrains the next stage. By the time Claude writes code, it has:

  • A plan validated with the user's concerns
  • A spec with explicit Must Do / Must NOT claims
  • A test shape mapping every claim to concrete test cases
  • Failing tests that define exactly what success looks like

This is test-driven development in the classical sense — define the acceptance criteria and write failing tests before writing a single line of implementation. The pipeline enforces this ordering because Claude will skip it if you let it.

The Solution

Two mechanisms enforce the pipeline:

1. A state machine that enforces step order

plan → spec → survey → shape-review(GATE) → tests → implement
→ diff-review(GATE) → verify → report-review(GATE) → ship → handoff(GATE)

A hook runs on every user message and injects the current phase as an instruction. Claude can't skip steps because each phase requires artifacts from the previous one — files on disk that prove a step was completed. No artifact = no advancement.

The pipeline progressively narrows scope: plan is broad ("what are we building and why?"), spec is precise ("what must be true when we're done?"), test shape is mechanical ("which test proves each claim?"), and implementation is focused ("make this specific failing test pass"). Each stage constrains the next, so by the time Claude writes code, the degrees of freedom are small and well-defined.

2. Adversarial sub-agents with zero context

At review gates, 8+ sub-agents review Claude's work in parallel. Each has a narrow mandate (security, performance, regression, contracts, etc.) and — critically — zero conversation context. They weren't "in the room" when decisions were made, so they question everything. This eliminates the anchoring bias that makes self-review useless.


Getting Started — Choose Your Tier

This is a choose-your-own-adventure. Each tier is self-contained and adds more enforcement. Read the tier descriptions below and pick where to start. You can always upgrade later.

┌─────────────────────────────────────────────────────────────────────────┐
│                        Which tier do you need?                         │
│                                                                        │
│  Claude sometimes pushes untested code?                                │
│  ──→ Tier 1: Guard Rails                                               │
│      Hooks block bad pushes. Zero workflow change.                     │
│                                                                        │
│  Claude tests, but builds the wrong thing?                             │
│  ──→ Tier 2: Phase Awareness                                           │
│      State machine forces plan → spec → test → implement order.        │
│      You approve at gates.                                             │
│                                                                        │
│  You want hands-off feature development?                               │
│  ──→ Tier 3: Full Ceremony                                             │
│      Type /develop once. Get a shipped PR. You approve at gates.       │
│      8+ adversarial agents review every diff.                          │
└─────────────────────────────────────────────────────────────────────────┘

How to implement your chosen tier

Once you've picked a tier, give this README to Claude and ask it to build your implementation. Here's the prompt:

I want to implement Tier [1/2/3] of the develop-ceremony system 
described in this README. My stack is [Go/Python/Rust/TypeScript/etc], 
my test runner is [go test/pytest/cargo test/jest/etc], and my base 
branch is [main/production/master].

Read the README for the full architecture, then build my hooks and 
scripts incrementally:
1. Start with one hook at a time
2. Define a test scenario first (e.g., "I run go test, then git push 
   — push should be blocked if tests failed")
3. Implement the hook
4. Have me verify it works by actually trying the scenario
5. Iterate to the next hook

A note on "TDD" — there are two layers here. The ceremony uses TDD for building features (plan → spec → failing tests → implementation). But when you're building the ceremony itself (the hooks, scripts, and skills), you should also work incrementally with defined scenarios. Don't try to build all hooks at once — build one, test it manually, then build the next. The ceremony is infrastructure that's hard to debug (hooks run in subprocesses with JSON on stdin), so incremental verification matters.

Claude will need to make decisions along the way — this is intentional. The README describes the patterns and trade-offs, and Claude adapts them to your specific setup:

Decisions Claude will help you make:

  • Which test runners to capture — depends on your stack
  • What to block vs warn on push — tests and lint should block; semgrep and style may be warnings
  • Which convention checks to add — depends on your coding standards
  • How to detect your project root — git root? monorepo subdirectory? workspace?
  • Which agents to include (Tier 3) — start with 8 core, add stack-specific ones
  • Which phases to include (Tier 3) — you may not need all 12

Sharing with your team

Each person builds their own version. This repo is a reference, not a dependency. Fork it, read the patterns, build what fits your workflow. Different team members may use different tiers — that's fine.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    USER PROMPT                              │
│                        ↓                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  UserPromptSubmit Hook                               │   │
│  │  → compute-phase.py (reads artifacts, detects phase) │   │
│  │  → injects phase instruction into conversation       │   │
│  └──────────────────────────────────────────────────────┘   │
│                        ↓                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Claude follows phase instruction                    │   │
│  │  (write plan, write tests, spawn review agents, etc.)│   │
│  └──────────────────────────────────────────────────────┘   │
│                        ↓                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  PostToolUse Hooks                                   │   │
│  │  → capture-results.sh (go test → test-results.json)  │   │
│  │  → semgrep-check.sh (Edit/Write → guards + checks)   │   │
│  └──────────────────────────────────────────────────────┘   │
│                        ↓                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  PreToolUse Hook                                     │   │
│  │  → pre-push-guard.sh (blocks push if checks fail)    │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Directory convention

All artifacts live in develop/<branch>/ relative to the project root:

<project-root>/
├── develop/
│   └── feature/my-feature/
│       ├── plan.md                 # Phase 1 output
│       ├── spec.md                 # Phase 2 output
│       ├── test-shape.yaml         # Phase 3 output
│       ├── gate-shape.approved     # Written by /approve-shape (never by Claude)
│       ├── gate-report.approved    # Written by /approve-report
│       ├── test-results-*.json     # Auto-captured by hooks
│       ├── semgrep-results.json    # Auto-captured
│       ├── lint-results.json       # Auto-captured
│       ├── phase-override.json     # Written by /set-phase (24h expiry, auto-deleted)
│       └── agents/
│           ├── gate1/              # Shape review findings
│           └── gate2/
│               ├── round-1/        # First diff review (all agents)
│               ├── round-2/        # Re-review (only agents that found must-fix)
│               └── round-3/        # Final safety net (Contracts + Regression + Security only)
└── screenshots/
    ├── claim-do-1-*.png            # Evidence for verification (naming matters — see Tier 3)
    └── verification-report.pdf     # Generated report

Claude Code Hooks Primer

If you haven't built Claude Code hooks before, here's what you need to know.

Hook types and their contracts

Hook Type When it fires stdin (JSON) How output works Exit code
PreToolUse Before Claude runs a tool {tool_name, tool_input, cwd} stdout/stderr shown to Claude as error exit 0 = allow, exit 2 = block the tool
PostToolUse After Claude runs a tool {tool_name, tool_input, tool_response, cwd} Must use hookSpecificOutput JSON (see below) Exit code doesn't block (tool already ran)
UserPromptSubmit Before every user message is processed {session_id, user_prompt, cwd} stdout is injected as a system-reminder Exit code ignored
SubagentStop When a sub-agent completes {last_assistant_message, agent_type, cwd} Same as PostToolUse Exit code ignored

Parsing stdin

Every hook receives JSON on stdin. Parse it with Python (more reliable than jq for nested fields):

INPUT=$(cat)
COMMAND=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('tool_input',{}).get('command',''))")
CWD=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('cwd',''))")
RESPONSE=$(echo "$INPUT" | python3 -c "
import json, sys
d = json.load(sys.stdin)
resp = d.get('tool_response', '')
if isinstance(resp, list):
    resp = ' '.join(str(r.get('text','') if isinstance(r,dict) else r) for r in resp)
print(str(resp)[:10000])  # cap at 10K chars to avoid memory issues
")

hookSpecificOutput (PostToolUse warnings)

PostToolUse hooks can't block (the tool already ran), but they can inject warnings into Claude's context via structured JSON on stdout:

# Build the warning message
VIOLATIONS="You wrote a gate-approval file. Only the USER can approve gates."

# Output as hookSpecificOutput JSON
CONTEXT=$(echo "$VIOLATIONS" | python3 -c "import json,sys; print(json.dumps(sys.stdin.read().strip()))")
cat <<EOF
{
  "hookSpecificOutput": {
    "hookEventName": "PostToolUse",
    "additionalContext": ${CONTEXT}
  }
}
EOF

Claude sees the additionalContext as a system message and course-corrects.

Registering hooks in settings.json

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{ "type": "command", "command": "/path/to/pre-push-guard.sh" }]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{ "type": "command", "command": "/path/to/capture-results.sh" }]
      },
      {
        "matcher": "Edit|Write",
        "hooks": [{ "type": "command", "command": "/path/to/semgrep-check.sh" }]
      }
    ],
    "UserPromptSubmit": [
      {
        "hooks": [{ "type": "command", "command": "/path/to/user-prompt-submit.sh" }]
      }
    ],
    "SubagentStop": [
      {
        "hooks": [{ "type": "command", "command": "/path/to/capture-gate-results.sh" }]
      }
    ]
  }
}

Debugging hooks

Hooks are hard to debug because they run in a subprocess. Techniques:

  • Log to a file: Add echo "$(date) fired with: $COMMAND" >> /tmp/hook-debug.log at the top
  • Test manually: Save sample JSON to a file, then cat sample.json | bash your-hook.sh
  • Check Claude's context: If a UserPromptSubmit hook outputs text, Claude will reference it. If it's not appearing, the hook is crashing silently.

Skills (slash commands)

Skills are custom commands registered in ~/.claude/commands/. Each is a directory or markdown file with a SKILL.md that defines what /command-name does. When a user types /develop, Claude Code loads ~/.claude/commands/develop/SKILL.md and follows it.

The Agent() tool is a Claude Code built-in that spawns a sub-agent (subprocess) with its own context. run_in_background=True makes it non-blocking so you can launch multiple agents in parallel.


Tiers

Each tier builds on the previous. Start with Tier 1 — it requires zero workflow changes and immediately prevents bad pushes.

When to upgrade:

  • Stay at Tier 1 if Claude is writing good code but occasionally pushes without testing
  • Move to Tier 2 when Claude is testing but building the wrong thing (testing what it wrote, not what was specified)
  • Move to Tier 3 when you want hands-off feature development with human review at gates

Tier 1: Guard Rails

What you get: Claude can't push broken code. Test/lint results are captured automatically on every run. Convention checks fire on every file edit.

What changes for you: Nothing. You code normally. The hooks silently capture results and block bad pushes.

Components:

Hook Event Purpose
capture-results.sh PostToolUse[Bash] Parses test/lint/build output → writes JSON artifacts
semgrep-check.sh PostToolUse[Edit|Write] Convention checks, gate-file protection, phase recomputation
pre-push-guard.sh PreToolUse[Bash] Reads artifacts, blocks git push if checks fail

How capture-results.sh works

The hook receives every Bash command + its output as JSON on stdin. It pattern-matches the command to detect what ran:

# After extracting cmd and resp from the JSON stdin:

if 'go test' in cmd:
    # Parse: count "ok" lines (passed) and "FAIL" lines (failed)
    # Extract individual test names from "--- PASS: TestFoo" / "--- FAIL: TestBar"
    passed = len(re.findall(r'^ok\s', resp, re.MULTILINE))
    failed = len(re.findall(r'^FAIL\s', resp, re.MULTILINE))
    test_names = {}
    for m in re.finditer(r'--- (PASS|FAIL|SKIP): (\S+)', resp):
        test_names[m.group(2)] = m.group(1).lower()
    write_json('test-results.json', {...})

elif 'vitest' in cmd:
    # IMPORTANT: strip ANSI color codes before parsing, or counts break
    clean = re.sub(r'\x1b\[[0-9;]*m', '', resp)
    # Parse: "Tests  5 passed | 2 failed"
    pass_match = re.search(r'(\d+)\s+passed', clean)
    fail_match = re.search(r'(\d+)\s+failed', clean)
    write_json('test-results-widget.json', {...})

elif 'golangci-lint' in cmd and 'run' in cmd:
    # Parse: count lines matching "file.go:N:N:"
    issues = re.findall(r'^\S+\.go:\d+:\d+:', resp, re.MULTILINE)
    write_json('lint-results.json', {...})

elif 'semgrep' in cmd and 'scan' in cmd:
    # Count findings
    write_json('semgrep-results.json', {...})

Key behaviors:

  • Test result merging: If Claude runs go test ./pkg/foo then later go test ./pkg/bar, the hook merges results — test_names from both runs accumulate, and pass/fail counts are recalculated from the merged set. This means incremental test runs build up a complete picture.

  • Commit pinning: Every artifact includes the current HEAD commit hash. If Claude makes code changes after running tests, HEAD changes and the artifact becomes stale. The pre-push guard catches this.

  • Output truncation: tool_response is capped at 10,000 chars on input; output_tail in artifacts is capped at 2,000 chars. This prevents memory issues with massive test output.

Each artifact looks like:

{
  "commit": "abc1234",
  "timestamp": "2026-04-06T...",
  "tests_pass": 42,
  "tests_fail": 0,
  "all_green": true,
  "test_names": {
    "TestFoo_HappyPath": "pass",
    "TestBar_EdgeCase": "pass"
  },
  "output_tail": "..."
}

How pre-push-guard.sh works

Fires before every Bash command. Early-exits for non-git push commands (~0ms overhead).

The guard maintains two separate accumulators — this is an important design choice:

  • FAILURESexit 2 (push blocked): stale tests, failing tests, missing lint, uncommitted changes, migration number conflicts
  • WARNINGSexit 0 (push allowed, but issues shown): semgrep findings, Gate 2 staleness (review ran at different commit than HEAD), missing widget build
INPUT=$(cat)
COMMAND=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('tool_input',{}).get('command',''))")
echo "$COMMAND" | grep -q "git push" || exit 0   # not a push? allow immediately

# Find artifact directory
CWD=$(echo "$INPUT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('cwd',''))")
PROJECT_ROOT=$(git -C "$CWD" rev-parse --show-toplevel 2>/dev/null)
BRANCH=$(git -C "$PROJECT_ROOT" branch --show-current 2>/dev/null)
DEVELOP_DIR="$PROJECT_ROOT/develop/$BRANCH"
HEAD=$(git -C "$PROJECT_ROOT" rev-parse --short HEAD 2>/dev/null)

FAILURES=""
WARNINGS=""

# Check 1: Tests at HEAD
TEST_FILE="$DEVELOP_DIR/test-results.json"
if [ -f "$TEST_FILE" ]; then
    TEST_COMMIT=$(python3 -c "import json; print(json.load(open('$TEST_FILE')).get('commit','')[:7])")
    TEST_FAIL=$(python3 -c "import json; print(json.load(open('$TEST_FILE')).get('tests_fail', 0))")
    if [ "$TEST_COMMIT" != "$HEAD" ]; then
        FAILURES="${FAILURES}\n❌ Tests are stale (ran at $TEST_COMMIT, HEAD is $HEAD). Run: go test ./..."
    elif [ "$TEST_FAIL" != "0" ]; then
        FAILURES="${FAILURES}\n❌ Tests have $TEST_FAIL failures."
    fi
else
    FAILURES="${FAILURES}\n❌ No test results found. Run: go test ./..."
fi

# Check 2: Lint — file-type-aware staleness
# Only blocks if .go files changed between lint commit and HEAD.
# If only docs/config changed, lint results are still valid.
LINT_FILE="$DEVELOP_DIR/lint-results.json"
if [ -f "$LINT_FILE" ]; then
    LINT_COMMIT=$(python3 -c "import json; print(json.load(open('$LINT_FILE')).get('commit','')[:7])")
    if [ "$LINT_COMMIT" != "$HEAD" ]; then
        GO_CHANGES=$(git -C "$PROJECT_ROOT" diff --name-only "$LINT_COMMIT".."$HEAD" -- '*.go' | head -1)
        if [ -n "$GO_CHANGES" ]; then
            FAILURES="${FAILURES}\n❌ Lint stale — Go files changed since $LINT_COMMIT."
        fi
        # If only non-Go files changed, lint is still valid — no failure
    fi
fi

# Check 3: Semgrep (WARNING only, doesn't block)
# Check 4: Uncommitted changes
# Check 5: Migration number conflicts against origin/production

# Block or allow
if [ -n "$FAILURES" ]; then
    echo -e "🚨 PRE-PUSH BLOCKED:\n$FAILURES" >&2
    [ -n "$WARNINGS" ] && echo -e "\nAlso:\n$WARNINGS"
    exit 2
fi
[ -n "$WARNINGS" ] && echo -e "⚠️ Warnings (push allowed):\n$WARNINGS"
exit 0

How semgrep-check.sh works

Fires after every Edit/Write. Three responsibilities:

1. Fabrication guard — prevents Claude from cheating:

BASENAME=$(basename "$FILE")

# Block gate-file fabrication
if [[ "$BASENAME" == gate-*.approved ]]; then
    VIOLATIONS="STOP: You wrote a gate approval file. Only the USER can approve gates."
fi

# Block artifact fabrication
if [[ "$BASENAME" == *-results.json ]] && [[ "$FILE" == *develop/* ]] && [[ "$TOOL" == "Write" ]]; then
    VIOLATIONS="STOP: You wrote an artifact file directly. Run the actual command instead."
fi

2. Modular convention checks — sources scripts from a checks/ directory:

CHECKS_DIR="$HOME/.claude/hooks/checks"

# Only scan CHANGED lines (via git diff), not the whole file
CHANGED=$(git diff -U0 -- "$FILE" | grep '^+[^+]' | sed 's/^+//')

# Dispatch by file type
if [[ "$FILE" == *.go ]] && [[ "$FILE" != *_test.go ]]; then
    [ -f "$CHECKS_DIR/go-resolvers.sh" ] && source "$CHECKS_DIR/go-resolvers.sh"
    [ -f "$CHECKS_DIR/go-layers.sh" ] && source "$CHECKS_DIR/go-layers.sh"
fi
if [[ "$FILE" == *.tsx ]] || [[ "$FILE" == *.ts ]]; then
    [ -f "$CHECKS_DIR/ts-components.sh" ] && source "$CHECKS_DIR/ts-components.sh"
fi

Each check script uses exported $CHANGED, $FILE, and appends to $VIOLATIONS. This makes it easy to add project-specific checks without modifying the main hook.

Important: convention checks only scan added/changed lines (via git diff -U0), not the whole file. This avoids flagging pre-existing issues in untouched code. Files in generated/, ent/, vendor/, node_modules/, and *_test.go are skipped entirely.

3. Phase recomputation on spec edits (Tier 2+):

# When spec.md or test-shape.yaml is edited, gates reset
if [[ "$BASENAME" == "spec.md" || "$BASENAME" == "test-shape.yaml" ]]; then
    # Re-run phase machine to detect gate reset
    python3 /path/to/compute-phase.py "$ENV_ROOT" 2>/dev/null
    # Auto-regenerate verification skeleton
    python3 /path/to/generate-verification-skeleton.py "$ENV_ROOT" 2>/dev/null
    VIOLATIONS="${VIOLATIONS}\n⚠️ ${BASENAME} modified — gates reset."
fi

This means editing the spec after approval automatically resets gates — intentionally forcing re-verification of everything.


Tier 2: Phase Awareness

What you get: Claude works through a structured pipeline that goes from broad to specific: plan (what and why) → spec (what "done" means) → test shape (how to verify each claim) → failing tests → implementation. Every message gets injected with the current phase and instruction. Claude follows the pipeline instead of freelancing.

What changes for you: Features now produce a develop/<branch>/ directory with plan.md, spec.md, and test-shape.yaml. Claude writes failing tests before implementation code. You approve at gates.

New components (in addition to Tier 1):

Hook/Script Purpose
user-prompt-submit.sh UserPromptSubmit hook — runs phase machine, injects instruction + behavioral directives
capture-gate-results.sh SubagentStop hook — captures sub-agent review findings
compute-phase.py The state machine — reads artifacts, determines current phase
phase_utils.py Helpers for file checks, JSON parsing, gate validation
spec_validator.py Validates spec.md structure (Must Do / Must NOT claims)

How the phase machine works

compute-phase.py takes a project root and checks phases top-down. First incomplete phase wins:

def compute_phase(project_root):
    branch = get_current_branch(project_root)
    if not branch or branch == "production":
        return Phase("no-branch", "Create a feature branch: git checkout -b feature/<TICKET>")
    
    develop_dir = f"{project_root}/develop/{branch}"
    # Auto-create develop dir if missing; also walks develop/ tree as fallback
    os.makedirs(develop_dir, exist_ok=True)
    head = get_head_commit(project_root)
    
    # Check for phase override (/set-phase) — expires after 24h
    override = read_json(f"{develop_dir}/phase-override.json")
    if override and not expired(override, hours=24):
        return Phase(override["phase"], f"Override active. Expires {override['set_at'] + 24h}")
    elif override:
        os.remove(f"{develop_dir}/phase-override.json")  # auto-cleanup
    
    # Phase 1: Plan
    if not exists(f"{develop_dir}/plan.md"):
        return Phase("plan", "Discuss scope and user concerns. Write plan.md.")
    
    # Hard gate: plan must have User Concerns section with content
    if not has_user_concerns(f"{develop_dir}/plan.md"):
        return Phase("plan", "plan.md missing User Concerns. Ask: 'What are you worried about?'")
    
    # Phase 2: Spec
    if not exists(f"{develop_dir}/spec.md"):
        return Phase("spec", "Define Must Do / Must NOT claims. Write spec.md.")
    
    # Phase 3: Survey
    if not exists(f"{develop_dir}/test-shape.yaml"):
        return Phase("survey", "Map claims to test cases. Write test-shape.yaml.")
    
    # Phase 4: Shape Review (GATE — requires user approval)
    if not exists(f"{develop_dir}/gate-shape.approved"):
        return Phase("shape-review", "BLOCKED — present spec + test shape for /approve-shape")
    
    # Phase 5: Tests
    test_results = read_json(f"{develop_dir}/test-results.json")
    if not test_results:
        return Phase("tests", "Write failing tests per test-shape.yaml. Run them.")
    
    # Phase 6: Implement — commit-pinned staleness check
    if test_results.get("commit", "")[:7] != head[:7]:
        return Phase("tests", "Test results are stale (HEAD changed). Re-run tests.")
    if not test_results.get("all_green"):
        return Phase("implement", f"Make failing tests pass. {test_results['tests_fail']} failures remain.")
    
    # Phase 7: Diff Review
    gate2 = read_json(f"{develop_dir}/gate2-results.json")
    if not gate2 or gate2.get("total_must_fix", 0) > 0:
        return Phase("diff-review", "Run adversarial review agents on the diff.")
    
    # Phase 8-11: verify → report-review → ship → handoff
    # ... same pattern: check artifact, check gate file, advance

Key behaviors not obvious from the pseudocode:

  • User concerns as a hard gate: The plan phase blocks if ## User Concerns is empty. This forces the "what are you worried about?" conversation to happen every time.

  • Late artifact detection: If early artifacts (plan/spec) are missing but later ones exist (test results, gate files), the machine detects this mismatch and suggests /set-phase options. This handles migrating existing work into the ceremony.

  • Git worktree support: The code handles .git being a file (worktree pointer) rather than a directory.

  • Suggested commands: Each phase includes a suggested_command field with the exact shell command to run next.

Full phase JSON output

The phase machine outputs rich JSON, not just phase + instruction:

{
  "phase": "implement",
  "stage": 6,
  "total_stages": 12,
  "progress": "6/12 (feature/ST-1234)",
  "instruction": "Make failing tests pass. 2 failures remain.",
  "suggested_command": "go test ./pkg/scoring/... -count=1 -v",
  "gates": {
    "shape": "approved",
    "report": "pending",
    "handoff": "pending"
  },
  "user_concerns": [
    "Performance with 500+ deals",
    "Migration backward compatibility"
  ],
  "custom_probes": [
    "Verify page load stays under 2s with 500 deals",
    "Verify migration rollback works"
  ],
  "user_visible_expectations": [
    "Score badge visible on deal cards",
    "Tooltip appears on hover"
  ],
  "stages": {
    "tests": {"status": "done", "commit": "abc1234"},
    "implement": {"status": "in_progress", "warning": "2 tests still failing"},
    "diff-review": {"status": "pending"}
  }
}

The UserPromptSubmit hook parses this and injects the relevant parts. User concerns and custom probes are carried through the entire pipeline — they originated in the plan phase and surface during diff review as custom agent mandates.

What the UserPromptSubmit hook injects

Beyond the phase instruction, the hook injects standing behavioral directives on every turn:

# Phase state
echo "Phase: $PHASE [$PROGRESS] | Gates: shape:$SHAPE report:$REPORT handoff:$HANDOFF"
echo "Instruction: $INSTRUCTION"

# Standing directives (always injected)
echo "PROCESS: Follow the phase instruction. If you skip ANY step, tell the user explicitly."
echo "GATE SAFETY: After shape gate is approved, do NOT edit spec.md or test-shape.yaml unless the user explicitly asks."
echo "BUG FIX PROTOCOL: If fixing a bug on current feature, ask: amend existing PR or new feature?"

# User concerns and probes (from plan.md, carried through pipeline)
echo "User Concerns: $CONCERNS"
echo "Custom Adversarial Probes: $PROBES"

# Checklist enforcement at ship/handoff phases
if [ "$PHASE" = "ship" ] || [ "$PHASE" = "handoff-review" ]; then
    if [ ! -f "$DEVELOP_DIR/checklist-ran.json" ] || is_stale "$DEVELOP_DIR/checklist-ran.json"; then
        echo "⛔ CHECKLIST NOT RUN. Run /checklist before proceeding."
    fi
fi

The gate safety directive prevents Claude from accidentally resetting gates by editing spec.md after approval. The bug fix protocol handles the common case where a user wants to fix something on the current feature — it asks whether to amend (resets gates intentionally) or start a new branch.

Gate approvals

Three phases require your approval. Claude cannot approve its own work — the semgrep-check.sh hook blocks it from writing gate-*.approved files.

You approve via slash commands (skills):

  • /approve-shape — after reviewing spec + test plan + Gate 1 agent findings
  • /approve-report — after reviewing verification report with screenshots
  • /approve-handoff — final sign-off before merge

Each approval skill writes a signed gate file:

approved 2026-04-06T15:30:00Z abc1234 sig=HASH

How capture-gate-results.sh works

When a sub-agent completes, this hook classifies its findings by scanning the output text for keywords:

# Classification by keyword matching on agent output
if 'style guide' in msg.lower() or 'file size' in msg.lower():
    gate = 2; agent_name = 'style_guide'
elif 'contract' in msg.lower() or 'breaking change' in msg.lower():
    gate = 2; agent_name = 'contracts'
elif 'regression' in msg.lower() or 'untested' in msg.lower():
    gate = 2; agent_name = 'regression'
# ... etc

# Count severities by regex on the raw message
must_fix = len(re.findall(r'must.?fix|HIGH|CRITICAL', msg, re.IGNORECASE))
should_fix = len(re.findall(r'should.?fix|MEDIUM', msg, re.IGNORECASE))

This means agent output format matters: agents must use severity keywords (must-fix, should-fix, HIGH, CRITICAL, MEDIUM) that the regex can parse. The structured table format isn't cosmetic — it's machine-read.

Multi-round tracking: Results are stored in a rounds array. If the same agent name appears twice in the current round, a new round is automatically created. This drives the targeted re-review — only agents with must-fix findings from the previous round are re-run.

CLAUDE.md rules for Tier 2

Add these behavioral rules to your CLAUDE.md:

## Development Ceremony

- The phase injection in your prompt IS the source of truth for what to do next.
  Follow it. Do NOT read STATE.md to determine phase — it may be stale.
- NEVER write gate-*.approved files. Only the user runs /approve-shape, 
  /approve-report, /approve-handoff.
- NEVER write *-results.json files directly. Run the actual commands (go test, 
  vitest, semgrep) — the capture hook writes artifacts automatically.
- If the phase says "BLOCKED", present the artifact to the user and STOP.
  Do not continue with other work — halt and wait for the user's decision.
- When spawning sub-agents you need results from, run them in FOREGROUND 
  (the default) and wait for completion before responding. If you reply before
  agents finish, the stop hook fires prematurely.

Tier 3: Full Ceremony

What you get: The complete TDD pipeline driven by /develop. Type it once and the machine drives: plan → spec → tests → implement → adversarial review → verify → ship. You approve at gates and that's it. When /develop reaches Done, it automatically invokes /ship — truly end-to-end.

What changes for you: Features are fully managed by the pipeline. You review and approve at gates, handle Claude's questions during planning, and get a shipped PR at the end.

New components:

Component Purpose
/develop skill The main workflow — 12 stages with full instructions per phase
/ship skill Autonomous build → test → lint → commit → push → PR → CI poll
/checklist skill Shows receipt for current gate (what's verified, what's missing)
/approve-* skills Write signed gate files after your review
agents.md Sub-agent prompt templates for Gate 1 (spec review) and Gate 2 (diff review)
agent-gauntlet.md 22 agent specifications for comprehensive pre-handoff review
conventions.md Your project's architecture rules (referenced by agents)

The 12 stages

# Phase What Claude does Artifact Gate?
1 plan Discusses scope, asks about concerns, spawns discovery agents plan.md
2 spec Defines Must Do / Must NOT claims, schema change declarations spec.md
3 survey Reads code, traces paths, maps claims → test cases, runs contract risk analysis test-shape.yaml
4 shape-review 2-3 agents review spec + tests (zero context) agents/gate1/*.json You: /approve-shape
5 tests Writes failing tests per test-shape.yaml, runs them Test files + test-results.json
6 implement Makes tests pass, one at a time, semgrep after each fix Code changes
7 diff-review 8+ agents review diff (zero context, 3 rounds) agents/gate2/*.json Auto-fix loop
8 verify E2E feature tests (dev-browser) + regression tests (Playwright) e2e-results.json + screenshots
9 report-review Presents verification report mapping every claim to evidence You: /approve-report
10 ship build → lint → test → commit → push → PR → CI poll PR created
11 handoff-review Final checklist + approval You: /approve-handoff

Phase details

Plan phase is interactive — Claude asks questions, not just writes a doc. It explicitly asks:

  • "What are you most worried about with this change?"
  • "Are there specific things you want adversarial agents to check?"
  • "Any areas of the codebase that are fragile?"

Answers become ## User Concerns and ## Custom Adversarial Probes in plan.md. These are tracked through the entire pipeline and injected into every phase instruction. The phase machine blocks if concerns are empty.

Claude also spawns discovery agents: an "analogical reasoning agent" that finds the most structurally similar existing feature, and a "reuse agent" that searches for existing code to leverage.

Spec phase includes a schema changes declaration in structured YAML:

graphql:
  new_types: [ScoreBadge, ScoreBreakdown]
  new_mutations: [recalculateScore]
  modified_files: [schema.graphqls, scoring.graphqls]
migrations:
  new: ["20260406_add_score_column.sql"]
  tables_modified: [deals]
codegen:
  required: true
  widget_codegen: true

The verification report later checks each declared item exists in the generated code and flags undeclared schema changes.

After spec is written, a verification skeleton PDF is auto-generated showing every claim with PENDING status and evidence slots. The user reviews this before approving — it previews what the final verification report will look like.

Survey phase includes contract risk analysis: for every modified function/interface, callers are grepped across all repos and a risk table is produced. High and medium risks must each map to a test. GraphQL value changes are flagged as "silent breakers" — codegen catches schema shape changes but not value changes.

Verify phase has two distinct steps:

  1. Feature validation via dev-browser — navigate the actual UI, exercise each test from the test shape, capture screenshots named by claim ID (claim-do-1-description.png, claim-not-3-no-raw-floats.png). The naming convention matters because the verification report matches by claim-{do|not}-{N} prefix.
  2. Regression testing via Playwright QA suite — maps changed files to test suites and runs the relevant ones.

Adversarial sub-agents

Gate 1 agents (3 agents, after spec/test-shape — review the PLAN, not the code):

Agent Receives Checks
Spec Reviewer Acceptance criteria, test plan, architecture file list from plan.md, repo root Test plan gaps, untested callers, missing edge cases
Architecture Reviewer Architecture file list, acceptance criteria, repo root + widget path Data duplication, wrong layers, SSOT violations, cross-boundary coupling
Reuse Reviewer Architecture file list, repo root Missed existing hooks, components, API endpoints

Gate 1 agents receive plan.md sections, NOT spec.md or a diff. They review the plan before any code is written.

Gate 2 agents (8+ agents, after implementation — review the CODE):

Agent What it checks Why it catches what Claude misses
Style & Conventions File size, naming, layer violations Claude wrote the code — it thinks the structure is fine
Contracts & Interfaces Breaking changes, backward compatibility Claude focused on making tests pass, not on callers
Regression Untested callers of changed functions Claude tested what it changed, not what calls it
Security Auth checks, injection, data exposure Claude builds features, not attacks
Performance N+1 queries, unbounded results Claude optimizes for correctness, not performance
Scope Unrelated changes, drive-by refactors Claude "improves" code it wasn't asked to touch
Logging Log format, error context, PII Claude doesn't think about ops
Reuse & Patterns Observability, pattern deviations Claude writes new code instead of finding existing patterns

Critical design detail: Gate 2 agents receive the repo path and base branch, NOT a pasted diff. They have tool access (Bash, Read, Grep, Glob) and actively explore the codebase — grepping for callers, reading style guides, checking file sizes. This is "zero context about the conversation" but "full access to the codebase."

Every agent prompt includes: "Do NOT write any code. Research only." — this keeps agents in adversarial reviewer mode rather than "fixing" things and losing their critical posture.

Custom probes from the plan phase become additional agents. Each probe is spawned as a Tier 2 agent with only the diff, the probe text, and the repo root. Output schema: {verdict: pass|fail, evidence, score: 0-5}.

Agent tiers — why the ordering matters

Agents are organized in three tiers, and the ordering is load-bearing:

Tier Agents Scope Why this order
Tier 1: Property Style, Contracts, Regression, Security, Performance, Scope, Logging, Patterns Individual files in the diff Catches most issues cheaply
Tier 2: Cross-cutting Widget→API data flow, API→DB data flow, Zero-value/nil safety, Error propagation, Temporal/async safety, State consistency Data flow across boundaries Catches bugs that single-file agents structurally cannot see
Tier 3: Meta Fresh-eyes summary, Gap detection, Concern validator, Test adequacy, Cold diff review The review itself Reviews the review — finds what all other agents missed

Tier 2 agents trace data across layers (frontend → API → DB → cache) and catch type mismatches, nil propagation, and consistency gaps. Tier 3's "Gap Detection" compares against analogous features — this only makes sense after Tier 1+2 have catalogued what's present.

The Cold Diff Review agent (Tier 3) is deliberately tool-less — it receives the full diff pre-loaded in its prompt and has NO file-reading tools. This forces pure diff-based reasoning, mimicking what a human reviewer actually sees on a PR.

Three-round diff review

Round Which agents Purpose
Round 1 All agents (8-22 depending on scope) Comprehensive first pass
Round 2 Only agents that found must-fix items Verify fixes didn't introduce new issues
Round 3 Always Contracts + Regression + Security Fixed safety net — these 3 catch real production bugs

Round 3 always runs even if Round 2 was clean. Exit criteria: Round 3 produces zero must-fix findings.

Weighted scoring: Each agent scores 0-5. Tier weights: Tier 1 = 1x, Tier 2 = 2x, Tier 3 = 1.5x. Cross-cutting agents are weighted double because their findings are harder to detect and more impactful in production.


How it prevents Claude from cheating

Problem How Claude cheats Prevention
Skipping tests "The code is straightforward, let me push" Phase machine won't advance without test artifacts at HEAD
Fake test results Writes test-results.json directly semgrep-check.sh detects Write to *-results.json in develop/ and warns
Self-approval Writes gate-shape.approved semgrep-check.sh blocks writing gate-*.approved files
Pushing broken code git push without running tests pre-push-guard.sh reads artifacts and blocks with exit 2
Stale results Runs tests, makes more changes, pushes Artifacts are commit-pinned — HEAD change = stale
Anchoring bias Reviews its own code and finds nothing Gate 2 agents have zero conversation context
Scope creep "While I'm here, let me refactor this" Scope agent specifically checks for unrelated changes
Skipping edge cases Tests happy path only Spec reviewer checks gaps; custom probes test specific risks
Skipping user concerns Ignores the planning conversation Phase machine blocks if ## User Concerns is empty
Editing spec after approval Changes the contract without re-review File-edit hook auto-resets gates when spec.md is modified
Continuing past a gate Does more work when it should wait for approval CLAUDE.md rule + phase injection says "BLOCKED — halt"

Deep dives

How editing spec.md cascades through the pipeline

When Claude edits spec.md after the shape gate was approved:

  1. semgrep-check.sh (PostToolUse[Edit]) detects the edit
  2. It re-runs compute-phase.py which sees spec mtime > gate-shape.approved mtime
  3. The gate is invalidated — phase regresses to shape-review
  4. Verification skeleton is auto-regenerated from the new spec
  5. Claude gets a warning: "spec.md modified — gates reset"
  6. On next user message, the UserPromptSubmit hook injects the regressed phase

This is intentional for bug fixes — editing the spec adds new claims, and everything downstream must be re-verified against the new claims.

How phase override works

/set-phase writes phase-override.json:

{"phase": "diff-review", "set_at": "2026-04-06T15:30:00Z"}

The phase machine checks this before the normal waterfall. After 24 hours, the override auto-expires and is deleted from disk. This handles cases where the machine is wrong (e.g., you imported existing tests that didn't go through the ceremony).

How test results merge across runs

When Claude runs go test ./pkg/foo then later go test ./pkg/bar:

  1. First run: writes test_names: {TestFoo: "pass"}
  2. Second run: reads existing file, merges test_names: {TestFoo: "pass", TestBar: "pass"}
  3. Recalculates tests_pass, tests_fail from the merged set

This means incremental test runs build up a complete picture. The pre-push guard sees the accumulated results, not just the last run.


Porting guide

The reference implementation uses Go + Next.js. Here's how to adapt each piece.

Capture hook — add your test runner

Your stack Pattern to match Output to parse
Python + pytest 'pytest' in cmd X passed, Y failed, Z error
Rust + cargo 'cargo test' in cmd test result: ok. N passed; M failed
Java + Maven 'mvn test' in cmd Parse surefire XML reports
Ruby + rspec 'rspec' in cmd X examples, Y failures
JavaScript + jest 'jest' in cmd Tests: X passed, Y failed (strip ANSI first!)

Always strip ANSI color codes before parsing: re.sub(r'\x1b\[[0-9;]*m', '', resp). Most test runners emit colored output that breaks regex matching.

Pre-push guard — adapt checks

Your check How to implement
Tests pass Read your test-results JSON, check all_green
Lint clean Read lint-results JSON, check clean
Type-check Add tsc --noEmit capture to capture-results, check in guard
Formatting Add prettier --check / black --check capture
Migration conflicts Compare your migration numbering against base branch

Agent gauntlet — adapt per stack

Go agent Python equivalent Rust equivalent
Go Idioms Python Idioms (type hints, exception handling, async patterns) Rust Safety (unsafe blocks, lifetime issues, error handling)
Style & Conventions PEP 8, import ordering, docstrings Clippy lints, naming conventions
Performance N+1 ORM queries, missing pagination Allocation patterns, clone vs borrow

FAQ

Can I use this with any language? Yes. The phase machine is language-agnostic. capture-results.sh pattern-matches test output — add patterns for your test runner. The agent gauntlet specs are customizable per stack.

How long does each phase take? Plan: 5-10 min (interactive). Spec: 2-5 min. Survey: 3-5 min. Tests: 5-15 min. Implement: varies. Diff review: ~2 min (agents run in parallel). Ship: ~3 min. Total for a medium feature: 30-60 min with minimal human intervention.

Is this overkill for small fixes? For a one-line fix, yes. The ceremony is designed for features that take 30+ minutes. For small fixes, the guard rails (Tier 1) still help — you get artifact capture and push protection without the full pipeline.

What if Claude gets stuck in a phase? Use /set-phase to override. It writes a phase-override.json that expires after 24 hours, then auto-deletes.

Do I need all 22 agents? No. Start with the 8 Tier 1 property agents. Add Tier 2 cross-cutting agents when you want data-flow analysis. Add Tier 3 meta agents when you want review-of-the-review.

What about CI — doesn't this duplicate CI checks? The pre-push guard runs locally (~50ms, reads JSON files). CI runs the full suite remotely. The guard prevents wasting CI time on obviously broken code. They're complementary.

What if I abort mid-pipeline? Delete develop/<branch>/ or just switch branches. The ceremony state is entirely file-based — no external database or service needed.

How do I debug a hook that's not firing? Add echo "$(date) fired" >> /tmp/hook-debug.log at the top. If the log doesn't grow, check that settings.json has the right matcher and command path. Test manually: cat sample-input.json | bash your-hook.sh.

About

A structured TDD pipeline for Claude Code with adversarial sub-agent review gates

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors