This document is the chronological build history of aaOS, phase by phase, verified against git commit timestamps and content. It stops at the end of Phase E + the AgentSkills integration — everything that constitutes "the runtime is feature-complete for its current goal."
Self-reflection runs (the system reading its own code, finding bugs, proposing features) are captured separately in reflection/ as an ongoing log. Cross-cutting lessons distilled from both the build history and the reflection log live in patterns.md.
aaOS is an agent runtime — a program that manages AI agents as first-class processes, with capability-based security replacing traditional permissions. It ships as a daemon (agentd) running inside Docker on Linux; the long-term plan is to migrate to a real capability-based microkernel (Redox OS or seL4). See README.md for the product pitch and docs/architecture.md for the layered design.
- Claude session — a single conversation with Anthropic's Claude model, driven by a human operator through a CLI. Most phases used two sessions running in parallel: one architecting, one implementing.
- Subagent — a fresh, isolated Claude session dispatched from an orchestrating session to do one task. Subagents start with an empty context window, receive a spec, produce artifacts, and exit. They are a build-time workflow pattern, not a runtime concept.
- Runtime agent — an autonomous process inside aaOS (a running
agentdcontainer). Runtime agents have capability tokens, message channels, and an LLM backing them. The Bootstrap Agent and its children are runtime agents. - Bootstrap Agent — the single runtime agent that starts when
agentdboots in bootstrap mode. It receives high-level goals over a Unix socket, decomposes them, spawns child runtime agents with narrower capabilities, and coordinates their work. Introduced in Phase D. Powered by Claude Sonnet in Phase D; switched to DeepSeek Reasoner after Phase E1 added OpenAI-compatible API support. - Capability token — an unforgeable permission object issued by the kernel at agent spawn. Every tool call is validated against the caller's tokens. Tokens can be narrowed but never escalated. Children inherit a subset of their parent's capabilities.
- MCP — Model Context Protocol. A JSON-RPC 2.0 message format Anthropic published for agent-to-agent and agent-to-tool communication. aaOS uses MCP envelopes for inter-agent messaging.
- PID 1 — on Linux, process ID 1 is the container's init process. In aaOS,
agentdruns as PID 1, meaning the daemon is the container — whenagentdexits, the container exits. - Skill — a named instruction bundle (per the AgentSkills open standard). A skill is a folder with
SKILL.mdholding YAML frontmatter and markdown workflow instructions. Runtime agents load skills on demand via theskill_readtool. - Peer review — sending a design spec or plan to an external LLM CLI (Copilot CLI running GPT-5.4, or Qwen CLI) for independent critique before implementation. Started in Phase B.
aaOS switched LLM providers as requirements evolved:
| Phase | Orchestration model | Worker model | Why |
|---|---|---|---|
| A (build-time only) | Claude (Anthropic API) | — | Prototype runtime, no runtime agents yet |
| B–D | Claude Sonnet / Haiku | Haiku | First autonomous demos; Anthropic was the only supported API |
| E1+ | DeepSeek Reasoner | DeepSeek Chat | OpenAiCompatibleClient unlocked cheaper providers |
Anthropic remains supported as a fallback. The daemon checks DEEPSEEK_API_KEY first, falls back to ANTHROPIC_API_KEY.
Numbers verified from git as of the landmark commits listed. Two counts shown: prod = non-test code (what earlier docs called "~4,000 lines" and similar), total = prod + test code in the same files. The original retrospective's headline numbers (4K → 8K → 12K → 13K) were production-only counts; this table records both so the methodology is legible.
| Stage | Landmark commit | Crates | Prod lines | Total lines | #[test] markers |
|---|---|---|---|---|---|
| A (runtime prototype) | 029d90b | 6 | 3,917 | 6,151 | 111 |
| B (persistent agents) | 464d8fc | 6 | 5,160 | 8,042 | 141 |
| C (agent memory) | 8d2efba | 7 | 7,630 | 11,693 | 206 |
| D (self-bootstrapping) | db641dd | 7 | 7,926 | 12,041 | 210 |
| April 13 end (E + skills + revocation + constraints) | 66542bf | 7 | 9,451 | 14,458 | 258 |
| Current (after run 5 response) | 58930f0 | 7 | 9,675 | 14,797 | 264 |
Commit timestamps: Phase A released 2026-03-21 17:13 (first commit). Everything from Phase B through the end of April-13-end is a single calendar day: 2026-04-13 10:47 → 22:37 (~11h 50m of activity). Subsequent tuning happened on 2026-04-14.
The original design brief estimated 3–5 people and 3 months to reach a working demo. What actually happened: one human operator, two Claude sessions, ~48 hours of concentrated work, landing as commit 029d90b on 2026-03-21 at 17:13.
The "48 hours" figure is the user's recalled wall-clock span (design through working prototype). Git only records the landing commit; the 48-hour count is load-bearing context, not a timestamp assertion.
A working agent runtime prototype: 6 Rust crates, 3,917 production lines (6,151 total with tests), 111 tests. Agent kernel with capability-based security, tool registry with two-level enforcement (tool access check + resource path check), LLM execution loop, agent orchestration with capability narrowing, human-in-the-loop approval queue, and MCP message routing. Verified end-to-end against the real Anthropic API.
The capability system shipped in Phase A and has anchored every later phase. An agent declares capabilities in its YAML manifest (e.g., file_read: /data/*). At spawn time, the runtime issues unforgeable tokens. Every tool invocation is checked against the caller's tokens before execution. Child agents inherit only a subset of what the parent holds. This is the foundation that makes later features — spawn_agent, workspace isolation, skill execution — secure.
Phase A also introduced spawn_tool (the built-in tool that lets an agent create child agents) and the parent⊆child capability enforcement. Both became load-bearing in Phase D.
Continuous design–build–validate loop. No context-switching overhead between designing and implementing. One Claude session produced a spec, another Claude session built it, tests verified it, cycle repeated. Each iteration took 30–60 minutes, not days.
Two sessions checking each other's work. The architecture session caught design issues before they became code:
- Cap'n Proto dropped — MCP is JSON-RPC 2.0; using a different serialization format would fight the protocol stack. The original brief pattern-matched on "serious OS needs serious serialization" without thinking about the actual wire format.
- Firecracker deferred — can't meaningfully isolate agents that don't execute yet. Isolation is a Phase B concern.
- Circular dependency in
AgentServicestrait placement — putting it inaaos-corewould have forced core to depend onaaos-ipcandaaos-tools, creating cycles. Fixed by usingserde_json::Valuefor messaging and movingToolDefinitionto core. - Approval queue dependency direction — the trait goes in core, the implementation in agentd. Caught before any code was written.
Human provides vision; the model does implementation. The human operator decided what to build and in what order. The Claude sessions designed the interfaces, wrote the specs, wrote the code, wrote the tests, and debugged the issues. The human reviewed designs and made judgment calls. This division eliminated the usual bottleneck: human creativity is slow but irreplaceable for direction; model-driven implementation is fast and consistent for execution.
Subagent-driven development. Each discrete task was dispatched to a fresh Claude subagent with an isolated context window. The orchestrating session coordinated without accumulating implementation details, which kept its context clean across 10+ tasks.
- Cap'n Proto in the original brief. Pattern-matched on convention without analyzing the actual protocol requirements.
AgentServicestrait placement. Initially placed inaaos-core, which would have created circular dependencies. Caught by the architecture session's review before implementation began.ToolInvocationcontext passing. The initial execution loop didn't includeInvocationContext— path-based capability checking wasn't possible without it. Emerged during the tools brainstorm as a necessary addition.- File write append flush. The append test failed because
tokio::fs::OpenOptionsdidn't flush before the test read the file. One-line fix, caught by TDD. - Unused imports. Subagents occasionally left module-level imports that were only used in tests. Caught by clippy in cleanup passes.
- Build sequencing. "Build the execution engine first, ship it, then open the socket for the external protocol." The model would have built both simultaneously.
- Recognizing when the brief was wrong. Dropping Cap'n Proto and deferring Firecracker required seeing that the original design document had made incorrect assumptions.
- Approval via Unix socket API, not stdout. The easy path was printing to stdout. The architecturally correct path was the same JSON-RPC API that a future dashboard would use. The human chose the path that wouldn't need replacement.
- Fire-and-forget messaging for Phase A. Request-response messaging requires persistent agents with message processing loops. Scoping messaging to fire-and-forget proved the IPC layer worked without requiring infrastructure that didn't exist yet.
- Excluding
shell_exec. The model designed it as one of the initial tools. The human identified it as a capability escape hatch that bypasses every other constraint in the system, and excluded it from scope. - Docker isolation. The development machine runs production systems. The human mandated Docker for all aaOS development after the first real API test ran on bare metal.
The 48-hour build wasn't fast because corners were cut. It was fast because the design–build–validate loop had no idle time. Every hour was either designing, implementing, testing, or reviewing. The human never waited for the model; the model never waited for a decision. Specs were written before code. Tests were written before implementation. Reviews happened after every task.
Phase B commits: f263e2b (design spec, 2026-04-13 10:47) → 464d8fc (Phase A+B E2E test, 2026-04-13 11:39). 52 minutes of landed commits. A single orchestrating Claude session produced the spec and plan, then dispatched implementation tasks to fresh subagents.
At the end of Phase B: 6 crates, 5,160 production lines (8,042 total), 141 tests.
Three sub-specs implemented:
-
Persistent agent lifecycle. Agents declared as
lifecycle: persistentrun a tokio background task (persistent_agent_loop) that processes messages sequentially from a channel, maintains conversation history in memory, and responds via a pending-response map on the router. Pause/Resume/Stop commands work. The loop survives executor errors without crashing. -
Request-response IPC.
DashMap<Uuid, oneshot::Sender<McpResponse>>onMessageRouter. Callers register a oneshot channel keyed by trace_id, route the message, and await the response with a timeout.send_and_wait()on theAgentServicestrait with capability enforcement. -
Conversation persistence.
SessionStoretrait withJsonlSessionStore(one JSONL file per agent). History loaded once at loop startup, appended after each turn, compacted every 10 turns.max_history_messagesconfig for trimming.run_with_history()onAgentExecutoraccepts prior messages and returns a transcript delta for storage.
- Persistent loop not wired to spawn. The plan had
start_persistent_loop()as a separate method, butspawn_from_yaml()in the server never called it. Messages were delivered to the channel but nobody was consuming them. Session store was empty. Caught by the live API smoke test — the most important test run this phase. &AgentIdvsAgentId.AgentIdisCopy, but the persistent loop passed it by reference torun_with_history()which takes it by value. One-character fix, caught by the compiler.- Binary crate can't be imported by integration tests.
agentdwas a[[bin]]-only crate. Integration tests couldn't referenceagentd::server::Server. Fixed by adding a[lib]target andlib.rsre-exporting the modules. - Unused imports. Subagents left
AgentServices,SessionStore,Muteximports that weren't used in the final code. Cleaned up after first compilation.
- "Did you implement all 3 sub-specs?" The human asked for explicit verification against the design spec before accepting the work as done.
- End-to-end coverage. The orchestrating session tested Phase B in isolation. The human recognized that Phase A + Phase B integration hadn't been verified and asked for a combined test. This surfaced the "persistent loop never started" bug.
- Reusing the live-API test harness. The human pointed back at the test-with-real-Anthropic-API pattern from an earlier project, ensuring the same verification standard applied.
- Multi-model peer review before implementation. The human insisted on external review of the Phase B design before any code was written. Both Qwen CLI and Copilot CLI (GPT-5.4) caught real issues: don't embed oneshot in McpMessage, load history once not per-message, executor must return transcript delta.
Phase B introduced subagent-driven development as a formal pattern: each of ~10 tasks dispatched to a fresh subagent, results verified, orchestrator's context kept clean. Also introduced multi-model peer review of design specs before any implementation.
Phase C commits: 054548f (design specs, 2026-04-13 12:49) → 8d2efba (E2E tests, 2026-04-13 15:34) with documentation wrap-up at 3047e0c/5820130 by 16:19. ~3h 30m of commits. Single orchestrating session via subagent-driven development. Three spec documents (C1, C2, C3), each peer-reviewed by Qwen CLI and Copilot CLI. Two of three sub-projects shipped; the third deferred by design.
At the end of Phase C: 7 crates (added aaos-memory), 7,630 production lines (11,693 total), 206 tests.
C1: Managed context windows. ContextManager sits between the persistent agent loop and the executor, transparently summarizing old messages when the context fills. TokenBudget parses human-readable sizes ("128k"), estimates context with a chars/4 heuristic, and triggers summarization at a configurable threshold (default 70%). Summary messages fold into the system prompt prefix (not as User messages — that would break API turn alternation). Tool call/result pairs are kept atomic during selection. Archives preserve raw messages on disk; TTL-based pruning prevents unbounded growth. Fallback to hard truncation on LLM failure.
C2: Episodic memory store. New aaos-memory crate. MemoryStore trait with InMemoryMemoryStore — cosine similarity search, agent isolation, LRU cap eviction, replaces/update semantics, dimension mismatch handling. EmbeddingSource trait with MockEmbeddingSource (testing) and OllamaEmbeddingSource (production — nomic-embed-text via local Ollama, 768 dims). Three new tools: memory_store, memory_query, memory_delete.
C3: Shared knowledge graph. Cross-agent knowledge sharing, designed but deferred. Both peer reviewers confirmed the deferral: "building shared infrastructure without proven demand is a classic trap."
A SqliteMemoryStore was added later (commit b8b7af4, 2026-04-13 21:11) for persistence across container restarts — listed here because it's part of the memory system's natural evolution, though it landed after Phase D.
- Circular dependency in the plan. The C1 plan put
estimate_tokens(&[Message])onTokenBudgetinaaos-core, butaaos-llmalready depends onaaos-core— circular. Both Qwen and Copilot caught this independently. Fixed by moving message-aware functions toaaos-runtime::context. - Prepared context never passed to executor. The C1 plan called
prepare_context()after the executor ran, meaning summarization output was discarded for the LLM call. Both reviewers flagged this as the #1 blocker. Fixed by addingrun_with_history_and_prompt()and restructuring the persistent loop to callprepare_contextbefore the executor. - Model name mismatch dropped all query results. The in-memory store filters retrieved memories by embedding dimension.
Server::new()created the store expecting model"nomic-embed-text"(768 dims) but wired upMockEmbeddingSource, which reports model"mock-embed"at different dimensions. The filter silently discarded every result. Caught during integration test implementation. - LanceDB API instability. Both reviewers flagged that the
lancedbRust crate has unstable APIs. Copilot recommended skipping it entirely in favor of SQLite+sqlite-vec. Decision: ship withInMemoryMemoryStore, use Ollama for real embeddings, defer the persistent vector store.
- "Ask Qwen for a review. Then ask Copilot for what Qwen proposes." The human established the multi-model peer review pattern that caught the three biggest bugs in the plans before any code was written.
- "The only thing that matters is to see visually if it works as designed." The human clarified that engineering correctness was delegated to the model + peer review; the human's validation criterion was observing the system work end-to-end.
- Cumulative E2E testing. The human insisted on a Phase A + B + C1 test, not just isolated C1 — same pattern as Phase B.
Phase C added multi-model peer review of design specs as a first-class step. Routing architecture decisions through Copilot produced concrete recommendations (skip LanceDB, use Ollama) that the primary Claude session would not have generated alone. Cost per review was described at the time as "~$0.02" — see reflection/cost-bookkeeping.md about per-run cost estimates.
Phase D commits: 87ff99a (agent count limits, 2026-04-13 16:43) → db641dd (doc updates, 17:18). ~35 minutes of commits. The roadmap originally had Phase D as a web dashboard. After reviewing Phase C results, the human redirected to "an agent runtime that boots and runs itself autonomously."
At the end of Phase D: 7 crates, 7,926 production lines (12,041 total), 210 tests.
A Docker container where agentd runs as the main process (PID 1 — the container's init, which means when agentd exits, the container exits). At startup, agentd spawns the Bootstrap Agent — the first and only runtime agent at boot, powered by Claude Sonnet. The Bootstrap Agent accepts goals over a Unix socket, decomposes them, and spawns child runtime agents (Haiku) with narrowed capabilities to execute them.
Three OS-level features landed this phase:
- Persistent Bootstrap loop with Unix socket goal queue. Additional goals can be sent to the running container without restarting it.
- Workspace isolation per goal. Each goal gets
/data/workspace/{name}/. Child agents write intermediate files there. - Automatic retry of failed child agents.
Plus safety guardrails: agent count limit (100), spawn depth limit (5), StdoutAuditLog for observability via docker logs -f.
First successful run: goal "fetch HN and summarize the top 5 stories" → Bootstrap Agent spawned a Fetcher agent (Haiku) that called web_fetch, spawned a Writer agent (Haiku) that called file_write, and produced /output/summary.txt. The capability system worked as designed — the Bootstrap Agent couldn't read /output/* even though its child wrote there. Wall time ~75 seconds. Cost was recorded at the time as ~$0.03 against Anthropic, though per-run cost figures from this period are token-math estimates — the authoritative cumulative figures are in reflection/cost-bookkeeping.md.
- Copilot review missed existing features. Copilot couldn't read
spawn_tool.rsduring the feasibility review due to file permission errors, and concluded the spawn_agent tool and parent⊆child capability enforcement didn't exist. They did — both shipped in Phase A. This wasted planning time on "gaps" that were already filled. Lesson: external reviewers are only as good as the context they can access. - Model ID wrong in Bootstrap manifest. Used
claude-sonnet-4-6-20250725(a dated variant) instead ofclaude-sonnet-4-6. Two container launches failed before this was caught.
- "Let aaOS build itself on boot." The human's vision for Phase D. The Claude sessions (and Copilot's review) suggested safer, smaller demos ("fetch 3 URLs"). The human pushed for the maximally ambitious version — a self-organizing system from a single Bootstrap Agent.
- "Is that not drifting away?" When Copilot recommended
shell_execand a CI demo, the human recognized this would turn aaOS into a developer tool, not a runtime. The human separated OS-level work (goal queue, workspace isolation, supervisor) from application-level work (shell_exec, CI).
Total new code in Phase D: ~330 lines. Most of the session time was spent on vision, direction, and peer review — not coding.
Phase E commits: f6b62a6 (E1 OpenAI-compat, 2026-04-13 19:40) → 2c74c4e (launcher fix, 20:29) for E1+E2+E3; budget enforcement E3 was d76f16c at 20:29. ~50 minutes of commits for E1-E3. The security-audit cleanup, SQLite memory, and AgentSkills integration followed immediately after (20:48 → 21:47).
At the end of April 13 (after Phase E + skills + revocation + constraints + Bootstrap-uses-skills fix): 7 crates, 9,451 production lines (14,458 total), 258 tests.
Why this existed. Running autonomous agents against Anthropic's API was tolerable for demos but expensive for long-running fleets. DeepSeek's reasoner is cheaper, and offers context caching that further discounts cache-hit input tokens to roughly 10% of the normal rate (a detail that broke naive token-math cost estimates in earlier versions of this retrospective — see the cost note in reflection/cost-bookkeeping.md). The roadmap had Phase E as "inference scheduling" with an implicit assumption of local models (Ollama, vLLM); the human rejected the local-model premise in favor of cheap cloud APIs.
What was built. OpenAiCompatibleClient in aaos-llm implements the LlmClient trait for any provider that speaks the OpenAI Chat Completions format. Request translation (system-as-first-message, tool_calls as function format, role:"tool" for results). Response translation (choices[0].message, finish_reason mapping, prompt_tokens/completion_tokens). OpenAiCompatConfig::deepseek_from_env() constructor. 15 unit tests. The daemon checks DEEPSEEK_API_KEY first, falls back to ANTHROPIC_API_KEY.
Bootstrap switched from Claude Sonnet to DeepSeek Reasoner. Worker children switched from Haiku to DeepSeek Chat. The rest of the runtime didn't change — the LlmClient trait was the clean abstraction Phase A had built.
Self-designing demo. With the new client online, the human gave Bootstrap the goal: "Read the aaOS architecture and roadmap, then design the next phase of inference scheduling." Bootstrap produced three spec files autonomously. That demo is documented in full in reflection/ under the rubric of earlier self-designing runs.
ScheduledLlmClient decorator wraps any LlmClient with a tokio::sync::Semaphore (default max 3 concurrent API calls). Optional rate smoothing via a configurable minimum delay between calls. Env vars: AAOS_MAX_CONCURRENT_INFERENCE, AAOS_MIN_INFERENCE_DELAY_MS. Both Bootstrap and normal daemon modes use the scheduler. 4 new tests.
Per-agent token budgets declared in manifests (budget_config: { max_tokens, reset_period_seconds }). BudgetTracker uses atomic CAS operations for lock-free tracking. Wired into InProcessAgentServices::report_usage() — agents exceeding their budget get BudgetExceeded errors. Optional: agents without budget_config have no enforcement. 5 new tests.
Commit 82d19e9 (2026-04-13 20:52) — "security: fix 4 vulnerabilities found by self-audit" — marks the integration point for the security self-audit. The audit itself was a runtime agent run; the details of what the runtime found and how it was judged are in reflection/2026-04-13-run-1-security-self-audit.md. The four vulnerabilities fixed in-code were: path traversal in glob_matches, unknown tools receiving all capability tokens, child tokens ignoring parent constraints, and no path canonicalization in file tools.
Commits 5a0a42e (skill loader) → 5c023fd (bundle 21 skills) → 7bdb1bb (doc updates), 2026-04-13 21:28 → 21:40. ~12 minutes of commits. aaos-core::skill parses SKILL.md files per the AgentSkills specification. SkillRegistry manages loaded skills. skill_read tool serves full instructions and reference files with path traversal protection. Skill catalog injected into agent system prompts at spawn time (progressive disclosure tier 1). 21 production-grade skills bundled from addyosmani/agent-skills.
The "named agents after skills but never actually loaded them" bug — discovered in a self-reflection run later that same evening — was fixed in 66542bf (2026-04-13 22:37) by updating the Bootstrap manifest with explicit skill_read instructions.
Commits f1732d9 (revocation, 22:07) and f106d97 (max_invocations enforcement, 22:34). Both landed as integrations of findings from self-reflection runs. See reflection/ for the run-level detail of how those findings came about.
Phases B through E, plus security self-audit, SQLite memory, AgentSkills integration, revocation, constraint enforcement, and the Bootstrap-uses-skills fix all happened on a single calendar day (2026-04-13 10:47 → 22:37). That is not "48 hours like Phase A" or "a few hours per phase" — it is one long sustained day of design → implement → review → ship cycles, with multiple self-reflection runs woven through.
The build process itself went through phases:
- Phase A: Continuous design–build–validate loop across two Claude sessions.
- Phase B: Added subagent-driven development and multi-model peer review.
- Phase C: Formalized the multi-model peer review step with alternative-evaluation prompts.
- Phase D: Shifted from "human specifies, model implements" to "human casts vision, model + reviewers converge on scope." Peer review filtered by human judgment against the vision.
- Phase E and after: The runtime started reading its own code. Self-reflection runs begin — documented in
reflection/.
The remaining chronicle — Runs 1 through 5, what they found, what shipped, what we corrected about our own cost estimates along the way — lives in reflection/. Cross-cutting lessons that apply beyond any single run or phase are in patterns.md.