The first agent harness where long-term memory is the loop, not an afterthought.
Helios is a memory-first Rust agent harness built on Lunaris (bi-temporal memory), Moon (Redis-compatible MVCC data substrate), and CubeSandbox (KVM microVM isolation, < 60 ms cold start). Every thought, tool call, test result, and plan revision is a durable, time-stamped, cross-session-recallable Episode. Default configuration ships two-tier model orchestration (expensive planner + cheap executor) and a built-in benchmark harness targeting the SWE-Bench Pro / τ-Bench / GAIA leaderboards on Scale SEAL's standardized methodology — not vendor-reported numbers.
Not a Claude Code clone. Claude Code is referenced for specific correctness invariants (saw_tool_use_block dispatch, :ping keepalive, typestate session). Everything else is designed for the leaderboard.
Status: pre-alpha, foundation phase. Planning artifacts are complete (see
.planning/PROJECT.mdfor thesis,.planning/REQUIREMENTS.mdfor 42 v1 requirements,.planning/ROADMAP.mdfor 9 phases). Code lands phase-by-phase via GSD-managed execution — Phase 1 (workspace +helios-coretypes) is the first executable phase.
Every existing agent harness hits the same wall: the harness is only as smart as the context it surfaces, and transcript management is lossy. When compaction happens, facts evaporate. When tools run on the wrong version of a file, they hallucinate. When the agent revisits the same subproblem, it re-solves it blind.
Helios' bet: memory is the loop. Five primitives no existing harness combines:
| # | Primitive | What Claude Code / Aider / Devin has | What Helios adds |
|---|---|---|---|
| 1 | Cross-session recall | Per-session transcript only | Hybrid RRF (Vector + BM25 + cross-encoder rerank) across all prior sessions + repo corpus |
| 2 | Time-travel replay | None | helios replay --session S --as-of T via Lunaris as_of(Hlc) + Moon TEMPORAL.SNAPSHOT_AT |
| 3 | Consolidation, not truncation | Lossy compaction | Facts promoted via Lunaris auto-consolidator pipeline (__lunaris_consolidate__ topic, 60 s debounce) |
| 4 | Grounded tools | Filesystem-first | Read / Grep / Edit hit Lunaris HeliosScratchpad first; filesystem fallback |
| 5 | Plan → Act → Verify → Reflect | Single-phase loop | Explicit four-phase loop with mandatory Verifier trait |
See .planning/PROJECT.md for the long-form argument.
helios/
├── crates/
│ ├── helios-core/ # Domain: pure types, Session<S> typestate, trait contracts
│ ├── helios-memory/ # Infra: LunarisPort wrapping HeliosScratchpad + hybrid RRF recall
│ ├── helios-provider/ # Infra: Model trait + Anthropic + OpenAI + MockModel + TwoTierModel<P,E>
│ ├── helios-tools/ # Infra: Read/Write/Edit/Grep (memory-first) + Bash + Verifier trait
│ ├── helios-engine/ # Application: Plan → Act → Verify → Reflect loop + policy + hooks + subagents
│ ├── helios-sandbox/ # Infra: CubeSandbox HTTP client + LocalPtySandbox fallback
│ ├── helios-eval/ # Infra: SWE-Bench-Pro + τ-Bench + GAIA runners with replay + cost tracking
│ ├── helios-cli/ # Binary: the `helios` command
│ ├── helios-tui/ # Binary surface: ratatui interactive mode with plan-tree visualization
│ └── helios-sdk-py/ # Binding: PyO3 Python SDK
├── benches/ # Criterion benchmarks (cross-crate)
├── xtask/ # Build automation (cargo xtask check-layers)
└── docs/adr/ # Architecture Decision Records
Layered architecture rules — cargo xtask check-layers enforces these:
| Crate | May import | Role |
|---|---|---|
helios-core |
std, serde, thiserror, uuid, chrono, async-trait, tokio-util |
Domain types, errors, trait contracts |
helios-engine |
helios-core |
Plan → Act → Verify → Reflect orchestration |
helios-memory / -provider / -tools / -sandbox / -eval |
helios-core (+ lunaris for -memory) |
Trait implementations |
helios-cli / -tui / -sdk-py |
everything above | Wiring only — no business logic |
┌──────────┐ ┌────────┐ ┌──────────┐ ┌──────────┐
│ Plan │ → │ Act │ → │ Verify │ → │ Reflect │
│ │ │ │ │ │ │ │
│ planner │ │executor│ │ Verifier │ │ planner │
│ model │ │ model │ │ trait │ │ revises │
└──────────┘ └────────┘ └──────────┘ └──────────┘
↑ │
└──────── revision path (on verify fail) ──────┘
Each phase is a Session<S> typestate marker. Illegal transitions are compile errors. Plan and Reflect dispatch to the planner model (expensive, deep); Act dispatches to the executor (cheap, fast); Verify runs the Verifier trait (default TestRunnerVerifier runs tests in CubeSandbox).
| Benchmark | Current SOTA (April 2026) | Helios v0.1.0-alpha target |
|---|---|---|
| SWE-Bench Pro (SEAL) | GPT-5.4 @ 59.1 %, Claude Opus 4.5 @ 45.9 % | ≥ 50 % |
| τ-Bench Retail | Claude 3.5 Sonnet @ ~81 % | ≥ 85 % |
| τ-Bench Airline | — | ≥ 70 % |
| GAIA Level-1 | — | ≥ 65 % |
| Agent-loop overhead | — | ≤ 15 ms p50 |
| Hybrid RRF recall p99 | — | ≤ 200 ms over 10 k-episode corpus |
| Sandbox cold start p95 | — | ≤ 100 ms end-to-end |
| Consolidation freshness | — | ≤ 60 s from ingest |
| Stream watchdog accuracy | — | 100 % on 10 k chaos runs |
| Subagent OOM isolation | — | 100 % on 10 k injection runs |
Note on SWE-Bench Verified (Anthropic-reported 79 % on Sonnet 4.6 / 80.8 % on Opus 4.6): OpenAI's 2026 audit found training contamination across every frontier model, so these numbers are inflated. Helios's helios eval emits per-instance contamination warnings per that methodology.
| Phase | Scope | Time |
|---|---|---|
| 1 | Core types + Session<S> typestate + trait contracts |
1 d |
| 2 | Memory via Lunaris (LunarisPort + HeliosScratchpad + RRF recall + auto-consolidation) |
1 d |
| 3 | Providers (Model + Anthropic + OpenAI + Mock + TwoTierModel<P,E>) |
2 d |
| 4 | Memory-first tools + Verifier trait + TestRunnerVerifier |
1.5 d |
| 5 | Agent engine (keystone) — Plan → Act → Verify → Reflect loop | 4 d |
| 6 | Sandbox (CubeSandbox Day-0 + LocalPty fallback) | 1.5 d |
| 7 | helios-eval benchmark harness |
2 d |
| 8 | CLI + Python SDK | 1.5 d |
| 9 | TUI + alpha ship-gate benchmark run | 2.5 d |
Total: ~17 engineer-days, ~3–4 weeks with review and iteration.
Full plan: .planning/ROADMAP.md.
Deferred to v0.2.0: Claude Code one-way settings importer, MCP client + server, TypeScript SDK, control plane, additional sandbox backends (Firecracker / Docker / E2B), Wasm hooks, reactive compaction, CubeSandbox snapshot rollback integration.
Prerequisites:
- Rust
1.94(rust-toolchain.tomlpins this automatically) - Linux x86_64 with KVM for production sandboxing (CubeSandbox daemon running locally on
http://127.0.0.1:3000). macOS / non-KVM hosts fall back toLocalPtySandboxfor dev. - Sibling checkouts:
lunarisandmoonat../lunarisand../moon. - For Python SDK: Python 3.9–3.13.
Build:
cargo build --workspace
cargo test --workspace --all-features
cargo clippy --workspace -- -D warnings
cargo xtask check-layersFirst run (post-alpha, Phase 8 or later):
helios -p "summarize docs and find TODOs" # headless agent, two-tier default
helios eval swe-pro --subset 10 --two-tier # benchmark run, 10-task subset
helios replay --session <id> --as-of 2026-04-22T10:30:00Z # time-travel (requires live Lunaris)
helios # interactive TUI (Phase 9+)During the build phase, the correct command is cargo check -p helios-core at Phase 1 and forward — the helios binary doesn't exist until Phase 8.
Helios is informed by three sibling checkouts (read-only) plus one vendored HTTP dep:
- moon — Redis-compatible data substrate with production MVCC (
TEMPORAL.SNAPSHOT_AT), vector HNSW + DiskANN, graph + Cypher. Single binary + library crate. Helios consumes transitively vialunaris-storage-moonover the RESP wire protocol (not as a direct Rust dep). - lunaris — bi-temporal memory layer built on Moon. Helios's
LunarisPortwrapslunaris::Lunarisandlunaris::recipes::HeliosScratchpad(pre-built Helios integration). - claude-code — reference TypeScript harness. Used for correctness invariants only (
saw_tool_use_blockdispatch,:pingkeepalive, typestate session pattern). Not a wire-protocol / hook-envelope compatibility contract in v0.1.0-alpha. - CubeSandbox — KVM microVM sandbox, Apache 2.0, E2B-compatible HTTP API. Consumed over HTTP (not as a Rust library). Production-validated at Tencent Cloud.
- Read
CLAUDE.mdbefore writing any code. It's the behavioral contract for contributors (and AI assistants) working in this repo. - Before Phase 1, the repo has no
crates/, noCargo.toml, nosrc/. That's intentional — don't invent a skeleton ahead of the plan. - Planning artifacts live under
.planning/. Build via/gsd-plan-phase, execute via/gsd-execute-phase. Current state: roadmap complete, Phase 1 ready to plan. - Commit format:
<type>(<scope>): <summary>with body andauthor: Tin Dangfooter. SeeCLAUDE.mdfor the full template.
Apache-2.0 OR MIT (dual license, contributor's choice). CubeSandbox dep is Apache 2.0, compatible.
- Current thesis —
.planning/PROJECT.md(memory-first pivot, 2026-04-21) - Requirements —
.planning/REQUIREMENTS.md(42 v1 items, 100 % phase-mapped) - Roadmap —
.planning/ROADMAP.md(9 phases, ~17 engineer-days) - Primary-source map —
.planning/architect/REFERENCES.md(Lunaris / Moon / CubeSandbox / claude-code capability pointers) - Historical snapshots —
.planning/architect/blueprint.md+.planning/architect/baseplan/(pre-pivot design; read for Rust patterns, not current scope) - ADRs —
docs/adr/(created at Phase 1) - Contributor guide —
CLAUDE.md