Helios

The first agent harness where long-term memory is the loop, not an afterthought.

Helios is a memory-first Rust agent harness built on Lunaris (bi-temporal memory), Moon (Redis-compatible MVCC data substrate), and CubeSandbox (KVM microVM isolation, < 60 ms cold start). Every thought, tool call, test result, and plan revision is a durable, time-stamped, cross-session-recallable Episode. Default configuration ships two-tier model orchestration (expensive planner + cheap executor) and a built-in benchmark harness targeting the SWE-Bench Pro / τ-Bench / GAIA leaderboards on Scale SEAL's standardized methodology — not vendor-reported numbers.

Not a Claude Code clone. Claude Code is referenced for specific correctness invariants (saw_tool_use_block dispatch, :ping keepalive, typestate session). Everything else is designed for the leaderboard.

Status: pre-alpha, foundation phase. Planning artifacts are complete (see .planning/PROJECT.md for thesis, .planning/REQUIREMENTS.md for 42 v1 requirements, .planning/ROADMAP.md for 9 phases). Code lands phase-by-phase via GSD-managed execution — Phase 1 (workspace + helios-core types) is the first executable phase.

Why Helios

Every existing agent harness hits the same wall: the harness is only as smart as the context it surfaces, and transcript management is lossy. When compaction happens, facts evaporate. When tools run on the wrong version of a file, they hallucinate. When the agent revisits the same subproblem, it re-solves it blind.

Helios' bet: memory is the loop. Five primitives no existing harness combines:

#	Primitive	What Claude Code / Aider / Devin has	What Helios adds
1	Cross-session recall	Per-session transcript only	Hybrid RRF (Vector + BM25 + cross-encoder rerank) across all prior sessions + repo corpus
2	Time-travel replay	None	`helios replay --session S --as-of T` via Lunaris `as_of(Hlc)` + Moon `TEMPORAL.SNAPSHOT_AT`
3	Consolidation, not truncation	Lossy compaction	Facts promoted via Lunaris auto-consolidator pipeline (`__lunaris_consolidate__` topic, 60 s debounce)
4	Grounded tools	Filesystem-first	`Read` / `Grep` / `Edit` hit Lunaris `HeliosScratchpad` first; filesystem fallback
5	Plan → Act → Verify → Reflect	Single-phase loop	Explicit four-phase loop with mandatory `Verifier` trait

See .planning/PROJECT.md for the long-form argument.

Architecture at a glance

helios/
├── crates/
│   ├── helios-core/        # Domain: pure types, Session<S> typestate, trait contracts
│   ├── helios-memory/      # Infra: LunarisPort wrapping HeliosScratchpad + hybrid RRF recall
│   ├── helios-provider/    # Infra: Model trait + Anthropic + OpenAI + MockModel + TwoTierModel<P,E>
│   ├── helios-tools/       # Infra: Read/Write/Edit/Grep (memory-first) + Bash + Verifier trait
│   ├── helios-engine/      # Application: Plan → Act → Verify → Reflect loop + policy + hooks + subagents
│   ├── helios-sandbox/     # Infra: CubeSandbox HTTP client + LocalPtySandbox fallback
│   ├── helios-eval/        # Infra: SWE-Bench-Pro + τ-Bench + GAIA runners with replay + cost tracking
│   ├── helios-cli/         # Binary: the `helios` command
│   ├── helios-tui/         # Binary surface: ratatui interactive mode with plan-tree visualization
│   └── helios-sdk-py/      # Binding: PyO3 Python SDK
├── benches/                # Criterion benchmarks (cross-crate)
├── xtask/                  # Build automation (cargo xtask check-layers)
└── docs/adr/               # Architecture Decision Records

Layered architecture rules — cargo xtask check-layers enforces these:

Crate	May import	Role
`helios-core`	std, `serde`, `thiserror`, `uuid`, `chrono`, `async-trait`, `tokio-util`	Domain types, errors, trait contracts
`helios-engine`	`helios-core`	Plan → Act → Verify → Reflect orchestration
`helios-memory` / `-provider` / `-tools` / `-sandbox` / `-eval`	`helios-core` (+ lunaris for -memory)	Trait implementations
`helios-cli` / `-tui` / `-sdk-py`	everything above	Wiring only — no business logic

The loop

┌──────────┐    ┌────────┐    ┌──────────┐    ┌──────────┐
│  Plan    │ →  │  Act   │ →  │  Verify  │ →  │ Reflect  │
│          │    │        │    │          │    │          │
│ planner  │    │executor│    │ Verifier │    │ planner  │
│  model   │    │ model  │    │  trait   │    │  revises │
└──────────┘    └────────┘    └──────────┘    └──────────┘
      ↑                                              │
      └──────── revision path (on verify fail) ──────┘

Each phase is a Session<S> typestate marker. Illegal transitions are compile errors. Plan and Reflect dispatch to the planner model (expensive, deep); Act dispatches to the executor (cheap, fast); Verify runs the Verifier trait (default TestRunnerVerifier runs tests in CubeSandbox).

Benchmark targets (Scale SEAL standardized harness)

Benchmark	Current SOTA (April 2026)	Helios v0.1.0-alpha target
SWE-Bench Pro (SEAL)	GPT-5.4 @ 59.1 %, Claude Opus 4.5 @ 45.9 %	≥ 50 %
τ-Bench Retail	Claude 3.5 Sonnet @ ~81 %	≥ 85 %
τ-Bench Airline	—	≥ 70 %
GAIA Level-1	—	≥ 65 %
Agent-loop overhead	—	≤ 15 ms p50
Hybrid RRF recall p99	—	≤ 200 ms over 10 k-episode corpus
Sandbox cold start p95	—	≤ 100 ms end-to-end
Consolidation freshness	—	≤ 60 s from ingest
Stream watchdog accuracy	—	100 % on 10 k chaos runs
Subagent OOM isolation	—	100 % on 10 k injection runs

Note on SWE-Bench Verified (Anthropic-reported 79 % on Sonnet 4.6 / 80.8 % on Opus 4.6): OpenAI's 2026 audit found training contamination across every frontier model, so these numbers are inflated. Helios's helios eval emits per-instance contamination warnings per that methodology.

Build roadmap — 9 phases, ~17 engineer-days

Phase	Scope	Time
1	Core types + `Session<S>` typestate + trait contracts	1 d
2	Memory via Lunaris (`LunarisPort` + `HeliosScratchpad` + RRF recall + auto-consolidation)	1 d
3	Providers (`Model` + Anthropic + OpenAI + Mock + `TwoTierModel<P,E>`)	2 d
4	Memory-first tools + `Verifier` trait + `TestRunnerVerifier`	1.5 d
5	Agent engine (keystone) — Plan → Act → Verify → Reflect loop	4 d
6	Sandbox (CubeSandbox Day-0 + LocalPty fallback)	1.5 d
7	`helios-eval` benchmark harness	2 d
8	CLI + Python SDK	1.5 d
9	TUI + alpha ship-gate benchmark run	2.5 d

Total: ~17 engineer-days, ~3–4 weeks with review and iteration.

Full plan: .planning/ROADMAP.md.

Deferred to v0.2.0: Claude Code one-way settings importer, MCP client + server, TypeScript SDK, control plane, additional sandbox backends (Firecracker / Docker / E2B), Wasm hooks, reactive compaction, CubeSandbox snapshot rollback integration.

Getting started

Prerequisites:

Rust 1.94 (rust-toolchain.toml pins this automatically)
Linux x86_64 with KVM for production sandboxing (CubeSandbox daemon running locally on http://127.0.0.1:3000). macOS / non-KVM hosts fall back to LocalPtySandbox for dev.
Sibling checkouts: lunaris and moon at ../lunaris and ../moon.
For Python SDK: Python 3.9–3.13.

Build:

cargo build --workspace
cargo test --workspace --all-features
cargo clippy --workspace -- -D warnings
cargo xtask check-layers

First run (post-alpha, Phase 8 or later):

helios -p "summarize docs and find TODOs"                         # headless agent, two-tier default
helios eval swe-pro --subset 10 --two-tier                        # benchmark run, 10-task subset
helios replay --session <id> --as-of 2026-04-22T10:30:00Z         # time-travel (requires live Lunaris)
helios                                                            # interactive TUI (Phase 9+)

During the build phase, the correct command is cargo check -p helios-core at Phase 1 and forward — the helios binary doesn't exist until Phase 8.

Reference codebases

Helios is informed by three sibling checkouts (read-only) plus one vendored HTTP dep:

moon — Redis-compatible data substrate with production MVCC (TEMPORAL.SNAPSHOT_AT), vector HNSW + DiskANN, graph + Cypher. Single binary + library crate. Helios consumes transitively via lunaris-storage-moon over the RESP wire protocol (not as a direct Rust dep).
lunaris — bi-temporal memory layer built on Moon. Helios's LunarisPort wraps lunaris::Lunaris and lunaris::recipes::HeliosScratchpad (pre-built Helios integration).
claude-code — reference TypeScript harness. Used for correctness invariants only (saw_tool_use_block dispatch, :ping keepalive, typestate session pattern). Not a wire-protocol / hook-envelope compatibility contract in v0.1.0-alpha.
CubeSandbox — KVM microVM sandbox, Apache 2.0, E2B-compatible HTTP API. Consumed over HTTP (not as a Rust library). Production-validated at Tencent Cloud.

Repository conventions

Read CLAUDE.md before writing any code. It's the behavioral contract for contributors (and AI assistants) working in this repo.
Before Phase 1, the repo has no crates/, no Cargo.toml, no src/. That's intentional — don't invent a skeleton ahead of the plan.
Planning artifacts live under .planning/. Build via /gsd-plan-phase, execute via /gsd-execute-phase. Current state: roadmap complete, Phase 1 ready to plan.
Commit format: <type>(<scope>): <summary> with body and author: Tin Dang footer. See CLAUDE.md for the full template.

License

Apache-2.0 OR MIT (dual license, contributor's choice). CubeSandbox dep is Apache 2.0, compatible.

Documentation

Current thesis — .planning/PROJECT.md (memory-first pivot, 2026-04-21)
Requirements — .planning/REQUIREMENTS.md (42 v1 items, 100 % phase-mapped)
Roadmap — .planning/ROADMAP.md (9 phases, ~17 engineer-days)
Primary-source map — .planning/architect/REFERENCES.md (Lunaris / Moon / CubeSandbox / claude-code capability pointers)
Historical snapshots — .planning/architect/blueprint.md + .planning/architect/baseplan/ (pre-pivot design; read for Rust patterns, not current scope)
ADRs — docs/adr/ (created at Phase 1)
Contributor guide — CLAUDE.md

Name		Name	Last commit message	Last commit date
Latest commit History 867 Commits
.cargo		.cargo
.claude		.claude
.planning		.planning
.summaries		.summaries
crates		crates
studio		studio
xtask		xtask
.gitignore		.gitignore
10-01-SUMMARY.md		10-01-SUMMARY.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEFERRED.md		DEFERRED.md
LICENSE		LICENSE
README.md		README.md
clippy.toml		clippy.toml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Helios

Why Helios

Architecture at a glance

The loop

Benchmark targets (Scale SEAL standardized harness)

Build roadmap — 9 phases, ~17 engineer-days

Getting started

Reference codebases

Repository conventions

License

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Helios

Why Helios

Architecture at a glance

The loop

Benchmark targets (Scale SEAL standardized harness)

Build roadmap — 9 phases, ~17 engineer-days

Getting started

Reference codebases

Repository conventions

License

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages