Skip to content

pilotspace/helios

Repository files navigation

Helios

The first agent harness where long-term memory is the loop, not an afterthought.

Helios is a memory-first Rust agent harness built on Lunaris (bi-temporal memory), Moon (Redis-compatible MVCC data substrate), and CubeSandbox (KVM microVM isolation, < 60 ms cold start). Every thought, tool call, test result, and plan revision is a durable, time-stamped, cross-session-recallable Episode. Default configuration ships two-tier model orchestration (expensive planner + cheap executor) and a built-in benchmark harness targeting the SWE-Bench Pro / τ-Bench / GAIA leaderboards on Scale SEAL's standardized methodology — not vendor-reported numbers.

Not a Claude Code clone. Claude Code is referenced for specific correctness invariants (saw_tool_use_block dispatch, :ping keepalive, typestate session). Everything else is designed for the leaderboard.

Status: pre-alpha, foundation phase. Planning artifacts are complete (see .planning/PROJECT.md for thesis, .planning/REQUIREMENTS.md for 42 v1 requirements, .planning/ROADMAP.md for 9 phases). Code lands phase-by-phase via GSD-managed execution — Phase 1 (workspace + helios-core types) is the first executable phase.


Why Helios

Every existing agent harness hits the same wall: the harness is only as smart as the context it surfaces, and transcript management is lossy. When compaction happens, facts evaporate. When tools run on the wrong version of a file, they hallucinate. When the agent revisits the same subproblem, it re-solves it blind.

Helios' bet: memory is the loop. Five primitives no existing harness combines:

# Primitive What Claude Code / Aider / Devin has What Helios adds
1 Cross-session recall Per-session transcript only Hybrid RRF (Vector + BM25 + cross-encoder rerank) across all prior sessions + repo corpus
2 Time-travel replay None helios replay --session S --as-of T via Lunaris as_of(Hlc) + Moon TEMPORAL.SNAPSHOT_AT
3 Consolidation, not truncation Lossy compaction Facts promoted via Lunaris auto-consolidator pipeline (__lunaris_consolidate__ topic, 60 s debounce)
4 Grounded tools Filesystem-first Read / Grep / Edit hit Lunaris HeliosScratchpad first; filesystem fallback
5 Plan → Act → Verify → Reflect Single-phase loop Explicit four-phase loop with mandatory Verifier trait

See .planning/PROJECT.md for the long-form argument.

Architecture at a glance

helios/
├── crates/
│   ├── helios-core/        # Domain: pure types, Session<S> typestate, trait contracts
│   ├── helios-memory/      # Infra: LunarisPort wrapping HeliosScratchpad + hybrid RRF recall
│   ├── helios-provider/    # Infra: Model trait + Anthropic + OpenAI + MockModel + TwoTierModel<P,E>
│   ├── helios-tools/       # Infra: Read/Write/Edit/Grep (memory-first) + Bash + Verifier trait
│   ├── helios-engine/      # Application: Plan → Act → Verify → Reflect loop + policy + hooks + subagents
│   ├── helios-sandbox/     # Infra: CubeSandbox HTTP client + LocalPtySandbox fallback
│   ├── helios-eval/        # Infra: SWE-Bench-Pro + τ-Bench + GAIA runners with replay + cost tracking
│   ├── helios-cli/         # Binary: the `helios` command
│   ├── helios-tui/         # Binary surface: ratatui interactive mode with plan-tree visualization
│   └── helios-sdk-py/      # Binding: PyO3 Python SDK
├── benches/                # Criterion benchmarks (cross-crate)
├── xtask/                  # Build automation (cargo xtask check-layers)
└── docs/adr/               # Architecture Decision Records

Layered architecture rulescargo xtask check-layers enforces these:

Crate May import Role
helios-core std, serde, thiserror, uuid, chrono, async-trait, tokio-util Domain types, errors, trait contracts
helios-engine helios-core Plan → Act → Verify → Reflect orchestration
helios-memory / -provider / -tools / -sandbox / -eval helios-core (+ lunaris for -memory) Trait implementations
helios-cli / -tui / -sdk-py everything above Wiring only — no business logic

The loop

┌──────────┐    ┌────────┐    ┌──────────┐    ┌──────────┐
│  Plan    │ →  │  Act   │ →  │  Verify  │ →  │ Reflect  │
│          │    │        │    │          │    │          │
│ planner  │    │executor│    │ Verifier │    │ planner  │
│  model   │    │ model  │    │  trait   │    │  revises │
└──────────┘    └────────┘    └──────────┘    └──────────┘
      ↑                                              │
      └──────── revision path (on verify fail) ──────┘

Each phase is a Session<S> typestate marker. Illegal transitions are compile errors. Plan and Reflect dispatch to the planner model (expensive, deep); Act dispatches to the executor (cheap, fast); Verify runs the Verifier trait (default TestRunnerVerifier runs tests in CubeSandbox).

Benchmark targets (Scale SEAL standardized harness)

Benchmark Current SOTA (April 2026) Helios v0.1.0-alpha target
SWE-Bench Pro (SEAL) GPT-5.4 @ 59.1 %, Claude Opus 4.5 @ 45.9 % ≥ 50 %
τ-Bench Retail Claude 3.5 Sonnet @ ~81 % ≥ 85 %
τ-Bench Airline ≥ 70 %
GAIA Level-1 ≥ 65 %
Agent-loop overhead ≤ 15 ms p50
Hybrid RRF recall p99 ≤ 200 ms over 10 k-episode corpus
Sandbox cold start p95 ≤ 100 ms end-to-end
Consolidation freshness ≤ 60 s from ingest
Stream watchdog accuracy 100 % on 10 k chaos runs
Subagent OOM isolation 100 % on 10 k injection runs

Note on SWE-Bench Verified (Anthropic-reported 79 % on Sonnet 4.6 / 80.8 % on Opus 4.6): OpenAI's 2026 audit found training contamination across every frontier model, so these numbers are inflated. Helios's helios eval emits per-instance contamination warnings per that methodology.

Build roadmap — 9 phases, ~17 engineer-days

Phase Scope Time
1 Core types + Session<S> typestate + trait contracts 1 d
2 Memory via Lunaris (LunarisPort + HeliosScratchpad + RRF recall + auto-consolidation) 1 d
3 Providers (Model + Anthropic + OpenAI + Mock + TwoTierModel<P,E>) 2 d
4 Memory-first tools + Verifier trait + TestRunnerVerifier 1.5 d
5 Agent engine (keystone) — Plan → Act → Verify → Reflect loop 4 d
6 Sandbox (CubeSandbox Day-0 + LocalPty fallback) 1.5 d
7 helios-eval benchmark harness 2 d
8 CLI + Python SDK 1.5 d
9 TUI + alpha ship-gate benchmark run 2.5 d

Total: ~17 engineer-days, ~3–4 weeks with review and iteration.

Full plan: .planning/ROADMAP.md.

Deferred to v0.2.0: Claude Code one-way settings importer, MCP client + server, TypeScript SDK, control plane, additional sandbox backends (Firecracker / Docker / E2B), Wasm hooks, reactive compaction, CubeSandbox snapshot rollback integration.

Getting started

Prerequisites:

  • Rust 1.94 (rust-toolchain.toml pins this automatically)
  • Linux x86_64 with KVM for production sandboxing (CubeSandbox daemon running locally on http://127.0.0.1:3000). macOS / non-KVM hosts fall back to LocalPtySandbox for dev.
  • Sibling checkouts: lunaris and moon at ../lunaris and ../moon.
  • For Python SDK: Python 3.9–3.13.

Build:

cargo build --workspace
cargo test --workspace --all-features
cargo clippy --workspace -- -D warnings
cargo xtask check-layers

First run (post-alpha, Phase 8 or later):

helios -p "summarize docs and find TODOs"                         # headless agent, two-tier default
helios eval swe-pro --subset 10 --two-tier                        # benchmark run, 10-task subset
helios replay --session <id> --as-of 2026-04-22T10:30:00Z         # time-travel (requires live Lunaris)
helios                                                            # interactive TUI (Phase 9+)

During the build phase, the correct command is cargo check -p helios-core at Phase 1 and forward — the helios binary doesn't exist until Phase 8.

Reference codebases

Helios is informed by three sibling checkouts (read-only) plus one vendored HTTP dep:

  • moon — Redis-compatible data substrate with production MVCC (TEMPORAL.SNAPSHOT_AT), vector HNSW + DiskANN, graph + Cypher. Single binary + library crate. Helios consumes transitively via lunaris-storage-moon over the RESP wire protocol (not as a direct Rust dep).
  • lunaris — bi-temporal memory layer built on Moon. Helios's LunarisPort wraps lunaris::Lunaris and lunaris::recipes::HeliosScratchpad (pre-built Helios integration).
  • claude-code — reference TypeScript harness. Used for correctness invariants only (saw_tool_use_block dispatch, :ping keepalive, typestate session pattern). Not a wire-protocol / hook-envelope compatibility contract in v0.1.0-alpha.
  • CubeSandbox — KVM microVM sandbox, Apache 2.0, E2B-compatible HTTP API. Consumed over HTTP (not as a Rust library). Production-validated at Tencent Cloud.

Repository conventions

  • Read CLAUDE.md before writing any code. It's the behavioral contract for contributors (and AI assistants) working in this repo.
  • Before Phase 1, the repo has no crates/, no Cargo.toml, no src/. That's intentional — don't invent a skeleton ahead of the plan.
  • Planning artifacts live under .planning/. Build via /gsd-plan-phase, execute via /gsd-execute-phase. Current state: roadmap complete, Phase 1 ready to plan.
  • Commit format: <type>(<scope>): <summary> with body and author: Tin Dang footer. See CLAUDE.md for the full template.

License

Apache-2.0 OR MIT (dual license, contributor's choice). CubeSandbox dep is Apache 2.0, compatible.

Documentation

  • Current thesis.planning/PROJECT.md (memory-first pivot, 2026-04-21)
  • Requirements.planning/REQUIREMENTS.md (42 v1 items, 100 % phase-mapped)
  • Roadmap.planning/ROADMAP.md (9 phases, ~17 engineer-days)
  • Primary-source map.planning/architect/REFERENCES.md (Lunaris / Moon / CubeSandbox / claude-code capability pointers)
  • Historical snapshots.planning/architect/blueprint.md + .planning/architect/baseplan/ (pre-pivot design; read for Rust patterns, not current scope)
  • ADRsdocs/adr/ (created at Phase 1)
  • Contributor guideCLAUDE.md

About

helios

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors