ACP Stop Reasons and ATProto Image Posts
- ACP prompt stop reasons —
v100 acpnow returns spec-validrefusalstop reasons instead oferror, preventing Zed and other typed ACP clients from failing to deserialize prompt responses. - ATProto image posts — Added and registered
atproto_upload_blob, fixed blob uploads to send raw image bytes with the image MIME type, and taughtatproto_postto acceptimages[], image-only posts, and quote+imagerecordWithMediaembeds.
- ATProto image workflow tests — Added coverage for raw blob upload request shape, image-only posts, quote+image embeds, upload-tool registration, and upload-tool metadata.
Release Pipeline Hardening and CLI Prompt Cleanup
- Wrapped CLI prompt cleanup — Fixed interactive CLI prompt redraws so wrapped multiline input clears all visual rows before repainting, including submit, EOF, and interrupt cleanup paths.
- Cross-platform terminal sizing — Replaced Unix-specific prompt-width probing with Go's terminal size helper so release cross-compiles continue to work across Linux, macOS, and Windows targets.
- CI-equivalent test script — Added
scripts/test.shas the local release-test entrypoint, matching CI package scope, cache isolation, andGOWORK=offbehavior. - Release workflow test parity — Updated the GitHub Release workflow to call
scripts/test.shin both test gates. - Stable UI snapshots — Made layout snapshot fixtures independent of the GitHub Actions checkout path.
ACP Server Mode and Interactive Model Switching
- ACP server mode — Added
v100 acp, a headless Agent Client Protocol server for editor integrations such as Zed, with newline-delimited JSON-RPC over stdio, stdout protocol isolation, session lifecycle handling, cancellation, streamed agent updates, tool-call updates, and workspace-aware execution. - Reusable run builder — Factored core run construction into shared runtime components so CLI, TUI, and ACP modes can initialize providers, policies, tools, budgets, traces, solvers, and sandbox sessions consistently.
- Interactive
/modelcommand — Added/modelsupport across CLI, TUI, resume, and ACP sessions for listing providers/models and switching the active provider/model during a session, including numbered provider/model selection and optional immediate task execution. - ACP slash-command advertisement — ACP sessions now advertise available commands including
model,auto, andlocalviaavailable_commands_updateso compatible clients can expose slash commands.
- Model propagation — Added
Loop.Modeland wired it through solver, recovery, reflection, resume, TUI, CLI, and ACP paths so provider requests honor runtime model switches. - ACP protocol hardening — Added content-block prompt handling, resource link text bridging, provider-aware image support, base64 image decoding, command-safe tool update serialization, larger stdio scan buffers, and safe session cleanup.
Alternate ATProto Accounts and TUI Localization
- Alternate ATProto account support —
atproto_feed,atproto_notifications, andatproto_postnow accept anaccountargument for selectingmainoralt, with config support for[alt-atproto]. - Safer ATProto posting — Unknown ATProto account names are rejected instead of silently falling back to the main account.
- French TUI localization — Added
LANG=fr/LANG=fr_CA.UTF-8andV100_LANG=frsupport for the main TUI chrome, status pane, dashboard, radio labels, confirmation dialog, detail pane, and rotating status messages. - Tool output readability — Improved JSON-aware tool summaries and structured detail-pane rendering for scan-friendly TUI output.
- Workspace artifact cleanup — Ignored local module-cache and patch artifact files, and removed the tracked scratch
blackboard.md.
GLM Stream Recovery, Detail Pane Highlighting, and Repo Cleanup
- GLM streamed-response recovery — Added a GLM-only streaming watchdog that detects partial-output stalls, cancels the dead stream, and retries the model turn once instead of hanging indefinitely.
- Tool detail pane JSON highlighting — The TUI detail pane and replay detail view now pretty-print and syntax-highlight JSON tool arguments and results for better readability.
- Release test cleanup — Updated stale config and UI test expectations, removed the accidental in-repo Go module cache tree, and restored plain
go test ./...from repo root. - Research layout cleanup — Moved train-loop assets under
research/train-loop/, moved research configs underresearch/configs/, moved loose benchmark configs underbenchmarks/, and updated docs/config references accordingly. - Research results location — Default research
results.tsvoutput now lives under the configured research workdir instead of repo root, andresearch/train-loop/results.tsvis no longer tracked. - Repo root cleanup — Moved
CLAUDE.mdtodocs/notes/CLAUDE.mdand ignored additional local shell/Nix artifacts to keep the root less noisy.
Brave Web Search, Tool Defaults Normalization, and Cleanup
web_searchtool — Added a Brave Search-backed web search tool that returns ranked results with title, URL, and description. Enabled in the default tool registry and surfaced in the TUI with its own glyph. RequiresBRAVE_SEARCH_API_KEY.
- Tool default normalization — Config loading now backfills missing default providers and tool registrations more consistently, including
git_commit,web_search,fs_render_image,fingerprint, andsem_diff. - Compression default — Default compression provider now targets
glm. - Artifact cleanup —
.claude/artifacts and*.settings.local.jsonare ignored, and stray.orig/.rejpatch artifacts were removed from the repo. - Docs cleanup — Removed the unused
v100 deventry from the changelog.
Context Intelligence, Self-Healing, Eval Pipeline, Dogfood Automation, and TUI Detail Pane
- Context Window Intelligence (CWI) —
PressureMonitorPolicyHook tracks estimated token usage against the model's context window. At 70% saturation, injects a guidance message encouraging conciseness. At 80.5%, forces a replan that triggers compression. Configurable viaPolicy.PressureThreshold. ExposesContextPressureandContextWindowSizeinLoopStatefor all hooks. Emitscontext.pressuretrace events. - Agent Self-Healing (ASH) —
RecoveryHookPolicyHook detects stuck agents and intervenes: (1) stuck detection after 3+ consecutive tool failures with a reflective guidance message, (2) error pattern matching against 14 common failure signatures (missing files, permission denied, rate limits, bad patches, syntax errors) with targeted correction hints, (3) graceful degradation after 3+ failures of the same tool with optional tool disabling. Auto-wired into all runs via lazy hook initialization. - Continuous Eval Pipeline (CEP) —
v100 bench history <name>scans run directories for matching bench runs and prints a score history table.v100 bench trend <name>renders an ASCII sparkline of pass/fail scores with drift detection (10% regression alert). Both support--runsflag for custom run directories. Newinternal/eval/history.gowithLoadHistory,Sparkline,FormatHistoryTable,FormatTrendSummary. - Dogfood Automation (DA) —
v100 dogfood run [quest...]auto-discovers bench TOML files indogfood/, runs them sequentially with git commit tagging, and prints a summary report.v100 dogfood reportshows results from the last run. Includes regression detection comparing against previous runs. Newinternal/eval/dogfood.gowith quest discovery, filtering, and report formatting. - TUI Tool Detail Pane — Third column in the TUI showing full tool call details when a tool result is selected. Toggle with Ctrl+D or click on any tool result line. Scrollable viewport for long outputs. Three-column layout: transcript (35%) | detail (35%) | trace+metrics (30%), falls back to two-column on narrow terminals. Escape key dismisses. Includes inline image rendering with height capping and chunked Kitty graphics payloads.
v100 bench history <bench-name>— show score history for a benchmarkv100 bench trend <bench-name>— show ASCII sparkline trend with drift detectionv100 dogfood run [quest...]— execute v100's self-test quest suitev100 dogfood report— show results from last dogfood run
context.pressure— emitted when context pressure exceeds threshold, withContextPressurePayload
PressureThreshold float64— context pressure ratio to trigger proactive compression (0 = disabled, default 0.70)
PressureMonitor(threshold) PolicyHook— proactive context managementRecoveryHook(config) PolicyHook— agent self-healing with stuck detection, error patterns, degradation
- Phase 400 roadmap complete: Provider Resilience, Context Intelligence, Eval Pipeline, Self-Healing, Dogfood Automation all shipped.
- Phase 300 roadmap complete: Reflective Scoring, Prompt Mutation, Synthetic Bootstrapping, Constraint-Gated Evolution all shipped.
Provider Resilience, Voice I/O, Bench Bootstrap, and XML Leak Fix
- Resilient provider with health tracking —
ResilientProviderwraps a primary provider with an ordered fallback chain and a per-providerHealthTracker(sliding-window error rate + cooldown-gated single-probe retry). Unhealthy primaries are short-circuited in favor of healthy fallbacks until one successful probe restores them.v100 providers healthsurfaces the live status, forwarded throughRetryProvider. - Voice input/output via
--speak—run --speakvoices assistant replies throughespeak-ng(override withV100_TTS_CMD). At the CLI prompt,/voicecaptures a one-shot utterance and/voice interactiveenters continuous voice mode (say "stop voice" to exit). TTS output is drained before the next mic capture to avoid feedback. bench bootstrapsubcommand — Scaffolds a bench TOML from a short description using an LLM, optionally appending to an existing file. Refuses to overwrite without--force.compress --recompress— Squashes an existingcompress.checkpoint.jsonfurther, instead of starting from the raw trace. The default path now hints when a checkpoint is already present.claudeprovider alias — Added as an alias ofanthropicin defaults and/claudemode in interactive prompts. Default model isclaude-opus-4-7.atproto_indexdeduplication — Records with URIs already present in the vector store are skipped; the tool now reportsskippedalongsideindexed.
- MiniMax XML tool-call leak — Anthropic-compatible providers (notably MiniMax) sometimes emit tool calls as raw
<minimax:tool_call><invoke name="...">…</invoke></minimax:tool_call>XML inside text content blocks.ExtractTextualToolCallsnow strips that markup from assistant text and promotes each<invoke>into a realToolCall, in both the non-streaming (anthropicParseResponse) and streaming (react solver convergence) paths. The TUI transcript no longer shows XML bleed. HealthTrackerprobe gating — After cooldown elapsed,IsHealthyreturned true on every call, effectively disabling fallback. It now re-armsunhealthyAton each probe-true so only one probe per cooldown window is allowed, and clears it on real recovery.FileSHA256error check — Thefile.Closeerror return is now explicitly discarded (errcheck baseline).
- Rename —
autoresearch→v100 train-loopacrossresearch/train-loop/prepare.py,pyproject.toml,research/configs/research.toml,research/train-loop/train.py,research/train-loop/program.md. Legacy~/.cache/autoresearch/is still read if present. - Docs —
README.md,docs/architecture.md, and newdocs/workflows.mdrefreshed. CLI and research-command taglines updated. .gitignore— Ignore*.log.
CLI Confirm Fix, Continuous Mode, and ATProto Index Improvements
- CLI confirm freeze (root cause) — The escape-key listener goroutine raced with
ConfirmToolon stdin. Both goroutines could end up blocked on the same fd simultaneously, causing the confirm prompt to freeze. Fixed by replacing the blockingos.Stdin.Readin the escape goroutine with asyscall.Selectpoll (50 ms timeout), ensuring the goroutine yields beforeConfirmToolneeds exclusive stdin access. confirmPlanExecutionfreeze — Plan approval prompt still usedbufio.NewScanner, which deadlocks in raw terminal mode. Replaced withui.ConfirmToolto use the same safe raw-mode read path.- CLI confirm freeze (prior fix) —
ConfirmToolwas rewritten to useterm.MakeRaw+ direct byte reads instead ofbufio.Scanner, fixing the original cooked-mode deadlock where keyboard input appeared frozen and Ctrl+C had no effect.
--continuousflag onrunandresume— Automatically continues to the next step after each agent turn without waiting for user input. Ctrl+C stops the loop. Useful for unattended multi-step runs.user_postssource inatproto_index— Direct PDS fetching for indexing a user's own posts without going through the feed API.
- Release workflow — Pinned
softprops/action-gh-releasetov2.3.2to eliminate Node.js 20 deprecation warnings ahead of the June 2026 forced migration to Node.js 24.
ATProto RAG, Audio Fingerprinting, Social Graph Tools, and Compress Command
This release adds semantic search over Bluesky records via vector embeddings, an acoustic fingerprinting tool for identifying songs from audio streams, a social graph explorer for second-degree network discovery, and a standalone compress command for force-compressing run context.
atproto_indextool — Fetches feed, notifications, or a user profile and embeds each record using a dedicated embedding provider, storing vectors in~/.v100/atproto.vectors.jsonfor persistence across runs and workspaces.atproto_recalltool — Semantic search over indexed ATProto records via cosine similarity. Accepts a natural language query, optionalrecord_typefilter (post,notification,profile), and returns scored results for use as RAG context.--embeddingflag onrun— Specifies a dedicated provider for embedding calls, independent of the chat provider (e.g.--embedding ollama). Defaults to the new[embedding]config section (provider = "ollama",model = "nomic-embed-text:latest").EmbedProvideron Loop and ToolCallContext — Embedding calls in tools now route through a separate provider field rather than the chat provider, allowing any model that supports embeddings to back the vector tools.NewNamedVectorStore— New constructor ininternal/memoryfor named vector stores (<name>.vectors.json) separate from the blackboard store.UserDataDir()— New config helper returning~/.v100/for user-local persistent data.
fingerprinttool — Identifies songs from an audio stream URL or local file using chromaprint (fpcalc) and the AcoustID API. Records a short sample, generates an acoustic fingerprint, and returns artist, title, and MusicBrainz recording ID. Requiresfpcalcandffmpeg.
atproto_get_follows— Lists accounts followed by a given user.atproto_get_followers— Lists accounts following a given user.atproto_get_profile— Fetches a full Bluesky profile.atproto_graph_explorer— Maps second-degree network: surfaces accounts followed by people you follow that you don't yet follow yourself, ranked by mutual follow count.
v100 compress <run_id>— Force-compresses the message history of any existing run and writes acompress.checkpoint.jsonto the run directory. Accepts--providerto select the compression model and--dry-runto preview token savings without writing.- Checkpoint-based resume —
v100 resumenow detectscompress.checkpoint.jsonand loads the compressed message history instead of replaying the full trace.
- Download spinner helpers — Added
DownloadSpinnerandSpinSlashanimation helpers for TUI status indicators. - Radio download animation — TUI radio mode ticks a download spinner while in downloading state.
Compression Hardening and Quebec News Defaults
This patch release makes GLM-backed context compression safer under provider limits, updates the default GLM model to GLM-5.1, and expands Quebec French news defaults.
- GLM compression hardened — Context compression now avoids bursty per-message calls on GLM, sanitizes malformed compression payloads before provider requests, and caps targeted compression for other providers.
- Default GLM model updated — Built-in GLM defaults now target
GLM-5.1across provider construction and config-backed tests.
- Quebec French news feeds expanded —
news_fetchnow includesTVA NouvellesandL'Actualitéin Quebec French defaults, with test coverage for the new routing behavior. - Benchmark fixture added — Added a
MiniMax vs GLMbenchmark config undertests/benchmarks/for repeatable provider comparisons.
GLM Defaults and Update Command Wiring
This patch release keeps default cloud runs on GLM for model calls, router cheap-tier calls, and context compression, while wiring the self-update command into the CLI surface.
- GLM default path hardened — Built-in defaults now use GLM for
provider,smart_provider,cheap_provider, andcompress_provider, avoiding accidental Ollama fallback during cloud runs. - Router cheap provider respects config — Router and smartrouter construction now honor the configured
cheap_providerbefore considering local fallbacks, socheap_provider = "glm"stays on GLM. - Cloud compression avoids local fallback — When no explicit compression provider is configured, cloud main providers now reuse the main provider instead of selecting a local Ollama backend.
- Update command registered — The root CLI now exposes
v100 update, runs background update checks outside the update command itself, and includes av100 versionhelper. - Update tests tightened — Added update package coverage for semantic version comparison and platform-specific release asset naming.
MiniGLM Provider Switching
This patch release adds the MiniGLM solver and makes GLM the default provider path for normal runs.
- MiniGLM solver added — Added a MiniGLM solver for intelligent switching between MiniMax and GLM-backed work.
- GLM default provider — Default provider settings now prefer GLM for the main run path.
Update Version Comparison Fix
This patch release fixes update detection so multi-digit patch versions compare correctly.
- Semantic version comparison fixed — Update checks now compare semantic version components numerically instead of lexicographically, so versions like
v0.2.10sort afterv0.2.9.
Lint Cleanup
This patch release resolves lint issues introduced during the update and provider work.
- Lint issues resolved — Fixed golangci-lint findings across the current release branch.
- Update install hardening — Update application handles cross-device executable replacement more robustly.
Tag-Only Release Workflow
This patch release moves the release workflow to tag-only publishing after the multi-platform release pipeline changes.
- Tag-only releases — Release automation now runs from tags instead of branch pushes, reducing accidental release attempts.
Multi-Platform Release Flow
This patch release finishes the cross-platform release pipeline, ships platform-specific install scripts, and removes release-blocking platform dependencies from the build path.
- Multi-platform artifacts — Release builds now publish Linux, macOS, and Windows binaries for both
amd64andarm64where applicable. - Checksum-verified installers — The shell and PowerShell installers now download the exact release assets and verify them against
checksums.txt. - Release metadata aligned — The README now documents the shipped binaries and installer entry points so operators can install without guessing asset names.
fs_outlineportability — The semantic file outline tool now uses the Go AST on non-Windows platforms, removing the tree-sitter dependency from release builds.- Windows CLI stubs — Windows-specific wake and UI stubs keep the command surface and package builds consistent across targets.
Structured News, Persistent Memory, and Interactive Diffing
This patch release adds a source-aware news retrieval tool, introduces categorized persistent memory with expiry, and ships a side-by-side trace diff TUI, while tightening watchdog discipline, trace analytics, and interactive budget behavior.
news_fetchtool — Added a dedicated structured news retrieval tool with feed-first collection, source-aware extraction, normalized headline items, and explicit partial-failure reporting for blocked or thin outlets.- Image-aware Codex runs — Codex provider flows now support image attachments, and policy defaults steer the agent toward direct image inspection when visual evidence is available.
- Shared blackboard state — Blackboard memory flows are more useful across runs, with category-aware storage and better review/search behavior.
- Categorized persistent memory — Durable memory now supports
fact,preference,constraint, andnotecategories, plus note expiry/TTL and category-aware retrieval. - Memory CLI and review upgrades —
v100 memorygained better remember/list/review ergonomics, and expired notes are pruned consistently from retrieval and operator views. - Wake goal scanning — Autonomous wake flows now mine TODOs, dirty files, recent failed runs, and failure artifacts to propose grounded next goals instead of relying on shallow workspace inspection.
- Synchronized trace diff model — Added an alignment-aware sync diff that can realign after mid-trace insertions or deletions, enabling reliable side-by-side comparison.
- Interactive
v100 diff --tui— New Bubble Tea diff viewer renders synchronized transcript panes, keeps scrolling aligned, and jumps directly to the first divergence. - Panelized TUI layout — Extracted panel rendering contracts and tightened pane sizing behavior, fixing status/trace allocation regressions and improving small-terminal behavior.
- Post-tool policy hooks — Threshold and deduplication hooks now trigger on actual tool results, preventing tool-free turns from consuming failure budget and making repeated tool misuse visible at the right time.
- Trace analytics accuracy — Stats and metrics now count executed tools from
tool.result, classify tool-budget exhaustion more clearly, and avoid double-counting streamed tool-call placeholders. - Budget continuation hardening — Interactive budget continuation and compression telemetry are more explicit, with better handling when runs approach or exhaust token budgets.
Autonomous Wake Hardening and Transcript Fixes
This patch release hardens the new wake issue-worker loop, restores missing user-message visibility in the UI, and tightens router escalation when cheap-tier models hallucinate tools.
- Wake issue-worker git safety — Autonomous issue-worker cycles now require a clean working tree before starting, require exactly one new commit, and only auto-push/close from the default branch.
- Wake sandbox fingerprint baseline — Sandboxed runs now persist the source-workspace fingerprint at run start, improving apply-back conflict detection and baseline tracking.
- Issue-worker watchdog handling — Headless wake issue-worker runs disable read-heavy watchdog interventions that were prematurely stopping autonomous inspection loops.
- CLI and TUI user messages restored — Submitted user messages now appear again in both the CLI transcript and TUI transcript instead of disappearing after the duplicate-echo workaround.
- CLI prompt echo cleanup — The terminal prompt line is cleared before event rendering so submitted messages are shown exactly once.
- Compact failure digest improvements — Failure digests are auto-printed at the end of failed runs with cleaner operator-facing summaries.
- Router cheap-tier escalation hardened — The router now escalates to the smart tier when the cheap model emits unknown or disabled tool names, while still allowing trivial safe mutations like
fs_mkdirto stay cheap. - Sandbox apply-back on
prompt_exit— Non-interactive--exitruns now allow normal sandbox apply-back, matching the intended successful one-shot flow.
- MiniMax unresolved tool-call sanitization — Live and provider-facing history now quarantine unresolved tool calls more aggressively to avoid MiniMax request failures.
- Host network policy regression fixed — Host-mode sessions no longer bypass
network_tier=offthrough the shell tool. - Gemini embedding auth corrected — Gemini embeddings now use real API-key auth instead of the wrong subscription-token path.
MiniMax Default Upgrade and Docs Refresh
This patch release updates the built-in MiniMax default model to MiniMax-M2.7 and refreshes stale operator docs so the README and memory notes match current runtime behavior.
- MiniMax default model upgraded — Built-in config defaults, provider defaults, tests, and benchmark fixtures now use
MiniMax-M2.7. - Provider docs aligned — README examples and provider matrix now reflect MiniMax as the built-in default provider and
MiniMax-M2.7as the default model.
- README cleanup — Corrected the default provider guidance, solver count, Go version requirement, and tool-surface description.
- Compression notes refreshed — Refreshed compression notes to reflect the current two-pass compression flow with targeted compression before oldest-half fallback.
Harness Cleanup and Watchdog Hardening
This patch release tightens CLI ergonomics, hardens watchdog and tool-surface behavior, and reduces sandbox artifact noise ahead of the next push.
- CLI dangerous-tool confirmation no longer breaks interactive input — The Escape listener now backs off while confirmation prompts are active, preventing raw-mode input races during approval flows.
- CLI transcript readability cleanup — The transcript now uses plainer labels (
me,agent,tool), separates spinner output from assistant text cleanly, and reduces decorative glyph noise. - Styled
digestoutput —v100 digestnow renders a clearer operator-facing failure digest in the CLI while preserving JSON output for machine use.
- Tool-surface validation is enforced across commands — Enabled tools are now validated against the registered runtime surface in
run,resume,eval/bench, andtools, with clearer reporting for invalid enabled entries. - Registry surface validation — Enabled tools must now have non-empty descriptions and non-null input schemas, reducing prompt/runtime drift and malformed tool surfaces.
- Watchdog stop-tools behavior now matches policy — Inspection/read-heavy watchdogs now force a true final no-tools synthesis turn instead of silently allowing more tool use or terminating early.
- System interventions no longer masquerade as user input — Solver steering and watchdog messages are recorded as system messages, improving trace correctness and downstream analysis.
- Stats/digest tool-call dedupe is step-scoped — Tool calls are no longer undercounted when call IDs repeat across different steps.
- Core-size TUI snapshots — Added snapshot-style regression coverage for narrow, standard, and wide TUI layouts.
- TUI step interruption support — Active TUI steps can now be interrupted cleanly without leaving the run in a confused state.
- Apply-back skips more runtime byproducts — Sandbox apply-back now ignores more harness/runtime and package-manager noise, including
exports/,.gocache/,.gomodcache/,.npm/, andnode_modules/.
UX Research Round 2: Dogfooding Fixes
This release addresses 12 issues found during intensive dogfooding with Gemini and MiniMax providers across ~25 runs.
- Spinner no longer pollutes non-TTY output — Spinner frames (
\r\033[K) are skipped entirely when stdout is redirected to a file or pipe, fixing garbled log captures. - Spinner no longer interleaves with tool output — The model-call spinner is now stopped before rendering tool results, eliminating visual artifacts in live terminal output.
resume --autoworks — Added missing--unsafeand--yoloflags to theresumecommand, makingresume --auto --unsafeandresume --yolofunctional.resumeno longer dumps usage on safety errors — AddedSilenceUsage: trueto the resume command for clean error messages.- MiniMax context overflow — Error code 2013 with "context window exceeds limit" now shows a clear message instead of the misleading "message ordering bug" label.
- Gemini 429 shows human-readable message — Rate-limit errors now extract the
messagefield from the JSON response (e.g., "You have exhausted your capacity on this model") instead of dumping raw JSON. - Stats no longer show zeros for aborted runs —
ComputeStatsnow infersTotalSteps=1when nostep.summaryevents were emitted but model calls occurred (e.g., budget-exceeded or error-aborted runs).
- Doctor warns instead of failing on unused providers — Only the default provider causes a failure; other configured-but-unauthenticated providers show warnings (
⚠) instead of failures (✗). runslist hides sub-runs by default — Plan-execute sub-runs are filtered out unless--allis passed. Sub-runs display with↳prefix when shown.runslist filtering — New flags:--provider <name>,--failed(show only failed/errored runs),--all(include sub-runs).
- Schema-aware plan_execute planner — The planning phase now receives tool specifications so the planner knows which tools exist and their parameter schemas, reducing hallucinated tool names.
- Pre-step budget check — ReactSolver now checks remaining token budget before entering a step. If remaining tokens are below 5% of total budget, the run exits early with a clear error instead of failing mid-step.
ParentRunIDin run metadata —RunMetanow tracks parent-child relationships between runs for sub-run hierarchy.
Phase 300: Autonomous Optimization Foundation
This release introduces meta-cognitive tools for agent self-refinement, hardens the TUI layout engine, and enables streaming by default.
reflecttool — Meta-cognitive self-critique: agents can pause to evaluate progress, plan correctness, and goal alignment. Returns a PASS/FAIL/PARTIAL verdict with reasoning and suggested pivot.v100 mutatecommand — Trace-driven prompt optimizer that analyzes both qualitative behavioral labels and quantitative failure signatures (step counts, tool error rates, context saturation) to suggest improved prompts.v100 digestcommand — Compact failure digest for completed runs, surfacing key failure points without the full trace.
- JSON Output — Added
--format jsontostats,metrics,analyze,digest, anddiffcommands for seamless integration with automation pipelines. - Scoring Persistence — Benchmarks and experiments now save full LLM-graded reasoning to
meta.jsonand a detailedevaluation.jsonartifact in the run directory.
- Streaming by Default — Token streaming is now enabled by default for all providers that support it.
- Compression Telemetry — Enhanced context compression events with token tracking for Anthropic and MiniMax providers.
- Dynamic TUI Layout — Proportional height allocation ensures perfect column alignment across all terminal sizes and pane combinations.
- Overflow-Safe Status Pane — Status pane text wrapping no longer breaks right-column height; trace pane absorbs the difference.
Phase 250: Harness Stabilization & Mission Control
This release focuses on operator experience, TUI aesthetics, and provider hardening to support long-horizon research.
- Mission Control TUI — Re-architected the right column to include three persistent panes: Trace, Visual Inspector, and Status.
- Visual Inspector — New gaming-inspired dashboard with real-time entropy gauges for token window saturation, step budget, and reasoning intensity (I/O ratio).
- Cognitive Heartbeat — Animated ASCII pulse indicating real-time agent cognitive activity.
- Radio Station Selector — Dedicated modal (
Alt+Ror/radio) for selecting ambient background stations by name. Renamed "Radiojar" to "Radio Al Hara". - Typing Hygiene — Removed conflicting single-key radio shortcuts (
n,p,1) to prevent interference with text input. - Layout Math — Refined vertical budgeting to ensure all panes fit perfectly across different terminal sizes.
- Non-Interactive Mode — New
--exitflag forv100 runthat executes the initial prompt and automatically finalizes the run without entering the interactive loop. - MiniMax Hardening — Implemented contiguous tool-result ordering to fix Error 2013.
- Improved Diagnostics — Explicit logging for message ordering bugs and Gemini multi-tool desyncs.
- Expanded Quest Pack — Added DF-12 (Non-Interactive Smoke) and updated DF-07/DF-08 to include MiniMax as a standard benchmark provider.
Phase 100: Recursive Self-Evolution
This release introduces the first milestone of the self-evolution engine, allowing agents to distill their own trajectories and author new tools at runtime.
- Distill command —
v100 distill <run_id>converts JSONL traces into ShareGPT-formatted datasets for model fine-tuning and DPO. - Dynamic Tool Registry — support for
RegisterAndEnableat runtime, enabling agents to expand harness capabilities without re-compilation. - Automatic Build Feedback — modified
internal/core/loop.goto triggergo build ./...after every workspace mutation, injecting compiler errors as aSYSTEM ALERTto enforce a reality-check loop.
sql_search— Execute SQL queries against local SQLite databases with path sanitization.graphviz— Render DOT graph definitions into images (PNG/SVG) for architectural visualization.
- Dependency Tracking — Added
github.com/mattn/go-sqlite3for local structured data operations. - Documentation — New DF-11 quest in
dogfood/for verifying self-evolution trajectories.
Initial release of v100, an experimental agent harness for studying long-horizon LLM behavior.
- Agent loop — ReAct-style tool-using agent loop with structured JSONL traces
- Budget enforcement — hard limits on steps, tokens, and cost (
--budget-steps,--budget-tokens,--budget-cost) - Context compression — automatic context window management with compression events
- Dangerous tool confirmation — CLI stdin prompt or TUI Ctrl+Y/Ctrl+N
- Codex — ChatGPT subscription via PKCE OAuth (
v100 login) - OpenAI — standard API access (
OPENAI_API_KEY) - Gemini — Google subscription via OAuth (
v100 login --provider gemini) - Ollama — fully local models, no API key required
- Anthropic — Claude API access (
ANTHROPIC_API_KEYorv100 login --provider anthropic) - Retry/backoff middleware — unified retry handling across providers for 429 and 5xx responses
- Model metadata discovery — providers expose context windows, pricing hints, and free/paid status to the harness
All providers support tool calling and generation parameters (temperature, top_p, top_k, max_tokens, seed).
- ReactSolver — classic ReAct loop (default)
- PlanExecuteSolver — two-phase plan-then-execute with automatic replanning on failure (
--solver plan_execute,--max-replans)
- Docker executor — isolated container execution with hardened security (seccomp, dropped capabilities, no-new-privileges, PID limits)
- Network policy — configurable network isolation (
offoropen) - Snapshots — checkpoint and restore sandbox state during runs
- Apply-back — merge sandbox changes back to host workspace (
manual,on_success,never)
fs_read, fs_write, fs_list, fs_mkdir, fs_outline, sh, git_status, git_diff, git_commit, git_push, sem_diff, sem_impact, sem_blame, patch_apply, project_search, curl_fetch, agent, dispatch, orchestrate, blackboard_read, blackboard_write, blackboard_store, blackboard_search
- Sub-agent delegation —
agenttool spawns bounded child loops - Named specialists — config-driven roles via
[agents.<name>] - Orchestration —
orchestratetool supportsfanoutandpipelinepatterns - Shared state — blackboard tools for cross-agent coordination with vectorized memory
- Reflection turn — internal confidence check before dangerous tool execution
- Run scoring —
v100 score <run_id> pass|fail|partial - Run statistics —
v100 stats,v100 metrics,v100 compare - Metadata-aware reporting —
meta.json,stats,compare, andquerysurface model context/pricing metadata - Batch benchmarks —
v100 bench <config.toml>with provider/model/parameter variants - Experiments —
v100 experiment create|run|resultsfor multi-variant statistical testing - Behavioral analysis —
v100 analyzewith automatic failure classification - Trace diffing —
v100 diffto find divergence between runs - Run querying —
v100 query --tag key=val --score pass - Pluggable scorers — exact_match, contains, regex, script, model_graded
- CLI — line-by-line streaming output (default)
- TUI — Bubble Tea 3-pane interface with transcript, trace, and input panes
- 21 structured event types covering run lifecycle, model calls, tool execution, solver planning, sandbox snapshots, agent delegation, and context compression
- Deterministic replay with
--replace-modeland--inject-toolfor counterfactual analysis - Run metadata with names, tags, and scores for later querying
v100 doctor— health check for providers, tools, and configurationv100 config init— generates default config and OAuth credential templates- CI — GitHub Actions with
go test -race,go vet, pinnedgolangci-lint, and hardened semantic tool detection