Voice in, voice out. Your own LLM. Your own tools. Your data never leaves the box.
HavenCore is a production-grade personal AI assistant I built to run entirely on my own hardware — no cloud inference, no data phoned home. It hears you through a wake-word device, transcribes with Whisper, reasons with a local 72B LLM (vLLM), calls tools over MCP (Home Assistant, Plex, web search, image gen, etc.), speaks back with Kokoro TTS, and runs proactively on its own schedule when you're not looking.
Everything is one docker compose up -d away. Twelve containers, one GPU fleet, one dashboard.
The assistant's name is Selene. She lives on four RTX 3090s in my shed.
Hardware you'll need: Linux host, recent NVIDIA driver + container toolkit, Docker Compose v2, and GPU VRAM for your chosen LLM. The default Qwen2.5-72B-AWQ stack wants ≥ 48 GB VRAM split across two cards; a single 24 GB card works if you swap in a smaller model. Plan on ~60 GB of disk for images + model weights on first build.
|
Chat with tool visibility Every tool call is rendered inline — arguments, results, and per-turn timings (LLM / tools / total / iterations) so you can actually see the agent think. |
|
|
Live Home Assistant state Real entity counts, automations, scenes — pulled straight from HA and grouped by domain. Selene controls all of it via MCP tool calls. |
|
|
Per-turn metrics, persisted Every LLM call, every tool invocation, every latency. Stored in Postgres. Charted over 14 days. p95 turn time, top tools, avg LLM latency — all queryable. |
|
|
Autonomous agenda Selene isn't purely reactive. A background engine fires scheduled briefings, anomaly sweeps over HA state, user-programmed reminders/watches/routines, and a nightly memory-consolidation pass. With a kill switch, rate limits, tier-gated tools, and quiet hours. |
|
|
Tiered semantic memory (L2 → L3 → L4) Qdrant-backed episodic store that a nightly LLM pass consolidates into summaries, promotes the important stuff into persistent context, and bounds the whole thing to a fixed token budget injected into every prompt. |
|
|
System health & MCP fleet Every MCP tool server, every registered tool, live log stream, vLLM model info, DB status. The boring operator plane — but it's there. |
More screenshots — conversation history, service playgrounds
| Conversation history | Service playgrounds |
|---|---|
![]() |
![]() |
Built-in playgrounds for TTS, STT, the vision model, and ComfyUI image generation — each proxied through the agent so there's zero CORS / network setup to test a model.
This is the part that's worth scrolling for.
The core loop (orchestrator.py) is a typed async generator that emits THINKING / TOOL_CALL / TOOL_RESULT / METRIC / DONE / ERROR events. The WebSocket handler streams them straight to the dashboard; the OpenAI-compatible endpoint assembles them into SSE. Same code path, three surfaces.
Concurrent users don't share state. A SessionOrchestratorPool keyed by session_id gives each conversation its own orchestrator, its own messages, its own asyncio.Lock — so two dashboard tabs or a dashboard plus a voice puck never race on each other's turns. The pool runs a 30-second idle sweep (flush timed-out sessions to Postgres, reinitialize in place), an LRU cap at 64 (evict + persist), and a shutdown flush (nothing lost on restart). A stored session_id can be cold-resumed from the DB via POST /api/conversations/{id}/resume — the /history page's Resume button hydrates a past conversation straight into /chat and keeps going. /v1/chat/completions deliberately bypasses all of this: it's stateless by design, ephemeral orchestrator per request, caller owns the history.
Tools aren't a hardcoded registry — they're discovered at startup by a MCPClientManager that spawns each tool server as a subprocess and speaks stdio JSON-RPC to it. A UnifiedTool abstraction converts MCP tool schemas to OpenAI function-calling format on the fly. Adding a tool server is a new folder with a __main__.py.
7 MCP servers, 41 tools live today: Home Assistant (18), Plex (5), Music Assistant (7), Qdrant memory (2), web/Wolfram/Wikipedia/Brave/weather/image-gen/Signal (7), MQTT cameras (1), HTTP fetch (1).
Selene runs an asyncio dispatcher in the same process that fires briefing, anomaly_sweep, user-defined reminder / watch / routine / memory_review kinds. Each run:
- spins up a fresh orchestrator with its own
session_id— never touches user chat state - is tool-gated by tier (
observe/notify/speak/act) with a hard deny-list on top - honors a global hourly rate limit, per-signature cooldowns, and quiet hours
- emits a full audit trail (messages, tool calls, metrics) into
autonomy_runs - can be killed at runtime via
POST /api/autonomy/pause
First novel act-tier action for any signature requires explicit confirmation — then joins a per-item allow-list.
Not RAG-over-everything. A four-tier scheme:
- L1 — current conversation (session scope)
- L2 — episodic, per-turn, Qdrant embeddings
- L3 — consolidated summaries produced by a nightly LLM pass, with importance decay and rank boosts
- L4 — persistent facts promoted through a gated review process, injected into every system prompt within a bounded token budget
Search at query time blends tiers; the /memory page lets you inspect and edit each level.
Every agent turn writes a row to a turn_metrics Postgres table: LLM latency, per-tool latencies, total, iteration count, tool list. The dashboard renders p95s, top tools, 14-day activity — out of the same pool the orchestrator uses, no separate metrics service.
Port 6002 serves:
- the static SvelteKit dashboard build (mounted from
/srv/agent-static— outside/appso the dev volume mount doesn't shadow it) /api/*REST,/ws/*WebSocket,/v1/*OpenAI-compatible chat completions (streaming SSE supported)- service proxies (
/api/{tts,stt,vision,comfy}/*) so the playground UIs work same-origin with zero CORS config
One uvicorn, one network surface. Nginx in front just does TLS termination and path routing.
Wake-word + mic + speaker runs on an ESP32-S3-BOX-3 and talks to HavenCore over the OpenAI-compat API — on-device "Hey Selene" wake, touch-to-talk fallback, per-device X-Device-Name / X-Session-Id so the server can label rooms and keep room-scoped history. See ThatMattCat/havencore-satellite-firmware.
|
AI / ML
|
Backend
|
Frontend
|
Infra
|
┌──────────────────────────────────────────────────────────────────┐
│ Satellite (ESP32-S3-BOX-3) → wake-word, mic, speaker │
│ │ │
│ ▼ OpenAI-compatible HTTPS │
│ ┌──────────────── nginx (80) ──────────────────┐ │
│ │ │ │
│ │ agent (6002) ─┬─ orchestrator (events) ───────► vLLM (8000) │
│ │ FastAPI + │ │
│ │ SvelteKit ├─ MCP client manager ─► general_tools │
│ │ │ homeassistant │
│ │ │ plex / music_assistant│
│ │ │ qdrant_tools │
│ │ │ mqtt_tools │
│ │ │ │
│ │ ├─ conversation_db ───► postgres │
│ │ ├─ metrics_db (turn_metrics) │
│ │ └─ autonomy engine (asyncio) │
│ │ │
│ │ stt (6001) tts (6005) vision (8100) comfy (8188) │
│ │ embeddings (3000) qdrant (6333) mosquitto (1883) │
│ └───────────────────────────────────────────────────────────────┘
Full diagrams and per-service docs: docs/architecture.md.
Everything below assumes a Linux box with NVIDIA GPUs, the container toolkit, and Docker Compose v2.
git clone https://github.com/ThatMattCat/havencore.git
cd havencore
cp .env.tmpl .env # fill in HOST_IP_ADDRESS, HAOS_TOKEN, API keys
docker compose up -d # first build: 60–90 min
# first model load: 10–15 min (Qwen2.5-72B-AWQ, ~35 GB pull)
open http://localhost # SvelteKit dashboardFull walkthrough, hardware requirements (TL;DR: one 24 GB GPU works with a smaller model; the default 72B AWQ wants ≥48 GB split across two), NVIDIA driver pinning, and troubleshooting: docs/getting-started.md.
- Hot reload for Python: services mount their source;
docker compose restart agentpicks up edits. - Hot reload for the dashboard:
cd services/agent/frontend && npm run dev— proxies to:6002. - Test an MCP server in isolation:
docker compose exec -T agent python -m selene_agent.modules.mcp_general_toolsand speak JSON-RPC stdio — full guide in docs/services/agent/tools/development.md.
havencore/
├── compose.yaml # 12 services
├── .env.tmpl # every config knob documented inline
├── services/
│ ├── agent/ # FastAPI + SvelteKit + MCP servers
│ │ ├── selene_agent/ # Python package: orchestrator, autonomy, api routers
│ │ │ ├── modules/ # MCP tool servers (general, homeassistant, qdrant, mqtt, plex, mass)
│ │ │ └── autonomy/ # background engine (schedule, turn, tool gating, notifiers)
│ │ └── frontend/ # SvelteKit dashboard (static adapter)
│ ├── speech-to-text/ # Faster-Whisper
│ ├── text-to-speech/ # Kokoro
│ ├── iav-to-text/ # Vision LLM (image/audio/video → text)
│ ├── text-to-image/ # ComfyUI
│ ├── vllm/ llamacpp/ # LLM backends (vLLM default)
│ ├── postgres/ qdrant/ embeddings/
│ ├── nginx/ mosquitto/
├── shared/ # shared_config.py, logger, trace_id
└── docs/ # deep-dive docs, per-service READMEs, integration guides
| Architecture | System diagrams, data flow, scaling notes |
| Getting started | Install, GPU setup, first-run walkthrough |
| Configuration | Every env var |
| API reference | REST, WebSocket, OpenAI-compat |
| Agent internals | Orchestrator, MCP manager, DB layer |
| Autonomy engine | v1–v4 design + guardrails |
| Memory tiers | L2/L3/L4 consolidation |
| Tool development | Adding an MCP server |
| Home Assistant integration | HA setup + tool reference |
| Media control | Plex / Music Assistant / TV wake |
| Troubleshooting | Common failures + fixes |
This is a real thing I use every day — not a weekend demo. It's also unapologetically bespoke: it runs on my hardware, against my Home Assistant, with my Signal account as the notification channel. The repo is public and the code is readable, but config portability is an ongoing effort and there are rough edges you'd hit trying to run it cold.
What I'd point employers at:
services/agent/selene_agent/orchestrator.py— event-driven agent loop with per-turn metrics, safety limits, and session timeout handling.services/agent/selene_agent/autonomy/— autonomy engine: tier-gated tools, per-signature cooldowns, fresh-session invariant, full audit trail.services/agent/selene_agent/utils/mcp_client_manager.py— subprocess lifecycle + schema translation for MCP tool servers.services/agent/frontend/— SvelteKit dashboard, WebSocket chat store, streaming tool-call cards.compose.yaml— twelve-service topology, GPU device assignments, health checks.
- havencore-satellite-firmware — ESP32-S3-BOX-3 voice satellite firmware (wake-word, mic, speaker, on-device "Hey Selene").
LGPL v2.1. Do what you want, share improvements back.
Standing on the shoulders of vLLM, Kokoro, Faster-Whisper, Home Assistant, Qdrant, ComfyUI, MCP, and the Svelte team.
Built by Matt. If you're hiring and this looks like work you'd want someone to do for you, let's talk.








