docs: update README and docs to reflect memory system implementation

hackertron · hackertron · commit 8a8b0d285951 · 2026-03-03T13:27:09.000+05:30
Update all documentation to cover the persistent memory system added in
Step 5 — SQLite storage, hybrid retrieval (vector + FTS with RRF),
OpenAI embeddings, rolling summarization, memory tools, and session store.
diff --git a/README.md b/README.md
@@ -20,14 +20,14 @@ Think of it as building your own Claude Code / Cursor agent from scratch.
 │           agent turn loop + session          │
 │     stream → think → act → observe → loop   │
 ├──────────────┬──────────────┬───────────────┤
-│   Provider   │    Tools     │    Memory     │
-│   Layer      │   System     │   (Step 5)    │
-│              │              │   (planned)   │
-│  OpenAI      │  read_file   │              │
-│  Anthropic   │  write_file  │              │
-│  Gemini      │  list_files  │              │
-│              │  shell_exec  │              │
-│              │  web_fetch   │              │
+│   Provider   │    Tools     │    Memory      │
+│   Layer      │   System     │   SQLite +     │
+│              │              │   Hybrid Search│
+│  OpenAI      │  read_file   │  store/recall  │
+│  Anthropic   │  write_file  │  vector + FTS  │
+│  Gemini      │  list_files  │  embeddings    │
+│              │  shell_exec  │  summaries     │
+│              │  web_fetch   │  sessions      │
 ├──────────────┴──────────────┴───────────────┤
 │                   Types                      │
 │          interfaces, contracts, config       │
@@ -81,7 +81,7 @@ internal/
     gemini.go         Google Gemini GenerateContent
     reliable.go       Retry wrapper with exponential backoff
   runtime/            Agent turn loop
-    session.go        In-memory conversation buffer
+    session.go        In-memory conversation buffer + compaction
     runtime.go        AgentRuntime, Run(), stream accumulation, tool dispatch
   tool/               Tool system
     schema.go         JSON Schema builder helpers
@@ -92,7 +92,16 @@ internal/
     list_files.go     list_files tool
     shell_exec.go     shell_exec tool
     web_fetch.go      web_fetch tool
+    memory_save.go    memory_save tool
+    memory_search.go  memory_search tool
     builtin.go        RegisterBuiltins() convenience
+  memory/             Persistent memory system
+    sqlite.go         SQLite database wrapper + schema migrations
+    store.go          Memory store (CRUD + hybrid retrieval)
+    retrieval.go      Vector search, FTS search, RRF fusion
+    session_store.go  Session lifecycle management
+    embedding.go      Embedding backend factory
+    embedding_openai.go  OpenAI embedding integration
 ```
 
 ## Configuration
@@ -117,13 +126,15 @@ All providers implement the same `Provider` interface. Swap between them with on
 
 ## Tools
 
-| Tool         | Safety tier    | Timeout | What it does                                |
-|--------------|---------------|---------|---------------------------------------------|
-| `read_file`  | ReadOnly      | 10s     | Read file with line numbers, offset, limit  |
-| `write_file` | SideEffecting | 10s     | Write/append to file, auto-creates dirs     |
-| `list_files` | ReadOnly      | 10s     | List directory, optional recursive          |
-| `shell_exec` | Privileged    | 60s     | Run shell command, capture stdout/stderr    |
-| `web_fetch`  | SideEffecting | 30s     | HTTP GET/POST, return status + body         |
+| Tool            | Safety tier    | Timeout | What it does                                |
+|-----------------|---------------|---------|---------------------------------------------|
+| `read_file`     | ReadOnly      | 10s     | Read file with line numbers, offset, limit  |
+| `write_file`    | SideEffecting | 10s     | Write/append to file, auto-creates dirs     |
+| `list_files`    | ReadOnly      | 10s     | List directory, optional recursive          |
+| `shell_exec`    | Privileged    | 60s     | Run shell command, capture stdout/stderr    |
+| `web_fetch`     | SideEffecting | 30s     | HTTP GET/POST, return status + body         |
+| `memory_save`   | SideEffecting | 15s     | Save knowledge to persistent memory         |
+| `memory_search` | ReadOnly      | 15s     | Search memory with hybrid retrieval         |
 
 ### Security
 
@@ -151,6 +162,25 @@ The runtime is the core agent loop that ties providers and tools together:
 
 The turn timeout covers both provider streaming and tool execution as a single budget. Ctrl-C (SIGINT/SIGTERM) propagates cleanly into the runtime via context cancellation.
 
+When the conversation approaches the context limit, the runtime triggers rolling summarization — it compacts older messages into a summary and keeps recent turns, so the agent can maintain context across long sessions.
+
+## Memory
+
+Persistent memory backed by SQLite (pure Go, no CGO via `modernc.org/sqlite`).
+
+**Hybrid retrieval** combines two search strategies:
+- **Vector search** — cosine similarity over OpenAI embeddings (semantic meaning)
+- **Full-text search** — SQLite FTS5 with BM25 ranking (exact keywords)
+- Results are merged using **Reciprocal Rank Fusion** with configurable weights (default 0.7 vector / 0.3 FTS)
+
+**Graceful degradation** — if no `OPENAI_API_KEY` is set, the system falls back to FTS-only search. Memory is optional; the agent works without it.
+
+**What gets persisted:**
+- Memory chunks (knowledge stored by the agent or user)
+- Conversation events (every message in the turn loop)
+- Rolling summaries (context compaction across sessions)
+- Session scratchpad (key-value state per session)
+
 ## Tests
 
 ```bash
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -39,7 +39,7 @@ Everything in Yantra exists to make this loop work well:
 - **Security** prevents the LLM from doing damage
 - **Config** makes it all customizable
 - **Runtime** runs the think → act → observe loop
-- **Memory** (planned) lets the agent remember across sessions
+- **Memory** lets the agent remember across sessions
 - **Gateway** (planned) lets you control it remotely
 
 ## Layer 1: Types (`internal/types/`)
@@ -341,7 +341,9 @@ Schema(
 
 This is a small quality-of-life builder. It outputs valid JSON Schema as `json.RawMessage`, which slots directly into `FunctionDecl.Parameters`.
 
-### The five built-in tools
+### The built-in tools
+
+Yantra ships with 7 built-in tools (5 core + 2 memory):
 
 **read_file** (ReadOnly, 10s timeout)
 - Reads a file with 6-digit line numbers: `     1\tpackage main`
@@ -459,9 +461,17 @@ The runtime classifies errors:
 - Max turns reached → `ErrMaxTurns`
 - Tool execution errors → placed in message content (the LLM sees them and can recover)
 
-### Context budget
+### Context budget and summarization
+
+After each tool dispatch, the runtime estimates token usage (`totalChars / 4`) and checks if the session is approaching the context limit (`TriggerRatio * MaxContextTokens`). When triggered:
+
+1. A `MinTurns` guard (default 6) prevents summarizing too-short conversations
+2. The runtime builds a summarization prompt including the existing summary (if any) and the messages to compact
+3. The LLM generates a rolling summary via a dedicated system prompt
+4. The summary is stored in the `session_summaries` table with an incrementing epoch
+5. `session.CompactWithSummary()` replaces older messages with a `[Conversation Summary]` pseudo-message, keeping the most recent turns
 
-After each tool dispatch, the runtime estimates token usage (chars/4) and logs a warning if the session is approaching the context limit (`TriggerRatio * MaxContextTokens`). Actual summarization is deferred to Step 5 (Memory).
+On session startup, if a prior summary exists, it's injected as the first messages so the agent has context from previous runs.
 
 ## How the pieces connect
 
@@ -482,10 +492,109 @@ yantra run "add error handling to server.go"
        └── loop until text-only response or MaxTurns
 ```
 
+## Layer 5: Memory (`internal/memory/`)
+
+Memory gives the agent persistence — it can store knowledge, recall it later, and maintain context across sessions.
+
+### Storage: SQLite (no CGO)
+
+The memory system uses `modernc.org/sqlite`, a pure-Go SQLite implementation. No CGO means the binary cross-compiles trivially. The database opens with **WAL mode** and a 5-second busy timeout for concurrent access safety.
+
+Schema (6 tables):
+
+```
+chunks              — memory fragments with optional embedding BLOBs
+chunks_fts          — FTS5 virtual table (porter stemmer + unicode61)
+sessions            — session lifecycle tracking
+conversation_events — per-session conversation log
+session_summaries   — rolling summary per session (with epoch counter)
+scratchpads         — key-value state per session
+```
+
+### Hybrid retrieval
+
+Memory search combines two strategies and merges them with Reciprocal Rank Fusion (RRF):
+
+```
+Query: "how does authentication work?"
+     │
+     ├── Vector Search (weight: 0.7)
+     │   Compute embedding → cosine similarity against all chunks
+     │   Returns: semantically similar results
+     │
+     ├── FTS Search (weight: 0.3)
+     │   SQLite FTS5 with BM25 ranking
+     │   Returns: keyword-matching results
+     │
+     └── Reciprocal Rank Fusion (k=60)
+         Merge + deduplicate by chunk ID
+         Score: weight / (60 + rank) per source
+         Return top K results
+```
+
+The system fetches `topK * 3` candidates from each source before fusion, ensuring good coverage. Weights are configurable — higher `VectorWeight` favors semantic matches, higher `FTSWeight` favors exact keyword matches.
+
+**Graceful degradation:**
+- No `OPENAI_API_KEY` → FTS-only search (no embeddings)
+- FTS query fails (malformed syntax) → vector-only search
+- No memory DB → agent runs without memory, logs a warning
+
+### Embeddings
+
+Embeddings are computed via the OpenAI API and stored as compact little-endian binary BLOBs (4 bytes per float32 dimension), saving ~75% compared to JSON-encoded arrays.
+
+Supported models:
+
+| Model | Dimensions |
+|-------|-----------|
+| `text-embedding-3-small` (default) | 1536 |
+| `text-embedding-3-large` | 3072 |
+| `text-embedding-ada-002` | 1536 |
+
+The factory returns `nil` (not an error) when the API key is missing, so the system can always boot.
+
+### Conversation persistence
+
+Every message in the turn loop (user, assistant, tool results) is persisted via `StoreConversationEvent()`. Persistence uses the turn context with deadline, ensuring it respects timeouts. Failed persistence is logged as a warning but does not halt execution (fire-and-forget).
+
+### Memory tools
+
+Two tools expose memory to the agent:
+
+- **`memory_save`** (SideEffecting, 15s) — stores knowledge with optional tags
+- **`memory_search`** (ReadOnly, 15s) — hybrid search with ranked results
+
+These are conditionally registered only when a `MemoryRetrieval` instance is available.
+
+### Session store
+
+`SQLiteSessionStore` manages session lifecycle:
+
+| Operation | Details |
+|-----------|---------|
+| Create | Generates `ses_<32 hex chars>` ID |
+| Get | Single session by ID |
+| List | All sessions, ordered by `updated_at DESC`, optionally including archived |
+| Update | Name, message count, timestamps |
+| Archive | Soft-delete (sets `archived = 1`) |
+
+### Key patterns
+
+**Interface-driven design** — the runtime and tools depend on `types.MemoryRetrieval` and `types.SessionStore` interfaces, not concrete types. Compile-time checks enforce this:
+```go
+var _ types.MemoryRetrieval = (*Store)(nil)
+var _ types.SessionStore = (*SQLiteSessionStore)(nil)
+```
+
+**Transaction safety** — multi-table operations (Store, Forget, StoreConversationEvent) use explicit transactions with `defer tx.Rollback()`.
+
+**Binary embedding storage** — float32 slices are serialized as little-endian BLOBs via custom `encodeFloat32s`/`decodeFloat32s` helpers.
+
 ## What's next
 
 | Step | What | Purpose |
 |------|------|---------|
-| 5 | Memory | Persistent vector DB for cross-session recall + rolling summarization |
 | 6 | Gateway | WebSocket server for remote control |
-| 7 | Multi-agent | Specialist subagents with delegation |
+| 7 | MCP | Model Context Protocol client for external tools |
+| 8 | TUI | Terminal UI with Bubble Tea |
+| 9 | Polish | Config scaffolding, cross-platform build, docs |
diff --git a/docs/config.md b/docs/config.md
@@ -131,36 +131,50 @@ When the conversation history approaches the context limit (trigger_ratio × max
 
 ### runtime.summarization
 
-Controls rolling summarization behavior.
+Controls rolling summarization behavior. When the context budget is exceeded, the runtime calls the LLM to generate a summary of older messages, stores it in the database, and compacts the session.
 
 ```toml
 [runtime.summarization]
 target_ratio = 0.5   # Aim to reduce context to 50% after summarization
 min_turns = 6        # Don't summarize conversations shorter than 6 turns
 ```
 
+**How it works:** When triggered, the runtime builds a summarization prompt with the existing summary (if any) and messages to compact. The LLM generates a rolling summary, which is stored in `session_summaries` with an incrementing epoch. Older messages are replaced with a `[Conversation Summary]` pseudo-message. On session restart, the prior summary is injected as the opening context.
+
 ### memory
 
-Persistent memory backed by a vector database.
+Persistent memory backed by SQLite (pure Go, no CGO via `modernc.org/sqlite`).
 
 ```toml
 [memory]
-enabled = true
-db_path = ".yantra/memory.db"          # SQLite/libSQL database path
-embedding_backend = "openai"           # "openai" or "ollama"
+enabled = true                         # Set false to disable memory entirely
+db_path = ".yantra/memory.db"          # SQLite database path (auto-created)
+embedding_backend = "openai"           # "openai" or "" (empty = openai default)
 
 [memory.embedding]
 model = "text-embedding-3-small"       # OpenAI embedding model
-# ollama_url = "http://localhost:11434" # For ollama backend
-# ollama_model = "nomic-embed-text"    # For ollama backend
+# Supported models:
+#   text-embedding-3-small  (1536 dimensions, default)
+#   text-embedding-3-large  (3072 dimensions)
+#   text-embedding-ada-002  (1536 dimensions)
+# ollama_url = "http://localhost:11434" # For ollama backend (reserved, not yet implemented)
+# ollama_model = "nomic-embed-text"    # For ollama backend (reserved, not yet implemented)
 
 [memory.retrieval]
-top_k = 8              # Number of results to retrieve
+top_k = 8              # Number of results to retrieve per search
 vector_weight = 0.7    # Weight for vector similarity (0-1)
 fts_weight = 0.3       # Weight for full-text search (0-1)
 ```
 
-Memory uses hybrid retrieval — combining vector similarity search (semantic meaning) with full-text search (exact keyword matching). The weights control the balance. Higher vector_weight favors semantic matches; higher fts_weight favors exact matches.
+Memory uses **hybrid retrieval** — combining vector similarity search (semantic meaning) with full-text search (exact keyword matching via SQLite FTS5). Results from both sources are merged using **Reciprocal Rank Fusion** (RRF) with configurable weights. Higher `vector_weight` favors semantic matches; higher `fts_weight` favors exact keyword matches.
+
+**Graceful degradation:** If `OPENAI_API_KEY` is not set, the system falls back to FTS-only search (no vector embeddings). If the database fails to open, the agent continues without memory. Memory tools (`memory_save`, `memory_search`) are only registered when memory is available.
+
+**Storage details:**
+- Embeddings are stored as compact little-endian binary BLOBs (4 bytes per dimension)
+- Conversation events are persisted per-session for history recall
+- Rolling summaries are maintained with an epoch counter for context compaction
+- Session scratchpads provide per-session key-value storage
 
 ### tools
 
diff --git a/docs/tools.md b/docs/tools.md
@@ -375,6 +375,48 @@ status: 200
 - No cookie handling
 - No redirect following configuration (uses Go's default: follows up to 10 redirects)
 
+### memory_save
+
+**Purpose:** Persist knowledge for future recall across sessions.
+
+**Parameters:**
+| Name    | Type           | Required | Description |
+|---------|----------------|----------|-------------|
+| content | string         | yes      | The knowledge to store |
+| tags    | array (string) | no       | Optional tags for categorization |
+
+**Behavior:**
+- Calls `mem.Store()` with source `"user_saved"`
+- If an embedding backend is configured, computes and stores the embedding alongside the content
+- Returns: `"Saved to memory (id: <hex>)"`
+
+**Safety tier:** SideEffecting (15s timeout)
+
+### memory_search
+
+**Purpose:** Search persistent memory using hybrid retrieval (vector + full-text).
+
+**Parameters:**
+| Name  | Type    | Required | Default | Description |
+|-------|---------|----------|---------|-------------|
+| query | string  | yes      | —       | Search query (used for both semantic and keyword matching) |
+| top_k | integer | no       | 5       | Maximum number of results to return |
+
+**Output format:**
+```
+1. [score: 0.85] Content of the memory chunk
+   Tags: tag1, tag2
+
+2. [score: 0.72] Another memory chunk
+   Tags: general
+```
+
+Returns `"No matching memories found."` when no results match.
+
+**Safety tier:** ReadOnly (15s timeout)
+
+**Note:** Both memory tools are conditionally registered — they only appear in the tool list when a `MemoryRetrieval` instance is provided to `RegisterBuiltins`. If memory is disabled or the database fails to open, the agent simply doesn't have these tools.
+
 ## Writing a custom tool
 
 To add a new tool, implement the `Tool` interface: