Skip to content

Commit 8491e0c

Browse files
committed
New more intelligent directory/file/buffer sync
1 parent e0b205c commit 8491e0c

4 files changed

Lines changed: 336 additions & 47 deletions

File tree

API.md

Lines changed: 61 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ A SQLite extension that provides semantic memory capabilities with hybrid search
55
## Table of Contents
66

77
- [Overview](#overview)
8+
- [Sync Behavior](#sync-behavior)
89
- [Loading the Extension](#loading-the-extension)
910
- [SQL Functions](#sql-functions)
1011
- [General Functions](#general-functions)
@@ -29,6 +30,31 @@ sqlite-memory enables semantic search over text content stored in SQLite. It:
2930

3031
---
3132

33+
## Sync Behavior
34+
35+
All `memory_sync_*` functions use **content-hash change detection** to avoid redundant embedding computation. Each piece of content is hashed before processing — if the hash already exists in the database, the content is skipped.
36+
37+
### Change Detection
38+
39+
| Scenario | Behavior |
40+
|----------|----------|
41+
| New content | Chunked, embedded, and indexed |
42+
| Unchanged content | Skipped (hash match) |
43+
| Modified file | Old entry atomically deleted, new content reindexed |
44+
| Deleted file | Entry removed during directory sync |
45+
46+
### Transactional Safety
47+
48+
Every sync operation is wrapped in a SQLite **SAVEPOINT** transaction. If any step fails (embedding error, disk issue, constraint violation), the entire operation rolls back. This guarantees:
49+
50+
- **No partially-indexed files** — content is either fully indexed or not at all
51+
- **No orphaned chunks** — embeddings and FTS entries are always consistent with `dbmem_content`
52+
- **Safe to retry** — a failed sync leaves the database in its previous valid state
53+
54+
This makes all sync functions idempotent and safe to call repeatedly (e.g., on a schedule or at application startup).
55+
56+
---
57+
3258
## Loading the Extension
3359

3460
### Dynamic Loading (Recommended)
@@ -174,9 +200,9 @@ SELECT memory_get_option('provider');
174200

175201
### Memory Management Functions
176202

177-
#### `memory_add_text(content TEXT [, context TEXT])`
203+
#### `memory_sync_text(content TEXT [, context TEXT])`
178204

179-
Adds text content to memory.
205+
Syncs text content to memory. Duplicate content (same hash) is skipped automatically.
180206

181207
**Parameters:**
182208
| Parameter | Type | Required | Description |
@@ -189,23 +215,24 @@ Adds text content to memory.
189215
**Notes:**
190216
- Content is chunked based on `max_tokens` and `overlay_tokens` settings
191217
- Each chunk is embedded and stored in `dbmem_vault`
192-
- Content hash prevents duplicate storage
218+
- Content hash prevents duplicate storage — calling with the same content is a no-op
219+
- Runs inside a SAVEPOINT transaction (see [Sync Behavior](#sync-behavior))
193220
- Sets `created_at` timestamp automatically
194221

195222
**Example:**
196223
```sql
197224
-- Add text without context
198-
SELECT memory_add_text('SQLite is a C-language library that implements a small, fast, self-contained SQL database engine.');
225+
SELECT memory_sync_text('SQLite is a C-language library that implements a small, fast, self-contained SQL database engine.');
199226

200227
-- Add text with context
201-
SELECT memory_add_text('Important meeting notes from 2024-01-15...', 'meetings');
228+
SELECT memory_sync_text('Important meeting notes from 2024-01-15...', 'meetings');
202229
```
203230

204231
---
205232

206-
#### `memory_add_file(path TEXT [, context TEXT])`
233+
#### `memory_sync_file(path TEXT [, context TEXT])`
207234

208-
Adds a file to memory.
235+
Syncs a file to memory. Unchanged files are skipped; modified files are atomically replaced.
209236

210237
**Parameters:**
211238
| Parameter | Type | Required | Description |
@@ -218,39 +245,51 @@ Adds a file to memory.
218245
**Notes:**
219246
- Only processes files matching configured extensions (default: `md,mdx`)
220247
- File path is stored in `dbmem_content.path`
248+
- If the file was previously indexed with different content, the old entry (chunks, embeddings, FTS) is deleted and new content is reindexed — all within a single SAVEPOINT transaction (see [Sync Behavior](#sync-behavior))
221249
- Not available when compiled with `DBMEM_OMIT_IO`
222250

223251
**Example:**
224252
```sql
225-
SELECT memory_add_file('/docs/readme.md');
226-
SELECT memory_add_file('/docs/api.md', 'documentation');
253+
SELECT memory_sync_file('/docs/readme.md');
254+
SELECT memory_sync_file('/docs/api.md', 'documentation');
227255
```
228256

229257
---
230258

231-
#### `memory_add_directory(path TEXT [, context TEXT])`
259+
#### `memory_sync_directory(path TEXT [, context TEXT])`
232260

233-
Recursively adds all matching files from a directory.
261+
Synchronizes a directory with memory. Adds new files, reindexes modified files, and removes entries for deleted files.
234262

235263
**Parameters:**
236264
| Parameter | Type | Required | Description |
237265
|-----------|------|----------|-------------|
238266
| `path` | TEXT | Yes | Full path to the directory |
239267
| `context` | TEXT | No | Optional context label applied to all files |
240268

241-
**Returns:** INTEGER - Number of files processed
269+
**Returns:** INTEGER - Number of new files processed
242270

243271
**Notes:**
244272
- Recursively scans subdirectories
245273
- Only processes files matching configured extensions
274+
- **Phase 1 — Cleanup**: Removes entries for files that no longer exist on disk
275+
- **Phase 2 — Scan**: Processes all matching files:
276+
- **New files** are chunked, embedded, and added to the index
277+
- **Unchanged files** are skipped (content hash match)
278+
- **Modified files** have their old entries atomically replaced with new content
279+
- Each file is processed inside its own SAVEPOINT transaction (see [Sync Behavior](#sync-behavior))
280+
- Safe to call repeatedly — only changed content triggers embedding computation
246281
- Not available when compiled with `DBMEM_OMIT_IO`
247282

248283
**Example:**
249284
```sql
250-
SELECT memory_add_directory('/path/to/docs');
251-
-- Returns: 42 (number of files added)
285+
SELECT memory_sync_directory('/path/to/docs');
286+
-- Returns: 42 (number of new files processed)
287+
288+
SELECT memory_sync_directory('/project/notes', 'project-notes');
252289

253-
SELECT memory_add_directory('/project/notes', 'project-notes');
290+
-- Safe to call again — unchanged files are skipped
291+
SELECT memory_sync_directory('/path/to/docs');
292+
-- Returns: 0 (nothing changed)
254293
```
255294

256295
---
@@ -404,7 +443,7 @@ The extension tracks two timestamps for each memory:
404443

405444
### `created_at`
406445

407-
- Set automatically when content is added via `memory_add_text`, `memory_add_file`, or `memory_add_directory`
446+
- Set automatically when content is added via `memory_sync_text`, `memory_sync_file`, or `memory_sync_directory`
408447
- Stored as Unix timestamp (seconds since 1970-01-01 00:00:00 UTC)
409448
- Never updated after initial creation
410449

@@ -445,8 +484,8 @@ SELECT memory_set_option('max_tokens', 512);
445484
SELECT memory_set_option('min_score', 0.75);
446485

447486
-- Add content
448-
SELECT memory_add_text('SQLite is a C library that provides a lightweight disk-based database.', 'sqlite-docs');
449-
SELECT memory_add_directory('/docs/sqlite', 'sqlite-docs');
487+
SELECT memory_sync_text('SQLite is a C library that provides a lightweight disk-based database.', 'sqlite-docs');
488+
SELECT memory_sync_directory('/docs/sqlite', 'sqlite-docs');
450489

451490
-- Search
452491
SELECT path, snippet, ranking
@@ -474,9 +513,9 @@ SELECT memory_clear();
474513

475514
```sql
476515
-- Add memories with different contexts
477-
SELECT memory_add_text('Meeting notes...', 'meetings');
478-
SELECT memory_add_text('API documentation...', 'api-docs');
479-
SELECT memory_add_text('Tutorial content...', 'tutorials');
516+
SELECT memory_sync_text('Meeting notes...', 'meetings');
517+
SELECT memory_sync_text('API documentation...', 'api-docs');
518+
SELECT memory_sync_text('Tutorial content...', 'tutorials');
480519

481520
-- Search within a context
482521
SELECT * FROM memory_search
@@ -546,6 +585,6 @@ Errors can be caught using standard SQLite error handling mechanisms.
546585

547586
```sql
548587
-- Example error handling in application code
549-
SELECT memory_add_text(123); -- Error: expects TEXT parameter
588+
SELECT memory_sync_text(123); -- Error: expects TEXT parameter
550589
SELECT memory_delete('abc'); -- Error: expects INTEGER parameter
551590
```

README.md

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ sqlite-memory bridges these concepts, allowing any SQLite-powered application to
3333

3434
- **Hybrid Search**: Combines vector similarity (cosine distance) with FTS5 full-text search for superior retrieval
3535
- **Smart Chunking**: Markdown-aware parsing preserves semantic boundaries
36+
- **Intelligent Sync**: Content-hash change detection — unchanged files are skipped, modified files are atomically replaced, deleted files are cleaned up
37+
- **Transactional Safety**: Every sync operation runs inside a SAVEPOINT transaction — either fully succeeds or fully rolls back, no partially-indexed content
3638
- **Efficient Storage**: Binary embeddings with configurable dimensions
3739
- **Flexible Embedding**: Use local models (llama.cpp) or [vectors.space](https://vectors.space) remote API
3840

@@ -80,16 +82,16 @@ SELECT memory_set_model('local', '/path/to/nomic-embed-text-v1.5.Q8_0.gguf');
8082
-- SELECT memory_set_apikey('your-vectorspace-api-key');
8183

8284
-- Add some knowledge
83-
SELECT memory_add_text('SQLite is a C-language library that implements a small, fast,
85+
SELECT memory_sync_text('SQLite is a C-language library that implements a small, fast,
8486
self-contained, high-reliability, full-featured, SQL database engine. SQLite is the
8587
most used database engine in the world.', 'sqlite-docs');
8688

87-
SELECT memory_add_text('Vector databases store data as high-dimensional vectors,
89+
SELECT memory_sync_text('Vector databases store data as high-dimensional vectors,
8890
enabling similarity search. They are essential for semantic search, recommendation
8991
systems, and AI applications.', 'concepts');
9092

91-
-- Add an entire documentation directory
92-
SELECT memory_add_directory('/path/to/docs', 'project-docs');
93+
-- Sync an entire documentation directory
94+
SELECT memory_sync_directory('/path/to/docs', 'project-docs');
9395

9496
-- Search your memory semantically
9597
SELECT path, snippet, ranking
@@ -121,7 +123,7 @@ conn.execute("SELECT memory_set_model('local', './models/nomic-embed-text-v1.5.Q
121123

122124
# Store conversation context
123125
def remember(content, context="conversation"):
124-
conn.execute("SELECT memory_add_text(?, ?)", (content, context))
126+
conn.execute("SELECT memory_sync_text(?, ?)", (content, context))
125127
conn.commit()
126128

127129
# Retrieve relevant memories
@@ -142,6 +144,20 @@ memories = recall("what's the project timeline")
142144
# Returns relevant context about March 15th deadline
143145
```
144146

147+
## Intelligent Sync
148+
149+
All `memory_sync_*` functions use content-hash change detection to avoid redundant work:
150+
151+
- **`memory_sync_text`** — Computes a hash of the content. If the same content was already indexed, it is skipped entirely. No duplicate embeddings are ever created.
152+
- **`memory_sync_file`** — Reads the file and hashes its content. If the file was previously indexed with different content, the old entry (chunks, embeddings, FTS) is atomically replaced. Unchanged files are skipped.
153+
- **`memory_sync_directory`** — Performs a full two-phase sync:
154+
1. **Cleanup**: Removes database entries for files that no longer exist on disk
155+
2. **Scan**: Recursively processes all matching files — adding new ones, replacing modified ones, and skipping unchanged ones
156+
157+
Every sync operation is wrapped in a SQLite SAVEPOINT transaction. If anything fails mid-sync (embedding error, disk issue, etc.), the entire operation rolls back cleanly. There is no risk of partially-indexed files or orphaned entries.
158+
159+
This makes all sync functions safe to call repeatedly — for example, on a cron schedule or at agent startup — with minimal overhead.
160+
145161
## Use Cases
146162

147163
- **AI Assistants**: Maintain conversation history and user preferences
@@ -217,7 +233,7 @@ make test
217233

218234
- **Local Engine**: Built-in llama.cpp for on-device embeddings (requires GGUF model)
219235
- **Remote Engine**: [vectors.space](https://vectors.space) API for cloud embeddings (requires free API key)
220-
- **File I/O**: `memory_add_file` and `memory_add_directory` functions
236+
- **File I/O**: `memory_sync_file` and `memory_sync_directory` functions
221237

222238
You can also combine options manually:
223239

0 commit comments

Comments
 (0)