Technical documentation of every custom algorithm in the Mnemosyne retrieval system. Formulas and parameters are extracted directly from source code.
Source: retrieval.py (_bm25_search, _escape_fts5)
Mnemosyne delegates lexical scoring to SQLite FTS5's built-in BM25 implementation, configured with the Porter stemmer and Unicode61 tokenizer.
Query preprocessing:
- Escape FTS5 special characters:
" ' ( ) * : ^ - Filter stopwords (60 common English words) and tokens shorter than 2 characters.
- Join remaining terms with
ORfor broad recall (FTS5 default AND is too strict for code search).
Score normalization:
FTS5 bm25() returns negative values (lower = better). Raw scores are converted to
positive magnitudes and normalized to [0, 1] relative to the top score:
normalized_score = abs(raw_score) / max(abs(raw_score) for all results)
Result limit: config.retrieval.max_results * 3 (default: 60 candidates).
Config parameters:
retrieval.bm25_weight=0.4(RRF fusion weight)retrieval.max_results=20
Reference: Robertson, S.E. & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval, 3(4), 333-389.
Source: embeddings/tfidf_backend.py (TFIDFBackend)
Pure-Python sparse vector backend with no external dependencies.
- Extract tokens matching
[a-zA-Z_][a-zA-Z0-9_]{2,}(minimum 3 characters). - Split camelCase:
getUserById->get,user,by,id(viare.sub(r"([a-z])([A-Z])", r"\1_\2", tok)). - Split snake_case: handled by underscore splitting after camelCase insertion.
- Filter stopwords (60-word set, synchronized with
retrieval._STOPWORDS). - Minimum sub-token length: 2 characters (
_MIN_SUBTOKEN_LEN). - Both the original lowercased token and its sub-parts are retained, so exact-match queries still score highly.
Prevents bias toward long documents while preserving relative frequency signal:
tf(t, d) = 0.5 + 0.5 * (count(t, d) / max_count(d))
Where max_count(d) is the frequency of the most common term in document d.
Avoids division-by-zero and gives non-zero weight to terms appearing in every document:
idf(t) = log((N + 1) / (df(t) + 1)) + 1.0
Where N = total documents, df(t) = number of documents containing term t.
w(t, d) = tf(t, d) * idf(t)
- Count document frequency (each term counted once per document).
- Filter: terms must appear in at least
min_dfdocuments. - Cap at
max_featuresterms, keeping the rarest terms (lowest df = highest IDF = most discriminative). - Version tracking: tokenizer config is hashed (SHA-256 of sorted stopwords + regex + min sub-token length). Stale vocabulary triggers a rebuild warning.
The inverted index maps each vocabulary term to list[(chunk_id, weight)]. At query time,
only terms present in the query vector are consulted (sub-linear search). Scores are
accumulated partial dot products:
score(d) = sum(q_weight[t] * d_weight[t] for t in query_terms intersect doc_terms)
Query vector norm is constant across all candidates so it is omitted from ranking comparison.
Vectors are stored as JSON {term: weight} dicts. Cosine similarity uses standard
dot-product / (norm_a * norm_b) over the common-term intersection.
Config parameters:
embedding.backend="tfidf"(default; disables dense vectors entirely)embedding.tfidf_max_features=10000embedding.tfidf_min_df=1retrieval.vector_weight=0.4(RRF fusion weight)
Reference: Salton, G. & Buckley, C. (1988). "Term-Weighting Approaches in Automatic Text Retrieval." Information Processing & Management, 24(5), 513-523.
Source: ranking.py (rrf_fuse)
Merges multiple ranked signal lists into a single fused ranking without requiring score calibration across sources.
rrf_score(d) = sum( weight_s / (k + rank_s(d)) ) for each source s
Where:
k= 60 (smoothing constant)rank_s(d)= 1-based rank of documentdin sources- If
dis absent from sources, penalty rank =len(source_s) + 1 weight_s= per-source weighting factor
| Source | Weight | Description |
|---|---|---|
bm25 |
0.4 | Lexical BM25 via FTS5 |
vector |
0.4 | TF-IDF sparse vector search |
usage |
0.2 | Decay-weighted usage frequency |
| Source | Weight | Condition |
|---|---|---|
symbol |
bm25_weight * 1.5 (= 0.6) |
Symbol name matches found |
prefetch |
max(bm25_weight, vector_weight) (= 0.4) |
Pre-fetch pattern hit |
Config parameters:
retrieval.bm25_weight=0.4retrieval.vector_weight=0.4retrieval.usage_weight=0.2
Reference: Cormack, G.V., Clarke, C.L.A. & Butt, S. (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." Proceedings of SIGIR 2009, 758-759.
Source: retrieval.py (_symbol_search, _filename_boost, _import_graph_boost, _file_level_filter)
Applied sequentially after RRF fusion to inject structural signals that pure term matching misses.
Chunks whose symbol_name matches a query term receive a 3x RRF score multiplier.
- Query terms are extracted as identifiers matching
[a-zA-Z_][a-zA-Z0-9_]{2,}. - camelCase is split into sub-parts for partial matching.
- Exact match: score 1.0. Partial/component overlap: score =
overlap_count / max(symbol_parts, search_terms) * 0.8. - Only chunks with score >= 0.5 qualify for the 3x boost.
All chunks from files whose basename matches query terms receive a 1.5x RRF score multiplier.
Matching rules:
- Query terms must be at least 5 characters and not stopwords.
- Filename components are split on
[-_.]separators after removing the extension. - Stem-prefix matching: a query term and filename component must share a common prefix of >= 5 characters.
- The prefix must cover at least 60% of the shorter string's length.
- Only files already present in fused results are boosted (no injection of new files).
Injects files connected to already-retrieved files via dependency edges. Runs after the file-level filter so keyword-matched results are never displaced.
Three reference types scanned:
- ES6 imports / CommonJS
require(static module references). - Runtime namespace access (
Namespace.Modulepatterns mapped to indexed basenames). - Path references (quoted strings matching indexed file paths).
Injection rules:
- Minimum 2 references required (
config.retrieval.import_inject_threshold). - Maximum 2 files injected per query (dynamic cap:
min(2, max(0, 10 - current_file_count))). - Source files prioritized over test files.
- High-confidence injections (3+ references): score =
top_score * 0.9. - Standard injections (2 references): score =
p75_score * 0.7.
Retains only chunks from the top-scoring files by aggregate RRF score.
max_files = min(max(3, n_unique_files // 3), 8)
Scales from 3 files (small projects) to 8 files (large corpora). Overridable via
config.retrieval.max_files (positive value = hard cap; 0 = adaptive).
Config parameters:
retrieval.max_files=0(adaptive)retrieval.import_inject_threshold=2
Source: ranking.py (cost_model_score), retrieval.py (_cost_model_rank)
Re-ranks fused results by information value per token, demoting bulky low-value chunks.
relevance = rrf_score * (1.0 - 0.5 * boilerplate_ratio)
if is_structured_code:
relevance *= 2.0
density = relevance / (1.0 + log1p(max(1, token_count)))
Where:
boilerplate_ratio= fraction of lines detected as boilerplate +repetition_score * 0.3, capped at 1.0is_structured_code= True whensymbol_name is not Noneorchunk_type in ("function", "class", "imports")log1p(x)=log(1 + x)(logarithmic dampening so chunk size influences ranking gently)
| File type / path pattern | Minimum boilerplate_ratio |
|---|---|
.html, .css, .md, .txt (non-code chunks) |
0.85 |
tests/ or test/ prefix |
0.50 |
Named symbol chunks (functions, classes) receive a 2x relevance multiplier, since they carry concrete code semantics that generic/block chunks lack.
Source: ranking.py (budget_cut)
Greedy token-budget selection. Chunks are consumed in pre-ranked value-density order.
- Iterate candidates in order from
_cost_model_rank(already sorted by density). - If
tokens <= remaining_budget: include chunk, subtract tokens. - If chunk exceeds remaining budget and a compressor is available:
- Named symbol chunks use
strict=Truecompression (skip TF-IDF thinning). - If compressed version fits: include with
compression_ratio = len(compressed) / len(original). - If compressed version still exceeds budget: skip chunk.
- Named symbol chunks use
- Stop when budget is exhausted.
Config parameter:
retrieval.token_budget=8000
Source: cache.py (ARCCache)
Self-tuning cache that balances recency and frequency without manual parameter tuning.
| Tier | Contents | Purpose |
|---|---|---|
| T1 | Recently-inserted items (seen once) | Recency signal |
| T2 | Frequently-accessed items (seen 2+ times) | Frequency signal |
| B1 | Ghost keys evicted from T1 (no data) | Recency history |
| B2 | Ghost keys evicted from T2 (no data) | Frequency history |
p = target size of T1 (how much of total capacity is allocated to recency).
B1 ghost hit (recency was underrepresented):
delta = max(1, len(B2) // max(1, len(B1)))
p = min(capacity, p + delta)
B2 ghost hit (frequency was underrepresented):
delta = max(1, len(B1) // max(1, len(B2)))
p = max(0, p - delta)
When len(T1) + len(T2) >= capacity:
- If
len(T1) > p(or T2 is empty): evict LRU of T1, promote key to B1. - Else: evict LRU of T2, promote key to B2.
- T1 hit: move item to MRU end of T2 (first re-access promotes to frequent).
- T2 hit: move item to MRU end of T2 (refresh LRU clock).
Combined B1 + B2 capped at ghost_capacity. When exceeded, the oldest entry from the
larger ghost list is evicted first.
Config parameters:
cache.capacity=500(max live entries in T1 + T2)cache.ghost_capacity=1000(max ghost entries in B1 + B2)
Reference: Megiddo, N. & Modha, D.S. (2003). "ARC: A Self-Tuning, Low Overhead Replacement Cache." Proceedings of USENIX FAST 2003, 115-130.
Source: bloom.py (BloomFilter, _double_hash, _optimal_size, _optimal_hash_count)
Probabilistic set membership using the Kirsch-Mitzenmacher double hashing construction.
Two independent hash functions simulate k hash functions:
h1(x) = first 8 bytes of MD5(x) as uint64 LE
h2(x) = first 8 bytes of SHA-1(x) as uint64 LE
h_i(x) = (h1(x) + i * h2(x)) mod m for i in 0..k-1
Bit-array size for target false-positive rate:
m = ceil(-n * ln(p) / (ln 2)^2)
Optimal hash function count:
k = round((m / n) * ln 2)
Where n = expected capacity, p = target false-positive rate.
Bit array stored as bytearray (1 byte per 8 bits). Binary persistence format:
- 32-byte header: capacity (uint64 LE), fp_rate (float64 LE), size_bits (uint64 LE), hash_count (uint64 LE).
- Followed by raw bit array bytes.
capacity=100,000fp_rate=0.001(0.1%)
Memory example: 100K capacity at 0.1% FP rate -> m = -100000 * ln(0.001) / ln(2)^2 = ~1,437,759 bits = ~175 KB.
Reference: Kirsch, A. & Mitzenmacher, M. (2006). "Less Hashing, Same Performance: Building a Better Bloom Filter." Proceedings of ESA 2006, LNCS 4168, 456-467.
Source: compress.py (Compressor)
Reduces chunk token count for LLM context injection while preserving structural information.
Identifies lines that must never be removed. Detected via regex patterns:
| Pattern | Regex |
|---|---|
| Function/class signatures | ^\s*(async\s+)?def\s+\w+ or ^\s*class\s+\w+ |
| Return/raise/yield | ^\s*(return|raise|yield)\b |
| Assert | ^\s*assert\b |
| Control flow | ^\s*(if|elif|else\s*:|for|while|try\s*:|except|finally\s*:|with|switch|case)\b |
| Docstrings | Lines inside """ or ''' blocks (when preserve_docstrings=True) |
| Annotated comments | #\s*(TODO|FIXME|HACK|NOTE|XXX)\b |
Replaces repetitive low-value patterns with compact inline summaries:
| Pattern | Threshold | Summary format |
|---|---|---|
| Import blocks | > 3 consecutive | # [N imports: a, b, ...] |
self.x = y assignments |
> 2 consecutive | # [N assignments: x, y, ...] |
| Logging/print calls | > 1 consecutive | # [N log statements] |
| Consecutive blank lines | > 1 | Collapsed to single blank |
Removes non-preserved lines with the lowest IDF-weighted scores until the target token ratio is reached.
- Score each removable line:
importance = sum(idf[term] for term in line_tokens). - Sort ascending (lowest importance = removed first).
- Remove lines until
remaining_tokens <= total_tokens * target_ratio. - Hard cap: never remove more than
max_prune_ratioof removable lines.
Accepts optional context_idf override for query-aware pruning where terms rare
in the current query context are weighted more heavily.
Strict mode: When strict=True (used for named symbol chunks), Stage 3 is skipped
entirely. Semantically dense code is not thinned by IDF.
- Collapse consecutive blank lines to at most one.
- Compare each non-preserved line against the previous 20 lines using
SequenceMatcher. - Remove line if
similarity_ratio > 0.85with any recent line. - Preserved lines are always kept regardless of similarity.
Config parameters:
compression.target_ratio=0.4(target token reduction)compression.preserve_signatures=truecompression.preserve_docstrings=truecompression.collapse_imports=truecompression.collapse_boilerplate=truecompression.max_prune_ratio=0.7(max 70% of removable lines pruned)compression.strict_for_symbols=true
Source: analytics.py (Analytics)
Exponential time-decay model that makes recently-used chunks rank higher in retrieval.
contribution(event) = 2 ^ (-age_days / halflife)
score(chunk) = sum(contribution(e) for e in qualifying_events(chunk))
Where:
age_days=(now_utc - event_timestamp).total_seconds() / 86400halflife=config.analytics.decay_halflife_days(default: 7 days)- Qualifying event types:
'selected'and'used'only 'retrieved'and'discarded'events are stored but do not contribute to scoring
| Age (days) | Contribution (halflife=7) |
|---|---|
| 0 | 1.000 |
| 7 | 0.500 |
| 14 | 0.250 |
| 28 | 0.063 |
Config parameters:
analytics.decay_halflife_days=7analytics.session_timeout_minutes=30retrieval.usage_weight=0.2(RRF fusion weight)
Design rationale for the structural precision fixes applied after initial evaluation.
Problem: Short prefix matches (3-4 chars) caused false-positive file boosting.
Fix: Minimum 5-character shared prefix with 60% coverage requirement. Both query term and filename component must be >= 5 characters. Only files already in fused results are boosted (no injection of files missed by BM25/TF-IDF).
Problem: Single-reference imports injected marginally relevant files that consumed budget.
Fix: Minimum 2 references required (import_inject_threshold), maximum 2 injections per
query. Source files prioritized over test files. Scoring differentiated by reference count
(3+ refs = 90% of top score; 2 refs = 70% of p75 score).
Problem: Fixed file limits are suboptimal across project sizes.
Fix: max_files = min(max(3, n_files // 3), 8). Scales from 3 (small) to 8 (large).
Config override available via retrieval.max_files.
Rationale: These are structural precision fixes that improve retrieval quality without algorithmic changes. They address false positives from overly broad matching rather than modifying the core ranking algorithms.
Storage per chunk: JSON dict of {term: weight} for vocabulary terms present in that chunk.
Total footprint: O(vocabulary_size * avg_terms_per_chunk).
| Corpus size | Typical vocabulary | Avg terms/chunk | Estimated size |
|---|---|---|---|
| 5K chunks | 5,000-10,000 | 20-50 | 2-10 MB |
| 50K chunks | 10,000 (capped) | 20-50 | 20-100 MB |
When enabled (requires switching embedding.backend from "tfidf"):
O(chunks * 384 * 4 bytes) for 384-dimensional float32 vectors.
| Corpus size | Estimated size |
|---|---|
| 5K chunks | ~7 MB |
| 50K chunks | ~73 MB |
O(capacity * avg_chunk_size) for live entries. Ghost lists store keys only (negligible).
Default: 500 live entries + 1000 ghost keys.
Optimal sizing at default parameters (100K capacity, 0.1% FP):
m = ceil(-100000 * ln(0.001) / ln(2)^2) = ~1,437,759 bits = ~175 KB.
- WAL mode (write-ahead logging) for concurrent read/write access.
- 64 MB page cache (
PRAGMA cache_size). - 256 MB memory-mapped I/O (
PRAGMA mmap_size).
embedding.backend = "tfidf" (default) keeps the system sparse-only, disabling dense
vector allocation entirely. This is the recommended configuration for codebases under
50K chunks where TF-IDF recall is sufficient.