Description
The current FTS5 tables use the built-in tokenizer, which works well for Latin scripts but has limitations for Chinese text:
- 2-char Chinese words (早报, 你好, 配置) fall below the trigram window (3-char minimum) and cannot be matched by FTS5
- Mixed ASCII+CJK tokens (C盘, API配置) fail to split correctly in
- No semantic segmentation — trigram treats text as character sequences, missing word boundaries (e.g., 早报 = early+report, not random bigrams)
Current Workaround
We modified to relax minimum token length:
// Changed from TRIGRAM_MIN(3) to allow 1-char CJK
if (r.length >= 1) expanded.push(r);
This works short-term but introduces noise (高频单字 like 的/是/了 enter FTS).
Proposed Solution
Add support for configurable FTS5 tokenizers, specifically:
- jieba-based tokenizer — Chinese word segmentation before FTS5 insertion
- Config option —
storage.ftsTokenizer: 'trigram' | 'jieba' | 'custom'
- Write-time segmentation — Pre-segment text with jieba before inserting into FTS5 tables
- Query-time segmentation — Apply same segmentation to search queries in
prepareFtsMatch()
Implementation Sketch
// config.yaml
storage:
ftsTokenizer: jieba // or 'trigram' (default)
// keyword.ts — prepareFtsMatch()
if (config.storage.ftsTokenizer === 'jieba') {
tokens = jieba.cut(query); // word-level segmentation
} else {
tokens = trigramExtract(query); // current behavior
}
Related Issues
Environment
- MemOS: v2.0.20
- OS: Windows 11
- Language: TypeScript (memos-local-plugin)
- Agent: Hermes Agent
Willingness to Implement
Description
The current FTS5 tables use the built-in tokenizer, which works well for Latin scripts but has limitations for Chinese text:
Current Workaround
We modified to relax minimum token length:
This works short-term but introduces noise (高频单字 like 的/是/了 enter FTS).
Proposed Solution
Add support for configurable FTS5 tokenizers, specifically:
storage.ftsTokenizer: 'trigram' | 'jieba' | 'custom'prepareFtsMatch()Implementation Sketch
Related Issues
Environment
Willingness to Implement