feat: support configurable FTS5 tokenizer for better CJK search

## Description

The current FTS5 tables use the built-in  tokenizer, which works well for Latin scripts but has limitations for Chinese text:

1. **2-char Chinese words** (早报, 你好, 配置) fall below the trigram window (3-char minimum) and cannot be matched by FTS5
2. **Mixed ASCII+CJK tokens** (C盘, API配置) fail to split correctly in 
3. **No semantic segmentation** — trigram treats text as character sequences, missing word boundaries (e.g., 早报 = early+report, not random bigrams)

## Current Workaround

We modified  to relax minimum token length:
```javascript
// Changed from TRIGRAM_MIN(3) to allow 1-char CJK
if (r.length >= 1) expanded.push(r);
```

This works short-term but introduces noise (高频单字 like 的/是/了 enter FTS).

## Proposed Solution

Add support for configurable FTS5 tokenizers, specifically:

1. **jieba-based tokenizer** — Chinese word segmentation before FTS5 insertion
2. **Config option** — `storage.ftsTokenizer: 'trigram' | 'jieba' | 'custom'`
3. **Write-time segmentation** — Pre-segment text with jieba before inserting into FTS5 tables
4. **Query-time segmentation** — Apply same segmentation to search queries in `prepareFtsMatch()`

## Implementation Sketch

```typescript
// config.yaml
storage:
  ftsTokenizer: jieba  // or 'trigram' (default)

// keyword.ts — prepareFtsMatch()
if (config.storage.ftsTokenizer === 'jieba') {
  tokens = jieba.cut(query);  // word-level segmentation
} else {
  tokens = trigramExtract(query);  // current behavior
}
```

## Related Issues

- #1904 — CJK text corruption on Windows (GBK encoding)
- #1595 — FTS5 query failures with long instructional prompts
- #130 — Chinese input handling (closed)

## Environment

- MemOS: v2.0.20
- OS: Windows 11
- Language: TypeScript (memos-local-plugin)
- Agent: Hermes Agent

## Willingness to Implement

- [x] I'm willing to submit a PR if the maintainers are interested in this direction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support configurable FTS5 tokenizer for better CJK search #1961

Description

Current Workaround

Proposed Solution

Implementation Sketch

Related Issues

Environment

Willingness to Implement

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: support configurable FTS5 tokenizer for better CJK search #1961

Description

Description

Current Workaround

Proposed Solution

Implementation Sketch

Related Issues

Environment

Willingness to Implement

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions