Skip to content

feat: support configurable FTS5 tokenizer for better CJK search #1961

Description

@7889545

Description

The current FTS5 tables use the built-in tokenizer, which works well for Latin scripts but has limitations for Chinese text:

  1. 2-char Chinese words (早报, 你好, 配置) fall below the trigram window (3-char minimum) and cannot be matched by FTS5
  2. Mixed ASCII+CJK tokens (C盘, API配置) fail to split correctly in
  3. No semantic segmentation — trigram treats text as character sequences, missing word boundaries (e.g., 早报 = early+report, not random bigrams)

Current Workaround

We modified to relax minimum token length:

// Changed from TRIGRAM_MIN(3) to allow 1-char CJK
if (r.length >= 1) expanded.push(r);

This works short-term but introduces noise (高频单字 like 的/是/了 enter FTS).

Proposed Solution

Add support for configurable FTS5 tokenizers, specifically:

  1. jieba-based tokenizer — Chinese word segmentation before FTS5 insertion
  2. Config optionstorage.ftsTokenizer: 'trigram' | 'jieba' | 'custom'
  3. Write-time segmentation — Pre-segment text with jieba before inserting into FTS5 tables
  4. Query-time segmentation — Apply same segmentation to search queries in prepareFtsMatch()

Implementation Sketch

// config.yaml
storage:
  ftsTokenizer: jieba  // or 'trigram' (default)

// keyword.ts — prepareFtsMatch()
if (config.storage.ftsTokenizer === 'jieba') {
  tokens = jieba.cut(query);  // word-level segmentation
} else {
  tokens = trigramExtract(query);  // current behavior
}

Related Issues

Environment

  • MemOS: v2.0.20
  • OS: Windows 11
  • Language: TypeScript (memos-local-plugin)
  • Agent: Hermes Agent

Willingness to Implement

  • I'm willing to submit a PR if the maintainers are interested in this direction

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or improvement | 新功能或改进pluginPlugin/adapter/bridge layer (apps/ directory) | 插件/适配层

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions