feat(chunking_utils): token-aware chunking — prevent silent data loss by luojiyin1987 · Pull Request #375 · StarTrail-org/LEANN

luojiyin1987 · 2026-06-13T09:43:45Z

Closes #374

Problem

chunk_size is in characters, but embedding models have token limits (512/2048/8192). Chars-to-tokens varies by language — code can be 1.2 tokens/char. Oversized chunks get silently truncated at embedding time.

Solution

Opt-in max_tokens_per_chunk param uses existing infrastructure (all 3 helpers were already present but never wired together):

# Before: chunk_size in chars, truncation happens later silently
chunks = create_text_chunks(docs, chunk_size=256)

# After: auto-scales chunk_size using calculate_safe_chunk_size()
limit = get_model_token_limit("nomic-embed-text")
chunks = create_text_chunks(docs, chunk_size=256, max_tokens_per_chunk=limit)

Changes (2 files, +113/-14)

chunking_utils.py: auto-scale chunk_size, auto-scale AST chunk_size, revalidate overlap, post-validate tokens
base_rag_example.py: _resolve_chunk_token_limit() helper + build-time chunk validation

Backward compat: max_tokens_per_chunk=None (default) = identical behavior.

- Add max_tokens_per_chunk param to create_traditional_chunks() and create_text_chunks() — auto-scales chunk_size using existing calculate_safe_chunk_size() and validates with validate_chunk_token_limits() - Also scales AST chunk_size when token limit is given - Revalidate chunk_overlap after scaling to avoid SentenceSplitter errors - Add BaseRAGExample._resolve_chunk_token_limit() helper to resolve the embedding model's token limit from CLI args - build_index() now warns if existing chunks exceed the model's token limit Backward compat: default max_tokens_per_chunk=None (no-op, identical behavior)

- Delete _traditional_chunks_as_dicts() — it was a pure alias for create_traditional_chunks(). Replace all 6 call sites with direct calls. - Extract _parse_ast_chunk_output() — normalizes the 4 different chunk output shapes (object, str, dict with content, dict with text) into a uniform (text, metadata) tuple. Removes 20 lines of inline dispatch from create_ast_chunks(). - Net: +28/−35 lines, no behaviour change

luojiyin1987 added 3 commits June 13, 2026 17:43

style: apply ruff format to satisfy CI pre-commit checks

7c3a86c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chunking_utils): token-aware chunking — prevent silent data loss#375

feat(chunking_utils): token-aware chunking — prevent silent data loss#375
luojiyin1987 wants to merge 3 commits into
StarTrail-org:mainfrom
luojiyin1987:feat/token-aware-chunking

luojiyin1987 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luojiyin1987 commented Jun 13, 2026

Problem

Solution

Changes (2 files, +113/-14)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant