fix(embed): truncate oversized inputs to the embedding token limit by winlp4ever · Pull Request #143 · vcmf/dim0

winlp4ever · 2026-06-21T13:16:02Z

Problem

text-embedding-3-small rejects any single input over 8191 tokens. When a long assistant message (deep research, a sheet, mini-app JSX in the reasoning trace) or a long/non-English note was persisted and embedded, the API returned 400 — Invalid 'input[0]': maximum input length is 8192 tokens and 500'd the whole chat turn (observed in chats.send_message → qdrant.store.add → embed).

The embed path had no token awareness, so oversized content went straight to the API and failed.

Fix

Clamp every embedding input to 8000 tokens (headroom under 8191) at the single chokepoint — OpenAIEmbedder._embed_batch. Because chat messages, notes, and file chunks all funnel through here, this covers every caller, not just the chat path that happened to trip first.

Token-accurate via tiktoken (cl100k_base, the encoding the model actually uses) — a flat char cap is unsafe because density varies ~4× (English ~4 chars/tok vs CJK/code ~1–2.5), so a "safe" char count would still 400 for non-English users.
O(1) fast path: a token spans ≥1 UTF-8 byte and a char is ≤4 bytes, so any string within max_tokens // 4 chars provably can't overflow and skips the (synchronous, GIL-holding) BPE encode entirely — keeps bulk embedding off the event loop's critical path.
Lazy singleton encoder: the ~1.5s vocab load happens once per process, on first use.
Observability: logs a warning when truncation actually clips, so we can see how often it fires and on what.

Offline / prod

tiktoken was only a transitive dep — promoted it to a direct dependency. It downloads cl100k_base from a Microsoft blob URL on first use, which fails in egress-restricted prod even when api.openai.com is reachable (and the lazy load surfaces it on the first embed, not at boot). The Dockerfile now pre-fetches the vocab at build time into a baked TIKTOKEN_CACHE_DIR, verified to load with the network fully blocked.

Tests

test_truncate.py — short/empty/at-limit passthrough; long English / code / CJK all re-encode to ≤ cap (CJK is the case a char cap misses); reproduces the original >8191-token overflow; asserts short text never touches the encoder.
test_embed.py — with a mocked client, asserts an oversized input reaches the API clamped, empty → $ placeholder preserved, short text sent verbatim.

Known follow-up (out of scope, pre-existing)

embed() packs up to 1000 inputs per request and can exceed OpenAI's per-request token cap (~300k) — a separate latent 400. This change only shrinks per-input size and doesn't worsen it; worth a dedicated follow-up for token-aware batching.

text-embedding-3-small rejects any single input over 8191 tokens, which 500'd the whole chat turn when a long assistant message (or sheet / mini-app / non-English note) was persisted and embedded. Clamp every embedding input to 8000 tokens at the embedder chokepoint, so the fix covers chat messages, notes, and file chunks alike. Truncation is token-accurate via tiktoken (cl100k_base), with an O(1) char fast-path that skips the BPE encode for short text to keep bulk embedding off the event loop's critical path. A build-time warm of the tiktoken vocab makes the embedder fully offline in prod and removes the first-call download.

The per-input token clamp runs a synchronous, GIL-holding BPE encode for any text long enough to possibly overflow. On the bulk path (batch_size up to 1000) that ran inline on the asyncio event loop, stalling every concurrent request until the batch finished encoding. tiktoken releases the GIL during encode, so offloading the fit to a thread genuinely frees the loop. Offload only when a batch actually contains encode-worthy text; all-short batches (e.g. chat turns) stay inline and skip the thread hop, since the O(1) fast path means they never encode.

winlp4ever added 2 commits June 21, 2026 15:15

winlp4ever merged commit 5c871bb into main Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(embed): truncate oversized inputs to the embedding token limit#143

fix(embed): truncate oversized inputs to the embedding token limit#143
winlp4ever merged 2 commits into
mainfrom
fix/embed-token-truncation

winlp4ever commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

winlp4ever commented Jun 21, 2026

Problem

Fix

Offline / prod

Tests

Known follow-up (out of scope, pre-existing)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant