Skip to content

fix(embed): truncate oversized inputs to the embedding token limit#143

Merged
winlp4ever merged 2 commits into
mainfrom
fix/embed-token-truncation
Jun 21, 2026
Merged

fix(embed): truncate oversized inputs to the embedding token limit#143
winlp4ever merged 2 commits into
mainfrom
fix/embed-token-truncation

Conversation

@winlp4ever

Copy link
Copy Markdown
Contributor

Problem

text-embedding-3-small rejects any single input over 8191 tokens. When a long assistant message (deep research, a sheet, mini-app JSX in the reasoning trace) or a long/non-English note was persisted and embedded, the API returned 400 — Invalid 'input[0]': maximum input length is 8192 tokens and 500'd the whole chat turn (observed in chats.send_messageqdrant.store.addembed).

The embed path had no token awareness, so oversized content went straight to the API and failed.

Fix

Clamp every embedding input to 8000 tokens (headroom under 8191) at the single chokepoint — OpenAIEmbedder._embed_batch. Because chat messages, notes, and file chunks all funnel through here, this covers every caller, not just the chat path that happened to trip first.

  • Token-accurate via tiktoken (cl100k_base, the encoding the model actually uses) — a flat char cap is unsafe because density varies ~4× (English ~4 chars/tok vs CJK/code ~1–2.5), so a "safe" char count would still 400 for non-English users.
  • O(1) fast path: a token spans ≥1 UTF-8 byte and a char is ≤4 bytes, so any string within max_tokens // 4 chars provably can't overflow and skips the (synchronous, GIL-holding) BPE encode entirely — keeps bulk embedding off the event loop's critical path.
  • Lazy singleton encoder: the ~1.5s vocab load happens once per process, on first use.
  • Observability: logs a warning when truncation actually clips, so we can see how often it fires and on what.

Offline / prod

tiktoken was only a transitive dep — promoted it to a direct dependency. It downloads cl100k_base from a Microsoft blob URL on first use, which fails in egress-restricted prod even when api.openai.com is reachable (and the lazy load surfaces it on the first embed, not at boot). The Dockerfile now pre-fetches the vocab at build time into a baked TIKTOKEN_CACHE_DIR, verified to load with the network fully blocked.

Tests

  • test_truncate.py — short/empty/at-limit passthrough; long English / code / CJK all re-encode to ≤ cap (CJK is the case a char cap misses); reproduces the original >8191-token overflow; asserts short text never touches the encoder.
  • test_embed.py — with a mocked client, asserts an oversized input reaches the API clamped, empty → $ placeholder preserved, short text sent verbatim.

Known follow-up (out of scope, pre-existing)

embed() packs up to 1000 inputs per request and can exceed OpenAI's per-request token cap (~300k) — a separate latent 400. This change only shrinks per-input size and doesn't worsen it; worth a dedicated follow-up for token-aware batching.

text-embedding-3-small rejects any single input over 8191 tokens, which
500'd the whole chat turn when a long assistant message (or sheet /
mini-app / non-English note) was persisted and embedded.

Clamp every embedding input to 8000 tokens at the embedder chokepoint, so
the fix covers chat messages, notes, and file chunks alike. Truncation is
token-accurate via tiktoken (cl100k_base), with an O(1) char fast-path that
skips the BPE encode for short text to keep bulk embedding off the event
loop's critical path. A build-time warm of the tiktoken vocab makes the
embedder fully offline in prod and removes the first-call download.
The per-input token clamp runs a synchronous, GIL-holding BPE encode for
any text long enough to possibly overflow. On the bulk path (batch_size
up to 1000) that ran inline on the asyncio event loop, stalling every
concurrent request until the batch finished encoding.

tiktoken releases the GIL during encode, so offloading the fit to a thread
genuinely frees the loop. Offload only when a batch actually contains
encode-worthy text; all-short batches (e.g. chat turns) stay inline and
skip the thread hop, since the O(1) fast path means they never encode.
@winlp4ever winlp4ever merged commit 5c871bb into main Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant