fix(embed): truncate oversized inputs to the embedding token limit#143
Merged
Conversation
text-embedding-3-small rejects any single input over 8191 tokens, which 500'd the whole chat turn when a long assistant message (or sheet / mini-app / non-English note) was persisted and embedded. Clamp every embedding input to 8000 tokens at the embedder chokepoint, so the fix covers chat messages, notes, and file chunks alike. Truncation is token-accurate via tiktoken (cl100k_base), with an O(1) char fast-path that skips the BPE encode for short text to keep bulk embedding off the event loop's critical path. A build-time warm of the tiktoken vocab makes the embedder fully offline in prod and removes the first-call download.
The per-input token clamp runs a synchronous, GIL-holding BPE encode for any text long enough to possibly overflow. On the bulk path (batch_size up to 1000) that ran inline on the asyncio event loop, stalling every concurrent request until the batch finished encoding. tiktoken releases the GIL during encode, so offloading the fit to a thread genuinely frees the loop. Offload only when a batch actually contains encode-worthy text; all-short batches (e.g. chat turns) stay inline and skip the thread hop, since the O(1) fast path means they never encode.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
text-embedding-3-smallrejects any single input over 8191 tokens. When a long assistant message (deep research, asheet, mini-app JSX in the reasoning trace) or a long/non-English note was persisted and embedded, the API returned400 — Invalid 'input[0]': maximum input length is 8192 tokensand 500'd the whole chat turn (observed inchats.send_message→qdrant.store.add→embed).The embed path had no token awareness, so oversized content went straight to the API and failed.
Fix
Clamp every embedding input to 8000 tokens (headroom under 8191) at the single chokepoint —
OpenAIEmbedder._embed_batch. Because chat messages, notes, and file chunks all funnel through here, this covers every caller, not just the chat path that happened to trip first.tiktoken(cl100k_base, the encoding the model actually uses) — a flat char cap is unsafe because density varies ~4× (English ~4 chars/tok vs CJK/code ~1–2.5), so a "safe" char count would still 400 for non-English users.max_tokens // 4chars provably can't overflow and skips the (synchronous, GIL-holding) BPE encode entirely — keeps bulk embedding off the event loop's critical path.Offline / prod
tiktokenwas only a transitive dep — promoted it to a direct dependency. It downloadscl100k_basefrom a Microsoft blob URL on first use, which fails in egress-restricted prod even whenapi.openai.comis reachable (and the lazy load surfaces it on the first embed, not at boot). The Dockerfile now pre-fetches the vocab at build time into a bakedTIKTOKEN_CACHE_DIR, verified to load with the network fully blocked.Tests
test_truncate.py— short/empty/at-limit passthrough; long English / code / CJK all re-encode to ≤ cap (CJK is the case a char cap misses); reproduces the original >8191-token overflow; asserts short text never touches the encoder.test_embed.py— with a mocked client, asserts an oversized input reaches the API clamped, empty →$placeholder preserved, short text sent verbatim.Known follow-up (out of scope, pre-existing)
embed()packs up to 1000 inputs per request and can exceed OpenAI's per-request token cap (~300k) — a separate latent 400. This change only shrinks per-input size and doesn't worsen it; worth a dedicated follow-up for token-aware batching.