Skip to content

Commit 4b03fff

Browse files
fix: batch OpenAI embeddings to respect 2048 input limit (#39)
Schemas with more than 2048 columns fail with OpenAI embeddings API error: "Invalid input: array length must be 2048 or less." Batch input in chunks of 2000 before sending to the API. Closes #39
1 parent 67537b3 commit 4b03fff

1 file changed

Lines changed: 10 additions & 7 deletions

File tree

src/nlp2sql/adapters/openai_embedding_adapter.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -79,24 +79,27 @@ async def encode(self, texts: List[str]) -> np.ndarray:
7979
processed_texts.append(text)
8080

8181
try:
82-
response = await self.client.embeddings.create(model=self.model, input=processed_texts)
82+
# OpenAI API limits input array to 2048 items per request
83+
batch_size = 2000
84+
all_embeddings: list[list[float]] = []
8385

84-
# Extract embeddings from response
85-
embeddings = [item.embedding for item in response.data]
86-
embeddings_array = np.array(embeddings)
86+
for i in range(0, len(processed_texts), batch_size):
87+
batch = processed_texts[i : i + batch_size]
88+
response = await self.client.embeddings.create(model=self.model, input=batch)
89+
all_embeddings.extend(item.embedding for item in response.data)
90+
91+
embeddings_array = np.array(all_embeddings)
8792

8893
# Normalize embeddings for cosine similarity with FAISS IndexFlatIP
89-
# This is critical: FAISS IndexFlatIP uses inner product which only works
90-
# as cosine similarity when vectors are normalized to unit length
9194
norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)
92-
# Avoid division by zero (though rare for real embeddings)
9395
norms = np.where(norms == 0, 1, norms)
9496
normalized_embeddings = embeddings_array / norms
9597

9698
logger.debug(
9799
"OpenAI embeddings generated and normalized",
98100
model=self.model,
99101
texts_count=len(processed_texts),
102+
batches=((len(processed_texts) - 1) // batch_size) + 1,
100103
dimension=normalized_embeddings.shape[1],
101104
)
102105

0 commit comments

Comments
 (0)