Commit 73545e5
committed
fix: Handle duplicate texts correctly in embed_stream
Addresses Copilot review comment: Duplicate texts cause incorrect embedding
index assignment.
Previously, when batch_texts contained duplicate texts, all embeddings for
those duplicates would be assigned the same index (the index of the first
occurrence) because list.index() always returns the first match.
Now tracks used indices and assigns each embedding to the next unused
occurrence of its text in the batch, ensuring correct index assignment
even with duplicate texts.
Example:
texts = ['hello', 'world', 'hello']
Before: indices would be [0, 1, 0] - WRONG
After: indices are [0, 1, 2] - CORRECT1 parent 7c198ea commit 73545e5
1 file changed
Lines changed: 16 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1209 | 1209 | | |
1210 | 1210 | | |
1211 | 1211 | | |
1212 | | - | |
| 1212 | + | |
1213 | 1213 | | |
1214 | 1214 | | |
1215 | 1215 | | |
| |||
1219 | 1219 | | |
1220 | 1220 | | |
1221 | 1221 | | |
1222 | | - | |
| 1222 | + | |
1223 | 1223 | | |
1224 | 1224 | | |
| 1225 | + | |
| 1226 | + | |
| 1227 | + | |
1225 | 1228 | | |
1226 | 1229 | | |
1227 | 1230 | | |
1228 | 1231 | | |
1229 | | - | |
1230 | | - | |
| 1232 | + | |
| 1233 | + | |
| 1234 | + | |
| 1235 | + | |
| 1236 | + | |
| 1237 | + | |
| 1238 | + | |
| 1239 | + | |
| 1240 | + | |
| 1241 | + | |
| 1242 | + | |
1231 | 1243 | | |
1232 | 1244 | | |
1233 | 1245 | | |
| |||
0 commit comments