Commit c41cae6
committed
fix: Handle duplicate texts correctly in embed_stream
Addresses Copilot review comment: Duplicate texts cause incorrect embedding
index assignment.
Previously, when batch_texts contained duplicate texts, all embeddings for
those duplicates would be assigned the same index (the index of the first
occurrence) because list.index() always returns the first match.
Now tracks used indices and assigns each embedding to the next unused
occurrence of its text in the batch, ensuring correct index assignment
even with duplicate texts.
Example:
texts = ['hello', 'world', 'hello']
Before: indices would be [0, 1, 0] - WRONG
After: indices are [0, 1, 2] - CORRECT1 parent 2e0ed46 commit c41cae6
1 file changed
Lines changed: 16 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1209 | 1209 | | |
1210 | 1210 | | |
1211 | 1211 | | |
1212 | | - | |
| 1212 | + | |
1213 | 1213 | | |
1214 | 1214 | | |
1215 | 1215 | | |
| |||
1219 | 1219 | | |
1220 | 1220 | | |
1221 | 1221 | | |
1222 | | - | |
| 1222 | + | |
1223 | 1223 | | |
1224 | 1224 | | |
| 1225 | + | |
| 1226 | + | |
| 1227 | + | |
1225 | 1228 | | |
1226 | 1229 | | |
1227 | 1230 | | |
1228 | 1231 | | |
1229 | | - | |
1230 | | - | |
| 1232 | + | |
| 1233 | + | |
| 1234 | + | |
| 1235 | + | |
| 1236 | + | |
| 1237 | + | |
| 1238 | + | |
| 1239 | + | |
| 1240 | + | |
| 1241 | + | |
| 1242 | + | |
1231 | 1243 | | |
1232 | 1244 | | |
1233 | 1245 | | |
| |||
0 commit comments