Update existing row when re-embedding the same content with different store/metadata#1420
Open
alvinttang wants to merge 1 commit into
Open
Conversation
…imonw#224) When llm.Collection.embed() or embed_multi*() were called a second time with the same content but a different --store or metadata setting, the existing row was silently kept untouched (because content_hash matched an existing record). This made it impossible to add stored content or new metadata to a previously embedded record without first deleting the collection. Now, when an existing row is found by content_hash: - skip the (expensive) embedding call as before - update the row's content / content_blob / metadata / updated columns to reflect the latest store= and metadata= arguments This matches Simon's stated intent in the issue: "It should update the database (while still avoiding calculating embeddings) for records that already exist but where the metadata or content --store option has changed." Includes regression tests for both the single embed() and the embed_multi*() code paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #224.
Collection.embed()andembed_multi_with_metadata()short-circuit when a row with the samecontent_hashalready exists, never updatingcontent/content_blob/metadatato reflect newstore=andmetadata=arguments. Switching fromstore=Falsetostore=True, or adding metadata, was silently ignored.This matches the intent stated in the issue: "It should update the database for records that already exist but where the metadata or content --store option has changed."
Fix
llm/embeddings.py:embed(): on existing hash, perform an UPDATE ofcontent/content_blob/metadata/updatedinstead of returning early.embed_multi_with_metadata(): split batch into existing-hash items (UPDATE in place via SQL) and new items (still go throughembed_multi+ insert), avoiding re-running the model on duplicates while honouring the new options.Test
tests/test_embed.py:test_embed_updates_store_and_metadata_on_existing_hashtest_embed_multi_updates_store_and_metadata_on_existing_hashpytest tests/→ 492 passed (full suite).ruff checkclean. Production diff: ~80 LOC.Risk notes
embed()now writes (clears or setscontent/metadata, refreshesupdated). This is the documented intent.Refs #224