Skip to content

Update existing row when re-embedding the same content with different store/metadata#1420

Open
alvinttang wants to merge 1 commit into
simonw:mainfrom
alvinttang:fix/embed-update-store-metadata-on-existing-hash
Open

Update existing row when re-embedding the same content with different store/metadata#1420
alvinttang wants to merge 1 commit into
simonw:mainfrom
alvinttang:fix/embed-update-store-metadata-on-existing-hash

Conversation

@alvinttang
Copy link
Copy Markdown

Summary

Fixes #224. Collection.embed() and embed_multi_with_metadata() short-circuit when a row with the same content_hash already exists, never updating content / content_blob / metadata to reflect new store= and metadata= arguments. Switching from store=False to store=True, or adding metadata, was silently ignored.

This matches the intent stated in the issue: "It should update the database for records that already exist but where the metadata or content --store option has changed."

Fix

llm/embeddings.py:

  • embed(): on existing hash, perform an UPDATE of content / content_blob / metadata / updated instead of returning early.
  • embed_multi_with_metadata(): split batch into existing-hash items (UPDATE in place via SQL) and new items (still go through embed_multi + insert), avoiding re-running the model on duplicates while honouring the new options.

Test

tests/test_embed.py:

  • test_embed_updates_store_and_metadata_on_existing_hash
  • test_embed_multi_updates_store_and_metadata_on_existing_hash

pytest tests/ → 492 passed (full suite). ruff check clean. Production diff: ~80 LOC.

Risk notes

  • Behaviour change: a previously no-op repeat embed() now writes (clears or sets content / metadata, refreshes updated). This is the documented intent.
  • The cross-collection embedding-reuse discussion in the thread (your model_id check that got punted from 0.10) is intentionally out of scope.

Refs #224

…imonw#224)

When llm.Collection.embed() or embed_multi*() were called a second time
with the same content but a different --store or metadata setting, the
existing row was silently kept untouched (because content_hash matched
an existing record). This made it impossible to add stored content or
new metadata to a previously embedded record without first deleting the
collection.

Now, when an existing row is found by content_hash:
- skip the (expensive) embedding call as before
- update the row's content / content_blob / metadata / updated columns
  to reflect the latest store= and metadata= arguments

This matches Simon's stated intent in the issue: "It should update the
database (while still avoiding calculating embeddings) for records that
already exist but where the metadata or content --store option has
changed."

Includes regression tests for both the single embed() and the
embed_multi*() code paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calling llm embed with the same content but different --store settings does not update the table

1 participant