Skip to content

[codex] Generate embeddings for semantic retrieval#44

Merged
cleak merged 5 commits into
masterfrom
codex/semantic-render-embeddings
May 2, 2026
Merged

[codex] Generate embeddings for semantic retrieval#44
cleak merged 5 commits into
masterfrom
codex/semantic-render-embeddings

Conversation

@cleak
Copy link
Copy Markdown
Owner

@cleak cleak commented May 2, 2026

Summary

  • add a shared semantic search engine that populates missing graph embeddings before vector or hybrid retrieval
  • wire on-demand embedding generation into CLI vsearch, context, ask, render tdd, and MCP graph_context / graph_render
  • replace the render semantic placeholder with provider-backed semantic sections that fail clearly when no provider is configured
  • update docs, TDD template output, and health messaging for on-demand embedding population

Validation

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace --locked -- --format terse
  • git diff --check

Summary by CodeRabbit

  • New Features

    • CLI search/context/vsearch now use integrated semantic search and surface vector similarity (vec/vector_score) in JSON and console outputs.
    • Renderer supports semantic-search-driven template sections (can include full node bodies) and accepts a semantic-search provider for rendering.
    • Embedding cache now detects provider changes and refreshes embeddings automatically.
  • Documentation

    • CLI spec and template examples updated to note embeddings are populated before vector retrieval and show body inclusion.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ae77267-b540-428e-9dfa-3380785580c4

📥 Commits

Reviewing files that changed from the base of the PR and between 95c1ad5 and bd95ece.

📒 Files selected for processing (1)
  • crates/tempyr-index/src/embeddings.rs

📝 Walkthrough

Walkthrough

Adds a semantic-search engine plus synchronous runtime wrappers; CLI, MCP, and renderer are refactored to call the runtime for vector and hybrid retrieval. Renderer gains a SemanticSearchProvider/options and collectors support semantic-search sections; embedding caching is made provider-aware. (49 words)

Changes

Semantic search core + CLI/MCP + render wiring

Layer / File(s) Summary
Core Engine
crates/tempyr-index/src/semantic.rs
Adds SemanticSearchEngine: ensures graph embeddings, embeds queries, supports async vector_search and hybrid_retrieve, and includes unit tests verifying embedding and vector-score behavior.
Embedding caching
crates/tempyr-index/src/embeddings.rs
Adds EmbeddingProvider::fingerprint(), embedding_store_meta metadata, EmbeddingStore::ensure_provider_fingerprint(), and tests that cached embeddings are cleared/replaced when provider fingerprint changes.
Public export
crates/tempyr-index/src/lib.rs
Exports new semantic module (pub mod semantic;).
Runtime wrapper (CLI)
crates/tempyr-cli/src/commands/semantic.rs
Adds SemanticSearchRuntime holding a SemanticSearchEngine and a Tokio Runtime; provides synchronous vector_search and hybrid_retrieve via runtime.block_on(...).
CLI command wiring
crates/tempyr-cli/src/commands/{ask.rs,context.rs,vsearch.rs}
Replace in-file index/embedding wiring with SemanticSearchRuntime; ask/context call hybrid_retrieve, vsearch calls vector_search; context output adds vector_score; legacy embedding-selection helper/tests removed.
CLI module registry
crates/tempyr-cli/src/commands/mod.rs
Registers new semantic submodule (pub mod semantic;).
Render API & types
crates/tempyr-render/src/lib.rs
Adds SemanticSearchRequest, SemanticSearchHit, SemanticSearchProvider trait, and RenderOptions<'a>; exposes render_with_options / render_from_str_with_options and collect_sections branching on presence of a provider.
Collector refactor
crates/tempyr-render/src/collector.rs
collect_section now accepts semantic_search: Option<&mut dyn SemanticSearchProvider> and returns Result<SectionData>; adds collect_section_with_semantic_search, collect_semantic_section, semantic_query, and matches_status_filter; implements request construction, hit filtering, optional body inclusion, and tests (including error when provider missing).
Render command wiring
crates/tempyr-cli/src/commands/render_cmd.rs
Adds RenderSemanticSearch provider that lazily constructs SemanticSearchRuntime, implements SemanticSearchProvider, and passes it via RenderOptions into rendering calls (render_with_options/render_from_str_with_options).
MCP integration
crates/tempyr-mcp/src/handler.rs
Adds McpSemanticSearch / McpSemanticSearchRuntime wrappers for MCP mode; updates graph_context and graph_render to use the runtime and options-based rendering.
Templates, docs & minor tweaks
templates/tdd.toml, docs/graphspec.md, crates/tempyr-index/src/hybrid.rs, crates/tempyr-index/src/health.rs, crates/tempyr-render/src/template.rs
Template now sets include_body = true for the sample section; docs note embeddings are populated before retrieval; small doc/warning formatting tweaks and a test assertion added.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Command
    participant Runtime as SemanticSearchRuntime
    participant Engine as SemanticSearchEngine
    participant Provider as EmbeddingProvider
    participant Store as EmbeddingStore/Index

    CLI->>Runtime: new(ctx)
    Runtime->>Engine: new(index, store, provider)

    CLI->>Runtime: vector_search / hybrid_retrieve(graph, query, ...)
    Runtime->>Engine: block_on(vector_search/hybrid_retrieve(...))

    Engine->>Engine: ensure_embeddings(graph)
    Engine->>Provider: embed_documents(graph_nodes)
    Provider-->>Engine: document_vectors
    Engine->>Store: persist/index document vectors

    Engine->>Provider: embed_query(query)
    Provider-->>Engine: query_vector
    Engine->>Store: vector_search / hybrid scoring
    Store-->>Engine: Vector/Hybrid results (with vector_score)

    Engine-->>Runtime: results
    Runtime-->>CLI: results
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hop the index, sniff the vector trail,
I spin a runtime bridge so queries never fail,
Providers hum their secret scores in tune,
Templates stitch the insights by the moon,
A rabbit cheers, "Semantic search — hooray, let's code!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding embedding generation for semantic retrieval, which is reflected across CLI commands, MCP endpoints, and render integration in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Review rate limit: 1/5 review remaining, refill in 41 minutes and 21 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Line 14: The field embeddings_ready currently short-circuits embedding
population (referencing embeddings_ready) so once true future graph changes
won’t trigger re-embedding; update the logic in the embedding-population routine
(the function that checks embeddings_ready / populates embeddings) to either
remove the one-time short-circuit or make embeddings_ready depend on graph
mutation state (e.g., compare a graph version/timestamp or reset
embeddings_ready when nodes are added/updated/deleted), and ensure any code
paths that modify the graph (node add/update/delete methods) reset or advance
that version so the populate routine will re-run and produce fresh embeddings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 90448b3e-638d-49dd-823f-30b51eece276

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd93b5 and 15ff8af.

📒 Files selected for processing (16)
  • crates/tempyr-cli/src/commands/ask.rs
  • crates/tempyr-cli/src/commands/context.rs
  • crates/tempyr-cli/src/commands/mod.rs
  • crates/tempyr-cli/src/commands/render_cmd.rs
  • crates/tempyr-cli/src/commands/semantic.rs
  • crates/tempyr-cli/src/commands/vsearch.rs
  • crates/tempyr-index/src/health.rs
  • crates/tempyr-index/src/hybrid.rs
  • crates/tempyr-index/src/lib.rs
  • crates/tempyr-index/src/semantic.rs
  • crates/tempyr-mcp/src/handler.rs
  • crates/tempyr-render/src/collector.rs
  • crates/tempyr-render/src/lib.rs
  • crates/tempyr-render/src/template.rs
  • docs/graphspec.md
  • templates/tdd.toml

Comment thread crates/tempyr-index/src/semantic.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Around line 70-72: The code currently takes the first element of
query_embeddings silently; change this to enforce exactly one embedding: in the
embed_query flow, collect query_embeddings into a Vec (or check its length) and
if vec.len() != 1 return Err(IndexError::General("Embedding provider returned
wrong number of vectors for the query; expected exactly 1".to_string())),
otherwise use vec.into_iter().next().unwrap(); reference symbols:
query_embeddings and IndexError::General (and the embed_query function) when
making this replacement.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7c0e7f81-e867-4629-a411-91973306e6f9

📥 Commits

Reviewing files that changed from the base of the PR and between 15ff8af and a4ab88f.

📒 Files selected for processing (1)
  • crates/tempyr-index/src/semantic.rs

Comment thread crates/tempyr-index/src/semantic.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Around line 25-30: ensure_embeddings calls embeddings::embed_graph which keys
cached embeddings by content hash only, allowing different embedding
providers/models to reuse incompatible vectors; update the logic to include the
provider fingerprint (e.g. provider id/name, model/version/tokenizer dims) when
storing and looking up embeddings in the cache so embeddings are scoped to
provider identity and vector dimensionality, or persist a provider fingerprint
alongside cached entries and, in ensure_embeddings (or in
embeddings::embed_graph), detect mismatches and invalidate or re-embed affected
documents via self.provider before returning; reference ensure_embeddings,
embeddings::embed_graph, self.provider, and the cache/store layer to implement
this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 40610a3a-9376-4f52-87d8-a370164686ed

📥 Commits

Reviewing files that changed from the base of the PR and between a4ab88f and d37b13e.

📒 Files selected for processing (1)
  • crates/tempyr-index/src/semantic.rs

Comment on lines +25 to +30
pub async fn ensure_embeddings(&mut self, graph: &Graph) -> Result<()> {
// embed_graph is content-hash aware and skips cached entries, so keep
// checking the current graph instead of assuming a long-lived engine has
// already seen every future graph mutation.
embeddings::embed_graph(&self.store, graph, self.provider.as_ref()).await?;
Ok(())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Provider/model drift is not invalidating cached document embeddings.

At Line 29, embedding reuse is keyed by content hash only (via embed_graph), so changing embedding provider/model can silently mix vector spaces and degrade ranking correctness. Scope cache entries by provider identity/version (and dimensions), or persist/validate provider fingerprint and force re-embed on mismatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tempyr-index/src/semantic.rs` around lines 25 - 30, ensure_embeddings
calls embeddings::embed_graph which keys cached embeddings by content hash only,
allowing different embedding providers/models to reuse incompatible vectors;
update the logic to include the provider fingerprint (e.g. provider id/name,
model/version/tokenizer dims) when storing and looking up embeddings in the
cache so embeddings are scoped to provider identity and vector dimensionality,
or persist a provider fingerprint alongside cached entries and, in
ensure_embeddings (or in embeddings::embed_graph), detect mismatches and
invalidate or re-embed affected documents via self.provider before returning;
reference ensure_embeddings, embeddings::embed_graph, self.provider, and the
cache/store layer to implement this change.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/embeddings.rs`:
- Around line 721-735: The current ensure_provider_fingerprint implementation
clears all embeddings when meta is missing, causing unnecessary invalidation;
modify the None branch so that if meta_value(Self::PROVIDER_FINGERPRINT_KEY)
returns None you do NOT call clear_embeddings() by default — instead directly
set_meta_value(Self::PROVIDER_FINGERPRINT_KEY, &fingerprint) to seed legacy
stores with the current fingerprint; reserve calling clear_embeddings() for an
explicit incompatible-schema path (e.g., a new function or an explicit check) so
only ensure_provider_fingerprint, PROVIDER_FINGERPRINT_KEY, clear_embeddings,
set_meta_value, count and meta_value are touched to implement this
non-destructive migration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7d1a9104-e26d-480f-9aa1-289014466ab9

📥 Commits

Reviewing files that changed from the base of the PR and between d37b13e and 95c1ad5.

📒 Files selected for processing (1)
  • crates/tempyr-index/src/embeddings.rs

Comment thread crates/tempyr-index/src/embeddings.rs
@cleak cleak merged commit 65c62e0 into master May 2, 2026
5 checks passed
@cleak cleak deleted the codex/semantic-render-embeddings branch May 2, 2026 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant