Skip to content

Feature/llm cache redis semantic issue 362#417

Open
Francis6-git wants to merge 3 commits into
Traqora:mainfrom
Francis6-git:feature/llm-cache-redis-semantic-issue-362
Open

Feature/llm cache redis semantic issue 362#417
Francis6-git wants to merge 3 commits into
Traqora:mainfrom
Francis6-git:feature/llm-cache-redis-semantic-issue-362

Conversation

@Francis6-git

Copy link
Copy Markdown

Description

Implements an optimized, provider-agnostic semantic caching layer for LLM responses under #362. To achieve the required sub-50ms lookup performance without introducing heavy external vector database dependencies, this implementation leverages an inline Python cosine similarity calculation engine sweeping a bounded window of recent candidates managed through Redis Sorted Sets (ZSET).

Technical Architecture

  • Time-Ordered Candidate Bucketing (astroml/cache/llm_semantic_cache.py): Candidate lookups are pinned directly to ZSET blocks by target LLM models ({namespace}:idx:{model}:all). New completions are logged with chronological epoch milliseconds as their tracking scores.
  • Soft Capacity Caps: The tracking index dynamically prunes itself to hold a soft maximum capacity ceiling ($10 \times \text{Lookback } K$, with a baseline floor of $100$ items) via zremrangebyrank to safeguard operational memory constraints and ensure O(log(N)) execution timelines.
  • Provider Decoupled Core Layer (astroml/llm/llm_cached_client.py): Exposes an expandable LLMProvider structurally typed protocol and an underlying LLMEmbeddingProvider constructor to support hot-swapping embedding models natively without breaking API pipelines.
  • Telemetry Metrics Integration: Tracks rolling cache totals (hits, misses) along with high-precision time.perf_counter() calculations via Redis pipelines, automatically generating calculated hit rates and average lookup latency execution times.

Validation Status against Acceptance Criteria

1. Lookup Latency SLA (< 50ms)

  • Bounding the evaluation array to a tunable candidate_top_k threshold configuration payload safely constrains linear _cosine_similarity iteration times.
  • Combined Redis operations utilize multi-key pipeline() batches to minimize round-trip connection overheads.

2. Targeted Cache Hit Rate (> 40%)

  • Includes a configurable threshold variable LLM_CACHE_SIMILARITY_THRESHOLD (defaults to 0.88) allowing runtime matching optimization parameters across text structures.

Closes #362

@drips-wave

drips-wave Bot commented Jun 27, 2026

Copy link
Copy Markdown

@Francis6-git Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LLM] Implement caching layer for LLM responses

1 participant