Reliability hardening suite#261
Conversation
Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add typed wrapping and error-level logging around high-risk broad exception paths while preserving debug tracebacks. Cover connection, provider, loader, pipeline, retrieval, and history validation error behavior.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a typed latency budget error and enforce Context budgets before retrieval phases, helper I/O, graph config probes, Cypher calls, and completion LLM calls. Cover propagation and phase gating with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds latency-budget enforcement and typed timeouts across providers and retrievals, threads ctx and timeouts through GraphRAG facade and strategies, introduces timeout helpers and new exception types, tightens history validation and logging, updates CI/workflows and docker/dev docs, and adds extensive tests for budgets, timeouts, and logging. ChangesLatency Budget Enforcement System
Sequence Diagram(s)sequenceDiagram
participant Client
participant GraphRAG
participant Context
participant Retrieval
participant Embedder
participant LLM
Client->>GraphRAG: retrieve(query, ctx)
GraphRAG->>Context: ensure_budget(graph-config)
GraphRAG->>GraphRAG: _validate_graph_config(ctx)
GraphRAG->>Context: ensure_budget(completion retrieval)
GraphRAG->>Retrieval: search(ctx)
Retrieval->>Context: ensure_budget(query embedding)
Retrieval->>Embedder: aembed_query(text, timeout=ctx.remaining_budget_seconds)
Retrieval->>Context: ensure_budget(chunk search)
Retrieval->>Retrieval: vector_store.search_chunks(...)
Retrieval->>Context: ensure_budget(rerank/embed)
Retrieval->>Embedder: aembed_documents(..., timeout=ctx.remaining_budget_seconds)
GraphRAG->>Context: ensure_budget(completion LLM call)
GraphRAG->>LLM: ainvoke_messages(messages, timeout=ctx.remaining_budget_seconds)
LLM-->>GraphRAG: response
GraphRAG-->>Client: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Deep-dive review of the reliability hardening suite. Solid intent and real value, but three load-bearing issues need addressing before merge. Posted as line comments — summary below.
Blocking (Critical):
- C1:
connection.pysilently wraps every DB error inDatabaseError— breaking change for any downstreamexcept falkordb.<Specific>Errorclause, andFalkorDBConnectionis a re-exported public class. - C2: the latency budget is never wired into provider
timeout=. Once a slow LLM call is in flight, the budget can fly past the threshold and nothing aborts it. - C3:
aembed_documents(..., timeout=10)applies the timeout per batch (and per binary-split sub-call), not as an overall deadline — the public contract reads as 10s but wall-clock total can be 50s+.
Wins worth keeping: the MENTIONED_IN cosine-rank merge from #259, the history-validation refactor in api/main.py (error messages were wrong before, are right now), the secret-leakage test in test_providers.py, the PdfLoader open/finally fix. Tests pass cleanly (393/393 on touched files).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py (1)
161-166:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse an overall deadline for retried LLM calls.
Like the LiteLLM path, this treats
timeoutas per-attempt, not per-operation. With retries enabled, the request can exceed the caller's remaining latency budget by the backoff plus extra attempts. Recompute a remaining timeout from a monotonic deadline on each loop iteration so the wholeainvoke*call stays within the passed budget.Also applies to: 216-221
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py` around lines 161 - 166, The current call to wait_for_provider_call(client.chat.completions.create(**create_kwargs), timeout=timeout, ...) treats timeout as per-attempt; change the retry loop to compute a monotonic deadline once at the start and, before each attempt, recompute remaining_timeout = max(0, deadline - time.monotonic()) and pass that remaining_timeout into wait_for_provider_call (and into any backoff/wait logic) so the whole ainvoke/chat completion operation respects the original budget; apply the same deadline-based timeout recomputation to the other OpenRouter completion call site that uses client.chat.completions.create and model_name.graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py (1)
138-143:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake
timeouta deadline across retries.These calls reuse the same
timeouton every retry attempt, soainvoke(..., timeout=2, max_retries=3)can still run well past 2 seconds once retry backoff is included. That breaks the new latency-budget plumbing, because callers now passctx.remaining_budget_secondsexpecting the whole provider operation to stay inside the remaining budget. Mirror thedeadline/remaining_timeout()pattern already used inaembed_documentsso retries and sleeps consume the same budget.Also applies to: 227-232
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py` around lines 138 - 143, The wait_for_provider_call invocation in LiteLLM methods (e.g., the acompletion call that uses self._completion_kwargs and model_name) currently reuses the same timeout for every retry, which can exceed a caller's overall latency budget; change the logic to compute a deadline/remaining timeout per attempt (mirror the remaining_timeout()/deadline pattern used in aembed_documents) so each retry and any backoff/sleep consumes from the same budget and you pass the updated remaining timeout into wait_for_provider_call (and any sleep/backoff) rather than the original static timeout; update both the completion path around acompletion and the other occurrence noted (lines ~227-232) to use this per-attempt remaining timeout.
🧹 Nitpick comments (1)
graphrag_sdk/tests/test_connection.py (1)
117-117: ⚡ Quick winTighten exception type assertions to match the “typed” intent.
Using
pytest.raises(Exception)is too broad and may pass on unintended exception paths. Use explicit exception classes inside_effectand assert those exact classes are propagated.✅ Suggested test hardening
- mock_graph.query = AsyncMock(side_effect=Exception("always fails")) + class AlwaysFailsError(Exception): + pass + mock_graph.query = AsyncMock(side_effect=AlwaysFailsError("always fails")) @@ - with pytest.raises(Exception, match="always fails"): + with pytest.raises(AlwaysFailsError, match="always fails"): await conn.query("MATCH (n) RETURN n") @@ - mock_graph.query = AsyncMock(side_effect=Exception("already indexed")) + class AlreadyIndexedError(Exception): + pass + mock_graph.query = AsyncMock(side_effect=AlreadyIndexedError("already indexed")) @@ - with pytest.raises(Exception, match="already indexed"): + with pytest.raises(AlreadyIndexedError, match="already indexed"): await conn.query("CREATE INDEX idx")Also applies to: 131-132
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/tests/test_connection.py` at line 117, The tests use pytest.raises(Exception, match="always fails") which is too broad; change the mocked side_effect to raise a specific exception class (e.g., RuntimeError or the actual custom exception used in the code) and update the pytest.raises call(s) to assert that exact exception type instead of Exception for the occurrences where pytest.raises(Exception, match="always fails") appears; ensure both instances (the shown pytest.raises call and the other occurrence later in the file tied to the same mock/side_effect) use the same explicit exception class so the test verifies the intended error path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/core/providers/_timeout.py`:
- Around line 18-27: The validate_timeout function currently raises a raw
ValueError when timeout <= 0 which bypasses the typed timeout error contract;
change validate_timeout(timeout: float | None) to raise the module's
timeout_error (not ValueError) when timeout is not None and timeout <= 0,
including the timeout value in the message (e.g. f"timed out after
{timeout:.3g}s") so callers and the except block keep receiving the typed
timeout error; update references to validate_timeout and keep the existing await
asyncio.wait_for/except behavior unchanged.
In `@graphrag_sdk/tests/test_providers.py`:
- Around line 729-734: The wall-clock assertion in the test around
embedder.aembed_documents is too strict and can flake; relax the timing ceiling
or mock the clock: change the assertion `assert time.monotonic() - started <
0.05` to a looser bound (for example `< 0.1` or `< 0.2`) or replace
time.monotonic with a mocked clock to avoid scheduler jitter, leaving the rest
of the test (the with pytest.raises(EmbeddingTimeoutError) block and the `assert
mock_litellm.aembedding.await_count == 2`) unchanged.
---
Outside diff comments:
In `@graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py`:
- Around line 138-143: The wait_for_provider_call invocation in LiteLLM methods
(e.g., the acompletion call that uses self._completion_kwargs and model_name)
currently reuses the same timeout for every retry, which can exceed a caller's
overall latency budget; change the logic to compute a deadline/remaining timeout
per attempt (mirror the remaining_timeout()/deadline pattern used in
aembed_documents) so each retry and any backoff/sleep consumes from the same
budget and you pass the updated remaining timeout into wait_for_provider_call
(and any sleep/backoff) rather than the original static timeout; update both the
completion path around acompletion and the other occurrence noted (lines
~227-232) to use this per-attempt remaining timeout.
In `@graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py`:
- Around line 161-166: The current call to
wait_for_provider_call(client.chat.completions.create(**create_kwargs),
timeout=timeout, ...) treats timeout as per-attempt; change the retry loop to
compute a monotonic deadline once at the start and, before each attempt,
recompute remaining_timeout = max(0, deadline - time.monotonic()) and pass that
remaining_timeout into wait_for_provider_call (and into any backoff/wait logic)
so the whole ainvoke/chat completion operation respects the original budget;
apply the same deadline-based timeout recomputation to the other OpenRouter
completion call site that uses client.chat.completions.create and model_name.
---
Nitpick comments:
In `@graphrag_sdk/tests/test_connection.py`:
- Line 117: The tests use pytest.raises(Exception, match="always fails") which
is too broad; change the mocked side_effect to raise a specific exception class
(e.g., RuntimeError or the actual custom exception used in the code) and update
the pytest.raises call(s) to assert that exact exception type instead of
Exception for the occurrences where pytest.raises(Exception, match="always
fails") appears; ensure both instances (the shown pytest.raises call and the
other occurrence later in the file tied to the same mock/side_effect) use the
same explicit exception class so the test verifies the intended error path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 46226887-d6df-4e6d-8508-2f1351babea6
📒 Files selected for processing (21)
docker-compose.ymlgraphrag_sdk/src/graphrag_sdk/api/main.pygraphrag_sdk/src/graphrag_sdk/core/connection.pygraphrag_sdk/src/graphrag_sdk/core/context.pygraphrag_sdk/src/graphrag_sdk/core/providers/_timeout.pygraphrag_sdk/src/graphrag_sdk/core/providers/base.pygraphrag_sdk/src/graphrag_sdk/core/providers/litellm.pygraphrag_sdk/src/graphrag_sdk/core/providers/openrouter.pygraphrag_sdk/src/graphrag_sdk/retrieval/reranking_strategies/cosine.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/chunk_retrieval.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/entity_discovery.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/local.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/relationship_expansion.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.pygraphrag_sdk/tests/test_connection.pygraphrag_sdk/tests/test_facade.pygraphrag_sdk/tests/test_multi_path_retrieval.pygraphrag_sdk/tests/test_providers.pygraphrag_sdk/tests/test_retrieval.py
✅ Files skipped from review due to trivial changes (1)
- docker-compose.yml
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/core/context.py`:
- Line 57: The method remaining_budget_seconds currently returns a tiny positive
1e-9 even when the budget is exhausted, which conflicts with budget_exceeded and
can allow a call to start with a non-zero timeout; update
remaining_budget_seconds (in the Context class/function named
remaining_budget_seconds) to return 0.0 when remaining milliseconds <= 0 (e.g.,
if remaining <= 0 return 0.0 else return remaining/1000.0) so an exhausted
budget yields exactly zero seconds and aligns with budget_exceeded semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f48d8e8-50d7-492e-8564-7ec08c41f346
📒 Files selected for processing (3)
graphrag_sdk/src/graphrag_sdk/core/context.pygraphrag_sdk/src/graphrag_sdk/core/providers/_timeout.pygraphrag_sdk/tests/test_providers.py
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
galshubeli
left a comment
There was a problem hiding this comment.
Overview
The PR bundles six themes — broadly cohesive under "reliability." Most of the surface area is consistent and well-tested:
- New
LatencyBudgetExceededError,LLMTimeoutError,EmbeddingTimeoutErrortypes with proper hierarchy. Context.ensure_budget(operation)+remaining_budget_secondsplumbed through retrieval, completion, validation, and provider calls.- Centralized
wait_for_provider_call/validate_timeouthelpers incore/providers/_timeout.py. - Error logging upgraded from silent broad-except to
logger.error+logger.debug(exc_info=...)in connection retry, retry helpers, loaders, pipeline, retrieval base. -m integrationpytest marker isolates real-FalkorDB tests; CI runs them; CONTRIBUTING documents the local invocation.
Issues
1. Duplicate comment lines in multi_path.py (merge artifact)
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py:234-293 has each step comment doubled:
# 4. Entity discovery (2 paths) + merge rel_entities + cypher_entities
# 4. Entity discovery (2 paths) + merge rel_entities + cypher_entitiesSame for steps 5, 6, 7, 8. Looks like rebase debris — should be cleaned up before merge.
2. Shared mutable _UNBOUNDED = Context() module-level singleton
Defined in chunk_retrieval.py:13, entity_discovery.py:15, cypher_generation.py:19, relationship_expansion.py:13, result_assembly.py:17. Today this works (because latency_budget_ms=None makes ensure_budget a no-op), but it's a footgun:
started_atis set at import time;elapsed_msgrows forever.- Any caller mutating
_UNBOUNDED.trace_id,tenant_id,latency_budget_mspoisons every other caller. - It's the classic "mutable default argument" anti-pattern, just hoisted to module scope.
Cleaner: ctx: Context | None = None, then if ctx is not None: ctx.ensure_budget(...). Marginal verbosity is worth the safety.
3. Inconsistent timeout=0 behavior
Production path: validate_timeout(0) → ValueError("timeout must be > 0").
Direct internal path (tested in test_provider_wait_timeout_zero_raises_typed_error): wait_for_provider_call(..., timeout=0) → typed LLMTimeoutError/EmbeddingTimeoutError.
The if timeout <= 0 branch in _timeout.py:18-22 is unreachable from production callers because every entry point calls validate_timeout first. The test exercises dead-but-buggy code. Pick one contract:
- If 0 is invalid → drop the branch in
wait_for_provider_calland drop the test. - If 0 means "fail-fast typed timeout" → remove
validate_timeout's zero check and let the helper own the semantics.
4. remaining_budget_seconds floor of 1e-9
context.py:55 clamps the returned timeout to 1e-9 seconds. Combined with ensure_budget immediately before each call, this is mostly safe — but for sub-millisecond remaining budgets you'll pass a clearly-infeasible timeout into aembed_query/ainvoke instead of failing fast. Consider raising LatencyBudgetExceededError if remaining is below a small threshold (e.g. 1 ms) rather than handing the LLM a 1-nanosecond deadline.
5. wait_for_provider_call closes arbitrary awaitables
_timeout.py:18-22 does getattr(awaitable, "close", None) and calls it. Coroutines support close(), but generic Awaitable (futures, tasks) may not — and calling close() on a coroutine that hasn't been awaited prints a RuntimeWarning in some Python configurations. As noted above this branch is unreachable from production today, but if you keep it, narrow the type or use inspect.iscoroutine.
6. LiteLLM import: stray import time at top vs. lazy import litellm
providers/litellm.py:9 adds import time at module level — fine — but aembed_documents (deadline path) is the only consumer. Just an FYI for consistency with the existing lazy-import pattern; not a bug.
Smaller notes
pypi-publish.yaml:defaults.run.working-directory: graphrag_sdkapplies only torun:steps;upload-artifactandpypa/gh-action-pypi-publishsee paths from repo root.graphrag_sdk/dist/is correct from repo root, so this works — but it's brittle. A comment would help._validate_historyswap fromtry/exceptonChatMessage(...)to explicit type checks (api/main.py:1488-1503) is a good change — old code swallowed real bugs.core/connection.py:194-208: the `if last_exc is not None` block is written twice; the second one is needed but the first ends in `raise last_exc`. Reads slightly oddly — fine functionally.- Async context manager tests reference
__aenter__/__aexit__not in the diff. They exist inapi/main.py:213-237already — verified, no missing implementation. - Integration test discipline (
-m integrationmarker +RUN_INTEGRATION=1doc) is well executed.
Test coverage
Strong. New tests cover timeout propagation per provider, budget exhaustion at each retrieval phase, log-redaction sanity (SECRET_KEY_xyz/proxy= exclusion from error logs), context-manager close-failure behavior, and the aembed_documents overall-deadline guarantee.
Risk assessment
Medium-low. The semantic change with most blast radius is making every retrieval step budget-aware — well-behaved when no budget is set (no-op), but callers passing tight budgets will now see LatencyBudgetExceededError mid-flow rather than silent timeouts. This is intentional and an improvement, but is API-visible.
Recommendation
Request changes for items 1 (cleanup) and ideally 2 (sentinel pattern). Items 3–5 are worth addressing but non-blocking. The bulk of the work is solid.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
galshubeli
left a comment
There was a problem hiding this comment.
Follow-up review — commit 4fc166b
Verified each item from the previous review against the follow-up commit.
Resolved
| # | Issue | Status |
|---|---|---|
| 1 | Duplicate comment lines in multi_path.py |
✅ Removed at steps 4–8 |
| 2 | Shared mutable _UNBOUNDED = Context() singleton across 5 retrieval modules |
✅ Replaced with ctx: Context | None = None + per-call-site if ctx is not None: ctx.ensure_budget(...) guards |
| 3 | Inconsistent timeout=0 behavior between validate_timeout (ValueError) and wait_for_provider_call (typed timeout) |
✅ Dead if timeout <= 0 branch removed from _timeout.py; validate_timeout is now the single source of truth; dead-code test removed |
| 4 | remaining_budget_seconds 1e-9 clamp would hand the provider a sub-millisecond deadline |
✅ Clamp removed; new Context.provider_timeout_seconds(operation, min_remaining_ms=1.0) raises LatencyBudgetExceededError if remaining ≤ 1 ms, otherwise returns the seconds value. All budget-derived provider timeouts now route through it. |
| 5 | wait_for_provider_call calling close() on arbitrary awaitables |
✅ Eliminated with the timeout <= 0 branch |
| — | Doubled if last_exc is not None: block in connection.py |
✅ Consolidated into a single guard wrapping both logger.debug and raise last_exc |
New tests in test_context.py cover both the raise-when-low and return-remaining paths of provider_timeout_seconds. Good coverage.
Minor follow-up (non-blocking)
provider_timeout_seconds itself raises LatencyBudgetExceededError when remaining ≤ 1 ms, so the immediately-preceding ctx.ensure_budget(operation) call is redundant at every site that pairs them — e.g. local.py:55-58, cosine.py:43-46, api/main.py:1527-1532 and api/main.py:1656-1660. The two have slightly different thresholds (ensure_budget fires at remaining ≤ 0, provider_timeout_seconds at remaining ≤ 1 ms), but since the latter is stricter, the prior call adds nothing — it can never fire when the next call wouldn't.
Two ways to clean up:
- Drop the redundant
ensure_budget(...)and rely onprovider_timeout_seconds(...)raising. - Or keep
ensure_budgetonly for the call sites that don't pass a timeout (graph queries, vector searches), and drop it elsewhere.
Either way, current behavior is correct — just a small simplification opportunity.
Recommendation
LGTM to merge. Optional polish above can be done in a follow-up or skipped.
Summary
Review
Testing
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores
Tests