Reliability hardening suite by Naseem77 · Pull Request #261 · FalkorDB/GraphRAG-SDK

Naseem77 · 2026-05-18T16:54:49Z

Summary

enforce LLM and embedding timeouts with typed timeout errors
improve error visibility for broad exception handling paths
cover GraphRAG async context manager cleanup behavior
enforce latency budgets across retrieval and LLM phases
mark and wire real FalkorDB integration tests
tighten release automation for docs, PyPI checks, artifacts, and Dependabot

Review

ran the local review process after each task and only committed approved fixes

Testing

targeted provider, facade, retrieval, integration-marker, YAML, and build checks were run during each task

Summary by CodeRabbit

New Features
- Timeouts for LLM/embedding calls, a new LatencyBudgetExceededError, and latency-budget enforcement across retrieval and completion.
Bug Fixes
- Improved error logging and clearer, typed timeout/error propagation so failures are reported consistently.
Documentation
- Dev setup now uses Docker Compose; added instructions to run integration tests locally.
Chores
- CI/CD workflow tweaks and weekly Dependabot updates.
Tests
- Many new tests for timeouts, budget checkpoints, and propagation.

Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add typed wrapping and error-level logging around high-risk broad exception paths while preserving debug tracebacks. Cover connection, provider, loader, pipeline, retrieval, and history validation error behavior.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a typed latency budget error and enforce Context budgets before retrieval phases, helper I/O, graph config probes, Cypher calls, and completion LLM calls. Cover propagation and phase gating with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-05-18T16:55:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds latency-budget enforcement and typed timeouts across providers and retrievals, threads ctx and timeouts through GraphRAG facade and strategies, introduces timeout helpers and new exception types, tightens history validation and logging, updates CI/workflows and docker/dev docs, and adds extensive tests for budgets, timeouts, and logging.

Changes

Latency Budget Enforcement System

Layer / File(s)	Summary
CI, workflows, docs, and repo config `.github/dependabot.yml`, `.github/workflows/ci.yml`, `.github/workflows/docs.yml`, `.github/workflows/pypi-publish.yaml`, `docker-compose.yml`, `CONTRIBUTING.md`, `graphrag_sdk/pyproject.toml`	Dependabot enabled for Actions and Python; CI integration runs pytest -m integration; docs workflow adds manual dispatch; PyPI workflow runs twine check and uploads artifacts; docker-compose exposes UI port and adds healthcheck start_period; contributing and pytest marker updated.
Exception hierarchy and Context API `graphrag_sdk/src/graphrag_sdk/core/exceptions.py`, `graphrag_sdk/src/graphrag_sdk/core/context.py`, `graphrag_sdk/src/graphrag_sdk/__init__.py`	Adds `LatencyBudgetExceededError`, `LLMTimeoutError`, `EmbeddingTimeoutError`; implements `Context.remaining_budget_seconds` and `Context.ensure_budget(operation)`; exports new exception via package `__all__`.
Provider timeout helpers `graphrag_sdk/src/graphrag_sdk/core/providers/_timeout.py`	Adds `wait_for_provider_call` and `validate_timeout` to enforce timeouts and convert timeout events into typed timeout exceptions with operation-aware messages.
Provider ABCs: timeout wiring and retries `graphrag_sdk/src/graphrag_sdk/core/providers/base.py`, `graphrag_sdk/src/graphrag_sdk/core/providers/_retry.py`	Adds `timeout` kwargs to Embedder and LLM interfaces, validates timeouts, wraps sync/async provider calls with `wait_for_provider_call`, forwards timeouts through batch/stream helpers, and treats `EmbeddingTimeoutError` as non-retriable.
Provider implementations (LiteLLM, OpenRouter) `graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py`, `graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py`	Implements timeout support end-to-end for async LLM and embedder methods, enforces per-call or shared-deadline timeouts for batched embeddings, maps timeout failures to typed errors, and consolidates final error logging after retries.
Retrieval strategies: ctx propagation & checkpoints `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/*.py`	Threads `ctx: Context` through chunk_retrieval, cypher_generation, entity_discovery, local, multi_path, relationship_expansion, result_assembly; inserts `ctx.ensure_budget(...)` before major async steps and re-raises `LatencyBudgetExceededError`.
MultiPath orchestration and keyword extraction `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py`	Adds `_unpack_gather_result` helper for parallel tasks, requires ctx in keyword extraction, propagates ctx into parallel Cypher/text retrieval paths, and checkpoints result assembly.
GraphRAG facade and completion flow `graphrag_sdk/src/graphrag_sdk/api/main.py`	Passes `ctx` into graph-config validation, retrieval, and reranking; inserts budget checkpoints before question rewrite, retrieval, and final LLM call; tightens conversation-history dict validation; sets provider timeouts from ctx.
Connection, loaders, and pipeline logging `graphrag_sdk/src/graphrag_sdk/core/connection.py`, `graphrag_sdk/src/graphrag_sdk/ingestion/loaders/*.py`, `graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py`	Adds structured error logging and debug tracebacks for non-transient FalkorDB failures, loader read failures, and ingestion pipeline unexpected exceptions; re-raises original/typed exceptions.
Reranking and cosine reranker `graphrag_sdk/src/graphrag_sdk/retrieval/reranking_strategies/cosine.py`, `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py`	Enforces budget for reranking embedding fallback and passes timeout derived from ctx.remaining_budget_seconds.
Tests: budgets, timeouts, and logging `graphrag_sdk/tests/*`	Extensive tests added/updated for Context.ensure_budget, provider timeout validation and behavior, LatencyBudgetExceededError propagation across retrieval and facade, loader/pipeline log assertions, and integration test tagging.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant GraphRAG
  participant Context
  participant Retrieval
  participant Embedder
  participant LLM
  Client->>GraphRAG: retrieve(query, ctx)
  GraphRAG->>Context: ensure_budget(graph-config)
  GraphRAG->>GraphRAG: _validate_graph_config(ctx)
  GraphRAG->>Context: ensure_budget(completion retrieval)
  GraphRAG->>Retrieval: search(ctx)
  Retrieval->>Context: ensure_budget(query embedding)
  Retrieval->>Embedder: aembed_query(text, timeout=ctx.remaining_budget_seconds)
  Retrieval->>Context: ensure_budget(chunk search)
  Retrieval->>Retrieval: vector_store.search_chunks(...)
  Retrieval->>Context: ensure_budget(rerank/embed)
  Retrieval->>Embedder: aembed_documents(..., timeout=ctx.remaining_budget_seconds)
  GraphRAG->>Context: ensure_budget(completion LLM call)
  GraphRAG->>LLM: ainvoke_messages(messages, timeout=ctx.remaining_budget_seconds)
  LLM-->>GraphRAG: response
  GraphRAG-->>Client: result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Production Hardening & Release Readiness #257: Implements production-hardening items (typed timeouts, wait_for_provider_call, Context.ensure_budget, LatencyBudgetExceededError, improved logging, CI/workflow updates) similar to this PR.

Possibly related PRs

FalkorDB/GraphRAG-SDK#247: Related CI/integration-job adjustments for integration-marked tests.

"🐰 I counted ms like carrot crumbs,
Budgets checked where timeouts hum,
Retrieval hops pause, then spring,
Logs whisper truths; tests make me sing,
Hooray — the rabbit hops, 'All green, run!'"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.32% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Reliability hardening suite' directly reflects the primary objectives of the PR: enforcing timeouts, improving error visibility, adding budget enforcement, and enhancing reliability across the SDK.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch reliability-hardening-suite

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

galshubeli

Deep-dive review of the reliability hardening suite. Solid intent and real value, but three load-bearing issues need addressing before merge. Posted as line comments — summary below.

Blocking (Critical):

C1: connection.py silently wraps every DB error in DatabaseError — breaking change for any downstream except falkordb.<Specific>Error clause, and FalkorDBConnection is a re-exported public class.
C2: the latency budget is never wired into provider timeout=. Once a slow LLM call is in flight, the budget can fly past the threshold and nothing aborts it.
C3: aembed_documents(..., timeout=10) applies the timeout per batch (and per binary-split sub-call), not as an overall deadline — the public contract reads as 10s but wall-clock total can be 50s+.

Wins worth keeping: the MENTIONED_IN cosine-rank merge from #259, the history-validation refactor in api/main.py (error messages were wrong before, are right now), the secret-leakage test in test_providers.py, the PdfLoader open/finally fix. Tests pass cleanly (393/393 on touched files).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py (1)
161-166: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use an overall deadline for retried LLM calls.

Like the LiteLLM path, this treats timeout as per-attempt, not per-operation. With retries enabled, the request can exceed the caller's remaining latency budget by the backoff plus extra attempts. Recompute a remaining timeout from a monotonic deadline on each loop iteration so the whole ainvoke* call stays within the passed budget.

Also applies to: 216-221
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py` around lines 161
- 166, The current call to
wait_for_provider_call(client.chat.completions.create(**create_kwargs),
timeout=timeout, ...) treats timeout as per-attempt; change the retry loop to
compute a monotonic deadline once at the start and, before each attempt,
recompute remaining_timeout = max(0, deadline - time.monotonic()) and pass that
remaining_timeout into wait_for_provider_call (and into any backoff/wait logic)
so the whole ainvoke/chat completion operation respects the original budget;
apply the same deadline-based timeout recomputation to the other OpenRouter
completion call site that uses client.chat.completions.create and model_name.
graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py (1)
138-143: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make timeout a deadline across retries.

These calls reuse the same timeout on every retry attempt, so ainvoke(..., timeout=2, max_retries=3) can still run well past 2 seconds once retry backoff is included. That breaks the new latency-budget plumbing, because callers now pass ctx.remaining_budget_seconds expecting the whole provider operation to stay inside the remaining budget. Mirror the deadline/remaining_timeout() pattern already used in aembed_documents so retries and sleeps consume the same budget.

Also applies to: 227-232
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py` around lines 138 -
143, The wait_for_provider_call invocation in LiteLLM methods (e.g., the
acompletion call that uses self._completion_kwargs and model_name) currently
reuses the same timeout for every retry, which can exceed a caller's overall
latency budget; change the logic to compute a deadline/remaining timeout per
attempt (mirror the remaining_timeout()/deadline pattern used in
aembed_documents) so each retry and any backoff/sleep consumes from the same
budget and you pass the updated remaining timeout into wait_for_provider_call
(and any sleep/backoff) rather than the original static timeout; update both the
completion path around acompletion and the other occurrence noted (lines
~227-232) to use this per-attempt remaining timeout.

🧹 Nitpick comments (1)

graphrag_sdk/tests/test_connection.py (1)

117-117: ⚡ Quick win

Tighten exception type assertions to match the “typed” intent.

Using pytest.raises(Exception) is too broad and may pass on unintended exception paths. Use explicit exception classes in side_effect and assert those exact classes are propagated.

✅ Suggested test hardening

-        mock_graph.query = AsyncMock(side_effect=Exception("always fails"))
+        class AlwaysFailsError(Exception):
+            pass
+        mock_graph.query = AsyncMock(side_effect=AlwaysFailsError("always fails"))
@@
-            with pytest.raises(Exception, match="always fails"):
+            with pytest.raises(AlwaysFailsError, match="always fails"):
                 await conn.query("MATCH (n) RETURN n")
@@
-        mock_graph.query = AsyncMock(side_effect=Exception("already indexed"))
+        class AlreadyIndexedError(Exception):
+            pass
+        mock_graph.query = AsyncMock(side_effect=AlreadyIndexedError("already indexed"))
@@
-            with pytest.raises(Exception, match="already indexed"):
+            with pytest.raises(AlreadyIndexedError, match="already indexed"):
                 await conn.query("CREATE INDEX idx")

Also applies to: 131-132

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_connection.py` at line 117, The tests use
pytest.raises(Exception, match="always fails") which is too broad; change the
mocked side_effect to raise a specific exception class (e.g., RuntimeError or
the actual custom exception used in the code) and update the pytest.raises
call(s) to assert that exact exception type instead of Exception for the
occurrences where pytest.raises(Exception, match="always fails") appears; ensure
both instances (the shown pytest.raises call and the other occurrence later in
the file tied to the same mock/side_effect) use the same explicit exception
class so the test verifies the intended error path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/core/providers/_timeout.py`:
- Around line 18-27: The validate_timeout function currently raises a raw
ValueError when timeout <= 0 which bypasses the typed timeout error contract;
change validate_timeout(timeout: float | None) to raise the module's
timeout_error (not ValueError) when timeout is not None and timeout <= 0,
including the timeout value in the message (e.g. f"timed out after
{timeout:.3g}s") so callers and the except block keep receiving the typed
timeout error; update references to validate_timeout and keep the existing await
asyncio.wait_for/except behavior unchanged.

In `@graphrag_sdk/tests/test_providers.py`:
- Around line 729-734: The wall-clock assertion in the test around
embedder.aembed_documents is too strict and can flake; relax the timing ceiling
or mock the clock: change the assertion `assert time.monotonic() - started <
0.05` to a looser bound (for example `< 0.1` or `< 0.2`) or replace
time.monotonic with a mocked clock to avoid scheduler jitter, leaving the rest
of the test (the with pytest.raises(EmbeddingTimeoutError) block and the `assert
mock_litellm.aembedding.await_count == 2`) unchanged.

---

Outside diff comments:
In `@graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py`:
- Around line 138-143: The wait_for_provider_call invocation in LiteLLM methods
(e.g., the acompletion call that uses self._completion_kwargs and model_name)
currently reuses the same timeout for every retry, which can exceed a caller's
overall latency budget; change the logic to compute a deadline/remaining timeout
per attempt (mirror the remaining_timeout()/deadline pattern used in
aembed_documents) so each retry and any backoff/sleep consumes from the same
budget and you pass the updated remaining timeout into wait_for_provider_call
(and any sleep/backoff) rather than the original static timeout; update both the
completion path around acompletion and the other occurrence noted (lines
~227-232) to use this per-attempt remaining timeout.

In `@graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py`:
- Around line 161-166: The current call to
wait_for_provider_call(client.chat.completions.create(**create_kwargs),
timeout=timeout, ...) treats timeout as per-attempt; change the retry loop to
compute a monotonic deadline once at the start and, before each attempt,
recompute remaining_timeout = max(0, deadline - time.monotonic()) and pass that
remaining_timeout into wait_for_provider_call (and into any backoff/wait logic)
so the whole ainvoke/chat completion operation respects the original budget;
apply the same deadline-based timeout recomputation to the other OpenRouter
completion call site that uses client.chat.completions.create and model_name.

---

Nitpick comments:
In `@graphrag_sdk/tests/test_connection.py`:
- Line 117: The tests use pytest.raises(Exception, match="always fails") which
is too broad; change the mocked side_effect to raise a specific exception class
(e.g., RuntimeError or the actual custom exception used in the code) and update
the pytest.raises call(s) to assert that exact exception type instead of
Exception for the occurrences where pytest.raises(Exception, match="always
fails") appears; ensure both instances (the shown pytest.raises call and the
other occurrence later in the file tied to the same mock/side_effect) use the
same explicit exception class so the test verifies the intended error path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 46226887-d6df-4e6d-8508-2f1351babea6

📥 Commits

Reviewing files that changed from the base of the PR and between 3eca0b4 and fea6b86.

📒 Files selected for processing (21)

docker-compose.yml
graphrag_sdk/src/graphrag_sdk/api/main.py
graphrag_sdk/src/graphrag_sdk/core/connection.py
graphrag_sdk/src/graphrag_sdk/core/context.py
graphrag_sdk/src/graphrag_sdk/core/providers/_timeout.py
graphrag_sdk/src/graphrag_sdk/core/providers/base.py
graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py
graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py
graphrag_sdk/src/graphrag_sdk/retrieval/reranking_strategies/cosine.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/chunk_retrieval.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/entity_discovery.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/local.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/relationship_expansion.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py
graphrag_sdk/tests/test_connection.py
graphrag_sdk/tests/test_facade.py
graphrag_sdk/tests/test_multi_path_retrieval.py
graphrag_sdk/tests/test_providers.py
graphrag_sdk/tests/test_retrieval.py

✅ Files skipped from review due to trivial changes (1)

docker-compose.yml

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/core/context.py`:
- Line 57: The method remaining_budget_seconds currently returns a tiny positive
1e-9 even when the budget is exhausted, which conflicts with budget_exceeded and
can allow a call to start with a non-zero timeout; update
remaining_budget_seconds (in the Context class/function named
remaining_budget_seconds) to return 0.0 when remaining milliseconds <= 0 (e.g.,
if remaining <= 0 return 0.0 else return remaining/1000.0) so an exhausted
budget yields exactly zero seconds and aligns with budget_exceeded semantics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f48d8e8-50d7-492e-8564-7ec08c41f346

📥 Commits

Reviewing files that changed from the base of the PR and between fea6b86 and d68c1c1.

📒 Files selected for processing (3)

graphrag_sdk/src/graphrag_sdk/core/context.py
graphrag_sdk/src/graphrag_sdk/core/providers/_timeout.py
graphrag_sdk/tests/test_providers.py

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

galshubeli

Overview

The PR bundles six themes — broadly cohesive under "reliability." Most of the surface area is consistent and well-tested:

New LatencyBudgetExceededError, LLMTimeoutError, EmbeddingTimeoutError types with proper hierarchy.
Context.ensure_budget(operation) + remaining_budget_seconds plumbed through retrieval, completion, validation, and provider calls.
Centralized wait_for_provider_call / validate_timeout helpers in core/providers/_timeout.py.
Error logging upgraded from silent broad-except to logger.error + logger.debug(exc_info=...) in connection retry, retry helpers, loaders, pipeline, retrieval base.
-m integration pytest marker isolates real-FalkorDB tests; CI runs them; CONTRIBUTING documents the local invocation.

Issues

1. Duplicate comment lines in `multi_path.py` (merge artifact)

graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py:234-293 has each step comment doubled:

# 4. Entity discovery (2 paths) + merge rel_entities + cypher_entities
# 4. Entity discovery (2 paths) + merge rel_entities + cypher_entities

Same for steps 5, 6, 7, 8. Looks like rebase debris — should be cleaned up before merge.

2. Shared mutable `_UNBOUNDED = Context()` module-level singleton

Defined in chunk_retrieval.py:13, entity_discovery.py:15, cypher_generation.py:19, relationship_expansion.py:13, result_assembly.py:17. Today this works (because latency_budget_ms=None makes ensure_budget a no-op), but it's a footgun:

started_at is set at import time; elapsed_ms grows forever.
Any caller mutating _UNBOUNDED.trace_id, tenant_id, latency_budget_ms poisons every other caller.
It's the classic "mutable default argument" anti-pattern, just hoisted to module scope.

Cleaner: ctx: Context | None = None, then if ctx is not None: ctx.ensure_budget(...). Marginal verbosity is worth the safety.

3. Inconsistent timeout=0 behavior

Production path: validate_timeout(0) → ValueError("timeout must be > 0").
Direct internal path (tested in test_provider_wait_timeout_zero_raises_typed_error): wait_for_provider_call(..., timeout=0) → typed LLMTimeoutError/EmbeddingTimeoutError.

The if timeout <= 0 branch in _timeout.py:18-22 is unreachable from production callers because every entry point calls validate_timeout first. The test exercises dead-but-buggy code. Pick one contract:

If 0 is invalid → drop the branch in wait_for_provider_call and drop the test.
If 0 means "fail-fast typed timeout" → remove validate_timeout's zero check and let the helper own the semantics.

4. `remaining_budget_seconds` floor of `1e-9`

context.py:55 clamps the returned timeout to 1e-9 seconds. Combined with ensure_budget immediately before each call, this is mostly safe — but for sub-millisecond remaining budgets you'll pass a clearly-infeasible timeout into aembed_query/ainvoke instead of failing fast. Consider raising LatencyBudgetExceededError if remaining is below a small threshold (e.g. 1 ms) rather than handing the LLM a 1-nanosecond deadline.

5. `wait_for_provider_call` closes arbitrary awaitables

_timeout.py:18-22 does getattr(awaitable, "close", None) and calls it. Coroutines support close(), but generic Awaitable (futures, tasks) may not — and calling close() on a coroutine that hasn't been awaited prints a RuntimeWarning in some Python configurations. As noted above this branch is unreachable from production today, but if you keep it, narrow the type or use inspect.iscoroutine.

6. `LiteLLM` import: stray `import time` at top vs. lazy `import litellm`

providers/litellm.py:9 adds import time at module level — fine — but aembed_documents (deadline path) is the only consumer. Just an FYI for consistency with the existing lazy-import pattern; not a bug.

Smaller notes

pypi-publish.yaml: defaults.run.working-directory: graphrag_sdk applies only to run: steps; upload-artifact and pypa/gh-action-pypi-publish see paths from repo root. graphrag_sdk/dist/ is correct from repo root, so this works — but it's brittle. A comment would help.
_validate_history swap from try/except on ChatMessage(...) to explicit type checks (api/main.py:1488-1503) is a good change — old code swallowed real bugs.
core/connection.py:194-208: the `if last_exc is not None` block is written twice; the second one is needed but the first ends in `raise last_exc`. Reads slightly oddly — fine functionally.
Async context manager tests reference __aenter__/__aexit__ not in the diff. They exist in api/main.py:213-237 already — verified, no missing implementation.
Integration test discipline (-m integration marker + RUN_INTEGRATION=1 doc) is well executed.

Test coverage

Strong. New tests cover timeout propagation per provider, budget exhaustion at each retrieval phase, log-redaction sanity (SECRET_KEY_xyz/proxy= exclusion from error logs), context-manager close-failure behavior, and the aembed_documents overall-deadline guarantee.

Risk assessment

Medium-low. The semantic change with most blast radius is making every retrieval step budget-aware — well-behaved when no budget is set (no-op), but callers passing tight budgets will now see LatencyBudgetExceededError mid-flow rather than silent timeouts. This is intentional and an improvement, but is API-visible.

Recommendation

Request changes for items 1 (cleanup) and ideally 2 (sentinel pattern). Items 3–5 are worth addressing but non-blocking. The bulk of the work is solid.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

galshubeli

Follow-up review — commit `4fc166b`

Verified each item from the previous review against the follow-up commit.

Resolved

#	Issue	Status
1	Duplicate comment lines in `multi_path.py`	✅ Removed at steps 4–8
2	Shared mutable `_UNBOUNDED = Context()` singleton across 5 retrieval modules	✅ Replaced with `ctx: Context \| None = None` + per-call-site `if ctx is not None: ctx.ensure_budget(...)` guards
3	Inconsistent `timeout=0` behavior between `validate_timeout` (ValueError) and `wait_for_provider_call` (typed timeout)	✅ Dead `if timeout <= 0` branch removed from `_timeout.py`; `validate_timeout` is now the single source of truth; dead-code test removed
4	`remaining_budget_seconds` 1e-9 clamp would hand the provider a sub-millisecond deadline	✅ Clamp removed; new `Context.provider_timeout_seconds(operation, min_remaining_ms=1.0)` raises `LatencyBudgetExceededError` if remaining ≤ 1 ms, otherwise returns the seconds value. All budget-derived provider timeouts now route through it.
5	`wait_for_provider_call` calling `close()` on arbitrary awaitables	✅ Eliminated with the `timeout <= 0` branch
—	Doubled `if last_exc is not None:` block in `connection.py`	✅ Consolidated into a single guard wrapping both `logger.debug` and `raise last_exc`

New tests in test_context.py cover both the raise-when-low and return-remaining paths of provider_timeout_seconds. Good coverage.

Minor follow-up (non-blocking)

provider_timeout_seconds itself raises LatencyBudgetExceededError when remaining ≤ 1 ms, so the immediately-preceding ctx.ensure_budget(operation) call is redundant at every site that pairs them — e.g. local.py:55-58, cosine.py:43-46, api/main.py:1527-1532 and api/main.py:1656-1660. The two have slightly different thresholds (ensure_budget fires at remaining ≤ 0, provider_timeout_seconds at remaining ≤ 1 ms), but since the latter is stricter, the prior call adds nothing — it can never fire when the next call wouldn't.

Two ways to clean up:

Drop the redundant ensure_budget(...) and rely on provider_timeout_seconds(...) raising.
Or keep ensure_budget only for the call sites that don't pass a timeout (graph queries, vector searches), and drop it elsewhere.

Either way, current behavior is correct — just a small simplification opportunity.

Recommendation

LGTM to merge. Optional polish above can be done in a follow-up or skipped.

Naseem77 added 6 commits May 18, 2026 16:54

Enforce provider async timeouts

a6f7fe8

Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Cover GraphRAG async context cleanup

7e7ec1d

Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mark real FalkorDB integration tests

cd760dc

Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tighten release automation

8002ada

Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Comment thread graphrag_sdk/tests/test_facade.py Fixed

Address facade context manager review warning

9dfa72b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Comment thread graphrag_sdk/tests/test_facade.py Fixed

Address remaining facade test reachability warning

3eca0b4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Naseem77 requested a review from galshubeli May 18, 2026 17:19

Naseem77 linked an issue May 18, 2026 that may be closed by this pull request

Production Hardening & Release Readiness #257

Closed

6 tasks

Naseem77 mentioned this pull request May 18, 2026

Production Hardening & Release Readiness #257

Closed

6 tasks

galshubeli reviewed May 19, 2026

View reviewed changes

Naseem77 and others added 4 commits May 20, 2026 10:58

Preserve FalkorDB query exception types

3cc7e4e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Enforce provider embedding deadlines

ada5b7e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Wire latency budgets into retrieval calls

94d9984

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bind FalkorDB browser to localhost

fea6b86

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread graphrag_sdk/src/graphrag_sdk/core/providers/_timeout.py Outdated

Comment thread graphrag_sdk/tests/test_providers.py

Naseem77 and others added 2 commits May 20, 2026 11:20

Raise typed errors for exhausted provider timeouts

d68c1c1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix graph store lint warnings

3b662cb

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread graphrag_sdk/src/graphrag_sdk/core/context.py Outdated

Format LiteLLM provider

89509a2

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

Naseem77 requested a review from galshubeli May 20, 2026 08:57

galshubeli reviewed May 24, 2026

View reviewed changes

Address reliability review follow-ups

4fc166b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Naseem77 requested a review from galshubeli May 25, 2026 08:10

galshubeli reviewed May 25, 2026

View reviewed changes

galshubeli approved these changes May 25, 2026

View reviewed changes

galshubeli merged commit 98c2eb4 into main May 25, 2026
10 checks passed

galshubeli deleted the reliability-hardening-suite branch May 25, 2026 08:38

Conversation

Naseem77 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

galshubeli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

galshubeli left a comment

Choose a reason for hiding this comment

Overview

Issues

1. Duplicate comment lines in multi_path.py (merge artifact)

2. Shared mutable _UNBOUNDED = Context() module-level singleton

3. Inconsistent timeout=0 behavior

4. remaining_budget_seconds floor of 1e-9

5. wait_for_provider_call closes arbitrary awaitables

6. LiteLLM import: stray import time at top vs. lazy import litellm

Smaller notes

Test coverage

Risk assessment

Recommendation

Uh oh!

galshubeli left a comment

Choose a reason for hiding this comment

Follow-up review — commit 4fc166b

Resolved

Minor follow-up (non-blocking)

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Naseem77 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

galshubeli left a comment •

edited

Loading

1. Duplicate comment lines in `multi_path.py` (merge artifact)

2. Shared mutable `_UNBOUNDED = Context()` module-level singleton

4. `remaining_budget_seconds` floor of `1e-9`

5. `wait_for_provider_call` closes arbitrary awaitables

6. `LiteLLM` import: stray `import time` at top vs. lazy `import litellm`

Follow-up review — commit `4fc166b`