Skip to content

feat: add retry backoff for transient failures#30

Merged
codebestia merged 2 commits into
ShadeProtocol:mainfrom
G-ELM:feat/exponential-backoff
Jun 26, 2026
Merged

feat: add retry backoff for transient failures#30
codebestia merged 2 commits into
ShadeProtocol:mainfrom
G-ELM:feat/exponential-backoff

Conversation

@G-ELM

@G-ELM G-ELM commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds automatic retry handling for transient network failures and server-side 5xx responses. The SDK now retries connection resets, timeouts, and 502/503/504 responses with exponential backoff plus jitter, while immediately surfacing non-retryable client errors such as 400s. Retry attempts are logged at DEBUG level, and exhausted retries raise NetworkError.

Fixes #10

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Integration test
  • Unit test

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Summary by CodeRabbit

  • Bug Fixes
    • Improved request handling for transient network issues and temporary server errors (502/503/504), with automatic jittered exponential backoff retries in both sync and async clients.
    • Updated error mapping for common HTTP responses: clearer invalid-request (400), authentication (401/403), not-found (404), and consistent behavior when retries are exhausted.
  • Tests
    • Added/expanded retry tests to verify recovery after transient 503s, immediate failure for invalid requests, and correct retry exhaustion behavior (including asserted backoff timing).

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 862be7aa-63c7-4f90-921f-9d67223aac1a

📥 Commits

Reviewing files that changed from the base of the PR and between 9c6d49d and ef4062a.

📒 Files selected for processing (1)
  • src/shade/http.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/shade/http.py

📝 Walkthrough

Walkthrough

HTTP requests now retry transient transport failures and 502/503/504 responses with jittered exponential backoff. Status handling now raises specific SDK errors for 400, 401, 403, and 404. Sync retry behavior is covered by new tests.

Changes

HTTP retry and status mapping

Layer / File(s) Summary
Retry helpers and status mapping
src/shade/http.py
Imports backoff and retry dependencies, adds retry classification helpers, and changes _raise_for_status to map 400/401/403/404 to specific SDK errors and 502/503/504 to retry or NetworkError.
Sync retry loop and tests
src/shade/http.py, tests/test_rate_limit.py
SyncHTTPClient.request now retries retryable transport exceptions with exponential backoff and uses _raise_for_status; tests cover 503 success-after-retry, immediate 400 failure, and exhausted retries raising NetworkError.
Async retry loop
src/shade/http.py
AsyncHTTPClient.request now retries retryable transport exceptions with asyncio.sleep, raises NetworkError after the retry limit, and still defers status handling to _raise_for_status.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant "SyncHTTPClient.request" as SyncHTTPClient_request
  participant "_execute" as execute
  participant "_raise_for_status" as raiseStatus
  participant "time.sleep" as sleep

  Caller->>SyncHTTPClient_request: request(...)
  SyncHTTPClient_request->>execute: send HTTP request
  execute-->>SyncHTTPClient_request: response or transport exception

  alt transport exception is retryable
    SyncHTTPClient_request->>sleep: backoff delay
    SyncHTTPClient_request->>execute: retry request
  else response status is 502/503/504 and attempts remain
    SyncHTTPClient_request->>raiseStatus: inspect status
    raiseStatus-->>SyncHTTPClient_request: wait interval
    SyncHTTPClient_request->>sleep: wait interval
  else response status is 400/401/403/404
    SyncHTTPClient_request->>raiseStatus: map status
    raiseStatus-->>Caller: specific SDK error
  else response is 2xx
    SyncHTTPClient_request-->>Caller: parsed JSON
  else retries exhausted
    SyncHTTPClient_request->>raiseStatus: exhausted retry
    raiseStatus-->>Caller: NetworkError
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • ShadeProtocol/shade-python#24: Updates to src/shade/http.py and tests/test_rate_limit.py directly extend the retry and error-mapping behavior introduced there.

Poem

🐇 I hopped through 503 and back again,
With jittered paws and a patient grin.
400 stayed firm, no retry in sight,
While backoff twinkled through the night.
Now requests bounce less, and code feels bright!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements retries for transient failures, but it drops the separate 429 handling required by issue #10. Restore separate 429 handling while keeping retries for ConnectError, TimeoutException, and 502/503/504, then re-run the retry tests.
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: retry backoff for transient failures.
Description check ✅ Passed The description includes the summary, issue link, type of change, testing, and checklist sections required by the template.
Out of Scope Changes check ✅ Passed The changes stay within HTTP retry/error handling and tests, with no clearly unrelated code paths introduced.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/test_rate_limit.py (1)

270-330: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Add coverage for transport-exception retries.

These tests only exercise status-code retries. Please add a sync test where _execute raises a retryable transport exception, then succeeds, so the new except retry path is protected.

Example test
+    def test_retries_on_transport_error_then_succeeds(self):
+        client = self._client(max_retries=2)
+        calls = 0
+
+        def fake_execute(req):
+            nonlocal calls
+            calls += 1
+            if calls <= 2:
+                raise TimeoutError("temporary timeout")
+            return 200, {}, _fake_200_body()
+
+        sleep_calls: List[float] = []
+        with patch.object(client, "_execute", side_effect=fake_execute), \
+             patch("time.sleep", side_effect=lambda s: sleep_calls.append(s)), \
+             patch("shade.http.random.uniform", side_effect=[0.0, 0.0]):
+            result = client.request("GET", "/payments")
+
+        assert result == {"id": "pay_123", "status": "ok"}
+        assert calls == 3
+        assert sleep_calls == [1.0, 2.0]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_rate_limit.py` around lines 270 - 330, Add a new synchronous test
in test_rate_limit.py alongside test_retries_on_transient_5xx_then_succeeds that
covers the retry path when _execute raises a retryable transport exception
instead of returning a 5xx. Patch client._execute to raise the transport
exception once or twice and then return a 200 response, and assert
client.request("GET", "/payments") succeeds and time.sleep is called with the
expected backoff values. Use the existing client.request, _execute, and sleep
patching pattern so the new except-based retry handling stays covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/shade/http.py`:
- Around line 182-183: The 404 handling in http.py is discarding the response
body before raising NotFoundError, so resource_type and resource_id cannot be
parsed. Update the status == 404 branch in the response handling logic to pass
through the original response body when constructing NotFoundError instead of
always using a generic message. Use the existing error-raising path in the HTTP
response handler so NotFoundError can extract the resource details.
- Around line 86-92: The retry classifier in _is_retryable_transport_error is
missing common aiohttp transport failures, so async connection hiccups can skip
retries. Update _is_retryable_transport_error to treat
aiohttp.ClientConnectionError and its common subclasses such as
ClientConnectorError, ClientOSError, and ServerDisconnectedError as retryable,
while keeping ServerTimeoutError covered via TimeoutError. Make sure the new
checks fit alongside the existing httpx, ConnectionResetError, TimeoutError, and
urllib.error.URLError handling.

---

Nitpick comments:
In `@tests/test_rate_limit.py`:
- Around line 270-330: Add a new synchronous test in test_rate_limit.py
alongside test_retries_on_transient_5xx_then_succeeds that covers the retry path
when _execute raises a retryable transport exception instead of returning a 5xx.
Patch client._execute to raise the transport exception once or twice and then
return a 200 response, and assert client.request("GET", "/payments") succeeds
and time.sleep is called with the expected backoff values. Use the existing
client.request, _execute, and sleep patching pattern so the new except-based
retry handling stays covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 5dabe333-6e35-47dc-98b6-5d6e73500046

📥 Commits

Reviewing files that changed from the base of the PR and between f6a4cff and 9c6d49d.

📒 Files selected for processing (2)
  • src/shade/http.py
  • tests/test_rate_limit.py

Comment thread src/shade/http.py
Comment thread src/shade/http.py Outdated

@codebestia codebestia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thank you for your contribution.

@codebestia codebestia merged commit ed3c131 into ShadeProtocol:main Jun 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement exponential backoff retry logic

2 participants