Skip to content

fix: improve leader failover reconnection and error mapping#479

Closed
w1am wants to merge 3 commits intomasterfrom
w1am/test-leader-failover
Closed

fix: improve leader failover reconnection and error mapping#479
w1am wants to merge 3 commits intomasterfrom
w1am/test-leader-failover

Conversation

@w1am
Copy link
Copy Markdown
Collaborator

@w1am w1am commented Apr 13, 2026

Summary

  • Add comprehensive reconnection tests for leader failover scenarios covering both gRPC write and Rust bridge read code paths, including concurrent operations and repeated reads
  • Map gRPC INTERNAL and DATA_LOSS status codes to UnavailableError so the client retries on these transient failures during leader failover
  • Improve bridge error conversion to handle UnavailableError, DeadlineExceededError, and UnknownError, use string literals for error name matching, and default unrecognized errors to UnknownError
  • Bump @kurrent/bridge from 0.1.3 to 0.1.5

Test plan

  • Verify new leader-failover.test.ts tests pass against a 3-node cluster
  • Confirm write operations reconnect after leader kill/resurrect
  • Confirm read operations (Rust bridge path) reconnect after NotLeaderError
  • Confirm concurrent mixed read/write operations recover after NotLeaderError
  • Verify existing reconnection tests still pass

w1am added 3 commits April 9, 2026 13:23
test: fix leader failover tests for dual-connection architecture

- Remove redundant write reconnection test (covered by UnavailableError.test.ts)
- Remove test.only marker
- Remove customer name references from comments
- Fix "readStream after kill" test with retry loop for Rust client stabilization
- Fix "mixed read/write" test to trigger NotLeader on both gRPC and bridge paths
Bump @kurrent/bridge from 0.1.3 to 0.1.5 in db-client and benchmark
packages to pick up improved error type support for leader failover
scenarios.
- Map gRPC INTERNAL and DATA_LOSS status codes to UnavailableError so
  the client retries on these transient failures during leader failover
- Add UnavailableError, DeadlineExceededError, and UnknownError handling
  to bridge error conversion
- Use string literals for error name matching instead of class .name
  references for reliability
- Remove stale commented-out code in convertBridgeError
- Default unrecognized bridge errors to UnknownError instead of passing
  through the raw error
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add leader failover tests and improve error mapping for reconnection

🐞 Bug fix 🧪 Tests ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add comprehensive leader failover reconnection tests covering gRPC writes and Rust bridge reads
• Map gRPC INTERNAL and DATA_LOSS status codes to UnavailableError for automatic client retry
• Improve bridge error conversion with string literals and handle UnavailableError,
  DeadlineExceededError, UnknownError
• Bump @kurrent/bridge dependency from 0.1.3 to 0.1.5
Diagram
flowchart LR
  A["gRPC Status Codes<br/>INTERNAL, DATA_LOSS"] -->|"Map to UnavailableError"| B["Client Retry Logic"]
  C["Bridge Error Conversion<br/>String Literals"] -->|"Handle Error Types"| D["UnavailableError<br/>DeadlineExceededError<br/>UnknownError"]
  E["Leader Failover Tests<br/>Write/Read/Concurrent"] -->|"Verify Reconnection"| F["Cluster Stability"]
  G["@kurrent/bridge 0.1.5"] -->|"Improved Error Support"| B
Loading

Grey Divider

File Changes

1. packages/db-client/src/utils/CommandError.ts 🐞 Bug fix +2/-0

Map gRPC status codes to UnavailableError

• Map gRPC StatusCode.INTERNAL and StatusCode.DATA_LOSS to UnavailableError
• Enables automatic client retry on these transient failures during leader failover

packages/db-client/src/utils/CommandError.ts


2. packages/db-client/src/utils/convertBridgeError.ts ✨ Enhancement +14/-25

Improve bridge error conversion and handling

• Replace dynamic error class name references with string literals for reliability
• Add handling for UnavailableError, DeadlineExceededError, and UnknownError
• Default unrecognized errors to UnknownError instead of passing through raw error
• Remove stale commented-out code block

packages/db-client/src/utils/convertBridgeError.ts


3. packages/test/src/connection/reconnect/leader-failover.test.ts 🧪 Tests +441/-0

Add leader failover reconnection test suite

• Add comprehensive test suite for leader failover scenarios with 7 test cases
• Test write operations (gRPC path) with leader kill/resurrect cycles
• Test read operations (Rust bridge path) with NotLeader errors and reconnection
• Test concurrent mixed read/write operations during failover
• Include retry loops for Rust client stabilization after cluster changes

packages/test/src/connection/reconnect/leader-failover.test.ts


View more (2)
4. packages/benchmark/package.json Dependencies +1/-1

Update bridge dependency version

• Bump @kurrent/bridge from 0.1.3 to 0.1.5

packages/benchmark/package.json


5. packages/db-client/package.json Dependencies +1/-1

Update bridge dependency version

• Bump @kurrent/bridge from 0.1.3 to 0.1.5

packages/db-client/package.json


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 13, 2026

Code Review by Qodo

🐞 Bugs (3)   📘 Rule violations (0)   📎 Requirement gaps (0)   🖥 UI issues (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1) ☼ Reliability (1) ⚙ Maintainability (1)

Grey Divider


Action required

1. Unknown errors get masked 🐞
Description
convertBridgeError now converts any unrecognized thrown value into UnknownError, so non-bridge
exceptions (e.g., bugs thrown while converting/yielding events) get misclassified and lose their
original error type/stack. This is especially problematic because readStream/readAll catch broadly
and route all exceptions through convertBridgeError.
Code

packages/db-client/src/utils/convertBridgeError.ts[R16-33]

  switch (error.name) {
-    case StreamNotFoundError.name:
+    case "StreamNotFoundError":
      return new StreamNotFoundError(serviceError, stream);
-    case StreamDeletedError.name:
+    case "StreamDeletedError":
      return StreamDeletedError.fromStreamName(stream);
-    case NotLeaderError.name:
+    case "NotLeaderError":
      return new NotLeaderError(serviceError);
-    case AccessDeniedError.name:
+    case "AccessDeniedError":
      return new AccessDeniedError(serviceError);
+    case "UnavailableError":
+      return new UnavailableError(serviceError);
+    case "DeadlineExceededError":
+      return new DeadlineExceededError(serviceError);
+    case "UnknownError":
+      return new UnknownError(serviceError);
    default:
-      return error;
+      return new UnknownError(serviceError);
  }
Evidence
readStream/readAll catch *any* exception from the async iteration (including conversion/yield code
paths) and rethrow convertBridgeError(...). With the new default branch returning UnknownError,
unexpected runtime errors are now always wrapped instead of propagating as-is.

packages/db-client/src/utils/convertBridgeError.ts[12-33]
packages/db-client/src/streams/readStream.ts[114-126]
packages/db-client/src/streams/readAll.ts[84-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`convertBridgeError` now wraps *all* unrecognized errors into `UnknownError`. Because `readStream` / `readAll` catch broadly around async iteration, this change masks non-bridge exceptions (e.g., `TypeError`, conversion bugs) and discards the original error identity/stack.

## Issue Context
The bridge conversion helper should only translate *known* bridge-originated errors. Unexpected errors should propagate unchanged (or be wrapped while preserving the original error as `cause` and keeping the original stack).

## Fix Focus Areas
- packages/db-client/src/utils/convertBridgeError.ts[12-33]
- packages/db-client/src/streams/readStream.ts[114-126]
- packages/db-client/src/streams/readAll.ts[84-96]

## Suggested approach
- In `convertBridgeError`, before switching/wrapping, detect whether the error is a bridge/grpc-like error (e.g., has `code` and/or `metadata`) or matches a known bridge error name.
- If it’s not a recognized bridge error, return/throw the original `error` unchanged.
- If you still want to map unknown *bridge* errors, wrap them but preserve the original error via `cause` (and/or attach the original error object on the wrapper) so debugging remains possible.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. INTERNAL treated unavailable 🐞
Description
convertToCommandError now maps gRPC INTERNAL and DATA_LOSS to UnavailableError, which the client
explicitly treats as “server unavailable” and uses to trigger channel teardown/reconnect. This
broadens reconnection/retry classification to errors that are not necessarily
connectivity/unavailability related.
Code

packages/db-client/src/utils/CommandError.ts[R640-645]

    case StatusCode.DEADLINE_EXCEEDED:
      return new DeadlineExceededError(error);
    case StatusCode.UNAVAILABLE:
+    case StatusCode.INTERNAL:
+    case StatusCode.DATA_LOSS:
      return new UnavailableError(error);
Evidence
convertToCommandError now returns UnavailableError for INTERNAL/DATA_LOSS. Client.shouldReconnect
interprets UnavailableError as “Server is unavailable to take request” and will reconnect/tear down
the channel when seen, changing behavior for these status codes across operations that call
convertToCommandError.

packages/db-client/src/utils/CommandError.ts[622-646]
packages/db-client/src/Client/index.ts[513-528]
packages/db-client/src/streams/appendToStream/append.ts[59-69]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`StatusCode.INTERNAL` and `StatusCode.DATA_LOSS` are now mapped to `UnavailableError`, which drives reconnection logic (channel teardown). This expands “unavailable” handling beyond the cases the code comment describes.

## Issue Context
`Client.shouldReconnect` explicitly treats `UnavailableError` as a reconnection signal. If INTERNAL/DATA_LOSS can represent non-unavailability failures in your system, this may cause unnecessary reconnections and retries.

## Fix Focus Areas
- packages/db-client/src/utils/CommandError.ts[622-646]
- packages/db-client/src/Client/index.ts[513-528]

## Suggested approach
- Add a guard so INTERNAL/DATA_LOSS are only mapped to `UnavailableError` when you can identify the specific transient failover signature (e.g., via details/metadata markers used during leader failover).
- Alternatively, consider mapping only one of the codes (or keeping them as `UnknownError`) and add targeted tests proving the intended leader-failover behavior for whichever code(s) you keep.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Failover test lacks asserts 🐞
Description
Some steps labeled “should fail” swallow errors without asserting the failure happened, so the test
can pass even if the expected failure does not occur. This reduces the reliability of the new
reconnection coverage.
Code

packages/test/src/connection/reconnect/leader-failover.test.ts[R57-66]

+      // First operation should fail
+      try {
+        await client.appendToStream(
+          "resurrect-stream",
+          jsonEvent({ type: "should-fail", data: { message: "test" } }),
+          { credentials: { username: "admin", password: "changeit" } }
+        );
+      } catch (error) {
+        // Expected failure
+      }
Evidence
In the write failover test, the “First operation should fail” block has no assertion if the append
unexpectedly succeeds. Similarly, in the read failover test after killing the leader, the “Read
should fail” block has no ‘unreachable’ assertion and will silently continue if the read succeeds.

packages/test/src/connection/reconnect/leader-failover.test.ts[57-66]
packages/test/src/connection/reconnect/leader-failover.test.ts[220-230]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new leader failover tests contain “should fail” steps that do not assert that a failure actually occurred, which can allow false positives.

## Issue Context
Try/catch with an empty catch block only verifies behavior if there is an explicit assertion in the try block (e.g., an “unreachable” expectation) or a flag checked after the catch.

## Fix Focus Areas
- packages/test/src/connection/reconnect/leader-failover.test.ts[57-66]
- packages/test/src/connection/reconnect/leader-failover.test.ts[220-230]

## Suggested approach
- Replace the try/catch blocks with `await expect(promise).rejects.toThrow()` (and ideally assert the error type).
- Or set a boolean like `let failed = false;` and assert `expect(failed).toBe(true)` after the try/catch.
- Consider asserting the error class (e.g., `UnavailableError`, `NotLeaderError`) to ensure you’re exercising the intended failure mode.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@w1am w1am closed this Apr 13, 2026
@w1am w1am deleted the w1am/test-leader-failover branch April 13, 2026 10:25
Comment on lines 16 to 33
switch (error.name) {
case StreamNotFoundError.name:
case "StreamNotFoundError":
return new StreamNotFoundError(serviceError, stream);
case StreamDeletedError.name:
case "StreamDeletedError":
return StreamDeletedError.fromStreamName(stream);
case NotLeaderError.name:
case "NotLeaderError":
return new NotLeaderError(serviceError);
case AccessDeniedError.name:
case "AccessDeniedError":
return new AccessDeniedError(serviceError);
case "UnavailableError":
return new UnavailableError(serviceError);
case "DeadlineExceededError":
return new DeadlineExceededError(serviceError);
case "UnknownError":
return new UnknownError(serviceError);
default:
return error;
return new UnknownError(serviceError);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Unknown errors get masked 🐞 Bug ≡ Correctness

convertBridgeError now converts any unrecognized thrown value into UnknownError, so non-bridge
exceptions (e.g., bugs thrown while converting/yielding events) get misclassified and lose their
original error type/stack. This is especially problematic because readStream/readAll catch broadly
and route all exceptions through convertBridgeError.
Agent Prompt
## Issue description
`convertBridgeError` now wraps *all* unrecognized errors into `UnknownError`. Because `readStream` / `readAll` catch broadly around async iteration, this change masks non-bridge exceptions (e.g., `TypeError`, conversion bugs) and discards the original error identity/stack.

## Issue Context
The bridge conversion helper should only translate *known* bridge-originated errors. Unexpected errors should propagate unchanged (or be wrapped while preserving the original error as `cause` and keeping the original stack).

## Fix Focus Areas
- packages/db-client/src/utils/convertBridgeError.ts[12-33]
- packages/db-client/src/streams/readStream.ts[114-126]
- packages/db-client/src/streams/readAll.ts[84-96]

## Suggested approach
- In `convertBridgeError`, before switching/wrapping, detect whether the error is a bridge/grpc-like error (e.g., has `code` and/or `metadata`) or matches a known bridge error name.
- If it’s not a recognized bridge error, return/throw the original `error` unchanged.
- If you still want to map unknown *bridge* errors, wrap them but preserve the original error via `cause` (and/or attach the original error object on the wrapper) so debugging remains possible.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant