Skip to content

nri-kafka: fix sendSync hanging forever on delivery failure#159

Merged
omnibs merged 2 commits into
trunkfrom
fix-sendsync-delivery-failure
Jun 15, 2026
Merged

nri-kafka: fix sendSync hanging forever on delivery failure#159
omnibs merged 2 commits into
trunkfrom
fix-sendsync-delivery-failure

Conversation

@omnibs

@omnibs omnibs commented Jun 13, 2026

Copy link
Copy Markdown
Member

Problem

Note: We never actually observed the problematic behavior in production or anywhere. It might be we never failed to write a synchronous message to Kafka 😅. Claude pointed out this issue during the sendSync performance work in #151, and since we'll rely on sendSync in the event platform, I thought we'd better have this covered.

Kafka.sendSync blocked on a TMVar that the delivery callback only signaled on the success branch. When a message was enqueued but the broker ultimately failed to deliver it (delivery.timeout.ms exceeded, retries exhausted, a non-retriable broker error, or no available partition leader), the TMVar was never written and the calling request handler parked forever — and the failure was muted from observability.

The synchronous ImmediateError enqueue-refused path already failed correctly; this was specifically the asynchronous, post-enqueue delivery-report path.

Fix

  • Add a pure Internal.deliveryReportToResult :: DeliveryReport -> Result Error () that maps every delivery report to a result, and signal the terminator on all branches.
  • A failed delivery now surfaces as a descriptive Task.fail:
    • DeliveryFailed (ProducerRecord, KafkaError) — broker-side delivery failure (carries the record).
    • NoMessageDelivered KafkaError — the message-less NoMessageError report (librdkafka's delivery callback fired with a null message pointer; error read from errno).
  • sendHelperAsync's callback now receives the whole DeliveryReport and fires on every report (was success-only). sendAsync keeps its prior success-only callback contract.
  • The exactly-once-signal contract is preserved (librdkafka invokes the delivery callback exactly once per message).

Error lives in the unexposed Kafka.Internal module, so the new constructors are not a public-API change.

Testing

  • New pure unit tests in test/Spec/Kafka.hs cover all three report branches (developed test-first: watched them fail against a stub, then pass).
  • Full worker integration suite passes against a live broker (12/12), confirming the rewiring doesn't regress sendSync's success path.
  • The broker-side delivery-failure path has no automated integration test by design — a deterministic delivery failure is inherently flaky to reproduce, which is why the dispatch logic was extracted into the pure, unit-tested function.

Docs

  • docs/known-issues.md entry marked resolved (original write-up kept for context).
  • CHANGELOG.md entry added under a new # Unreleased heading.

🤖 Generated with Claude Code

sendSync blocked on a TMVar that the delivery callback only signalled on
the success branch, so a broker-side delivery failure (delivery.timeout.ms
exceeded, retries exhausted, non-retriable error, no partition leader) left
the caller parked forever and muted the error from observability.

Add a pure Internal.deliveryReportToResult that maps every DeliveryReport to
a Result Error (), and signal the terminator on all branches. A failed
delivery now surfaces as a descriptive Task.fail: DeliveryFailed for a
broker-side failure, NoMessageDelivered for the message-less NoMessageError
report. sendAsync keeps its prior success-only callback contract.

The dispatch is unit-tested in test/Spec/Kafka.hs; the full worker
integration suite passes against a live broker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 13, 2026 05:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Kafka.sendSync potentially hanging forever when a message is enqueued successfully but later fails delivery by ensuring the delivery callback always signals completion and propagates a failure result.

Changes:

  • Add Internal.deliveryReportToResult and new internal Error constructors to represent all delivery outcomes.
  • Rewire sendHelperAsync/sendSync to receive the full DeliveryReport and signal the TMVar on both success and failure.
  • Add unit tests for delivery-report mapping and update docs/changelog accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
nri-kafka/test/Spec/Kafka.hs Adds pure unit tests for deliveryReportToResult.
nri-kafka/test/Main.hs Registers the new Spec.Kafka tests in the test runner.
nri-kafka/src/Kafka/Internal.hs Adds DeliveryFailed/NoMessageDelivered and implements deliveryReportToResult.
nri-kafka/src/Kafka.hs Updates async/sync send logic so sync send terminates on all delivery reports; updates helper callback shape.
nri-kafka/nri-kafka.cabal Adds Spec.Kafka to the test-suite module list.
nri-kafka/docs/known-issues.md Marks the sendSync hang issue as resolved with an explanation.
nri-kafka/CHANGELOG.md Adds an Unreleased entry describing the fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nri-kafka/test/Spec/Kafka.hs Outdated
Comment on lines 74 to 75
errorToText :: Error -> Text
errorToText err = Text.fromList (Prelude.show err)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already the case for SendingFailed. I think this is better addressed by ensuring our messages wrap PII in e.g. Log.Secret, so they're still debuggable, than omitting their contents completely.

Match the rest of the repo (src/Kafka.hs, test/Helpers.hs) which builds
prTopic explicitly rather than relying on the IsString instance. Addresses
PR review feedback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@omnibs omnibs enabled auto-merge June 15, 2026 14:52
Comment thread nri-kafka/src/Kafka.hs
Comment on lines +222 to +226
let onDeliveryReport deliveryReport =
case deliveryReport of
Producer.DeliverySuccess _producerRecord _offset -> onDeliveryCallback
_ -> Task.succeed ()
sendHelperAsync producer doAnything onDeliveryReport msg'

@omnibs omnibs Jun 15, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is unchanged from trunk. We do nothing on failure in the async path.

Fixing this is a tad larger than what we're aiming for here, and I'm not 100% sure how it would work. I've tried and failed to have a LogHandler report stuff in code that runs in a separate thread in our deployer service, for instance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of scope of this pr, agreed

@omnibs omnibs requested a review from jali-clarke June 15, 2026 15:58

@jali-clarke jali-clarke left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

Comment thread nri-kafka/src/Kafka.hs
Comment on lines +298 to +303
-- librdkafka invokes this callback exactly once per message, on
-- both success and failure, so handing it the whole delivery
-- report lets callers be notified of either outcome.
( \deliveryReport -> do
log <- Platform.silentHandler
Task.perform log <|
case deliveryReport of
Producer.DeliverySuccess _producerRecord _offset -> onDeliveryCallback
_ -> Task.succeed ()
Task.perform log (onDeliveryReport deliveryReport)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💜

@omnibs omnibs added this pull request to the merge queue Jun 15, 2026
Merged via the queue into trunk with commit 42b6edb Jun 15, 2026
2 checks passed
@omnibs omnibs deleted the fix-sendsync-delivery-failure branch June 15, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants