Summary
Two consumer-side behaviours in the RabbitMQ retry path (verified on v1.3.1) can hurt throughput and stability under sustained retryable failures (e.g. a downstream API returning 429 in bursts):
RetryBackoff keeps the original delivery unacked during the in-process time.Sleep, so backing-off messages hold prefetch slots and can starve the consumer.
- The immediate-requeue path (
default → Nack(requeue=true)) has no retry cap and no backoff, so a poison message can hot-loop forever.
Both are in internal/events/rabbitmq/consumer/. Happy to send a PR if you agree with the direction.
1. RetryBackoff holds the delivery unacked during the sleep → prefetch starvation
retryBackoff sleeps for the backoff interval and only acks the original delivery afterwards, inside requeue():
// internal/events/rabbitmq/consumer/retryBackoff.go
setDeliveryCount(&msg, attempts+1)
time.Sleep(interval) // original delivery is still UNACKED here
err := requeue(ctx, r.chManager.Channel, r.config.QueueName, msg)
func requeue(...) error {
err := channel.PublishWithContext(...) // re-publish a copy
...
return msg.Ack(false) // ack the original only now
}
Because the original stays unacked for the whole sleep, it occupies a prefetch slot. Under a burst of retryable errors, up to PrefetchCount messages can be sleeping in backoff simultaneously, holding every slot — so the broker delivers no new messages until a sleep finishes. Effective throughput drops to roughly PrefetchCount / interval, and the interval grows exponentially, so a sustained retryable-error window throttles all traffic on the queue, not just the failing messages.
Suggested direction
Use broker-native delayed retry instead of an in-process sleep: ack immediately and park the message in a per-attempt delay queue (TTL + dead-letter back to the work queue), or via the delayed-message-exchange plugin. That frees the prefetch slot, survives consumer restarts, and holds nothing in memory during the wait.
2. Immediate requeue (default case) is uncapped and un-backed-off → poison-message hot-loop
The delivery counter (x-delivery-count) is incremented only in retryBackoff. The consumer switch:
// internal/events/rabbitmq/consumer/handler.go
switch res {
case events.Success: msg.Ack(false)
case events.DeadLetter: msg.Nack(false, false) // -> DLQ
case events.RetryBackoff: go r.retryBackoff(ctx, msg) // capped by MaxRetries
default: msg.Nack(false, true) // immediate requeue, no cap, no backoff
}
There is no explicit case for events.Retry, so it (and any unmapped response) falls into default → Nack(requeue=true). A persistent/poison error there is requeued immediately and forever — no backoff, no delivery-count increment, no DLQ escape — which pegs CPU and floods logs. This path is reachable from normal error handling (roxy.RequeueMessageAction via the non-backoff HandleError maps to Retry).
Suggested direction
- Make
default fail-safe → treat unknown responses as DeadLetter rather than infinite requeue, and/or
- Apply the
MaxRetries delivery-count cap to the immediate-requeue path too (escalate to DLQ after N attempts).
Minor / related observations
- Duplication window in
requeue: publish-then-ack is the right (loss-averse) order, but a crash between the publish and msg.Ack re-delivers the original and the republished copy. This is inherent to at-least-once; a stable message id on republish would make consumer-side dedup easier.
- No failure context on dead-lettered messages: attaching the last error / a reason header before dead-lettering would make DLQ triage possible without correlating logs.
- Log volume: the per-delivery
Info("consuming message") plus per-requeue/dead-letter info lines are noisy at high throughput; a debug level or sampling would help.
Versions checked: v1.3.0 and v1.3.1.
Summary
Two consumer-side behaviours in the RabbitMQ retry path (verified on
v1.3.1) can hurt throughput and stability under sustained retryable failures (e.g. a downstream API returning429in bursts):RetryBackoffkeeps the original delivery unacked during the in-processtime.Sleep, so backing-off messages hold prefetch slots and can starve the consumer.default → Nack(requeue=true)) has no retry cap and no backoff, so a poison message can hot-loop forever.Both are in
internal/events/rabbitmq/consumer/. Happy to send a PR if you agree with the direction.1.
RetryBackoffholds the delivery unacked during the sleep → prefetch starvationretryBackoffsleeps for the backoff interval and only acks the original delivery afterwards, insiderequeue():Because the original stays unacked for the whole sleep, it occupies a prefetch slot. Under a burst of retryable errors, up to
PrefetchCountmessages can be sleeping in backoff simultaneously, holding every slot — so the broker delivers no new messages until a sleep finishes. Effective throughput drops to roughlyPrefetchCount / interval, and the interval grows exponentially, so a sustained retryable-error window throttles all traffic on the queue, not just the failing messages.Suggested direction
Use broker-native delayed retry instead of an in-process sleep: ack immediately and park the message in a per-attempt delay queue (TTL + dead-letter back to the work queue), or via the delayed-message-exchange plugin. That frees the prefetch slot, survives consumer restarts, and holds nothing in memory during the wait.
2. Immediate requeue (
defaultcase) is uncapped and un-backed-off → poison-message hot-loopThe delivery counter (
x-delivery-count) is incremented only inretryBackoff. The consumer switch:There is no explicit
caseforevents.Retry, so it (and any unmapped response) falls intodefault → Nack(requeue=true). A persistent/poison error there is requeued immediately and forever — no backoff, no delivery-count increment, no DLQ escape — which pegs CPU and floods logs. This path is reachable from normal error handling (roxy.RequeueMessageActionvia the non-backoffHandleErrormaps toRetry).Suggested direction
defaultfail-safe → treat unknown responses asDeadLetterrather than infinite requeue, and/orMaxRetriesdelivery-count cap to the immediate-requeue path too (escalate to DLQ after N attempts).Minor / related observations
requeue: publish-then-ack is the right (loss-averse) order, but a crash between the publish andmsg.Ackre-delivers the original and the republished copy. This is inherent to at-least-once; a stable message id on republish would make consumer-side dedup easier.Info("consuming message")plus per-requeue/dead-letter info lines are noisy at high throughput; a debug level or sampling would help.Versions checked:
v1.3.0andv1.3.1.