Skip to content

Consumer retry: RetryBackoff holds delivery unacked during sleep (prefetch starvation) + uncapped immediate-requeue poison loop #126

@emmanuelramosdevops

Description

@emmanuelramosdevops

Summary

Two consumer-side behaviours in the RabbitMQ retry path (verified on v1.3.1) can hurt throughput and stability under sustained retryable failures (e.g. a downstream API returning 429 in bursts):

  1. RetryBackoff keeps the original delivery unacked during the in-process time.Sleep, so backing-off messages hold prefetch slots and can starve the consumer.
  2. The immediate-requeue path (default → Nack(requeue=true)) has no retry cap and no backoff, so a poison message can hot-loop forever.

Both are in internal/events/rabbitmq/consumer/. Happy to send a PR if you agree with the direction.


1. RetryBackoff holds the delivery unacked during the sleep → prefetch starvation

retryBackoff sleeps for the backoff interval and only acks the original delivery afterwards, inside requeue():

// internal/events/rabbitmq/consumer/retryBackoff.go
setDeliveryCount(&msg, attempts+1)
time.Sleep(interval)                 // original delivery is still UNACKED here
err := requeue(ctx, r.chManager.Channel, r.config.QueueName, msg)

func requeue(...) error {
    err := channel.PublishWithContext(...) // re-publish a copy
    ...
    return msg.Ack(false)                  // ack the original only now
}

Because the original stays unacked for the whole sleep, it occupies a prefetch slot. Under a burst of retryable errors, up to PrefetchCount messages can be sleeping in backoff simultaneously, holding every slot — so the broker delivers no new messages until a sleep finishes. Effective throughput drops to roughly PrefetchCount / interval, and the interval grows exponentially, so a sustained retryable-error window throttles all traffic on the queue, not just the failing messages.

Suggested direction

Use broker-native delayed retry instead of an in-process sleep: ack immediately and park the message in a per-attempt delay queue (TTL + dead-letter back to the work queue), or via the delayed-message-exchange plugin. That frees the prefetch slot, survives consumer restarts, and holds nothing in memory during the wait.


2. Immediate requeue (default case) is uncapped and un-backed-off → poison-message hot-loop

The delivery counter (x-delivery-count) is incremented only in retryBackoff. The consumer switch:

// internal/events/rabbitmq/consumer/handler.go
switch res {
case events.Success:      msg.Ack(false)
case events.DeadLetter:   msg.Nack(false, false)     // -> DLQ
case events.RetryBackoff: go r.retryBackoff(ctx, msg) // capped by MaxRetries
default:                  msg.Nack(false, true)      // immediate requeue, no cap, no backoff
}

There is no explicit case for events.Retry, so it (and any unmapped response) falls into default → Nack(requeue=true). A persistent/poison error there is requeued immediately and forever — no backoff, no delivery-count increment, no DLQ escape — which pegs CPU and floods logs. This path is reachable from normal error handling (roxy.RequeueMessageAction via the non-backoff HandleError maps to Retry).

Suggested direction

  • Make default fail-safe → treat unknown responses as DeadLetter rather than infinite requeue, and/or
  • Apply the MaxRetries delivery-count cap to the immediate-requeue path too (escalate to DLQ after N attempts).

Minor / related observations

  • Duplication window in requeue: publish-then-ack is the right (loss-averse) order, but a crash between the publish and msg.Ack re-delivers the original and the republished copy. This is inherent to at-least-once; a stable message id on republish would make consumer-side dedup easier.
  • No failure context on dead-lettered messages: attaching the last error / a reason header before dead-lettering would make DLQ triage possible without correlating logs.
  • Log volume: the per-delivery Info("consuming message") plus per-requeue/dead-letter info lines are noisy at high throughput; a debug level or sampling would help.

Versions checked: v1.3.0 and v1.3.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions