Consumer retry: RetryBackoff holds delivery unacked during sleep (prefetch starvation) + uncapped immediate-requeue poison loop

## Summary

Two consumer-side behaviours in the RabbitMQ retry path (verified on `v1.3.1`) can hurt throughput and stability under sustained retryable failures (e.g. a downstream API returning `429` in bursts):

1. `RetryBackoff` keeps the original delivery **unacked during the in-process `time.Sleep`**, so backing-off messages hold prefetch slots and can starve the consumer.
2. The immediate-requeue path (`default → Nack(requeue=true)`) has **no retry cap and no backoff**, so a poison message can hot-loop forever.

Both are in `internal/events/rabbitmq/consumer/`. Happy to send a PR if you agree with the direction.

---

## 1. `RetryBackoff` holds the delivery unacked during the sleep → prefetch starvation

`retryBackoff` sleeps for the backoff interval and only **acks the original delivery afterwards**, inside `requeue()`:

```go
// internal/events/rabbitmq/consumer/retryBackoff.go
setDeliveryCount(&msg, attempts+1)
time.Sleep(interval)                 // original delivery is still UNACKED here
err := requeue(ctx, r.chManager.Channel, r.config.QueueName, msg)

func requeue(...) error {
    err := channel.PublishWithContext(...) // re-publish a copy
    ...
    return msg.Ack(false)                  // ack the original only now
}
```

Because the original stays unacked for the whole sleep, it **occupies a prefetch slot**. Under a burst of retryable errors, up to `PrefetchCount` messages can be sleeping in backoff simultaneously, holding every slot — so the broker delivers no new messages until a sleep finishes. Effective throughput drops to roughly `PrefetchCount / interval`, and the interval grows exponentially, so a sustained retryable-error window throttles *all* traffic on the queue, not just the failing messages.

### Suggested direction
Use **broker-native delayed retry** instead of an in-process sleep: ack immediately and park the message in a per-attempt delay queue (TTL + dead-letter back to the work queue), or via the delayed-message-exchange plugin. That frees the prefetch slot, survives consumer restarts, and holds nothing in memory during the wait.

---

## 2. Immediate requeue (`default` case) is uncapped and un-backed-off → poison-message hot-loop

The delivery counter (`x-delivery-count`) is incremented **only** in `retryBackoff`. The consumer switch:

```go
// internal/events/rabbitmq/consumer/handler.go
switch res {
case events.Success:      msg.Ack(false)
case events.DeadLetter:   msg.Nack(false, false)     // -> DLQ
case events.RetryBackoff: go r.retryBackoff(ctx, msg) // capped by MaxRetries
default:                  msg.Nack(false, true)      // immediate requeue, no cap, no backoff
}
```

There is no explicit `case` for `events.Retry`, so it (and any unmapped response) falls into `default → Nack(requeue=true)`. A persistent/poison error there is requeued **immediately and forever** — no backoff, no delivery-count increment, no DLQ escape — which pegs CPU and floods logs. This path is reachable from normal error handling (`roxy.RequeueMessageAction` via the non-backoff `HandleError` maps to `Retry`).

### Suggested direction
- Make `default` fail-safe → treat unknown responses as `DeadLetter` rather than infinite requeue, and/or
- Apply the `MaxRetries` delivery-count cap to the immediate-requeue path too (escalate to DLQ after N attempts).

---

## Minor / related observations

- **Duplication window in `requeue`:** publish-then-ack is the right (loss-averse) order, but a crash between the publish and `msg.Ack` re-delivers the original *and* the republished copy. This is inherent to at-least-once; a stable message id on republish would make consumer-side dedup easier.
- **No failure context on dead-lettered messages:** attaching the last error / a reason header before dead-lettering would make DLQ triage possible without correlating logs.
- **Log volume:** the per-delivery `Info("consuming message")` plus per-requeue/dead-letter info lines are noisy at high throughput; a debug level or sampling would help.

Versions checked: `v1.3.0` and `v1.3.1`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer retry: RetryBackoff holds delivery unacked during sleep (prefetch starvation) + uncapped immediate-requeue poison loop #126

Summary

1. `RetryBackoff` holds the delivery unacked during the sleep → prefetch starvation

Suggested direction

2. Immediate requeue (`default` case) is uncapped and un-backed-off → poison-message hot-loop

Suggested direction

Minor / related observations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Consumer retry: RetryBackoff holds delivery unacked during sleep (prefetch starvation) + uncapped immediate-requeue poison loop #126

Description

Summary

1. RetryBackoff holds the delivery unacked during the sleep → prefetch starvation

Suggested direction

2. Immediate requeue (default case) is uncapped and un-backed-off → poison-message hot-loop

Suggested direction

Minor / related observations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `RetryBackoff` holds the delivery unacked during the sleep → prefetch starvation

2. Immediate requeue (`default` case) is uncapped and un-backed-off → poison-message hot-loop