Skip to content

fix(go): back off before re-polling failed batch#284

Draft
NikolayS wants to merge 1 commit into
mainfrom
claude/fix-go-consumer-backoff-oqxpbr
Draft

fix(go): back off before re-polling failed batch#284
NikolayS wants to merge 1 commit into
mainfrom
claude/fix-go-consumer-backoff-oqxpbr

Conversation

@NikolayS

Copy link
Copy Markdown
Owner

Bug

In clients/go/consumer.go, the consumer loop slept pollInterval only on receive error or empty batch. When a batch WAS received but finalization failed — nackFailed true (loop did continue with no sleep) or Ack returned an error (fell through to the loop top with no sleep) — the loop re-polled immediately. Because the batch was never finished, pgque.next_batch returns the SAME batch instantly, so a persistent nack/ack failure (e.g. partial grants where receive works but nack/ack don't) produced a tight loop at full speed: re-receive same batch, re-run all handlers (duplicate side effects), fail again, repeat. Even one transient ack failure re-executed the whole batch's handlers with zero delay.

Fix

Sleep pollInterval (respecting ctx cancellation, same select { case <-ctx.Done() ... case <-time.After(...) } pattern used in the receive-error and empty paths) before re-polling on both the nack-failure path and the Ack-error path. The Ack n==0 stale/double-ack case stays warning-only and gains no sleep.

Red/green TDD: added TestConsumer_NackFailure_BacksOffBeforeRepoll and TestConsumer_AckFailure_BacksOffBeforeRepoll with a stub backend (redeliverStubBackend) whose Receive always returns the same batch and whose Nack/Ack fail; the tests bound the number of Receive calls in a 400 ms window with a 50 ms poll interval.

Verification

Red (on unfixed code, go test -run BacksOffBeforeRepoll ./... in clients/go):

--- FAIL: TestConsumer_NackFailure_BacksOffBeforeRepoll (0.45s)
    Receive called 52141 times in 400ms with pollInterval 50ms — tight re-poll loop on nack failure (want <= 11)
--- FAIL: TestConsumer_AckFailure_BacksOffBeforeRepoll (0.58s)
    Receive called 236297 times in 400ms with pollInterval 50ms — tight re-poll loop on ack failure (want <= 11)

Green (with fix, in clients/go):

$ go test ./...
ok  	github.com/NikolayS/pgque-go	4.784s
?   	github.com/NikolayS/pgque-go/bench/coop_demo	[no test files]
$ go vet ./...
(clean, no output)

Addresses finding B1 (Go) of #283

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv


Generated by Claude Code

When a batch was received but finalization failed (a Nack failed, or
Ack returned an error), the consumer loop re-polled immediately. The
unfinished batch is redelivered by pgque.next_batch at once, so a
persistent nack/ack failure (e.g. partial grants) produced a tight
loop re-running every handler at full speed, and even one transient
ack failure re-executed the whole batch with zero delay.

Sleep pollInterval (respecting ctx cancellation) before re-polling on
both the nack-failure and ack-error paths. The Ack n==0 stale/double
ack case stays warning-only with no sleep.

Verified red/green: the new tests counted 52k/236k Receive calls in
400 ms on the unfixed code; with the fix the count stays within the
pollInterval bound.

https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants