fix(ts): back off on nack/ack failure in consumer#289
Draft
NikolayS wants to merge 2 commits into
Draft
Conversation
The Consumer.start() loop slept pollInterval only on receive error or empty batch. When a batch was received but finalization failed (a nack failed so ack was skipped, or ack threw), the loop re-polled instantly; since the batch was never finished, pgque.next_batch returned the same batch immediately. A persistent nack/ack failure (e.g. partial grants where receive works but nack/ack do not) therefore hot-looped at full speed, re-running every handler with duplicate side effects, and even a single transient ack failure re-executed the batch with zero delay. Sleep pollIntervalMs (abort-aware) before re-polling on both paths. Ack returning 0 without throwing stays warning-only with no sleep. Red/green: the two new mock-based tests hot-loop so hard unfixed that the vitest worker dies of OOM recording mock calls; green after fix. Addresses finding B1 (TypeScript) of #283. https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
producer_bench.ts spliced the table name returned by pgque.current_event_table() into 'select count(*) from ...' via a raw template literal -- the only non-parameterized query in the repo. The value is self-generated and the script is dev-only, so this is not exploitable, but quote it properly anyway with pg's escapeIdentifier, part by part for the schema-qualified name. Also fix the result generic: the pgque pool parses int8 (OID 20) to BigInt, so count(*) arrives as bigint, not string; the old code only worked because Number.parseInt coerces its argument. Verified: bun run check, plus a live run against local PG 16 (PGQUE_BENCH_REPEATS=1 bun src/producer_bench.ts) where verifyCount passes for all batch sizes. Addresses the producer_bench informational note of #283. https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
Consumer.start()inclients/typescript/src/consumer.tssleptpollIntervalMsonly on receive error or empty batch. When a batch was received but finalization failed — a nack failed so ack was intentionally skipped, orack()threw — the while loop re-polled immediately. Because the batch was never finished,pgque.next_batchreturns the same batch instantly, so:receiveworks butnack/ackdon't) becomes a tight loop at full speed: re-receive → re-run ALL handlers (duplicate side effects) → fail → repeat, hammering both the application and Postgres;The comment on the skip-ack path even claimed redelivery happens "on the next poll" — a poll interval that did not exist on that path.
Fix
await sleep(this.pollIntervalMs, signal)before re-polling on both theanyNackFailedpath and the ack-throw path. The existingsleephelper is abort-aware, so shutdown latency is unchanged.ackreturning 0 without throwing stays warning-only with no sleep (the batch is finished in that case; the loop should keep draining).Red/green TDD: two new mock-based tests (
pollInterval: 60_000,receivealways returns the same batch, nack/ack fail persistently) assertreceiveis called exactly once within a 300 ms observation window. Unfixed, the hot loop is so tight that the vitest worker dies of OOM recording mock calls — that was the red run. Green after the fix.Second commit (informational note from the same review):
src/producer_bench.tsspliced the table name frompgque.current_event_table()intoselect count(*) from ${table}via a template literal — not exploitable (self-generated value, dev-only script) but the only non-parameterized query in the repo. Now quoted part by part with pg'sescapeIdentifier. Also fixed the result generic: the pgque pool parses int8 (OID 20) toBigInt, socount(*)arrives asbigint, notstring; the old<{ count: string }>+Number.parseIntonly worked by coercion.Verification
All run in
clients/typescript/(deps viabun install):Red run (before the fix, same test file): the
consumer.test.tsworker crashed withFATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory— direct evidence of the unbounded hot loop.Addresses finding B1 (TypeScript) and the producer_bench informational note of #283
https://claude.ai/code/session_01KAaEGkQZmey1D1xCsVGmqv
Generated by Claude Code