fix(modregistry): eliminate parallelWorker deadlock on error path#5
Merged
fix(modregistry): eliminate parallelWorker deadlock on error path#5
Conversation
When a task returned an error, wait() could block forever:
1. All workers finish, watchdog goroutine fills cap-1 done channel.
2. Error arrives, main loop tries `done <- struct{}{}` to break out.
3. done is already full, second send has no receiver → deadlock.
Runtime reports: "fatal error: all goroutines are asleep - deadlock!"
at anduin_parallel.go:87 [chan send].
Rewrite wait() to use context.Cancel as the sole termination signal:
- cancel() on first error, then keep draining respCh/errCh until the
wg watcher closes a drain-only `done` channel (single owner).
- Worker goroutines guard their sends with <-w.ctx.Done() so they
can't leak if wait() stops reading after cancellation.
- Defer cancel+Wait+close on wait() return so channel close only
happens after every worker has left its select block (no panic
sending on closed channel).
- Drop the standalone close() method, now dead code.
Add anduin_parallel_test.go covering the regression: error-only,
all-success, and error-with-slow-siblings. Each test wraps wait()
in a 2-second watchdog so a regression fails the test instead of
hanging CI.
Validated with `go test -race -count=500` on the new tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix a latent deadlock in
parallelWorker.wait()that fires whenever any task returns an error. Surfaced in production via the formModule Lambda:cue mod publish→parallelPushBlob→ blob push to OCI registry errors → goroutine hang atanduin_parallel.go:87→ Lambda timeout after 5 minutes → cascading test failures.The bug
Race:
errCh <- errw.cancel(), triesdone <- struct{}{}wg.Wait()unblocks (all workers done) and sendsdone <- struct{}{}, filling the cap-1 channeldone <- struct{}{}has no receiver — hangs foreverAdditionally, worker goroutines did
errCh <- err/respCh <- respunconditionally. Ifwait()had already returned (via the error path), those sends would block forever — leaked goroutines.The fix
done.close(done)s the channel (read-only signal, idempotent, can't be filled twice).done.select { case ch <- v: case <-ctx.Done(): }so cancellation unblocks them.deferinwait(): cancel +wg.Wait()+close(errCh/respCh)— channels closed only after every worker has left its send site, eliminating the send-on-closed panic risk.Tests
New regression tests in
anduin_parallel_test.gowith a 2-secondwaitWithTimeoutwatchdog so future regressions fail loud instead of hanging CI:TestParallelWorkerErrorNoDeadlock— minimal repro of the reported deadlockTestParallelWorkerMultipleSuccess— 16 concurrent successes, ordering preservedTestParallelWorkerErrorWithSlowSiblings— error racing against in-flight workersValidated with
go test -race -count=500 ./mod/modregistry/ -run TestParallelWorker.Test plan
go vet ./mod/modregistry/...go build ./...go test -race -count=500 ./mod/modregistry/ -run TestParallelWorker