Skip to content

fix(modregistry): eliminate parallelWorker deadlock on error path#5

Merged
ngbinh merged 1 commit intomasterfrom
fix-parallel-worker-deadlock-v2
Apr 15, 2026
Merged

fix(modregistry): eliminate parallelWorker deadlock on error path#5
ngbinh merged 1 commit intomasterfrom
fix-parallel-worker-deadlock-v2

Conversation

@ngbinh
Copy link
Copy Markdown
Member

@ngbinh ngbinh commented Apr 14, 2026

Summary

Fix a latent deadlock in parallelWorker.wait() that fires whenever any task returns an error. Surfaced in production via the formModule Lambda: cue mod publishparallelPushBlob → blob push to OCI registry errors → goroutine hang at anduin_parallel.go:87 → Lambda timeout after 5 minutes → cascading test failures.

The bug

// Old wait() loop
done := make(chan struct{}, 1)  // capacity 1
go func() { w.wg.Wait(); done <- struct{}{} }()  // watchdog

for {
    select {
    case <-done: return ...
    case err = <-w.errCh:
        w.cancel()
        done <- struct{}{}  // ← race: may block forever
    }
}

Race:

  1. Task A errors → errCh <- err
  2. Main loop reads err, calls w.cancel(), tries done <- struct{}{}
  3. Meanwhile the watchdog goroutine's wg.Wait() unblocks (all workers done) and sends done <- struct{}{}, filling the cap-1 channel
  4. Main loop's done <- struct{}{} has no receiver — hangs forever

Additionally, worker goroutines did errCh <- err / respCh <- resp unconditionally. If wait() had already returned (via the error path), those sends would block forever — leaked goroutines.

The fix

  • Use context cancellation as the sole termination signal. Main loop no longer touches done.
  • Watchdog close(done)s the channel (read-only signal, idempotent, can't be filled twice).
  • On error: first error wins, cancel context, keep draining until watchdog closes done.
  • Worker sends are guarded with select { case ch <- v: case <-ctx.Done(): } so cancellation unblocks them.
  • defer in wait(): cancel + wg.Wait() + close(errCh/respCh) — channels closed only after every worker has left its send site, eliminating the send-on-closed panic risk.

Tests

New regression tests in anduin_parallel_test.go with a 2-second waitWithTimeout watchdog so future regressions fail loud instead of hanging CI:

  • TestParallelWorkerErrorNoDeadlock — minimal repro of the reported deadlock
  • TestParallelWorkerMultipleSuccess — 16 concurrent successes, ordering preserved
  • TestParallelWorkerErrorWithSlowSiblings — error racing against in-flight workers

Validated with go test -race -count=500 ./mod/modregistry/ -run TestParallelWorker.

Test plan

  • go vet ./mod/modregistry/...
  • go build ./...
  • go test -race -count=500 ./mod/modregistry/ -run TestParallelWorker
  • Rebuild & deploy formModule Lambda with this fix and verify stargazer CI no longer hangs

When a task returned an error, wait() could block forever:

  1. All workers finish, watchdog goroutine fills cap-1 done channel.
  2. Error arrives, main loop tries `done <- struct{}{}` to break out.
  3. done is already full, second send has no receiver → deadlock.

Runtime reports: "fatal error: all goroutines are asleep - deadlock!"
at anduin_parallel.go:87 [chan send].

Rewrite wait() to use context.Cancel as the sole termination signal:
- cancel() on first error, then keep draining respCh/errCh until the
  wg watcher closes a drain-only `done` channel (single owner).
- Worker goroutines guard their sends with <-w.ctx.Done() so they
  can't leak if wait() stops reading after cancellation.
- Defer cancel+Wait+close on wait() return so channel close only
  happens after every worker has left its select block (no panic
  sending on closed channel).
- Drop the standalone close() method, now dead code.

Add anduin_parallel_test.go covering the regression: error-only,
all-success, and error-with-slow-siblings. Each test wraps wait()
in a 2-second watchdog so a regression fails the test instead of
hanging CI.

Validated with `go test -race -count=500` on the new tests.
@ngbinh ngbinh merged commit da20700 into master Apr 15, 2026
@ngbinh ngbinh deleted the fix-parallel-worker-deadlock-v2 branch April 15, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants