Skip to content

gh-119592: gh-152967: Fix ProcessPoolExecutor stranding submitted work when a max_tasks_per_child worker exits#152978

Open
gpshead wants to merge 1 commit into
python:mainfrom
gpshead:fix-gh-119592-worker-replacement
Open

gh-119592: gh-152967: Fix ProcessPoolExecutor stranding submitted work when a max_tasks_per_child worker exits#152978
gpshead wants to merge 1 commit into
python:mainfrom
gpshead:fix-gh-119592-worker-replacement

Conversation

@gpshead

@gpshead gpshead commented Jul 3, 2026

Copy link
Copy Markdown
Member

Worker replacement went through the executor object: the manager thread read executor attributes that shutdown(wait=False) clears concurrently, and could not replace workers at all once the executor was garbage collected. A worker exiting at its max_tasks_per_child limit in those states left the remaining submitted work permanently unexecuted and hung interpreter exit; the racing case could crash the manager thread.

Replace workers from the executor manager thread using its own state plus configuration read through the live executor weakref, which shutdown() never clears:


Drafted and investigated entirely by Claude Fable 5 based on the issues. I'm putting this up as a draft to better iterate on review to see what shape this should take and how feasible backporting this further as a bugfix could even be. edit: Looks to be in good shape. Undrafting.

…n a max_tasks_per_child worker exits

Worker replacement went through the executor object: the manager thread
read executor attributes that shutdown(wait=False) clears concurrently,
and could not replace workers at all once the executor was garbage
collected. A worker exiting at its max_tasks_per_child limit in those
states left the remaining submitted work permanently unexecuted and hung
interpreter exit; the racing case could crash the manager thread.

Replace workers from the executor manager thread using its own state
plus configuration read through the live executor weakref, which
shutdown() never clears:

- After shutdown(wait=False) with the executor still referenced, a
  replacement is spawned and the remaining work is executed as
  documented.
- Once the executor has been garbage collected (pythongh-152967), or a
  replacement worker cannot be started and no workers remain, the
  remaining futures now fail with BrokenProcessPool instead of never
  resolving.
- A new _force_shutting_down flag stops both spawn paths from starting
  workers that would escape terminate_workers()/kill_workers().

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
self._join_executor_internals(broken=True)

def terminate_broken(self, cause):
def terminate_broken(self, cause, bpe_message=None):

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this might look like a public API change in the diff... it's on the _ExecutorManagerThread internal use only class. Fine to backport.

@gpshead gpshead marked this pull request as ready for review July 4, 2026 05:04
@gpshead gpshead added the needs backport to 3.14 bugs and security fixes label Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting core review needs backport to 3.14 bugs and security fixes needs backport to 3.15 pre-release feature fixes, bugs and security fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant