We observed in production that Flower was showing 0 core tasks running for an extended period of time, yet the collection_status operations table had ~40 core tasks listed as Collecting (with UUIDs).
confirming (by searching the repo url for those repos in flower) that these tasks were not currently running and hadnt already completed, we concluded that they were stale/the DB's picture of collection status had gotten out of sync with flower.
Changing a few of these tasks from Collecting with a UUID, to Pending (and nulling out the UUID) seems to have partially cleared the block on core workers.
When asked to identify bugs that could cause this out of sync-ness/stale tasks, Generative AI identified 4 issues. Three of these were deemed (by me) to be not relevant and/or intended behavior, but the one that was promising seemed to be:
No on_revoke / task revocation handler
celery_app.py Lines 1-10
from celery.signals import worker_process_init, worker_process_shutdown
There's no task_revoked or task_prerun/task_postrun signal handler that would reset COLLECTING to ERROR/PENDING when tasks are revoked during worker shutdown. The on_failure on CoreRepoCollectionTask is only triggered by exceptions raised inside the task function — not by SIGKILL, OOM kills, or Celery task revocation.
This seems like a very plausible error chain:
- a celery worker gets shutdown or recreated for some reason
- we have no error handler to catch this when that workers tasks are revoked
- we subsequently never change the status of those now-revoked tasks in the DB
- even if/when the new workers come back, the collection monitor doesnt pick up any more tasks because it thinks core collection is already running at full beans
We observed in production that Flower was showing 0 core tasks running for an extended period of time, yet the
collection_statusoperations table had ~40 core tasks listed asCollecting(with UUIDs).confirming (by searching the repo url for those repos in flower) that these tasks were not currently running and hadnt already completed, we concluded that they were stale/the DB's picture of collection status had gotten out of sync with flower.
Changing a few of these tasks from
Collectingwith a UUID, toPending(and nulling out the UUID) seems to have partially cleared the block on core workers.When asked to identify bugs that could cause this out of sync-ness/stale tasks, Generative AI identified 4 issues. Three of these were deemed (by me) to be not relevant and/or intended behavior, but the one that was promising seemed to be:
This seems like a very plausible error chain: