fix(ha): don't wedge the app when scale-down lock can't be acquired by delgod · Pull Request #65 · canonical/valkey-operator

delgod · 2026-05-22T20:02:20Z

Removing the whole application fires storage-detaching on every unit at once. Each unit races for a single scale-down lock; the losers raised RequestingLockTimedOutError, which failed the hook and parked the unit in error state. Its agent is then torn down before juju can retry the hook, so the unit stays wedged and the application can never finish removing — juju.wait(APP not in status) times out (seen in k8s/test_scaling.py:: test_scale_down_remove_application).

ScaleDownLock.request_lock now genuinely retries under contention. The retry count was min(timeout // 5, 1) (always 1) and a contended SET NX returns falsy WITHOUT raising, so tenacity never re-polled. Replace it with an explicit poll loop using max(...) so the timeout is honored. On retry it re-resolves the primary with the resilient get_primary_ip_for_scale_down() (which rides out transient sentinel failures during teardown) and builds the ValkeyClient once.
_on_storage_detaching no longer raises when the lock is unavailable during full-app removal (planned_units() == 0): it goes away gracefully instead of wedging, logging at warning with the unit name. A partial single-unit scale-down still raises so juju retries the hook.
While acquiring the lock, ValkeyCannotGetPrimaryIPError and ValkeyWorkloadCommandError (e.g. from get_active_sentinel_ips) are tolerated the same way: go away on full-app removal, otherwise re-raise so juju retries — rather than silently skipping failover and the dataset save on a partial scale-down.

Covered by unit tests in tests/unit/test_locks.py and tests/unit/test_scaledown.py.

Have you updated relevant documentation?

Removing the whole application fires storage-detaching on every unit at once. Each unit races for a single scale-down lock; the losers raised RequestingLockTimedOutError, which failed the hook and parked the unit in error state. Its agent is then torn down before juju can retry the hook, so the unit stays wedged and the application can never finish removing — `juju.wait(APP not in status)` times out (seen in k8s/test_scaling.py:: test_scale_down_remove_application). - ScaleDownLock.request_lock now genuinely retries under contention. The retry count was `min(timeout // 5, 1)` (always 1) and a contended SET NX returns falsy WITHOUT raising, so tenacity never re-polled. Replace it with an explicit poll loop using `max(...)` so the timeout is honored. On retry it re-resolves the primary with the resilient get_primary_ip_for_scale_down() (which rides out transient sentinel failures during teardown) and builds the ValkeyClient once. - _on_storage_detaching no longer raises when the lock is unavailable during full-app removal (planned_units() == 0): it goes away gracefully instead of wedging, logging at warning with the unit name. A partial single-unit scale-down still raises so juju retries the hook. - While acquiring the lock, ValkeyCannotGetPrimaryIPError and ValkeyWorkloadCommandError (e.g. from get_active_sentinel_ips) are tolerated the same way: go away on full-app removal, otherwise re-raise so juju retries — rather than silently skipping failover and the dataset save on a partial scale-down. Covered by unit tests in tests/unit/test_locks.py and tests/unit/test_scaledown.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

skourta · 2026-05-25T13:39:50Z

+                    self.charm.state.unit_server.unit_name,
+                )
+                self._set_state_for_going_away()
+                return


If we return here we will not save the databse to disk before stopping the workload. The unit will just be killed by juju

skourta · 2026-05-25T13:40:22Z

+        # must re-attempt on a fixed interval until it is acquired or the timeout
+        # elapses. With no timeout we make a single attempt and let the caller
+        # decide how to wait (e.g. defer the hook so juju re-runs it later).
+        number_of_retries = max(timeout // 5, 1) if timeout else 1


This is a good catch. I thinkwe should still use tenacity to manage the retries though.

skourta · 2026-05-25T13:44:58Z

+                    e,
+                )
+                self._set_state_for_going_away()
+                return


On full app removal, we currently still orchestrate the shutdown to be executed one unit at a time but whther we change that to all units doing it at the same time or not, saving the DB to disk on line 636 should always be executed before the workload is shutdown.

skourta · 2026-05-25T13:46:42Z

+# How long a unit waits in-process to acquire the scale-down lock before either
+# deferring the storage-detaching hook (so juju re-runs it) or, on full-app
+# removal, going away. Kept well under juju's hook timeout.
+SCALE_DOWN_LOCK_TIMEOUT = 30


I think we should increase it to something like 300. This should enable most units to avoid erroring out in storage detaching

skourta reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ha): don't wedge the app when scale-down lock can't be acquired#65

fix(ha): don't wedge the app when scale-down lock can't be acquired#65
delgod wants to merge 1 commit into
9/edgefrom
scale-down-lock

delgod commented May 22, 2026

Uh oh!

skourta May 25, 2026

Uh oh!

skourta May 25, 2026

Uh oh!

skourta May 25, 2026

Uh oh!

skourta May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

delgod commented May 22, 2026

Uh oh!

skourta May 25, 2026

Choose a reason for hiding this comment

Uh oh!

skourta May 25, 2026

Choose a reason for hiding this comment

Uh oh!

skourta May 25, 2026

Choose a reason for hiding this comment

Uh oh!

skourta May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants