Skip to content

fix(ha): don't wedge the app when scale-down lock can't be acquired#65

Draft
delgod wants to merge 1 commit into
9/edgefrom
scale-down-lock
Draft

fix(ha): don't wedge the app when scale-down lock can't be acquired#65
delgod wants to merge 1 commit into
9/edgefrom
scale-down-lock

Conversation

@delgod
Copy link
Copy Markdown
Member

@delgod delgod commented May 22, 2026

Removing the whole application fires storage-detaching on every unit at once. Each unit races for a single scale-down lock; the losers raised RequestingLockTimedOutError, which failed the hook and parked the unit in error state. Its agent is then torn down before juju can retry the hook, so the unit stays wedged and the application can never finish removing — juju.wait(APP not in status) times out (seen in k8s/test_scaling.py:: test_scale_down_remove_application).

  • ScaleDownLock.request_lock now genuinely retries under contention. The retry count was min(timeout // 5, 1) (always 1) and a contended SET NX returns falsy WITHOUT raising, so tenacity never re-polled. Replace it with an explicit poll loop using max(...) so the timeout is honored. On retry it re-resolves the primary with the resilient get_primary_ip_for_scale_down() (which rides out transient sentinel failures during teardown) and builds the ValkeyClient once.

  • _on_storage_detaching no longer raises when the lock is unavailable during full-app removal (planned_units() == 0): it goes away gracefully instead of wedging, logging at warning with the unit name. A partial single-unit scale-down still raises so juju retries the hook.

  • While acquiring the lock, ValkeyCannotGetPrimaryIPError and ValkeyWorkloadCommandError (e.g. from get_active_sentinel_ips) are tolerated the same way: go away on full-app removal, otherwise re-raise so juju retries — rather than silently skipping failover and the dataset save on a partial scale-down.

Covered by unit tests in tests/unit/test_locks.py and tests/unit/test_scaledown.py.

  • Have you updated relevant documentation?

Removing the whole application fires storage-detaching on every unit at
once. Each unit races for a single scale-down lock; the losers raised
RequestingLockTimedOutError, which failed the hook and parked the unit in
error state. Its agent is then torn down before juju can retry the hook,
so the unit stays wedged and the application can never finish removing —
`juju.wait(APP not in status)` times out (seen in k8s/test_scaling.py::
test_scale_down_remove_application).

- ScaleDownLock.request_lock now genuinely retries under contention. The
  retry count was `min(timeout // 5, 1)` (always 1) and a contended SET NX
  returns falsy WITHOUT raising, so tenacity never re-polled. Replace it
  with an explicit poll loop using `max(...)` so the timeout is honored.
  On retry it re-resolves the primary with the resilient
  get_primary_ip_for_scale_down() (which rides out transient sentinel
  failures during teardown) and builds the ValkeyClient once.

- _on_storage_detaching no longer raises when the lock is unavailable
  during full-app removal (planned_units() == 0): it goes away gracefully
  instead of wedging, logging at warning with the unit name. A partial
  single-unit scale-down still raises so juju retries the hook.

- While acquiring the lock, ValkeyCannotGetPrimaryIPError and
  ValkeyWorkloadCommandError (e.g. from get_active_sentinel_ips) are
  tolerated the same way: go away on full-app removal, otherwise re-raise
  so juju retries — rather than silently skipping failover and the dataset
  save on a partial scale-down.

Covered by unit tests in tests/unit/test_locks.py and
tests/unit/test_scaledown.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread src/events/base_events.py
self.charm.state.unit_server.unit_name,
)
self._set_state_for_going_away()
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return here we will not save the databse to disk before stopping the workload. The unit will just be killed by juju

Comment thread src/common/locks.py
# must re-attempt on a fixed interval until it is acquired or the timeout
# elapses. With no timeout we make a single attempt and let the caller
# decide how to wait (e.g. defer the hook so juju re-runs it later).
number_of_retries = max(timeout // 5, 1) if timeout else 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch. I thinkwe should still use tenacity to manage the retries though.

Comment thread src/events/base_events.py
e,
)
self._set_state_for_going_away()
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On full app removal, we currently still orchestrate the shutdown to be executed one unit at a time but whther we change that to all units doing it at the same time or not, saving the DB to disk on line 636 should always be executed before the workload is shutdown.

Comment thread src/literals.py
# How long a unit waits in-process to acquire the scale-down lock before either
# deferring the storage-detaching hook (so juju re-runs it) or, on full-app
# removal, going away. Kept well under juju's hook timeout.
SCALE_DOWN_LOCK_TIMEOUT = 30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should increase it to something like 300. This should enable most units to avoid erroring out in storage detaching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants