test(ha): harden flaky tests by delgod · Pull Request #64 · canonical/valkey-operator

delgod · 2026-05-22T11:27:38Z

test_network_cut_primary flakes in CI for two independent reasons:

VM: the new-primary election wait was ~30s (stop_after_attempt(4) x wait_fixed(10)), but Sentinel's down-after-milliseconds is 30s (src/managers/config.py), so the final attempt landed exactly when failover begins -- before any replica is promoted. Widen the window to ~140s, still under the 180s failover-timeout. The loop breaks as soon as a new primary appears, so the happy path is unaffected.
K8s: applying the Chaos Mesh NetworkChaos manifest races the mnetworkchaos.kb.io admission webhook coming up and fails with "connection refused". Retry the idempotent apply for a bounded window so the webhook has time to start serving.
Have you updated relevant documentation?

test_network_cut_primary flakes in CI for two independent reasons: - VM: the new-primary election wait was ~30s (stop_after_attempt(4) x wait_fixed(10)), but Sentinel's down-after-milliseconds is 30s (src/managers/config.py), so the final attempt landed exactly when failover begins -- before any replica is promoted. Widen the window to ~140s, still under the 180s failover-timeout. The loop breaks as soon as a new primary appears, so the happy path is unaffected. - K8s: applying the Chaos Mesh NetworkChaos manifest races the mnetworkchaos.kb.io admission webhook coming up and fails with "connection refused". Retry the idempotent apply for a bounded window so the webhook has time to start serving. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test_network_cut_primary on K8s races chaos-mesh startup. The chaos_mesh fixture is module-scoped, so deploy_chaos_mesh runs immediately before the test, and the script only did `helm install ... && sleep 10` without waiting for the components to be Ready. The test then hits two symptoms of the same cause: - the mnetworkchaos.kb.io admission webhook refuses connections because the controller-manager pod is not ready, and - once the apply lands, the loss:"100" rule is programmed late by a not-yet-ready chaos-daemon, so the first ping packet leaks through and the reachability check still sees the cut unit as reachable. Add `--wait --timeout 5m0s` to the helm install so the controller-manager Deployment and chaos-daemon DaemonSet are Ready before any test applies a NetworkChaos. The apply retry added earlier stays as defense in depth for the brief webhook serving-cert propagation gap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Multi-unit VM integration tests (test_scaling, test_storage_reuse) run each juju machine as an LXD instance under /var/snap/lxd/common/lxd on the root disk and exhausted it: LXD's unsquashfs failed with "No space left on device" unpacking a container rootfs after the runner hit "Free space left: 17 MB". These GitHub VM runners are nvme-backed with a single ~72G root volume (~17G free) and no separate /mnt SSD to offload to, so free the ~30G of unused preinstalled toolchains (android, dotnet, ghcup, CodeQL, Docker images) on / instead, gated to VM jobs, with a fail-loud guard if <25G remains after cleanup. Also bump the spread backend's default VM disk 20G->40G for backends that allocate their own VMs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mehdi-Bendriss

Thanks Mykola! Looks good, I only have 1 question

Mehdi-Bendriss · 2026-05-25T15:05:31Z

+    # see src/managers/config.py) and then needs time to reach quorum and promote a
+    # replica, so a new primary cannot appear before ~30s. Wait well past that (but
+    # under the 180s `failover-timeout`) to absorb CI scheduling jitter.
+    for attempt in Retrying(stop=stop_after_attempt(15), wait=wait_fixed(10), reraise=True):


Shouldn't we assume that for a small cluster (not 10s of Gb of data) that the failover should happen fast (right after down-after-milliseconds) and if it takes longer it's abnormal and potentially a bug to be investigated ?

In my personal opinion, failover should happen under 1-3 seconds with failover-timeout = 2 x down-after-milliseconds. But definitely separate spec.

tracked here

skourta

Can we run the CI a few times to confirm the stability?

reneradoi

Just a small question from my side, other than that I'm fine with the changes.

reneradoi · 2026-05-26T06:19:51Z

        with:
          pattern: ${{ inputs.artifact-prefix }}-*
          merge-multiple: true
+      - name: Free disk space on /


Did you add this for a particular reason or is it a leftover before we moved the integration tests to Github hosted runners? From what I can see, we usually have 80 to 90 GB of disk space available after the env setup, which should be enough for the tests we run.

delgod and others added 2 commits May 22, 2026 11:27

delgod force-pushed the harden-flaky-network-tests branch from 47d6e89 to 619e954 Compare May 22, 2026 13:17

delgod changed the title ~~test(ha): harden flaky network-cut failover checks~~ test(ha): harden flaky tests May 22, 2026

delgod force-pushed the harden-flaky-network-tests branch from 619e954 to 6ae5f49 Compare May 22, 2026 14:54

Mehdi-Bendriss reviewed May 25, 2026

View reviewed changes

skourta approved these changes May 26, 2026

View reviewed changes

skourta reviewed May 26, 2026

View reviewed changes

reneradoi reviewed May 26, 2026

View reviewed changes

Mehdi-Bendriss approved these changes May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ha): harden flaky tests#64

test(ha): harden flaky tests#64
delgod wants to merge 3 commits into
9/edgefrom
harden-flaky-network-tests

delgod commented May 22, 2026

Uh oh!

Mehdi-Bendriss left a comment

Uh oh!

Mehdi-Bendriss May 25, 2026 •

edited

Loading

Uh oh!

delgod May 25, 2026

Uh oh!

Mehdi-Bendriss May 26, 2026

Uh oh!

skourta left a comment

Uh oh!

reneradoi left a comment

Uh oh!

reneradoi May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

delgod commented May 22, 2026

Uh oh!

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

Uh oh!

Mehdi-Bendriss May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delgod May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Mehdi-Bendriss May 26, 2026

Choose a reason for hiding this comment

Uh oh!

skourta left a comment

Choose a reason for hiding this comment

Uh oh!

reneradoi left a comment

Choose a reason for hiding this comment

Uh oh!

reneradoi May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mehdi-Bendriss May 25, 2026 •

edited

Loading