test(ha): harden flaky tests#64
Conversation
test_network_cut_primary flakes in CI for two independent reasons: - VM: the new-primary election wait was ~30s (stop_after_attempt(4) x wait_fixed(10)), but Sentinel's down-after-milliseconds is 30s (src/managers/config.py), so the final attempt landed exactly when failover begins -- before any replica is promoted. Widen the window to ~140s, still under the 180s failover-timeout. The loop breaks as soon as a new primary appears, so the happy path is unaffected. - K8s: applying the Chaos Mesh NetworkChaos manifest races the mnetworkchaos.kb.io admission webhook coming up and fails with "connection refused". Retry the idempotent apply for a bounded window so the webhook has time to start serving. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test_network_cut_primary on K8s races chaos-mesh startup. The chaos_mesh fixture is module-scoped, so deploy_chaos_mesh runs immediately before the test, and the script only did `helm install ... && sleep 10` without waiting for the components to be Ready. The test then hits two symptoms of the same cause: - the mnetworkchaos.kb.io admission webhook refuses connections because the controller-manager pod is not ready, and - once the apply lands, the loss:"100" rule is programmed late by a not-yet-ready chaos-daemon, so the first ping packet leaks through and the reachability check still sees the cut unit as reachable. Add `--wait --timeout 5m0s` to the helm install so the controller-manager Deployment and chaos-daemon DaemonSet are Ready before any test applies a NetworkChaos. The apply retry added earlier stays as defense in depth for the brief webhook serving-cert propagation gap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
47d6e89 to
619e954
Compare
Multi-unit VM integration tests (test_scaling, test_storage_reuse) run each juju machine as an LXD instance under /var/snap/lxd/common/lxd on the root disk and exhausted it: LXD's unsquashfs failed with "No space left on device" unpacking a container rootfs after the runner hit "Free space left: 17 MB". These GitHub VM runners are nvme-backed with a single ~72G root volume (~17G free) and no separate /mnt SSD to offload to, so free the ~30G of unused preinstalled toolchains (android, dotnet, ghcup, CodeQL, Docker images) on / instead, gated to VM jobs, with a fail-loud guard if <25G remains after cleanup. Also bump the spread backend's default VM disk 20G->40G for backends that allocate their own VMs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
619e954 to
6ae5f49
Compare
Mehdi-Bendriss
left a comment
There was a problem hiding this comment.
Thanks Mykola! Looks good, I only have 1 question
| # see src/managers/config.py) and then needs time to reach quorum and promote a | ||
| # replica, so a new primary cannot appear before ~30s. Wait well past that (but | ||
| # under the 180s `failover-timeout`) to absorb CI scheduling jitter. | ||
| for attempt in Retrying(stop=stop_after_attempt(15), wait=wait_fixed(10), reraise=True): |
There was a problem hiding this comment.
Shouldn't we assume that for a small cluster (not 10s of Gb of data) that the failover should happen fast (right after down-after-milliseconds) and if it takes longer it's abnormal and potentially a bug to be investigated ?
There was a problem hiding this comment.
In my personal opinion, failover should happen under 1-3 seconds with failover-timeout = 2 x down-after-milliseconds. But definitely separate spec.
skourta
left a comment
There was a problem hiding this comment.
Can we run the CI a few times to confirm the stability?
reneradoi
left a comment
There was a problem hiding this comment.
Just a small question from my side, other than that I'm fine with the changes.
| with: | ||
| pattern: ${{ inputs.artifact-prefix }}-* | ||
| merge-multiple: true | ||
| - name: Free disk space on / |
There was a problem hiding this comment.
Did you add this for a particular reason or is it a leftover before we moved the integration tests to Github hosted runners? From what I can see, we usually have 80 to 90 GB of disk space available after the env setup, which should be enough for the tests we run.
test_network_cut_primary flakes in CI for two independent reasons:
VM: the new-primary election wait was ~30s (stop_after_attempt(4) x wait_fixed(10)), but Sentinel's down-after-milliseconds is 30s (src/managers/config.py), so the final attempt landed exactly when failover begins -- before any replica is promoted. Widen the window to ~140s, still under the 180s failover-timeout. The loop breaks as soon as a new primary appears, so the happy path is unaffected.
K8s: applying the Chaos Mesh NetworkChaos manifest races the mnetworkchaos.kb.io admission webhook coming up and fails with "connection refused". Retry the idempotent apply for a bounded window so the webhook has time to start serving.
Have you updated relevant documentation?