Skip to content

test(ha): harden flaky tests#64

Open
delgod wants to merge 3 commits into
9/edgefrom
harden-flaky-network-tests
Open

test(ha): harden flaky tests#64
delgod wants to merge 3 commits into
9/edgefrom
harden-flaky-network-tests

Conversation

@delgod
Copy link
Copy Markdown
Member

@delgod delgod commented May 22, 2026

test_network_cut_primary flakes in CI for two independent reasons:

  • VM: the new-primary election wait was ~30s (stop_after_attempt(4) x wait_fixed(10)), but Sentinel's down-after-milliseconds is 30s (src/managers/config.py), so the final attempt landed exactly when failover begins -- before any replica is promoted. Widen the window to ~140s, still under the 180s failover-timeout. The loop breaks as soon as a new primary appears, so the happy path is unaffected.

  • K8s: applying the Chaos Mesh NetworkChaos manifest races the mnetworkchaos.kb.io admission webhook coming up and fails with "connection refused". Retry the idempotent apply for a bounded window so the webhook has time to start serving.

  • Have you updated relevant documentation?

delgod and others added 2 commits May 22, 2026 11:27
test_network_cut_primary flakes in CI for two independent reasons:

- VM: the new-primary election wait was ~30s (stop_after_attempt(4) x
  wait_fixed(10)), but Sentinel's down-after-milliseconds is 30s
  (src/managers/config.py), so the final attempt landed exactly when
  failover begins -- before any replica is promoted. Widen the window
  to ~140s, still under the 180s failover-timeout. The loop breaks as
  soon as a new primary appears, so the happy path is unaffected.

- K8s: applying the Chaos Mesh NetworkChaos manifest races the
  mnetworkchaos.kb.io admission webhook coming up and fails with
  "connection refused". Retry the idempotent apply for a bounded window
  so the webhook has time to start serving.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test_network_cut_primary on K8s races chaos-mesh startup. The chaos_mesh
fixture is module-scoped, so deploy_chaos_mesh runs immediately before the
test, and the script only did `helm install ... && sleep 10` without waiting
for the components to be Ready. The test then hits two symptoms of the same
cause:

- the mnetworkchaos.kb.io admission webhook refuses connections because the
  controller-manager pod is not ready, and
- once the apply lands, the loss:"100" rule is programmed late by a
  not-yet-ready chaos-daemon, so the first ping packet leaks through and the
  reachability check still sees the cut unit as reachable.

Add `--wait --timeout 5m0s` to the helm install so the controller-manager
Deployment and chaos-daemon DaemonSet are Ready before any test applies a
NetworkChaos. The apply retry added earlier stays as defense in depth for the
brief webhook serving-cert propagation gap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@delgod delgod force-pushed the harden-flaky-network-tests branch from 47d6e89 to 619e954 Compare May 22, 2026 13:17
@delgod delgod changed the title test(ha): harden flaky network-cut failover checks test(ha): harden flaky tests May 22, 2026
Multi-unit VM integration tests (test_scaling, test_storage_reuse) run
each juju machine as an LXD instance under /var/snap/lxd/common/lxd on
the root disk and exhausted it: LXD's unsquashfs failed with "No space
left on device" unpacking a container rootfs after the runner hit
"Free space left: 17 MB".

These GitHub VM runners are nvme-backed with a single ~72G root volume
(~17G free) and no separate /mnt SSD to offload to, so free the ~30G of
unused preinstalled toolchains (android, dotnet, ghcup, CodeQL, Docker
images) on / instead, gated to VM jobs, with a fail-loud guard if <25G
remains after cleanup. Also bump the spread backend's default VM disk
20G->40G for backends that allocate their own VMs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@delgod delgod force-pushed the harden-flaky-network-tests branch from 619e954 to 6ae5f49 Compare May 22, 2026 14:54
Copy link
Copy Markdown
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mykola! Looks good, I only have 1 question

# see src/managers/config.py) and then needs time to reach quorum and promote a
# replica, so a new primary cannot appear before ~30s. Wait well past that (but
# under the 180s `failover-timeout`) to absorb CI scheduling jitter.
for attempt in Retrying(stop=stop_after_attempt(15), wait=wait_fixed(10), reraise=True):
Copy link
Copy Markdown
Contributor

@Mehdi-Bendriss Mehdi-Bendriss May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we assume that for a small cluster (not 10s of Gb of data) that the failover should happen fast (right after down-after-milliseconds) and if it takes longer it's abnormal and potentially a bug to be investigated ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my personal opinion, failover should happen under 1-3 seconds with failover-timeout = 2 x down-after-milliseconds. But definitely separate spec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracked here

Copy link
Copy Markdown
Contributor

@skourta skourta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run the CI a few times to confirm the stability?

Copy link
Copy Markdown
Contributor

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small question from my side, other than that I'm fine with the changes.

with:
pattern: ${{ inputs.artifact-prefix }}-*
merge-multiple: true
- name: Free disk space on /
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add this for a particular reason or is it a leftover before we moved the integration tests to Github hosted runners? From what I can see, we usually have 80 to 90 GB of disk space available after the env setup, which should be enough for the tests we run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants