Skip to content

Commit 1869645

Browse files
committed
Chaos Test: MySQL Chaos Test Blog Post
Signed-off-by: SK Ali Arman <arman@appscode.com>
1 parent 99d0a23 commit 1869645

1 file changed

Lines changed: 245 additions & 0 deletions

File tree

  • content/post/chaos-testing-mysql
Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
---
2+
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 60 Experiments"
3+
date: "2026-04-08"
4+
weight: 14
5+
authors:
6+
- SK Ali Arman
7+
tags:
8+
- chaos-engineering
9+
- chaos-mesh
10+
- database
11+
- kubedb
12+
- kubernetes
13+
- mysql
14+
- group-replication
15+
- high-availability
16+
---
17+
18+
## Overview
19+
20+
We conducted **60 chaos experiments** across **4 MySQL versions** and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
21+
22+
**The result: every experiment on MySQL 8.0+ passed with zero data loss, zero split-brain, and zero errant GTIDs.**
23+
24+
This post summarizes the methodology, results, and key findings from the most comprehensive chaos testing effort we have run on KubeDB MySQL to date.
25+
26+
## Why Chaos Testing?
27+
28+
Running databases on Kubernetes introduces failure modes that traditional infrastructure does not have — pods can be evicted, nodes can go down, network policies can partition traffic, and resource limits can trigger OOMKills at any time. Chaos engineering deliberately injects these failures to verify that the system recovers correctly **before** they happen in production.
29+
30+
For a MySQL Group Replication cluster managed by KubeDB, we needed to answer:
31+
32+
- Does the cluster **lose data** when a primary is killed mid-transaction?
33+
- Does **automatic failover** work under network partitions?
34+
- Can the cluster **self-heal** after a full outage with no manual intervention?
35+
- Are **GTIDs consistent** across all nodes after recovery?
36+
- Does the cluster survive **combined failures** (CPU + memory + load simultaneously)?
37+
38+
## Test Environment
39+
40+
| Component | Details |
41+
|---|---|
42+
| Cluster Topology | 3-node Group Replication (Single-Primary & Multi-Primary) |
43+
| MySQL Versions | 5.7.44, 8.0.36, 8.4.8, 9.6.0 |
44+
| Storage | 2Gi PVC per node (Durable, ReadWriteOnce) |
45+
| Memory Limit | 1.5Gi per MySQL pod |
46+
| CPU Request | 500m per pod |
47+
| Chaos Engine | Chaos Mesh |
48+
| Load Generator | sysbench `oltp_write_only`, 4-12 tables, 4-16 threads |
49+
| Baseline TPS | ~2,400 (Single-Primary) / ~1,150 (Multi-Primary) |
50+
51+
All experiments were run under **sustained sysbench write load** to simulate production traffic during failures. The load generator ran as a Kubernetes Deployment inside the same namespace as the MySQL cluster.
52+
53+
## The 12-Experiment Matrix
54+
55+
Every MySQL version and topology was tested against the same 12-experiment matrix covering single-node failures, resource exhaustion, network degradation, and complex multi-fault scenarios:
56+
57+
| # | Experiment | Chaos Type | What It Tests |
58+
|---|---|---|---|
59+
| 1 | Pod Kill | PodChaos | Ungraceful termination (grace-period=0) |
60+
| 2 | OOMKill | StressChaos / Load | Memory exhaustion beyond pod limits |
61+
| 3 | Network Partition | NetworkChaos | Isolate a node from the cluster |
62+
| 4 | CPU Stress (98%) | StressChaos | Extreme CPU pressure on nodes |
63+
| 5 | IO Latency (100ms) | IOChaos | Disk I/O delays on a node |
64+
| 6 | Network Latency (1s) | NetworkChaos | Replication traffic delays |
65+
| 7 | Packet Loss (30%) | NetworkChaos | Unreliable network across cluster |
66+
| 8 | Combined Stress | StressChaos x3 | Memory + CPU + load simultaneously |
67+
| 9 | Full Cluster Kill | kubectl delete | All 3 pods deleted at once |
68+
| 10 | OOMKill Natural | Load | 128-thread queries to exhaust memory |
69+
| 11 | Scheduled Pod Kill | Schedule | Repeated kills every 30s-1min |
70+
| 12 | Degraded Failover | Workflow | IO latency + pod kill in sequence |
71+
72+
## Data Integrity Validation
73+
74+
Every experiment verified data integrity through **4 checks** across all 3 nodes:
75+
76+
1. **GTID Consistency**`SELECT @@gtid_executed` must match on all nodes after recovery
77+
2. **Checksum Verification**`CHECKSUM TABLE` on all sysbench tables must match across nodes
78+
3. **Row Count Validation** — Cumulative tracking table row counts must be preserved
79+
4. **Errant GTID Detection** — No local `server_uuid` GTIDs outside the group UUID
80+
81+
## Results — Single-Primary Mode
82+
83+
### MySQL 9.6.0 — All 12 PASSED
84+
85+
| # | Experiment | Failover | Data Loss | Errant GTIDs | Verdict |
86+
|---|---|---|---|---|---|
87+
| 1 | Pod Kill Primary | Yes | Zero | 0 | **PASS** |
88+
| 2 | OOMKill Natural | Yes | Zero | 0 | **PASS** |
89+
| 3 | Network Partition | Yes | Zero | 0 | **PASS** |
90+
| 4 | IO Latency (100ms) | No | Zero | 0 | **PASS** |
91+
| 5 | Network Latency (1s) | No | Zero | 0 | **PASS** |
92+
| 6 | CPU Stress (98%) | No | Zero | 0 | **PASS** |
93+
| 7 | Packet Loss (30%) | Yes | Zero | 0 | **PASS** |
94+
| 8 | Combined Stress | Yes (OOMKill) | Zero | 0 | **PASS** |
95+
| 9 | Full Cluster Kill | Yes | Zero | 0 | **PASS** |
96+
| 10 | OOMKill Retry | No (survived) | Zero | 0 | **PASS** |
97+
| 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
98+
| 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
99+
100+
### MySQL 8.4.8 — All 12 PASSED
101+
102+
| # | Experiment | Failover | Data Loss | Errant GTIDs | Verdict |
103+
|---|---|---|---|---|---|
104+
| 1 | Pod Kill Primary | Yes | Zero | 0 | **PASS** |
105+
| 2 | OOMKill Stress | No (survived) | Zero | 0 | **PASS** |
106+
| 3 | Network Partition | Yes | Zero | 0 | **PASS** |
107+
| 4 | IO Latency (100ms) | No | Zero | 0 | **PASS** |
108+
| 5 | Network Latency (1s) | No | Zero | 0 | **PASS** |
109+
| 6 | CPU Stress (98%) | No | Zero | 0 | **PASS** |
110+
| 7 | Packet Loss (30%) | Yes | Zero | 0 | **PASS** |
111+
| 8 | Combined Stress | Yes (OOMKill) | Zero | 0 | **PASS** |
112+
| 9 | Full Cluster Kill | Yes | Zero | 0 | **PASS** |
113+
| 10 | OOMKill Natural | No (survived) | Zero | 0 | **PASS** |
114+
| 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
115+
| 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
116+
117+
### MySQL 8.0.36 — All 12 PASSED
118+
119+
| # | Experiment | Failover | Data Loss | Errant GTIDs | Verdict |
120+
|---|---|---|---|---|---|
121+
| 1 | Pod Kill Primary | Yes | Zero | 0 | **PASS** |
122+
| 2 | OOMKill Natural | No (survived) | Zero | 0 | **PASS** |
123+
| 3 | Network Partition | Yes | Zero | 0 | **PASS** |
124+
| 4 | IO Latency (100ms) | No | Zero | 0 | **PASS** |
125+
| 5 | Network Latency (1s) | No | Zero | 0 | **PASS** |
126+
| 6 | CPU Stress (98%) | No | Zero | 0 | **PASS** |
127+
| 7 | Packet Loss (30%) | Yes | Zero | 0 | **PASS** |
128+
| 8 | Combined Stress | Yes (OOMKill) | Zero | 0 | **PASS** |
129+
| 9 | Full Cluster Kill | Yes | Zero | 0 | **PASS** |
130+
| 10 | OOMKill Natural (retry) | Yes | Zero | 0 | **PASS** |
131+
| 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
132+
| 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
133+
134+
### MySQL 5.7.44 — 1 PASSED, 1 FAILED, 10 BLOCKED
135+
136+
| # | Experiment | Data Loss | Errant GTIDs | Verdict |
137+
|---|---|---|---|---|
138+
| 1 | Pod Kill Primary | Zero | 0 | **PASS** |
139+
| 2 | OOMKill Natural | Zero | **1 (persistent)** | **FAIL** |
140+
| 3-12 | All remaining ||| **BLOCKED** |
141+
142+
MySQL 5.7 does not support the CLONE plugin (requires 8.0.17+). After OOMKill, a persistent errant GTID prevented the node from rejoining. The cluster degraded to 2 nodes with no automatic recovery path. **Recommendation: upgrade to MySQL 8.0+.**
143+
144+
## Results — Multi-Primary Mode (MySQL 8.4.8)
145+
146+
In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/replica distinction. This changes the failure dynamics significantly: no failover election is needed, but Paxos consensus must be maintained across all writable nodes.
147+
148+
| # | Experiment | Data Loss | GTIDs | Checksums | Verdict |
149+
|---|---|---|---|---|---|
150+
| 1 | Pod Kill (random) | Zero | MATCH | MATCH | **PASS** |
151+
| 2 | OOMKill (1200MB stress) | Zero | MATCH | MATCH | **PASS** |
152+
| 3 | Network Partition (3 min) | Zero | MATCH | MATCH | **PASS** |
153+
| 4 | CPU Stress (98%, 3 min) | Zero | MATCH | MATCH | **PASS** |
154+
| 5 | IO Latency (100ms, 3 min) | Zero | MATCH | MATCH | **PASS** |
155+
| 6 | Network Latency (1s, 3 min) | Zero | MATCH | MATCH | **PASS** |
156+
| 7 | Packet Loss (30%, 3 min) | Zero | MATCH | MATCH | **PASS** |
157+
| 8 | Combined Stress (mem+cpu+load) | Zero | MATCH | MATCH | **PASS** |
158+
| 9 | Full Cluster Kill | Zero | MATCH | MATCH | **PASS** |
159+
| 10 | OOMKill Natural (90 JOINs) | Zero | MATCH | MATCH | **PASS** |
160+
| 11 | Scheduled Pod Kill (every 1 min) | Zero | MATCH | MATCH | **PASS** |
161+
| 12 | Degraded Failover (IO + Kill) | Zero | MATCH | MATCH | **PASS** |
162+
163+
**All 12 experiments PASSED with zero data loss.**
164+
165+
## Failover Performance (Single-Primary)
166+
167+
| Scenario | Failover Time | Full Recovery Time |
168+
|---|---|---|
169+
| Pod Kill Primary | ~2-3 seconds | ~30-33 seconds |
170+
| OOMKill Primary | ~2-3 seconds | ~30 seconds |
171+
| Network Partition | ~3 seconds | ~3 minutes |
172+
| Packet Loss (30%) | ~30 seconds | ~2 minutes |
173+
| Full Cluster Kill | ~10 seconds | ~1-2 minutes |
174+
| Combined Stress (OOMKill) | ~3 seconds | ~4 minutes |
175+
176+
## Performance Impact Under Chaos
177+
178+
### Single-Primary Mode
179+
180+
| Chaos Type | TPS During Chaos | Reduction from Baseline (~2,400) |
181+
|---|---|---|
182+
| IO Latency (100ms) | 2-3.5 | 99.9% |
183+
| Network Latency (1s) | 1.2-1.4 | 99.9% |
184+
| CPU Stress (98%) | 1,300-1,370 | ~46% |
185+
| Packet Loss (30%) | Variable | Triggers failover |
186+
187+
### Multi-Primary Mode
188+
189+
| Chaos Type | TPS During Chaos | Impact |
190+
|---|---|---|
191+
| IO Latency (100ms) | 272 | ~73% drop |
192+
| Network Latency (1s) | 1.57 | 99.9% drop |
193+
| CPU Stress (98%) | 0 (writes blocked) | Paxos consensus fails |
194+
| Packet Loss (30%) | 4.98 | 99.6% drop |
195+
| Combined Stress | ~530 then OOMKill | ~44% drop |
196+
197+
## Multi-Primary vs Single-Primary
198+
199+
| Aspect | Multi-Primary | Single-Primary |
200+
|---|---|---|
201+
| Failover needed | No (all primaries) | Yes (election ~2-3s) |
202+
| Write availability | All nodes writable | Only primary writable |
203+
| CPU stress 98% | All writes blocked (Paxos fails) | ~46% TPS reduction |
204+
| IO latency impact | ~73% TPS drop | ~99.9% TPS drop |
205+
| Packet loss 30% | 4.98 TPS (stayed ONLINE) | Triggers failover |
206+
| High concurrency | GR certification conflicts possible | No conflicts (single writer) |
207+
| Recovery mechanism | Rejoin as PRIMARY | Election + rejoin |
208+
209+
## Version Compatibility
210+
211+
| Capability | 5.7.44 | 8.0.36 | 8.4.8 | 9.6.0 |
212+
|---|---|---|---|---|
213+
| Pod Kill Recovery | Yes | Yes | Yes | Yes |
214+
| OOMKill Recovery | **No** | Yes | Yes | Yes |
215+
| Network Partition Recovery | Blocked | Yes | Yes | Yes |
216+
| CLONE Plugin | **No** | Yes | Yes | Yes |
217+
| Single-Primary (12 tests) | 1/12 | **12/12** | **12/12** | **12/12** |
218+
| Multi-Primary (12 tests) | Not tested | Not tested | **12/12** | Not tested |
219+
220+
## Key Takeaways
221+
222+
1. **KubeDB MySQL 8.0+ achieves zero data loss** across all 60 chaos experiments in both Single-Primary and Multi-Primary topologies.
223+
224+
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios.
225+
226+
3. **Multi-Primary mode is production-ready** — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.
227+
228+
4. **Upgrade from MySQL 5.7** — it is EOL and lacks the CLONE plugin needed for automatic recovery from OOMKill and errant GTID scenarios.
229+
230+
5. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
231+
232+
## What's Next
233+
234+
- **Multi-Primary testing on additional MySQL versions** — extend chaos testing to MySQL 8.0.36 and 9.6.0 in Multi-Primary mode
235+
- **Long-duration soak testing** — extended chaos runs (hours/days) to validate stability under sustained failure injection
236+
237+
## Support
238+
239+
To speak with us, please leave a message on [our website](https://appscode.com/contact/).
240+
241+
To receive product announcements, follow us on [X](https://twitter.com/KubeDB).
242+
243+
To watch tutorials of various Production-Grade Kubernetes Tools Subscribe our [YouTube](https://youtube.com/@appscode) channel.
244+
245+
If you have found a bug with KubeDB or want to request for new features, please [file an issue](https://github.com/kubedb/project/issues/new).

0 commit comments

Comments
 (0)