Skip to content

Commit 9c3c215

Browse files
committed
update 6 more test
Signed-off-by: SK Ali Arman <arman@appscode.com>
1 parent 38c43bf commit 9c3c215

1 file changed

Lines changed: 234 additions & 12 deletions

File tree

  • content/post/chaos-testing-mysql

content/post/chaos-testing-mysql/index.md

Lines changed: 234 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 48 Experiments"
2+
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 57 Experiments"
33
date: "2026-04-08"
44
weight: 14
55
authors:
@@ -17,7 +17,7 @@ tags:
1717

1818
## Overview
1919

20-
We conducted **48 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
20+
We conducted **57 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
2121

2222
**The result: every experiment passed with zero data loss, zero split-brain, and zero errant GTIDs.**
2323

@@ -385,9 +385,11 @@ for i in 0 1 2; do
385385
done
386386
```
387387

388-
## The 12-Experiment Matrix
388+
## The 21-Experiment Matrix
389389

390-
Every MySQL version and topology was tested against the same 12-experiment matrix covering single-node failures, resource exhaustion, network degradation, and complex multi-fault scenarios:
390+
Every MySQL version and topology was tested against a comprehensive experiment matrix covering single-node failures, resource exhaustion, network degradation, I/O faults, multi-fault scenarios, and advanced recovery tests:
391+
392+
### Core Experiments (1-12)
391393

392394
| # | Experiment | Chaos Type | What It Tests |
393395
|---|---|---|---|
@@ -404,6 +406,20 @@ Every MySQL version and topology was tested against the same 12-experiment matri
404406
| 11 | Scheduled Pod Kill | Schedule | Repeated kills every 30s-1min |
405407
| 12 | Degraded Failover | Workflow | IO latency + pod kill in sequence |
406408

409+
### Extended Experiments (13-21)
410+
411+
| # | Experiment | Chaos Type | What It Tests |
412+
|---|---|---|---|
413+
| 13 | Double Primary Kill | kubectl delete x2 | Kill primary, then immediately kill newly elected primary |
414+
| 14 | Rolling Restart | kubectl delete x3 | Delete pods one at a time (0→1→2) under write load |
415+
| 15 | Coordinator Crash | kill PID 1 | Kill coordinator sidecar, MySQL process stays running |
416+
| 16 | Long Network Partition | NetworkChaos | 10-minute partition (5x longer than standard test) |
417+
| 17 | DNS Failure | DNSChaos | Block DNS resolution on primary for 3 minutes |
418+
| 18 | PVC Delete + Pod Kill | kubectl delete | Destroy pod + persistent storage, rebuild via CLONE |
419+
| 19 | IO Fault (EIO errors) | IOChaos | 50% of disk I/O operations return EIO errors |
420+
| 20 | Clock Skew (-5 min) | TimeChaos | Shift primary's system clock back 5 minutes |
421+
| 21 | Bandwidth Throttle (1mbps) | NetworkChaos | Limit primary's network bandwidth to 1mbps |
422+
407423
## Data Integrity Validation
408424

409425
Every experiment verified data integrity through **4 checks** across all 3 nodes:
@@ -432,7 +448,7 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
432448
| 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
433449
| 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
434450

435-
### MySQL 8.4.8 — All 12 PASSED
451+
### MySQL 8.4.8 — All 21 PASSED (12 core + 9 extended)
436452

437453
| # | Experiment | Failover | Data Loss | Errant GTIDs | Verdict |
438454
|---|---|---|---|---|---|
@@ -448,6 +464,15 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
448464
| 10 | OOMKill Natural | No (survived) | Zero | 0 | **PASS** |
449465
| 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
450466
| 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
467+
| 13 | Double Primary Kill | Yes (x2) | Zero | 0 | **PASS** |
468+
| 14 | Rolling Restart (0→1→2) | Yes (x3) | Zero | 0 | **PASS** |
469+
| 15 | Coordinator Crash | No | Zero | 0 | **PASS** |
470+
| 16 | Long Network Partition (10 min) | Yes | Zero | 0 | **PASS** |
471+
| 17 | DNS Failure on Primary | No | Zero | 0 | **PASS** |
472+
| 18 | PVC Delete + Pod Kill | Yes | Zero | 0 | **PASS** |
473+
| 19 | IO Fault (EIO 50%) | Yes (crash) | Zero | 0 | **PASS** |
474+
| 20 | Clock Skew (-5 min) | No | Zero | 0 | **PASS** |
475+
| 21 | Bandwidth Throttle (1mbps) | No | Zero | 0 | **PASS** |
451476

452477
### MySQL 8.0.36 — All 12 PASSED
453478

@@ -487,6 +512,190 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
487512

488513
**All 12 experiments PASSED with zero data loss.**
489514

515+
## Extended Experiments — Details (MySQL 8.4.8 Single-Primary)
516+
517+
### Exp 13: Double Primary Kill
518+
519+
Kill the primary, wait for new election, then immediately kill the new primary. Tests survival of two consecutive leader failures.
520+
521+
```bash
522+
# Kill first primary
523+
kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
524+
# Wait 15s for new primary election, then kill the new primary
525+
sleep 15
526+
NEW_PRIMARY=$(kubectl get pods -n demo \
527+
-l "app.kubernetes.io/instance=mysql-ha-cluster,kubedb.com/role=primary" \
528+
-o jsonpath='{.items[0].metadata.name}')
529+
kubectl delete pod $NEW_PRIMARY -n demo --force --grace-period=0
530+
```
531+
532+
**Result:** Pod-0 was elected as the third primary. Cluster recovered in ~90 seconds. Zero data loss.
533+
534+
### Exp 14: Rolling Restart (0→1→2)
535+
536+
Simulate a rolling upgrade — delete each pod sequentially with 40-second gaps under write load.
537+
538+
```bash
539+
kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
540+
sleep 40
541+
kubectl delete pod mysql-ha-cluster-1 -n demo --force --grace-period=0
542+
sleep 40
543+
kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
544+
```
545+
546+
**Result:** Each pod recovered and rejoined within ~30 seconds. Two failovers triggered (when primary was deleted). Zero data loss.
547+
548+
### Exp 15: Coordinator Crash
549+
550+
Kill only the mysql-coordinator sidecar container, leaving MySQL running. Tests whether the cluster stays stable without coordinator.
551+
552+
```bash
553+
kubectl exec -n demo mysql-ha-cluster-1 -c mysql-coordinator -- kill 1
554+
```
555+
556+
**Result:** Kubernetes auto-restarted the coordinator container. MySQL was completely unaffected — no failover, no write interruption, 728 TPS (normal). The coordinator is a management layer; MySQL GR operates independently.
557+
558+
### Exp 16: Long Network Partition (10 min)
559+
560+
Isolate the primary from replicas for 10 minutes — 5x longer than the standard 2-minute test.
561+
562+
```yaml
563+
apiVersion: chaos-mesh.org/v1alpha1
564+
kind: NetworkChaos
565+
metadata:
566+
name: mysql-primary-network-partition-long
567+
namespace: chaos-mesh
568+
spec:
569+
action: partition
570+
mode: one
571+
selector:
572+
namespaces: [demo]
573+
labelSelectors:
574+
"app.kubernetes.io/instance": "mysql-ha-cluster"
575+
"kubedb.com/role": "primary"
576+
target:
577+
mode: all
578+
selector:
579+
namespaces: [demo]
580+
labelSelectors:
581+
"kubedb.com/role": "standby"
582+
direction: both
583+
duration: "10m"
584+
```
585+
586+
**Result:** Failover triggered within seconds. After 10-minute partition removed, the isolated node rejoined cleanly via GR distributed recovery. All 3 nodes ONLINE within ~2 minutes. Zero data loss.
587+
588+
### Exp 17: DNS Failure on Primary
589+
590+
Block all DNS resolution on the primary for 3 minutes. GR uses hostnames for inter-node communication, so this tests a critical infrastructure dependency.
591+
592+
```yaml
593+
apiVersion: chaos-mesh.org/v1alpha1
594+
kind: DNSChaos
595+
metadata:
596+
name: mysql-dns-error-primary
597+
namespace: chaos-mesh
598+
spec:
599+
action: error
600+
mode: one
601+
selector:
602+
namespaces: [demo]
603+
labelSelectors:
604+
"app.kubernetes.io/instance": "mysql-ha-cluster"
605+
"kubedb.com/role": "primary"
606+
duration: "3m"
607+
```
608+
609+
**Result:** Primary survived without failover. TPS dropped ~32% (497 vs ~730 baseline) due to DNS-dependent operations timing out. No errors, no data loss. Existing TCP connections between GR members stayed open.
610+
611+
### Exp 18: PVC Delete + Pod Kill
612+
613+
Completely destroy a node's data — delete both the pod and its PVC. The node must rebuild from scratch using the CLONE plugin.
614+
615+
```bash
616+
kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
617+
kubectl delete pvc data-mysql-ha-cluster-0 -n demo
618+
```
619+
620+
**Result:** StatefulSet auto-created a new PVC. The CLONE plugin copied a full data snapshot from a donor node. Pod recovered and joined GR in ~90 seconds with identical GTIDs and checksums. This is the ultimate recovery test — MySQL 8.0+ handles it fully automatically.
621+
622+
### Exp 19: IO Fault (EIO Errors)
623+
624+
Inject I/O read/write errors (errno 5 = EIO) on 50% of disk operations on the primary's data volume. Unlike IO latency which slows things down, IO faults cause actual operation failures — simulating a failing disk or storage system.
625+
626+
```yaml
627+
apiVersion: chaos-mesh.org/v1alpha1
628+
kind: IOChaos
629+
metadata:
630+
name: mysql-primary-io-fault
631+
namespace: chaos-mesh
632+
spec:
633+
action: fault
634+
mode: one
635+
selector:
636+
namespaces: [demo]
637+
labelSelectors:
638+
"app.kubernetes.io/instance": "mysql-ha-cluster"
639+
"kubedb.com/role": "primary"
640+
volumePath: "/var/lib/mysql"
641+
path: "/**"
642+
errno: 5
643+
percent: 50
644+
duration: "3m"
645+
```
646+
647+
**Result:** Initially 703 TPS (InnoDB retries handle some errors), but the 50% EIO rate eventually crashed the MySQL process on the primary. Failover triggered to a secondary. After chaos removed and pod force-restarted, InnoDB crash recovery repaired the data directory and the node rejoined GR cleanly. Zero data loss.
648+
649+
### Exp 20: Clock Skew (-5 min)
650+
651+
Shift the primary's system clock back by 5 minutes. GR uses timestamps for conflict detection, Paxos message ordering, and timeout calculations. Tests whether clock drift breaks consensus or causes split-brain.
652+
653+
```yaml
654+
apiVersion: chaos-mesh.org/v1alpha1
655+
kind: TimeChaos
656+
metadata:
657+
name: mysql-primary-clock-skew
658+
namespace: chaos-mesh
659+
spec:
660+
mode: one
661+
selector:
662+
namespaces: [demo]
663+
labelSelectors:
664+
"app.kubernetes.io/instance": "mysql-ha-cluster"
665+
"kubedb.com/role": "primary"
666+
timeOffset: "-5m"
667+
duration: "3m"
668+
```
669+
670+
**Result:** 404 TPS (~45% reduction from baseline). No failover triggered, no errors. GR's Paxos protocol is resilient to clock drift — it uses logical clocks for consensus, not wall-clock time. All 3 nodes stayed ONLINE throughout. Zero data loss.
671+
672+
### Exp 21: Bandwidth Throttle (1mbps)
673+
674+
Limit the primary's outbound network bandwidth to 1mbps. Simulates degraded network in cross-AZ or cross-region deployments where bandwidth is limited but not broken.
675+
676+
```yaml
677+
apiVersion: chaos-mesh.org/v1alpha1
678+
kind: NetworkChaos
679+
metadata:
680+
name: mysql-bandwidth-throttle
681+
namespace: chaos-mesh
682+
spec:
683+
action: bandwidth
684+
mode: one
685+
selector:
686+
namespaces: [demo]
687+
labelSelectors:
688+
"app.kubernetes.io/instance": "mysql-ha-cluster"
689+
"kubedb.com/role": "primary"
690+
bandwidth:
691+
rate: "1mbps"
692+
limit: 20971520
693+
buffer: 10000
694+
duration: "3m"
695+
```
696+
697+
**Result:** 147 TPS (~80% reduction from baseline). The bandwidth limit heavily throttles GR's Paxos consensus traffic, but the cluster stays completely stable — no failover, no errors, no member state changes. All 3 nodes remained ONLINE. Zero data loss.
698+
490699
## Failover Performance (Single-Primary)
491700

492701
| Scenario | Failover Time | Full Recovery Time |
@@ -502,12 +711,16 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
502711

503712
### Single-Primary Mode
504713

505-
| Chaos Type | TPS During Chaos | Reduction from Baseline (~2,400) |
714+
| Chaos Type | TPS During Chaos | Reduction from Baseline (~730) |
506715
|---|---|---|
507-
| IO Latency (100ms) | 2-3.5 | 99.9% |
508-
| Network Latency (1s) | 1.2-1.4 | 99.9% |
716+
| IO Latency (100ms) | 2-3.5 | 99.5% |
717+
| Network Latency (1s) | 1.2-1.4 | 99.8% |
509718
| CPU Stress (98%) | 1,300-1,370 | ~46% |
510719
| Packet Loss (30%) | Variable | Triggers failover |
720+
| IO Fault (EIO 50%) | 703 then crash | Failover triggered |
721+
| Clock Skew (-5 min) | 404 | ~45% |
722+
| Bandwidth Throttle (1mbps) | 147 | ~80% |
723+
| DNS Failure | 497 | ~32% |
511724

512725
### Multi-Primary Mode
513726

@@ -539,18 +752,27 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
539752
| OOMKill Recovery | Yes | Yes | Yes |
540753
| Network Partition Recovery | Yes | Yes | Yes |
541754
| CLONE Plugin | Yes | Yes | Yes |
542-
| Single-Primary (12 tests) | **12/12** | **12/12** | **12/12** |
755+
| Single-Primary (core 12) | **12/12** | **12/12** | **12/12** |
756+
| Single-Primary (extended 13-21) | Not tested | **9/9** | Not tested |
543757
| Multi-Primary (12 tests) | Not tested | **12/12** | Not tested |
544758

545759
## Key Takeaways
546760

547-
1. **KubeDB MySQL achieves zero data loss** across all 48 chaos experiments in both Single-Primary and Multi-Primary topologies.
761+
1. **KubeDB MySQL achieves zero data loss** across all 57 chaos experiments in both Single-Primary and Multi-Primary topologies.
548762

549-
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios.
763+
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios, including double primary kill and disk failure.
550764

551765
3. **Multi-Primary mode is production-ready** — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.
552766

553-
4. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
767+
4. **Full data rebuild works automatically** — even after complete PVC deletion, the CLONE plugin rebuilds a node from scratch in ~90 seconds with zero manual intervention.
768+
769+
5. **Coordinator crash has zero impact** — MySQL GR operates independently of the coordinator sidecar. Killing the coordinator does not trigger failover or interrupt writes.
770+
771+
6. **Disk failures trigger safe failover** — 50% I/O error rate eventually crashes MySQL, but InnoDB crash recovery + GR distributed recovery handles it with zero data loss after pod restart.
772+
773+
7. **Clock skew and bandwidth limits are tolerated** — GR's Paxos protocol is resilient to 5-minute clock drift (~45% TPS drop, no errors) and 1mbps bandwidth limits (~80% TPS drop, no errors).
774+
775+
8. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
554776

555777
## What's Next
556778

0 commit comments

Comments
 (0)