You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 48 Experiments"
2
+
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 57 Experiments"
3
3
date: "2026-04-08"
4
4
weight: 14
5
5
authors:
@@ -17,7 +17,7 @@ tags:
17
17
18
18
## Overview
19
19
20
-
We conducted **48 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
20
+
We conducted **57 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
21
21
22
22
**The result: every experiment passed with zero data loss, zero split-brain, and zero errant GTIDs.**
23
23
@@ -385,9 +385,11 @@ for i in 0 1 2; do
385
385
done
386
386
```
387
387
388
-
## The 12-Experiment Matrix
388
+
## The 21-Experiment Matrix
389
389
390
-
Every MySQL version and topology was tested against the same 12-experiment matrix covering single-node failures, resource exhaustion, network degradation, and complex multi-fault scenarios:
390
+
Every MySQL version and topology was tested against a comprehensive experiment matrix covering single-node failures, resource exhaustion, network degradation, I/O faults, multi-fault scenarios, and advanced recovery tests:
391
+
392
+
### Core Experiments (1-12)
391
393
392
394
| # | Experiment | Chaos Type | What It Tests |
393
395
|---|---|---|---|
@@ -404,6 +406,20 @@ Every MySQL version and topology was tested against the same 12-experiment matri
404
406
| 11 | Scheduled Pod Kill | Schedule | Repeated kills every 30s-1min |
405
407
| 12 | Degraded Failover | Workflow | IO latency + pod kill in sequence |
406
408
409
+
### Extended Experiments (13-21)
410
+
411
+
| # | Experiment | Chaos Type | What It Tests |
412
+
|---|---|---|---|
413
+
| 13 | Double Primary Kill | kubectl delete x2 | Kill primary, then immediately kill newly elected primary |
414
+
| 14 | Rolling Restart | kubectl delete x3 | Delete pods one at a time (0→1→2) under write load |
415
+
| 15 | Coordinator Crash | kill PID 1 | Kill coordinator sidecar, MySQL process stays running |
416
+
| 16 | Long Network Partition | NetworkChaos | 10-minute partition (5x longer than standard test) |
417
+
| 17 | DNS Failure | DNSChaos | Block DNS resolution on primary for 3 minutes |
418
+
| 18 | PVC Delete + Pod Kill | kubectl delete | Destroy pod + persistent storage, rebuild via CLONE |
419
+
| 19 | IO Fault (EIO errors) | IOChaos | 50% of disk I/O operations return EIO errors |
420
+
| 20 | Clock Skew (-5 min) | TimeChaos | Shift primary's system clock back 5 minutes |
**Result:** Kubernetes auto-restarted the coordinator container. MySQL was completely unaffected — no failover, no write interruption, 728 TPS (normal). The coordinator is a management layer; MySQL GR operates independently.
557
+
558
+
### Exp 16: Long Network Partition (10 min)
559
+
560
+
Isolate the primary from replicas for 10 minutes — 5x longer than the standard 2-minute test.
561
+
562
+
```yaml
563
+
apiVersion: chaos-mesh.org/v1alpha1
564
+
kind: NetworkChaos
565
+
metadata:
566
+
name: mysql-primary-network-partition-long
567
+
namespace: chaos-mesh
568
+
spec:
569
+
action: partition
570
+
mode: one
571
+
selector:
572
+
namespaces: [demo]
573
+
labelSelectors:
574
+
"app.kubernetes.io/instance": "mysql-ha-cluster"
575
+
"kubedb.com/role": "primary"
576
+
target:
577
+
mode: all
578
+
selector:
579
+
namespaces: [demo]
580
+
labelSelectors:
581
+
"kubedb.com/role": "standby"
582
+
direction: both
583
+
duration: "10m"
584
+
```
585
+
586
+
**Result:** Failover triggered within seconds. After 10-minute partition removed, the isolated node rejoined cleanly via GR distributed recovery. All 3 nodes ONLINE within ~2 minutes. Zero data loss.
587
+
588
+
### Exp 17: DNS Failure on Primary
589
+
590
+
Block all DNS resolution on the primary for 3 minutes. GR uses hostnames for inter-node communication, so this tests a critical infrastructure dependency.
591
+
592
+
```yaml
593
+
apiVersion: chaos-mesh.org/v1alpha1
594
+
kind: DNSChaos
595
+
metadata:
596
+
name: mysql-dns-error-primary
597
+
namespace: chaos-mesh
598
+
spec:
599
+
action: error
600
+
mode: one
601
+
selector:
602
+
namespaces: [demo]
603
+
labelSelectors:
604
+
"app.kubernetes.io/instance": "mysql-ha-cluster"
605
+
"kubedb.com/role": "primary"
606
+
duration: "3m"
607
+
```
608
+
609
+
**Result:** Primary survived without failover. TPS dropped ~32% (497 vs ~730 baseline) due to DNS-dependent operations timing out. No errors, no data loss. Existing TCP connections between GR members stayed open.
610
+
611
+
### Exp 18: PVC Delete + Pod Kill
612
+
613
+
Completely destroy a node's data — delete both the pod and its PVC. The node must rebuild from scratch using the CLONE plugin.
614
+
615
+
```bash
616
+
kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
**Result:** StatefulSet auto-created a new PVC. The CLONE plugin copied a full data snapshot from a donor node. Pod recovered and joined GR in ~90 seconds with identical GTIDs and checksums. This is the ultimate recovery test — MySQL 8.0+ handles it fully automatically.
621
+
622
+
### Exp 19: IO Fault (EIO Errors)
623
+
624
+
Inject I/O read/write errors (errno 5 = EIO) on 50% of disk operations on the primary's data volume. Unlike IO latency which slows things down, IO faults cause actual operation failures — simulating a failing disk or storage system.
625
+
626
+
```yaml
627
+
apiVersion: chaos-mesh.org/v1alpha1
628
+
kind: IOChaos
629
+
metadata:
630
+
name: mysql-primary-io-fault
631
+
namespace: chaos-mesh
632
+
spec:
633
+
action: fault
634
+
mode: one
635
+
selector:
636
+
namespaces: [demo]
637
+
labelSelectors:
638
+
"app.kubernetes.io/instance": "mysql-ha-cluster"
639
+
"kubedb.com/role": "primary"
640
+
volumePath: "/var/lib/mysql"
641
+
path: "/**"
642
+
errno: 5
643
+
percent: 50
644
+
duration: "3m"
645
+
```
646
+
647
+
**Result:** Initially 703 TPS (InnoDB retries handle some errors), but the 50% EIO rate eventually crashed the MySQL process on the primary. Failover triggered to a secondary. After chaos removed and pod force-restarted, InnoDB crash recovery repaired the data directory and the node rejoined GR cleanly. Zero data loss.
648
+
649
+
### Exp 20: Clock Skew (-5 min)
650
+
651
+
Shift the primary's system clock back by 5 minutes. GR uses timestamps for conflict detection, Paxos message ordering, and timeout calculations. Tests whether clock drift breaks consensus or causes split-brain.
652
+
653
+
```yaml
654
+
apiVersion: chaos-mesh.org/v1alpha1
655
+
kind: TimeChaos
656
+
metadata:
657
+
name: mysql-primary-clock-skew
658
+
namespace: chaos-mesh
659
+
spec:
660
+
mode: one
661
+
selector:
662
+
namespaces: [demo]
663
+
labelSelectors:
664
+
"app.kubernetes.io/instance": "mysql-ha-cluster"
665
+
"kubedb.com/role": "primary"
666
+
timeOffset: "-5m"
667
+
duration: "3m"
668
+
```
669
+
670
+
**Result:** 404 TPS (~45% reduction from baseline). No failover triggered, no errors. GR's Paxos protocol is resilient to clock drift — it uses logical clocks for consensus, not wall-clock time. All 3 nodes stayed ONLINE throughout. Zero data loss.
671
+
672
+
### Exp 21: Bandwidth Throttle (1mbps)
673
+
674
+
Limit the primary's outbound network bandwidth to 1mbps. Simulates degraded network in cross-AZ or cross-region deployments where bandwidth is limited but not broken.
675
+
676
+
```yaml
677
+
apiVersion: chaos-mesh.org/v1alpha1
678
+
kind: NetworkChaos
679
+
metadata:
680
+
name: mysql-bandwidth-throttle
681
+
namespace: chaos-mesh
682
+
spec:
683
+
action: bandwidth
684
+
mode: one
685
+
selector:
686
+
namespaces: [demo]
687
+
labelSelectors:
688
+
"app.kubernetes.io/instance": "mysql-ha-cluster"
689
+
"kubedb.com/role": "primary"
690
+
bandwidth:
691
+
rate: "1mbps"
692
+
limit: 20971520
693
+
buffer: 10000
694
+
duration: "3m"
695
+
```
696
+
697
+
**Result:** 147 TPS (~80% reduction from baseline). The bandwidth limit heavily throttles GR's Paxos consensus traffic, but the cluster stays completely stable — no failover, no errors, no member state changes. All 3 nodes remained ONLINE. Zero data loss.
698
+
490
699
## Failover Performance (Single-Primary)
491
700
492
701
| Scenario | Failover Time | Full Recovery Time |
@@ -502,12 +711,16 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
502
711
503
712
### Single-Primary Mode
504
713
505
-
| Chaos Type | TPS During Chaos | Reduction from Baseline (~2,400) |
714
+
| Chaos Type | TPS During Chaos | Reduction from Baseline (~730) |
506
715
|---|---|---|
507
-
| IO Latency (100ms) | 2-3.5 | 99.9% |
508
-
| Network Latency (1s) | 1.2-1.4 | 99.9% |
716
+
| IO Latency (100ms) | 2-3.5 | 99.5% |
717
+
| Network Latency (1s) | 1.2-1.4 | 99.8% |
509
718
| CPU Stress (98%) | 1,300-1,370 | ~46% |
510
719
| Packet Loss (30%) | Variable | Triggers failover |
| Single-Primary (extended 13-21) | Not tested | **9/9** | Not tested |
543
757
| Multi-Primary (12 tests) | Not tested | **12/12** | Not tested |
544
758
545
759
## Key Takeaways
546
760
547
-
1. **KubeDB MySQL achieves zero data loss** across all 48 chaos experiments in both Single-Primary and Multi-Primary topologies.
761
+
1. **KubeDB MySQL achieves zero data loss** across all 57 chaos experiments in both Single-Primary and Multi-Primary topologies.
548
762
549
-
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios.
763
+
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios, including double primary kill and disk failure.
550
764
551
765
3. **Multi-Primary mode is production-ready** — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.
552
766
553
-
4. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
767
+
4. **Full data rebuild works automatically** — even after complete PVC deletion, the CLONE plugin rebuilds a node from scratch in ~90 seconds with zero manual intervention.
768
+
769
+
5. **Coordinator crash has zero impact** — MySQL GR operates independently of the coordinator sidecar. Killing the coordinator does not trigger failover or interrupt writes.
770
+
771
+
6. **Disk failures trigger safe failover** — 50% I/O error rate eventually crashes MySQL, but InnoDB crash recovery + GR distributed recovery handles it with zero data loss after pod restart.
772
+
773
+
7. **Clock skew and bandwidth limits are tolerated** — GR's Paxos protocol is resilient to 5-minute clock drift (~45% TPS drop, no errors) and 1mbps bandwidth limits (~80% TPS drop, no errors).
774
+
775
+
8. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
0 commit comments