You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/post/chaos-testing-mysql/index.md
+140-9Lines changed: 140 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 48 Experiments"
2
+
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 54 Experiments"
3
3
date: "2026-04-08"
4
4
weight: 14
5
5
authors:
@@ -17,7 +17,7 @@ tags:
17
17
18
18
## Overview
19
19
20
-
We conducted **48 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
20
+
We conducted **54 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
21
21
22
22
**The result: every experiment passed with zero data loss, zero split-brain, and zero errant GTIDs.**
23
23
@@ -385,9 +385,11 @@ for i in 0 1 2; do
385
385
done
386
386
```
387
387
388
-
## The 12-Experiment Matrix
388
+
## The 18-Experiment Matrix
389
389
390
-
Every MySQL version and topology was tested against the same 12-experiment matrix covering single-node failures, resource exhaustion, network degradation, and complex multi-fault scenarios:
390
+
Every MySQL version and topology was tested against a comprehensive experiment matrix covering single-node failures, resource exhaustion, network degradation, multi-fault scenarios, and advanced recovery tests:
391
+
392
+
### Core Experiments (1-12)
391
393
392
394
| # | Experiment | Chaos Type | What It Tests |
393
395
|---|---|---|---|
@@ -404,6 +406,17 @@ Every MySQL version and topology was tested against the same 12-experiment matri
404
406
| 11 | Scheduled Pod Kill | Schedule | Repeated kills every 30s-1min |
405
407
| 12 | Degraded Failover | Workflow | IO latency + pod kill in sequence |
406
408
409
+
### Extended Experiments (13-18)
410
+
411
+
| # | Experiment | Chaos Type | What It Tests |
412
+
|---|---|---|---|
413
+
| 13 | Double Primary Kill | kubectl delete x2 | Kill primary, then immediately kill newly elected primary |
414
+
| 14 | Rolling Restart | kubectl delete x3 | Delete pods one at a time (0→1→2) under write load |
415
+
| 15 | Coordinator Crash | kill PID 1 | Kill coordinator sidecar, MySQL process stays running |
416
+
| 16 | Long Network Partition | NetworkChaos | 10-minute partition (5x longer than standard test) |
417
+
| 17 | DNS Failure | DNSChaos | Block DNS resolution on primary for 3 minutes |
418
+
| 18 | PVC Delete + Pod Kill | kubectl delete | Destroy pod + persistent storage, rebuild via CLONE |
419
+
407
420
## Data Integrity Validation
408
421
409
422
Every experiment verified data integrity through **4 checks** across all 3 nodes:
@@ -432,7 +445,7 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
**Result:** Kubernetes auto-restarted the coordinator container. MySQL was completely unaffected — no failover, no write interruption, 728 TPS (normal). The coordinator is a management layer; MySQL GR operates independently.
551
+
552
+
### Exp 16: Long Network Partition (10 min)
553
+
554
+
Isolate the primary from replicas for 10 minutes — 5x longer than the standard 2-minute test.
555
+
556
+
```yaml
557
+
apiVersion: chaos-mesh.org/v1alpha1
558
+
kind: NetworkChaos
559
+
metadata:
560
+
name: mysql-primary-network-partition-long
561
+
namespace: chaos-mesh
562
+
spec:
563
+
action: partition
564
+
mode: one
565
+
selector:
566
+
namespaces: [demo]
567
+
labelSelectors:
568
+
"app.kubernetes.io/instance": "mysql-ha-cluster"
569
+
"kubedb.com/role": "primary"
570
+
target:
571
+
mode: all
572
+
selector:
573
+
namespaces: [demo]
574
+
labelSelectors:
575
+
"kubedb.com/role": "standby"
576
+
direction: both
577
+
duration: "10m"
578
+
```
579
+
580
+
**Result:** Failover triggered within seconds. After 10-minute partition removed, the isolated node rejoined cleanly via GR distributed recovery. All 3 nodes ONLINE within ~2 minutes. Zero data loss.
581
+
582
+
### Exp 17: DNS Failure on Primary
583
+
584
+
Block all DNS resolution on the primary for 3 minutes. GR uses hostnames for inter-node communication, so this tests a critical infrastructure dependency.
585
+
586
+
```yaml
587
+
apiVersion: chaos-mesh.org/v1alpha1
588
+
kind: DNSChaos
589
+
metadata:
590
+
name: mysql-dns-error-primary
591
+
namespace: chaos-mesh
592
+
spec:
593
+
action: error
594
+
mode: one
595
+
selector:
596
+
namespaces: [demo]
597
+
labelSelectors:
598
+
"app.kubernetes.io/instance": "mysql-ha-cluster"
599
+
"kubedb.com/role": "primary"
600
+
duration: "3m"
601
+
```
602
+
603
+
**Result:** Primary survived without failover. TPS dropped ~32% (497 vs ~730 baseline) due to DNS-dependent operations timing out. No errors, no data loss. Existing TCP connections between GR members stayed open.
604
+
605
+
### Exp 18: PVC Delete + Pod Kill
606
+
607
+
Completely destroy a node's data — delete both the pod and its PVC. The node must rebuild from scratch using the CLONE plugin.
608
+
609
+
```bash
610
+
kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
**Result:** StatefulSet auto-created a new PVC. The CLONE plugin copied a full data snapshot from a donor node. Pod recovered and joined GR in ~90 seconds with identical GTIDs and checksums. This is the ultimate recovery test — MySQL 8.0+ handles it fully automatically.
615
+
490
616
## Failover Performance (Single-Primary)
491
617
492
618
| Scenario | Failover Time | Full Recovery Time |
@@ -539,18 +665,23 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
| Single-Primary (extended 13-18) | Not tested | **6/6** | Not tested |
543
670
| Multi-Primary (12 tests) | Not tested | **12/12** | Not tested |
544
671
545
672
## Key Takeaways
546
673
547
-
1. **KubeDB MySQL achieves zero data loss** across all 48 chaos experiments in both Single-Primary and Multi-Primary topologies.
674
+
1. **KubeDB MySQL achieves zero data loss** across all 54 chaos experiments in both Single-Primary and Multi-Primary topologies.
548
675
549
-
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios.
676
+
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios, including double primary kill.
550
677
551
678
3. **Multi-Primary mode is production-ready** — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.
552
679
553
-
4. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
680
+
4. **Full data rebuild works automatically** — even after complete PVC deletion, the CLONE plugin rebuilds a node from scratch in ~90 seconds with zero manual intervention.
681
+
682
+
5. **Coordinator crash has zero impact** — MySQL GR operates independently of the coordinator sidecar. Killing the coordinator does not trigger failover or interrupt writes.
683
+
684
+
6. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
0 commit comments