update 6 more test

sheikh-arman · sheikh-arman · commit 9c3c215cf54d · 2026-04-09T12:13:51.000+06:00
Signed-off-by: SK Ali Arman &lt;arman@appscode.com&gt;
diff --git a/content/post/chaos-testing-mysql/index.md b/content/post/chaos-testing-mysql/index.md
@@ -1,5 +1,5 @@
 ---
-title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 48 Experiments"
+title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 57 Experiments"
 date: "2026-04-08"
 weight: 14
 authors:
@@ -17,7 +17,7 @@ tags:
 
 ## Overview
 
-We conducted **48 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
+We conducted **57 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
 
 **The result: every experiment passed with zero data loss, zero split-brain, and zero errant GTIDs.**
 
@@ -385,9 +385,11 @@ for i in 0 1 2; do
 done
 ```
 
-## The 12-Experiment Matrix
+## The 21-Experiment Matrix
 
-Every MySQL version and topology was tested against the same 12-experiment matrix covering single-node failures, resource exhaustion, network degradation, and complex multi-fault scenarios:
+Every MySQL version and topology was tested against a comprehensive experiment matrix covering single-node failures, resource exhaustion, network degradation, I/O faults, multi-fault scenarios, and advanced recovery tests:
+
+### Core Experiments (1-12)
 
 | # | Experiment | Chaos Type | What It Tests |
 |---|---|---|---|
@@ -404,6 +406,20 @@ Every MySQL version and topology was tested against the same 12-experiment matri
 | 11 | Scheduled Pod Kill | Schedule | Repeated kills every 30s-1min |
 | 12 | Degraded Failover | Workflow | IO latency + pod kill in sequence |
 
+### Extended Experiments (13-21)
+
+| # | Experiment | Chaos Type | What It Tests |
+|---|---|---|---|
+| 13 | Double Primary Kill | kubectl delete x2 | Kill primary, then immediately kill newly elected primary |
+| 14 | Rolling Restart | kubectl delete x3 | Delete pods one at a time (0→1→2) under write load |
+| 15 | Coordinator Crash | kill PID 1 | Kill coordinator sidecar, MySQL process stays running |
+| 16 | Long Network Partition | NetworkChaos | 10-minute partition (5x longer than standard test) |
+| 17 | DNS Failure | DNSChaos | Block DNS resolution on primary for 3 minutes |
+| 18 | PVC Delete + Pod Kill | kubectl delete | Destroy pod + persistent storage, rebuild via CLONE |
+| 19 | IO Fault (EIO errors) | IOChaos | 50% of disk I/O operations return EIO errors |
+| 20 | Clock Skew (-5 min) | TimeChaos | Shift primary's system clock back 5 minutes |
+| 21 | Bandwidth Throttle (1mbps) | NetworkChaos | Limit primary's network bandwidth to 1mbps |
+
 ## Data Integrity Validation
 
 Every experiment verified data integrity through **4 checks** across all 3 nodes:
@@ -432,7 +448,7 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
 | 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
 | 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
 
-### MySQL 8.4.8 — All 12 PASSED
+### MySQL 8.4.8 — All 21 PASSED (12 core + 9 extended)
 
 | # | Experiment | Failover | Data Loss | Errant GTIDs | Verdict |
 |---|---|---|---|---|---|
@@ -448,6 +464,15 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
 | 10 | OOMKill Natural | No (survived) | Zero | 0 | **PASS** |
 | 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
 | 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
+| 13 | Double Primary Kill | Yes (x2) | Zero | 0 | **PASS** |
+| 14 | Rolling Restart (0→1→2) | Yes (x3) | Zero | 0 | **PASS** |
+| 15 | Coordinator Crash | No | Zero | 0 | **PASS** |
+| 16 | Long Network Partition (10 min) | Yes | Zero | 0 | **PASS** |
+| 17 | DNS Failure on Primary | No | Zero | 0 | **PASS** |
+| 18 | PVC Delete + Pod Kill | Yes | Zero | 0 | **PASS** |
+| 19 | IO Fault (EIO 50%) | Yes (crash) | Zero | 0 | **PASS** |
+| 20 | Clock Skew (-5 min) | No | Zero | 0 | **PASS** |
+| 21 | Bandwidth Throttle (1mbps) | No | Zero | 0 | **PASS** |
 
 ### MySQL 8.0.36 — All 12 PASSED
 
@@ -487,6 +512,190 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
 
 **All 12 experiments PASSED with zero data loss.**
 
+## Extended Experiments — Details (MySQL 8.4.8 Single-Primary)
+
+### Exp 13: Double Primary Kill
+
+Kill the primary, wait for new election, then immediately kill the new primary. Tests survival of two consecutive leader failures.
+
+```bash
+# Kill first primary
+kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
+# Wait 15s for new primary election, then kill the new primary
+sleep 15
+NEW_PRIMARY=$(kubectl get pods -n demo \
+  -l "app.kubernetes.io/instance=mysql-ha-cluster,kubedb.com/role=primary" \
+  -o jsonpath='{.items[0].metadata.name}')
+kubectl delete pod $NEW_PRIMARY -n demo --force --grace-period=0
+```
+
+**Result:** Pod-0 was elected as the third primary. Cluster recovered in ~90 seconds. Zero data loss.
+
+### Exp 14: Rolling Restart (0→1→2)
+
+Simulate a rolling upgrade — delete each pod sequentially with 40-second gaps under write load.
+
+```bash
+kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
+sleep 40
+kubectl delete pod mysql-ha-cluster-1 -n demo --force --grace-period=0
+sleep 40
+kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
+```
+
+**Result:** Each pod recovered and rejoined within ~30 seconds. Two failovers triggered (when primary was deleted). Zero data loss.
+
+### Exp 15: Coordinator Crash
+
+Kill only the mysql-coordinator sidecar container, leaving MySQL running. Tests whether the cluster stays stable without coordinator.
+
+```bash
+kubectl exec -n demo mysql-ha-cluster-1 -c mysql-coordinator -- kill 1
+```
+
+**Result:** Kubernetes auto-restarted the coordinator container. MySQL was completely unaffected — no failover, no write interruption, 728 TPS (normal). The coordinator is a management layer; MySQL GR operates independently.
+
+### Exp 16: Long Network Partition (10 min)
+
+Isolate the primary from replicas for 10 minutes — 5x longer than the standard 2-minute test.
+
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+metadata:
+  name: mysql-primary-network-partition-long
+  namespace: chaos-mesh
+spec:
+  action: partition
+  mode: one
+  selector:
+    namespaces: [demo]
+    labelSelectors:
+      "app.kubernetes.io/instance": "mysql-ha-cluster"
+      "kubedb.com/role": "primary"
+  target:
+    mode: all
+    selector:
+      namespaces: [demo]
+      labelSelectors:
+        "kubedb.com/role": "standby"
+  direction: both
+  duration: "10m"
+```
+
+**Result:** Failover triggered within seconds. After 10-minute partition removed, the isolated node rejoined cleanly via GR distributed recovery. All 3 nodes ONLINE within ~2 minutes. Zero data loss.
+
+### Exp 17: DNS Failure on Primary
+
+Block all DNS resolution on the primary for 3 minutes. GR uses hostnames for inter-node communication, so this tests a critical infrastructure dependency.
+
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: DNSChaos
+metadata:
+  name: mysql-dns-error-primary
+  namespace: chaos-mesh
+spec:
+  action: error
+  mode: one
+  selector:
+    namespaces: [demo]
+    labelSelectors:
+      "app.kubernetes.io/instance": "mysql-ha-cluster"
+      "kubedb.com/role": "primary"
+  duration: "3m"
+```
+
+**Result:** Primary survived without failover. TPS dropped ~32% (497 vs ~730 baseline) due to DNS-dependent operations timing out. No errors, no data loss. Existing TCP connections between GR members stayed open.
+
+### Exp 18: PVC Delete + Pod Kill
+
+Completely destroy a node's data — delete both the pod and its PVC. The node must rebuild from scratch using the CLONE plugin.
+
+```bash
+kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
+kubectl delete pvc data-mysql-ha-cluster-0 -n demo
+```
+
+**Result:** StatefulSet auto-created a new PVC. The CLONE plugin copied a full data snapshot from a donor node. Pod recovered and joined GR in ~90 seconds with identical GTIDs and checksums. This is the ultimate recovery test — MySQL 8.0+ handles it fully automatically.
+
+### Exp 19: IO Fault (EIO Errors)
+
+Inject I/O read/write errors (errno 5 = EIO) on 50% of disk operations on the primary's data volume. Unlike IO latency which slows things down, IO faults cause actual operation failures — simulating a failing disk or storage system.
+
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: IOChaos
+metadata:
+  name: mysql-primary-io-fault
+  namespace: chaos-mesh
+spec:
+  action: fault
+  mode: one
+  selector:
+    namespaces: [demo]
+    labelSelectors:
+      "app.kubernetes.io/instance": "mysql-ha-cluster"
+      "kubedb.com/role": "primary"
+  volumePath: "/var/lib/mysql"
+  path: "/**"
+  errno: 5
+  percent: 50
+  duration: "3m"
+```
+
+**Result:** Initially 703 TPS (InnoDB retries handle some errors), but the 50% EIO rate eventually crashed the MySQL process on the primary. Failover triggered to a secondary. After chaos removed and pod force-restarted, InnoDB crash recovery repaired the data directory and the node rejoined GR cleanly. Zero data loss.
+
+### Exp 20: Clock Skew (-5 min)
+
+Shift the primary's system clock back by 5 minutes. GR uses timestamps for conflict detection, Paxos message ordering, and timeout calculations. Tests whether clock drift breaks consensus or causes split-brain.
+
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: TimeChaos
+metadata:
+  name: mysql-primary-clock-skew
+  namespace: chaos-mesh
+spec:
+  mode: one
+  selector:
+    namespaces: [demo]
+    labelSelectors:
+      "app.kubernetes.io/instance": "mysql-ha-cluster"
+      "kubedb.com/role": "primary"
+  timeOffset: "-5m"
+  duration: "3m"
+```
+
+**Result:** 404 TPS (~45% reduction from baseline). No failover triggered, no errors. GR's Paxos protocol is resilient to clock drift — it uses logical clocks for consensus, not wall-clock time. All 3 nodes stayed ONLINE throughout. Zero data loss.
+
+### Exp 21: Bandwidth Throttle (1mbps)
+
+Limit the primary's outbound network bandwidth to 1mbps. Simulates degraded network in cross-AZ or cross-region deployments where bandwidth is limited but not broken.
+
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+metadata:
+  name: mysql-bandwidth-throttle
+  namespace: chaos-mesh
+spec:
+  action: bandwidth
+  mode: one
+  selector:
+    namespaces: [demo]
+    labelSelectors:
+      "app.kubernetes.io/instance": "mysql-ha-cluster"
+      "kubedb.com/role": "primary"
+  bandwidth:
+    rate: "1mbps"
+    limit: 20971520
+    buffer: 10000
+  duration: "3m"
+```
+
+**Result:** 147 TPS (~80% reduction from baseline). The bandwidth limit heavily throttles GR's Paxos consensus traffic, but the cluster stays completely stable — no failover, no errors, no member state changes. All 3 nodes remained ONLINE. Zero data loss.
+
 ## Failover Performance (Single-Primary)
 
 | Scenario | Failover Time | Full Recovery Time |
@@ -502,12 +711,16 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
 
 ### Single-Primary Mode
 
-| Chaos Type | TPS During Chaos | Reduction from Baseline (~2,400) |
+| Chaos Type | TPS During Chaos | Reduction from Baseline (~730) |
 |---|---|---|
-| IO Latency (100ms) | 2-3.5 | 99.9% |
-| Network Latency (1s) | 1.2-1.4 | 99.9% |
+| IO Latency (100ms) | 2-3.5 | 99.5% |
+| Network Latency (1s) | 1.2-1.4 | 99.8% |
 | CPU Stress (98%) | 1,300-1,370 | ~46% |
 | Packet Loss (30%) | Variable | Triggers failover |
+| IO Fault (EIO 50%) | 703 then crash | Failover triggered |
+| Clock Skew (-5 min) | 404 | ~45% |
+| Bandwidth Throttle (1mbps) | 147 | ~80% |
+| DNS Failure | 497 | ~32% |
 
 ### Multi-Primary Mode
 
@@ -539,18 +752,27 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
 | OOMKill Recovery | Yes | Yes | Yes |
 | Network Partition Recovery | Yes | Yes | Yes |
 | CLONE Plugin | Yes | Yes | Yes |
-| Single-Primary (12 tests) | **12/12** | **12/12** | **12/12** |
+| Single-Primary (core 12) | **12/12** | **12/12** | **12/12** |
+| Single-Primary (extended 13-21) | Not tested | **9/9** | Not tested |
 | Multi-Primary (12 tests) | Not tested | **12/12** | Not tested |
 
 ## Key Takeaways
 
-1. **KubeDB MySQL achieves zero data loss** across all 48 chaos experiments in both Single-Primary and Multi-Primary topologies.
+1. **KubeDB MySQL achieves zero data loss** across all 57 chaos experiments in both Single-Primary and Multi-Primary topologies.
 
-2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios.
+2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios, including double primary kill and disk failure.
 
 3. **Multi-Primary mode is production-ready** — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.
 
-4. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
+4. **Full data rebuild works automatically** — even after complete PVC deletion, the CLONE plugin rebuilds a node from scratch in ~90 seconds with zero manual intervention.
+
+5. **Coordinator crash has zero impact** — MySQL GR operates independently of the coordinator sidecar. Killing the coordinator does not trigger failover or interrupt writes.
+
+6. **Disk failures trigger safe failover** — 50% I/O error rate eventually crashes MySQL, but InnoDB crash recovery + GR distributed recovery handles it with zero data loss after pod restart.
+
+7. **Clock skew and bandwidth limits are tolerated** — GR's Paxos protocol is resilient to 5-minute clock drift (~45% TPS drop, no errors) and 1mbps bandwidth limits (~80% TPS drop, no errors).
+
+8. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
 
 ## What's Next