update 6 more test

sheikh-arman · sheikh-arman · commit 8817f5b2ea03 · 2026-04-09T11:44:10.000+06:00
Signed-off-by: SK Ali Arman &lt;arman@appscode.com&gt;
diff --git a/content/post/chaos-testing-mysql/index.md b/content/post/chaos-testing-mysql/index.md
@@ -1,5 +1,5 @@
 ---
-title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 48 Experiments"
+title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 54 Experiments"
 date: "2026-04-08"
 weight: 14
 authors:
@@ -17,7 +17,7 @@ tags:
 
 ## Overview
 
-We conducted **48 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
+We conducted **54 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
 
 **The result: every experiment passed with zero data loss, zero split-brain, and zero errant GTIDs.**
 
@@ -385,9 +385,11 @@ for i in 0 1 2; do
 done
 ```
 
-## The 12-Experiment Matrix
+## The 18-Experiment Matrix
 
-Every MySQL version and topology was tested against the same 12-experiment matrix covering single-node failures, resource exhaustion, network degradation, and complex multi-fault scenarios:
+Every MySQL version and topology was tested against a comprehensive experiment matrix covering single-node failures, resource exhaustion, network degradation, multi-fault scenarios, and advanced recovery tests:
+
+### Core Experiments (1-12)
 
 | # | Experiment | Chaos Type | What It Tests |
 |---|---|---|---|
@@ -404,6 +406,17 @@ Every MySQL version and topology was tested against the same 12-experiment matri
 | 11 | Scheduled Pod Kill | Schedule | Repeated kills every 30s-1min |
 | 12 | Degraded Failover | Workflow | IO latency + pod kill in sequence |
 
+### Extended Experiments (13-18)
+
+| # | Experiment | Chaos Type | What It Tests |
+|---|---|---|---|
+| 13 | Double Primary Kill | kubectl delete x2 | Kill primary, then immediately kill newly elected primary |
+| 14 | Rolling Restart | kubectl delete x3 | Delete pods one at a time (0→1→2) under write load |
+| 15 | Coordinator Crash | kill PID 1 | Kill coordinator sidecar, MySQL process stays running |
+| 16 | Long Network Partition | NetworkChaos | 10-minute partition (5x longer than standard test) |
+| 17 | DNS Failure | DNSChaos | Block DNS resolution on primary for 3 minutes |
+| 18 | PVC Delete + Pod Kill | kubectl delete | Destroy pod + persistent storage, rebuild via CLONE |
+
 ## Data Integrity Validation
 
 Every experiment verified data integrity through **4 checks** across all 3 nodes:
@@ -432,7 +445,7 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
 | 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
 | 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
 
-### MySQL 8.4.8 — All 12 PASSED
+### MySQL 8.4.8 — All 18 PASSED (12 core + 6 extended)
 
 | # | Experiment | Failover | Data Loss | Errant GTIDs | Verdict |
 |---|---|---|---|---|---|
@@ -448,6 +461,12 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
 | 10 | OOMKill Natural | No (survived) | Zero | 0 | **PASS** |
 | 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
 | 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
+| 13 | Double Primary Kill | Yes (x2) | Zero | 0 | **PASS** |
+| 14 | Rolling Restart (0→1→2) | Yes (x3) | Zero | 0 | **PASS** |
+| 15 | Coordinator Crash | No | Zero | 0 | **PASS** |
+| 16 | Long Network Partition (10 min) | Yes | Zero | 0 | **PASS** |
+| 17 | DNS Failure on Primary | No | Zero | 0 | **PASS** |
+| 18 | PVC Delete + Pod Kill | Yes | Zero | 0 | **PASS** |
 
 ### MySQL 8.0.36 — All 12 PASSED
 
@@ -487,6 +506,113 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
 
 **All 12 experiments PASSED with zero data loss.**
 
+## Extended Experiments — Details (MySQL 8.4.8 Single-Primary)
+
+### Exp 13: Double Primary Kill
+
+Kill the primary, wait for new election, then immediately kill the new primary. Tests survival of two consecutive leader failures.
+
+```bash
+# Kill first primary
+kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
+# Wait 15s for new primary election, then kill the new primary
+sleep 15
+NEW_PRIMARY=$(kubectl get pods -n demo \
+  -l "app.kubernetes.io/instance=mysql-ha-cluster,kubedb.com/role=primary" \
+  -o jsonpath='{.items[0].metadata.name}')
+kubectl delete pod $NEW_PRIMARY -n demo --force --grace-period=0
+```
+
+**Result:** Pod-0 was elected as the third primary. Cluster recovered in ~90 seconds. Zero data loss.
+
+### Exp 14: Rolling Restart (0→1→2)
+
+Simulate a rolling upgrade — delete each pod sequentially with 40-second gaps under write load.
+
+```bash
+kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
+sleep 40
+kubectl delete pod mysql-ha-cluster-1 -n demo --force --grace-period=0
+sleep 40
+kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
+```
+
+**Result:** Each pod recovered and rejoined within ~30 seconds. Two failovers triggered (when primary was deleted). Zero data loss.
+
+### Exp 15: Coordinator Crash
+
+Kill only the mysql-coordinator sidecar container, leaving MySQL running. Tests whether the cluster stays stable without coordinator.
+
+```bash
+kubectl exec -n demo mysql-ha-cluster-1 -c mysql-coordinator -- kill 1
+```
+
+**Result:** Kubernetes auto-restarted the coordinator container. MySQL was completely unaffected — no failover, no write interruption, 728 TPS (normal). The coordinator is a management layer; MySQL GR operates independently.
+
+### Exp 16: Long Network Partition (10 min)
+
+Isolate the primary from replicas for 10 minutes — 5x longer than the standard 2-minute test.
+
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+metadata:
+  name: mysql-primary-network-partition-long
+  namespace: chaos-mesh
+spec:
+  action: partition
+  mode: one
+  selector:
+    namespaces: [demo]
+    labelSelectors:
+      "app.kubernetes.io/instance": "mysql-ha-cluster"
+      "kubedb.com/role": "primary"
+  target:
+    mode: all
+    selector:
+      namespaces: [demo]
+      labelSelectors:
+        "kubedb.com/role": "standby"
+  direction: both
+  duration: "10m"
+```
+
+**Result:** Failover triggered within seconds. After 10-minute partition removed, the isolated node rejoined cleanly via GR distributed recovery. All 3 nodes ONLINE within ~2 minutes. Zero data loss.
+
+### Exp 17: DNS Failure on Primary
+
+Block all DNS resolution on the primary for 3 minutes. GR uses hostnames for inter-node communication, so this tests a critical infrastructure dependency.
+
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: DNSChaos
+metadata:
+  name: mysql-dns-error-primary
+  namespace: chaos-mesh
+spec:
+  action: error
+  mode: one
+  selector:
+    namespaces: [demo]
+    labelSelectors:
+      "app.kubernetes.io/instance": "mysql-ha-cluster"
+      "kubedb.com/role": "primary"
+  duration: "3m"
+```
+
+**Result:** Primary survived without failover. TPS dropped ~32% (497 vs ~730 baseline) due to DNS-dependent operations timing out. No errors, no data loss. Existing TCP connections between GR members stayed open.
+
+### Exp 18: PVC Delete + Pod Kill
+
+Completely destroy a node's data — delete both the pod and its PVC. The node must rebuild from scratch using the CLONE plugin.
+
+```bash
+kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
+kubectl delete pvc data-mysql-ha-cluster-0 -n demo
+```
+
+**Result:** StatefulSet auto-created a new PVC. The CLONE plugin copied a full data snapshot from a donor node. Pod recovered and joined GR in ~90 seconds with identical GTIDs and checksums. This is the ultimate recovery test — MySQL 8.0+ handles it fully automatically.
+
 ## Failover Performance (Single-Primary)
 
 | Scenario | Failover Time | Full Recovery Time |
@@ -539,18 +665,23 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
 | OOMKill Recovery | Yes | Yes | Yes |
 | Network Partition Recovery | Yes | Yes | Yes |
 | CLONE Plugin | Yes | Yes | Yes |
-| Single-Primary (12 tests) | **12/12** | **12/12** | **12/12** |
+| Single-Primary (core 12) | **12/12** | **12/12** | **12/12** |
+| Single-Primary (extended 13-18) | Not tested | **6/6** | Not tested |
 | Multi-Primary (12 tests) | Not tested | **12/12** | Not tested |
 
 ## Key Takeaways
 
-1. **KubeDB MySQL achieves zero data loss** across all 48 chaos experiments in both Single-Primary and Multi-Primary topologies.
+1. **KubeDB MySQL achieves zero data loss** across all 54 chaos experiments in both Single-Primary and Multi-Primary topologies.
 
-2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios.
+2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios, including double primary kill.
 
 3. **Multi-Primary mode is production-ready** — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.
 
-4. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
+4. **Full data rebuild works automatically** — even after complete PVC deletion, the CLONE plugin rebuilds a node from scratch in ~90 seconds with zero manual intervention.
+
+5. **Coordinator crash has zero impact** — MySQL GR operates independently of the coordinator sidecar. Killing the coordinator does not trigger failover or interrupt writes.
+
+6. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
 
 ## What's Next