Skip to content

Commit 8817f5b

Browse files
committed
update 6 more test
Signed-off-by: SK Ali Arman <arman@appscode.com>
1 parent 38c43bf commit 8817f5b

1 file changed

Lines changed: 140 additions & 9 deletions

File tree

  • content/post/chaos-testing-mysql

content/post/chaos-testing-mysql/index.md

Lines changed: 140 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 48 Experiments"
2+
title: "Chaos Engineering Results: KubeDB MySQL Group Replication Achieves Zero Data Loss Across 54 Experiments"
33
date: "2026-04-08"
44
weight: 14
55
authors:
@@ -17,7 +17,7 @@ tags:
1717

1818
## Overview
1919

20-
We conducted **48 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
20+
We conducted **54 chaos experiments** across **3 MySQL versions** (8.0.36, 8.4.8, 9.6.0) and **2 Group Replication topologies** (Single-Primary and Multi-Primary) on KubeDB-managed 3-node clusters. The goal: validate that KubeDB MySQL delivers **zero data loss**, **automatic failover**, and **self-healing recovery** under realistic failure conditions with production-level write loads.
2121

2222
**The result: every experiment passed with zero data loss, zero split-brain, and zero errant GTIDs.**
2323

@@ -385,9 +385,11 @@ for i in 0 1 2; do
385385
done
386386
```
387387

388-
## The 12-Experiment Matrix
388+
## The 18-Experiment Matrix
389389

390-
Every MySQL version and topology was tested against the same 12-experiment matrix covering single-node failures, resource exhaustion, network degradation, and complex multi-fault scenarios:
390+
Every MySQL version and topology was tested against a comprehensive experiment matrix covering single-node failures, resource exhaustion, network degradation, multi-fault scenarios, and advanced recovery tests:
391+
392+
### Core Experiments (1-12)
391393

392394
| # | Experiment | Chaos Type | What It Tests |
393395
|---|---|---|---|
@@ -404,6 +406,17 @@ Every MySQL version and topology was tested against the same 12-experiment matri
404406
| 11 | Scheduled Pod Kill | Schedule | Repeated kills every 30s-1min |
405407
| 12 | Degraded Failover | Workflow | IO latency + pod kill in sequence |
406408

409+
### Extended Experiments (13-18)
410+
411+
| # | Experiment | Chaos Type | What It Tests |
412+
|---|---|---|---|
413+
| 13 | Double Primary Kill | kubectl delete x2 | Kill primary, then immediately kill newly elected primary |
414+
| 14 | Rolling Restart | kubectl delete x3 | Delete pods one at a time (0→1→2) under write load |
415+
| 15 | Coordinator Crash | kill PID 1 | Kill coordinator sidecar, MySQL process stays running |
416+
| 16 | Long Network Partition | NetworkChaos | 10-minute partition (5x longer than standard test) |
417+
| 17 | DNS Failure | DNSChaos | Block DNS resolution on primary for 3 minutes |
418+
| 18 | PVC Delete + Pod Kill | kubectl delete | Destroy pod + persistent storage, rebuild via CLONE |
419+
407420
## Data Integrity Validation
408421

409422
Every experiment verified data integrity through **4 checks** across all 3 nodes:
@@ -432,7 +445,7 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
432445
| 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
433446
| 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
434447

435-
### MySQL 8.4.8 — All 12 PASSED
448+
### MySQL 8.4.8 — All 18 PASSED (12 core + 6 extended)
436449

437450
| # | Experiment | Failover | Data Loss | Errant GTIDs | Verdict |
438451
|---|---|---|---|---|---|
@@ -448,6 +461,12 @@ Every experiment verified data integrity through **4 checks** across all 3 nodes
448461
| 10 | OOMKill Natural | No (survived) | Zero | 0 | **PASS** |
449462
| 11 | Scheduled Replica Kill | Multiple | Zero | 0 | **PASS** |
450463
| 12 | Degraded Failover | Yes | Zero | 0 | **PASS** |
464+
| 13 | Double Primary Kill | Yes (x2) | Zero | 0 | **PASS** |
465+
| 14 | Rolling Restart (0→1→2) | Yes (x3) | Zero | 0 | **PASS** |
466+
| 15 | Coordinator Crash | No | Zero | 0 | **PASS** |
467+
| 16 | Long Network Partition (10 min) | Yes | Zero | 0 | **PASS** |
468+
| 17 | DNS Failure on Primary | No | Zero | 0 | **PASS** |
469+
| 18 | PVC Delete + Pod Kill | Yes | Zero | 0 | **PASS** |
451470

452471
### MySQL 8.0.36 — All 12 PASSED
453472

@@ -487,6 +506,113 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
487506

488507
**All 12 experiments PASSED with zero data loss.**
489508

509+
## Extended Experiments — Details (MySQL 8.4.8 Single-Primary)
510+
511+
### Exp 13: Double Primary Kill
512+
513+
Kill the primary, wait for new election, then immediately kill the new primary. Tests survival of two consecutive leader failures.
514+
515+
```bash
516+
# Kill first primary
517+
kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
518+
# Wait 15s for new primary election, then kill the new primary
519+
sleep 15
520+
NEW_PRIMARY=$(kubectl get pods -n demo \
521+
-l "app.kubernetes.io/instance=mysql-ha-cluster,kubedb.com/role=primary" \
522+
-o jsonpath='{.items[0].metadata.name}')
523+
kubectl delete pod $NEW_PRIMARY -n demo --force --grace-period=0
524+
```
525+
526+
**Result:** Pod-0 was elected as the third primary. Cluster recovered in ~90 seconds. Zero data loss.
527+
528+
### Exp 14: Rolling Restart (0→1→2)
529+
530+
Simulate a rolling upgrade — delete each pod sequentially with 40-second gaps under write load.
531+
532+
```bash
533+
kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
534+
sleep 40
535+
kubectl delete pod mysql-ha-cluster-1 -n demo --force --grace-period=0
536+
sleep 40
537+
kubectl delete pod mysql-ha-cluster-2 -n demo --force --grace-period=0
538+
```
539+
540+
**Result:** Each pod recovered and rejoined within ~30 seconds. Two failovers triggered (when primary was deleted). Zero data loss.
541+
542+
### Exp 15: Coordinator Crash
543+
544+
Kill only the mysql-coordinator sidecar container, leaving MySQL running. Tests whether the cluster stays stable without coordinator.
545+
546+
```bash
547+
kubectl exec -n demo mysql-ha-cluster-1 -c mysql-coordinator -- kill 1
548+
```
549+
550+
**Result:** Kubernetes auto-restarted the coordinator container. MySQL was completely unaffected — no failover, no write interruption, 728 TPS (normal). The coordinator is a management layer; MySQL GR operates independently.
551+
552+
### Exp 16: Long Network Partition (10 min)
553+
554+
Isolate the primary from replicas for 10 minutes — 5x longer than the standard 2-minute test.
555+
556+
```yaml
557+
apiVersion: chaos-mesh.org/v1alpha1
558+
kind: NetworkChaos
559+
metadata:
560+
name: mysql-primary-network-partition-long
561+
namespace: chaos-mesh
562+
spec:
563+
action: partition
564+
mode: one
565+
selector:
566+
namespaces: [demo]
567+
labelSelectors:
568+
"app.kubernetes.io/instance": "mysql-ha-cluster"
569+
"kubedb.com/role": "primary"
570+
target:
571+
mode: all
572+
selector:
573+
namespaces: [demo]
574+
labelSelectors:
575+
"kubedb.com/role": "standby"
576+
direction: both
577+
duration: "10m"
578+
```
579+
580+
**Result:** Failover triggered within seconds. After 10-minute partition removed, the isolated node rejoined cleanly via GR distributed recovery. All 3 nodes ONLINE within ~2 minutes. Zero data loss.
581+
582+
### Exp 17: DNS Failure on Primary
583+
584+
Block all DNS resolution on the primary for 3 minutes. GR uses hostnames for inter-node communication, so this tests a critical infrastructure dependency.
585+
586+
```yaml
587+
apiVersion: chaos-mesh.org/v1alpha1
588+
kind: DNSChaos
589+
metadata:
590+
name: mysql-dns-error-primary
591+
namespace: chaos-mesh
592+
spec:
593+
action: error
594+
mode: one
595+
selector:
596+
namespaces: [demo]
597+
labelSelectors:
598+
"app.kubernetes.io/instance": "mysql-ha-cluster"
599+
"kubedb.com/role": "primary"
600+
duration: "3m"
601+
```
602+
603+
**Result:** Primary survived without failover. TPS dropped ~32% (497 vs ~730 baseline) due to DNS-dependent operations timing out. No errors, no data loss. Existing TCP connections between GR members stayed open.
604+
605+
### Exp 18: PVC Delete + Pod Kill
606+
607+
Completely destroy a node's data — delete both the pod and its PVC. The node must rebuild from scratch using the CLONE plugin.
608+
609+
```bash
610+
kubectl delete pod mysql-ha-cluster-0 -n demo --force --grace-period=0
611+
kubectl delete pvc data-mysql-ha-cluster-0 -n demo
612+
```
613+
614+
**Result:** StatefulSet auto-created a new PVC. The CLONE plugin copied a full data snapshot from a donor node. Pod recovered and joined GR in ~90 seconds with identical GTIDs and checksums. This is the ultimate recovery test — MySQL 8.0+ handles it fully automatically.
615+
490616
## Failover Performance (Single-Primary)
491617

492618
| Scenario | Failover Time | Full Recovery Time |
@@ -539,18 +665,23 @@ In Multi-Primary mode, **all 3 nodes accept writes** — there is no primary/rep
539665
| OOMKill Recovery | Yes | Yes | Yes |
540666
| Network Partition Recovery | Yes | Yes | Yes |
541667
| CLONE Plugin | Yes | Yes | Yes |
542-
| Single-Primary (12 tests) | **12/12** | **12/12** | **12/12** |
668+
| Single-Primary (core 12) | **12/12** | **12/12** | **12/12** |
669+
| Single-Primary (extended 13-18) | Not tested | **6/6** | Not tested |
543670
| Multi-Primary (12 tests) | Not tested | **12/12** | Not tested |
544671

545672
## Key Takeaways
546673

547-
1. **KubeDB MySQL achieves zero data loss** across all 48 chaos experiments in both Single-Primary and Multi-Primary topologies.
674+
1. **KubeDB MySQL achieves zero data loss** across all 54 chaos experiments in both Single-Primary and Multi-Primary topologies.
548675

549-
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios.
676+
2. **Automatic failover works reliably** — primary election completes in 2-3 seconds, full recovery in under 4 minutes for all scenarios, including double primary kill.
550677

551678
3. **Multi-Primary mode is production-ready** — all 12 experiments passed on MySQL 8.4.8. Be aware that multi-primary has higher sensitivity to CPU stress and network issues due to Paxos consensus requirements on all writable nodes.
552679

553-
4. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
680+
4. **Full data rebuild works automatically** — even after complete PVC deletion, the CLONE plugin rebuilds a node from scratch in ~90 seconds with zero manual intervention.
681+
682+
5. **Coordinator crash has zero impact** — MySQL GR operates independently of the coordinator sidecar. Killing the coordinator does not trigger failover or interrupt writes.
683+
684+
6. **Transient GTID mismatches are normal** — brief mismatches (15-30 seconds) during recovery are expected and resolve automatically via GR distributed recovery.
554685

555686
## What's Next
556687

0 commit comments

Comments
 (0)