diff --git a/docs/addon-vcluster-cluster-delete-convergence-guide.md b/docs/addon-vcluster-cluster-delete-convergence-guide.md new file mode 100644 index 0000000..1693a31 --- /dev/null +++ b/docs/addon-vcluster-cluster-delete-convergence-guide.md @@ -0,0 +1,175 @@ +# vcluster KB Cluster Delete Convergence + +> **Audience**: addon test author writing per-run cleanup logic that deletes a KB Cluster CR and waits for namespace to be empty +> **Status**: stable +> **Applies to**: any KB addon test running inside vcluster (Loft / Sealos / hand-rolled) backed by a shared host K8s +> **Companion docs**: +> - [`addon-vcluster-bounded-convergence-window-guide.md`](addon-vcluster-bounded-convergence-window-guide.md) — chaos pod-delete + replacement window; this guide is the parallel for CR delete + finalizer chain +> - [`addon-terminating-archive-before-force-finalizer-guide.md`](addon-terminating-archive-before-force-finalizer-guide.md) — never force / patch finalizer before evidence + +When a test deletes a KB `Cluster` CR (`kubectl delete cluster ` or +`terminationPolicy: Delete`), inside vcluster the actual host-side teardown +of pods + PVCs + InstanceSet + namespace mappings is much slower than the +same operation in single-tier K8s. If your test's `cleanup wait` is too +short, you will misclassify slow-but-converging cleanup as a finalizer +deadlock or a KB controller bug. + +## Symptom + +After `kubectl delete cluster `: + +- Cluster object stays in `Deleting` with finalizer + `cluster.kubeblocks.io/finalizer`. +- Component condition `wait for the workloads to be deleted` looping for + >3 min. +- Pods phase `Failed/Error` with deletionTimestamp set but no finalizer + and not garbage-collected yet. +- PVCs `Terminating` with `kubernetes.io/pvc-protection` finalizer. +- InstanceSet `delete OK` in controller log but the object is still + visible from `kubectl get`. + +In single-tier k3d these typically clear within 2-3 min. In vcluster the +same chain commonly takes 10-15 min, occasionally up to 25 min. + +## Mechanism (engine-neutral) + +``` +T0 : kubectl delete cluster X → Cluster has deletionTimestamp + cluster.kubeblocks.io/finalizer +T0+: KB cluster-controller sees Cluster Deleting → cascades delete to Component → Component cascades to InstanceSet +T0+: InstanceSet's owner-deletion handler issues delete for owned Pods, PVCs, headless Service +T0+: vcluster syncer sees the vcluster-side Pod deletionTimestamp → propagates delete to host Pod +T0+30s..3min: host kubelet runs preStop / SIGTERM / SIGKILL; host Pod enters Terminating +T0+: host container runtime stops containers (engine flush, fsync); pod object GC +T0+: vcluster syncer sees host Pod gone → propagates back to vcluster, removes vcluster Pod object +T0+: vcluster InstanceSet's controller observes Pod gone → InstanceSet finalizer cleared → Component finalizer cleared → Cluster finalizer cleared +T0+10..15min (typical): Cluster object actually disappears +T0+25..30min (worst case): same but with engine taking longer to flush / fsync, or syncer GC running behind +``` + +Every "→" between vcluster and host is an async syncer hop with seconds-to-minutes +of lag. The chain is long; the multiplier vs single-tier k3d is 4-8x. + +## Recommended cleanup-wait baseline + +| Test step | Single-tier k3d | vcluster | +|---|---|---| +| Per-run `cleanup wait` after `K delete cluster` | 180s | **≥1500s (25 min)** | +| Cross-test namespace clean check | 300s | **≥1800s (30 min)** | +| Soak teardown after EXIT trap | 600s | **≥1800s** | + +Concrete recommended snippet: + +```bash +# After cleanup wait, check residual pods managed by KubeBlocks. +# CRITICAL: preserve kubectl rc. `2>/dev/null | wc -l` would silently +# convert an API timeout / RBAC denial / NotFound into "0 pods" and +# misclassify env failure as a clean cleanup. Also: under `set -e`, +# `stdout=$(K ...)` would exit before `pods_rc=$?` runs, so use the +# `if cmd; then ok; else rc=$?; fi` form so the failing branch keeps +# the rc. Use a per-iteration stderr file to avoid clobbering on retry. +local cd=$((SECONDS + 1500)) # 25 min +local iter=0 +local pods="" pods_rc=0 +local cleanup_evd_dir="${EVD:-/tmp}/cleanup-wait" +mkdir -p "$cleanup_evd_dir" +while [ "$SECONDS" -lt $cd ]; do + iter=$((iter + 1)) + local stderr_file="$cleanup_evd_dir/get-pod-$(printf '%04d' "$iter").err" + local stdout + if stdout=$(K get pod -n "$NS" \ + -l "app.kubernetes.io/instance=$cluster" --no-headers \ + 2>"$stderr_file"); then + pods_rc=0 + else + pods_rc=$? + fi + if [ "$pods_rc" -ne 0 ]; then + # API error: do NOT treat as "clean". Wait and retry; per-iter file + # is kept for evidence (no clobbering across retries). + echo "cleanup wait iter=$iter: kubectl rc=$pods_rc stderr=$(head -c 200 "$stderr_file")" + sleep 15 + continue + fi + pods=$(printf '%s' "$stdout" | grep -c . || true) + [ "$pods" = "0" ] && break + sleep 15 +done +if [ "$pods_rc" -ne 0 ]; then + echo "cleanup wait: API not healthy at the end (rc=$pods_rc) — route as env" + return 2 +fi +echo "cleanup wait done. residual pods=$pods (rc=0 verified, iters=$iter)" +``` + +Three-track verdict (rc + stderr + observed count) avoids the silent-fallback +trap. If `pods_rc != 0` at the end of the window, treat as environment, not +"clean". See `addon-kubectl-pipeline-evidence-integrity-guide.md` for the +general principle. + +If `pods > 0` (with rc=0) after 25 min, route to DevOps with the residual +object names +and time window (per `addon-vcluster-bounded-convergence-window-guide.md` +escalation path). Do NOT `--force --grace-period=0` and do NOT patch the +finalizer — that loses the diagnostic evidence and may corrupt cluster +state. + +## When NOT to assume cleanup is stuck + +The "stuck cleanup" pattern is convergent in vcluster. Before escalating: + +1. Run the cleanup wait for at least the recommended baseline. +2. Check `kubectl get events` for `delete ... successful` lines from + InstanceSet — proves the controller side has fired. +3. Check `kubectl logs deploy/kb-kubeblocks -n kb-system` for `wait for the + workloads to be deleted` looping repeatedly with no error — proves KB is + waiting on InstanceSet, not deadlocked. +4. If all of the above are normal, **wait longer**. Cluster will converge. + +## When the cleanup IS stuck + +Stuck signals (must hold for >25 min after delete): + +- Cluster.deletionTimestamp + finalizer present +- AND no progress from `kubectl get pod -n ` for >10 min straight +- AND host-side Pod (resolvable via `-x--x-` mapping) + is also stuck (request DevOps to read-only check) +- AND KB controller log shows no recent reconciliation activity for that + cluster + +In that case use the escalation packet: + +```text +Environment blocker (vcluster cleanup stuck): +- target: vcluster , ns , cluster +- cluster delete time: +- residual: pod-X phase Failed, PVC data-pod-X Terminating +- finalizer present: cluster.kubeblocks.io/finalizer +- controller log: "wait for the workloads to be deleted" looping >25 min +- ruled out: cluster delete event propagated (instanceset DELETE OK in logs) +- exact action: read-only check host-side pod state + node kubelet + container runtime; if stuck, restart kubelet on affected node (host-side) without force-deleting vcluster objects +- work continuing: other lanes / other vcluster +``` + +## Source observations + +- 2026-05-15 Henry batch 4 self-induced kill of in-flight cluster (~25 min + observed) — initially misclassified as finalizer deadlock; Mason + read-only investigation + (`mysql-cleanup-residual-chaos-b4-73664-mason-readonly-20260515T114316Z.tar.gz` + sha `1e9d7fec0b904c1a797ee3684c4f179cc13d1f5f2c966b20ed3065c9932cb58b`) + showed the delete event DID propagate through syncer at T0+0, + host kubelet entered Killing path at T0+0, host Pod actually + GC'd at T0+4..13 min, vcluster mapping cleared at +13 min. No + deadlock, just slow convergence. +- Henry's batch 4 / batch 5 / batch 6 / soak scripts updated to use 1500s + cleanup wait baseline after this finding. + +## Related skills / docs + +- `skills/soak-test-classification/SKILL.md` — classify long-run findings; + cleanup-stuck-but-eventually-converged is `external-environmental-cascade`, + not invariant break +- `docs/addon-vcluster-bounded-convergence-window-guide.md` — chaos pod-delete + side of the same vcluster syncer multiplier story +- `docs/addon-terminating-archive-before-force-finalizer-guide.md` — never + force-finalize before evidence