From 3e842973b22af5cffc02beaaeac97175ed2fb347 Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Sat, 16 May 2026 03:54:14 +0800 Subject: [PATCH 1/3] docs: add vcluster KB Cluster delete convergence guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit vcluster syncer adds a multi-hop async chain between vcluster apiserver, host kubelet, container runtime, and back. When a KB Cluster CR is deleted, the full teardown (Cluster finalizer → Component → InstanceSet → Pods → PVCs → vcluster mapping GC) typically takes 10-15 min in vcluster vs 2-3 min in single-tier k3d, with 25 min worst case. This guide: - Documents the chain mechanism engine-neutral. - Gives recommended cleanup-wait baselines (1500s per-run, 1800s cross-test) backed by Mason's host-side read-only investigation on 2026-05-15. - Lists signals for "still converging" vs "actually stuck". - Provides the escalation packet shape for actual stuck cases. - Warns against force-delete / patch-finalizer before evidence. Companion to the existing vcluster bounded convergence window guide which covers the parallel chaos pod-delete + replacement window. Co-Authored-By: Claude Opus 4.7 --- ...luster-cluster-delete-convergence-guide.md | 139 ++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 docs/addon-vcluster-cluster-delete-convergence-guide.md diff --git a/docs/addon-vcluster-cluster-delete-convergence-guide.md b/docs/addon-vcluster-cluster-delete-convergence-guide.md new file mode 100644 index 0000000..b26af02 --- /dev/null +++ b/docs/addon-vcluster-cluster-delete-convergence-guide.md @@ -0,0 +1,139 @@ +# vcluster KB Cluster Delete Convergence + +> **Audience**: addon test author writing per-run cleanup logic that deletes a KB Cluster CR and waits for namespace to be empty +> **Status**: stable +> **Applies to**: any KB addon test running inside vcluster (Loft / Sealos / hand-rolled) backed by a shared host K8s +> **Companion docs**: +> - [`addon-vcluster-bounded-convergence-window-guide.md`](addon-vcluster-bounded-convergence-window-guide.md) — chaos pod-delete + replacement window; this guide is the parallel for CR delete + finalizer chain +> - [`addon-terminating-archive-before-force-finalizer-guide.md`](addon-terminating-archive-before-force-finalizer-guide.md) — never force / patch finalizer before evidence + +When a test deletes a KB `Cluster` CR (`kubectl delete cluster ` or +`terminationPolicy: Delete`), inside vcluster the actual host-side teardown +of pods + PVCs + InstanceSet + namespace mappings is much slower than the +same operation in single-tier K8s. If your test's `cleanup wait` is too +short, you will misclassify slow-but-converging cleanup as a finalizer +deadlock or a KB controller bug. + +## Symptom + +After `kubectl delete cluster `: + +- Cluster object stays in `Deleting` with finalizer + `cluster.kubeblocks.io/finalizer`. +- Component condition `wait for the workloads to be deleted` looping for + >3 min. +- Pods phase `Failed/Error` with deletionTimestamp set but no finalizer + and not garbage-collected yet. +- PVCs `Terminating` with `kubernetes.io/pvc-protection` finalizer. +- InstanceSet `delete OK` in controller log but the object is still + visible from `kubectl get`. + +In single-tier k3d these typically clear within 2-3 min. In vcluster the +same chain commonly takes 10-15 min, occasionally up to 25 min. + +## Mechanism (engine-neutral) + +``` +T0 : kubectl delete cluster X → Cluster has deletionTimestamp + cluster.kubeblocks.io/finalizer +T0+: KB cluster-controller sees Cluster Deleting → cascades delete to Component → Component cascades to InstanceSet +T0+: InstanceSet's owner-deletion handler issues delete for owned Pods, PVCs, headless Service +T0+: vcluster syncer sees the vcluster-side Pod deletionTimestamp → propagates delete to host Pod +T0+30s..3min: host kubelet runs preStop / SIGTERM / SIGKILL; host Pod enters Terminating +T0+: host container runtime stops containers (engine flush, fsync); pod object GC +T0+: vcluster syncer sees host Pod gone → propagates back to vcluster, removes vcluster Pod object +T0+: vcluster InstanceSet's controller observes Pod gone → InstanceSet finalizer cleared → Component finalizer cleared → Cluster finalizer cleared +T0+10..15min (typical): Cluster object actually disappears +T0+25..30min (worst case): same but with engine taking longer to flush / fsync, or syncer GC running behind +``` + +Every "→" between vcluster and host is an async syncer hop with seconds-to-minutes +of lag. The chain is long; the multiplier vs single-tier k3d is 4-8x. + +## Recommended cleanup-wait baseline + +| Test step | Single-tier k3d | vcluster | +|---|---|---| +| Per-run `cleanup wait` after `K delete cluster` | 180s | **≥1500s (25 min)** | +| Cross-test namespace clean check | 300s | **≥1800s (30 min)** | +| Soak teardown after EXIT trap | 600s | **≥1800s** | + +Concrete recommended snippet: + +```bash +# After cleanup wait, check residual pods managed by KubeBlocks +local cd=$((SECONDS + 1500)) # 25 min +while [ "$SECONDS" -lt $cd ]; do + local pods + pods=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/dev/null | wc -l | tr -d ' ') + [ "$pods" = "0" ] && break + sleep 15 +done +echo "cleanup wait done. residual pods=$pods" +``` + +If `pods > 0` after 25 min, route to DevOps with the residual object names +and time window (per `addon-vcluster-bounded-convergence-window-guide.md` +escalation path). Do NOT `--force --grace-period=0` and do NOT patch the +finalizer — that loses the diagnostic evidence and may corrupt cluster +state. + +## When NOT to assume cleanup is stuck + +The "stuck cleanup" pattern is convergent in vcluster. Before escalating: + +1. Run the cleanup wait for at least the recommended baseline. +2. Check `kubectl get events` for `delete ... successful` lines from + InstanceSet — proves the controller side has fired. +3. Check `kubectl logs deploy/kb-kubeblocks -n kb-system` for `wait for the + workloads to be deleted` looping repeatedly with no error — proves KB is + waiting on InstanceSet, not deadlocked. +4. If all of the above are normal, **wait longer**. Cluster will converge. + +## When the cleanup IS stuck + +Stuck signals (must hold for >25 min after delete): + +- Cluster.deletionTimestamp + finalizer present +- AND no progress from `kubectl get pod -n ` for >10 min straight +- AND host-side Pod (resolvable via `-x--x-` mapping) + is also stuck (request DevOps to read-only check) +- AND KB controller log shows no recent reconciliation activity for that + cluster + +In that case use the escalation packet: + +```text +Environment blocker (vcluster cleanup stuck): +- target: vcluster , ns , cluster +- cluster delete time: +- residual: pod-X phase Failed, PVC data-pod-X Terminating +- finalizer present: cluster.kubeblocks.io/finalizer +- controller log: "wait for the workloads to be deleted" looping >25 min +- ruled out: cluster delete event propagated (instanceset DELETE OK in logs) +- exact action: read-only check host-side pod state + node kubelet + container runtime; if stuck, restart kubelet on affected node (host-side) without force-deleting vcluster objects +- work continuing: other lanes / other vcluster +``` + +## Source observations + +- 2026-05-15 Henry batch 4 self-induced kill of in-flight cluster (~25 min + observed) — initially misclassified as finalizer deadlock; Mason + read-only investigation + (`mysql-cleanup-residual-chaos-b4-73664-mason-readonly-20260515T114316Z.tar.gz` + sha `1e9d7fec0b904c1a797ee3684c4f179cc13d1f5f2c966b20ed3065c9932cb58b`) + showed the delete event DID propagate through syncer at T0+0, + host kubelet entered Killing path at T0+0, host Pod actually + GC'd at T0+4..13 min, vcluster mapping cleared at +13 min. No + deadlock, just slow convergence. +- Henry's batch 4 / batch 5 / batch 6 / soak scripts updated to use 1500s + cleanup wait baseline after this finding. + +## Related skills / docs + +- `skills/soak-test-classification/SKILL.md` — classify long-run findings; + cleanup-stuck-but-eventually-converged is `external-environmental-cascade`, + not invariant break +- `docs/addon-vcluster-bounded-convergence-window-guide.md` — chaos pod-delete + side of the same vcluster syncer multiplier story +- `docs/addon-terminating-archive-before-force-finalizer-guide.md` — never + force-finalize before evidence From 00b680da58305351cf2d0287a556744c3d264781 Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Sat, 16 May 2026 04:02:48 +0800 Subject: [PATCH 2/3] docs(vcluster-cluster-delete-convergence): preserve kubectl rc in cleanup snippet Per William review on PR #149: the previous cleanup snippet used `K get pod ... 2>/dev/null | wc -l`, which silently converts kubectl API timeout / RBAC denial / NotFound into "0 pods" and misclassifies env failure as a clean cleanup. Reworked to: - Capture kubectl rc separately - Treat rc != 0 as "wait + retry + record" (not "clean") - At the end of the window, if rc still != 0, return env-class signal - Cross-reference addon-kubectl-pipeline-evidence-integrity-guide.md Co-Authored-By: Claude Opus 4.7 --- ...luster-cluster-delete-convergence-guide.md | 32 ++++++++++++++++--- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/docs/addon-vcluster-cluster-delete-convergence-guide.md b/docs/addon-vcluster-cluster-delete-convergence-guide.md index b26af02..1749aa7 100644 --- a/docs/addon-vcluster-cluster-delete-convergence-guide.md +++ b/docs/addon-vcluster-cluster-delete-convergence-guide.md @@ -60,18 +60,40 @@ of lag. The chain is long; the multiplier vs single-tier k3d is 4-8x. Concrete recommended snippet: ```bash -# After cleanup wait, check residual pods managed by KubeBlocks +# After cleanup wait, check residual pods managed by KubeBlocks. +# CRITICAL: preserve kubectl rc. `2>/dev/null | wc -l` would silently +# convert an API timeout / RBAC denial / NotFound into "0 pods" and +# misclassify env failure as a clean cleanup. local cd=$((SECONDS + 1500)) # 25 min +local pods="" pods_rc=0 while [ "$SECONDS" -lt $cd ]; do - local pods - pods=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/dev/null | wc -l | tr -d ' ') + local stdout stderr + stdout=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/tmp/cleanup.err) + pods_rc=$? + if [ "$pods_rc" -ne 0 ]; then + # API error: do NOT treat as "clean". Wait and retry; record for evidence. + echo "cleanup wait: kubectl rc=$pods_rc stderr=$(head -c 200 /tmp/cleanup.err)" + sleep 15 + continue + fi + pods=$(echo "$stdout" | grep -c . || true) [ "$pods" = "0" ] && break sleep 15 done -echo "cleanup wait done. residual pods=$pods" +if [ "$pods_rc" -ne 0 ]; then + echo "cleanup wait: API not healthy at the end (rc=$pods_rc) — route as env" + return 2 +fi +echo "cleanup wait done. residual pods=$pods (rc=0 verified)" ``` -If `pods > 0` after 25 min, route to DevOps with the residual object names +Three-track verdict (rc + stderr + observed count) avoids the silent-fallback +trap. If `pods_rc != 0` at the end of the window, treat as environment, not +"clean". See `addon-kubectl-pipeline-evidence-integrity-guide.md` for the +general principle. + +If `pods > 0` (with rc=0) after 25 min, route to DevOps with the residual +object names and time window (per `addon-vcluster-bounded-convergence-window-guide.md` escalation path). Do NOT `--force --grace-period=0` and do NOT patch the finalizer — that loses the diagnostic evidence and may corrupt cluster From c7cc7003ca16f4e089ae11960df9d64dca01fac3 Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Sat, 16 May 2026 04:06:45 +0800 Subject: [PATCH 3/3] docs(vcluster-cluster-delete-convergence): fix set -e + per-iter stderr Per William second-pass review on PR #149: 1. `stdout=$(K get ...)` under `set -e` exits before `pods_rc=$?` runs. Rewrote with `if cmd; then rc=0; else rc=$?; fi` so the rc survives in the failing branch under errexit. 2. Fixed `/tmp/cleanup.err` was overwritten every retry, losing evidence for non-final failures. Now per-iteration stderr file under `$EVD/cleanup-wait/get-pod-NNNN.err` keeps the full timeline. Also switched `echo` of captured stdout to `printf '%s'` to avoid backslash interpretation surprises. Co-Authored-By: Claude Opus 4.7 --- ...luster-cluster-delete-convergence-guide.md | 30 ++++++++++++++----- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/docs/addon-vcluster-cluster-delete-convergence-guide.md b/docs/addon-vcluster-cluster-delete-convergence-guide.md index 1749aa7..1693a31 100644 --- a/docs/addon-vcluster-cluster-delete-convergence-guide.md +++ b/docs/addon-vcluster-cluster-delete-convergence-guide.md @@ -63,20 +63,34 @@ Concrete recommended snippet: # After cleanup wait, check residual pods managed by KubeBlocks. # CRITICAL: preserve kubectl rc. `2>/dev/null | wc -l` would silently # convert an API timeout / RBAC denial / NotFound into "0 pods" and -# misclassify env failure as a clean cleanup. +# misclassify env failure as a clean cleanup. Also: under `set -e`, +# `stdout=$(K ...)` would exit before `pods_rc=$?` runs, so use the +# `if cmd; then ok; else rc=$?; fi` form so the failing branch keeps +# the rc. Use a per-iteration stderr file to avoid clobbering on retry. local cd=$((SECONDS + 1500)) # 25 min +local iter=0 local pods="" pods_rc=0 +local cleanup_evd_dir="${EVD:-/tmp}/cleanup-wait" +mkdir -p "$cleanup_evd_dir" while [ "$SECONDS" -lt $cd ]; do - local stdout stderr - stdout=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/tmp/cleanup.err) - pods_rc=$? + iter=$((iter + 1)) + local stderr_file="$cleanup_evd_dir/get-pod-$(printf '%04d' "$iter").err" + local stdout + if stdout=$(K get pod -n "$NS" \ + -l "app.kubernetes.io/instance=$cluster" --no-headers \ + 2>"$stderr_file"); then + pods_rc=0 + else + pods_rc=$? + fi if [ "$pods_rc" -ne 0 ]; then - # API error: do NOT treat as "clean". Wait and retry; record for evidence. - echo "cleanup wait: kubectl rc=$pods_rc stderr=$(head -c 200 /tmp/cleanup.err)" + # API error: do NOT treat as "clean". Wait and retry; per-iter file + # is kept for evidence (no clobbering across retries). + echo "cleanup wait iter=$iter: kubectl rc=$pods_rc stderr=$(head -c 200 "$stderr_file")" sleep 15 continue fi - pods=$(echo "$stdout" | grep -c . || true) + pods=$(printf '%s' "$stdout" | grep -c . || true) [ "$pods" = "0" ] && break sleep 15 done @@ -84,7 +98,7 @@ if [ "$pods_rc" -ne 0 ]; then echo "cleanup wait: API not healthy at the end (rc=$pods_rc) — route as env" return 2 fi -echo "cleanup wait done. residual pods=$pods (rc=0 verified)" +echo "cleanup wait done. residual pods=$pods (rc=0 verified, iters=$iter)" ``` Three-track verdict (rc + stderr + observed count) avoids the silent-fallback