From 3e842973b22af5cffc02beaaeac97175ed2fb347 Mon Sep 17 00:00:00 2001
From: Wei Cao <cyg.cao@gmail.com>
Date: Sat, 16 May 2026 03:54:14 +0800
Subject: [PATCH 1/3] docs: add vcluster KB Cluster delete convergence guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

vcluster syncer adds a multi-hop async chain between vcluster apiserver,
host kubelet, container runtime, and back. When a KB Cluster CR is
deleted, the full teardown (Cluster finalizer → Component → InstanceSet
→ Pods → PVCs → vcluster mapping GC) typically takes 10-15 min in
vcluster vs 2-3 min in single-tier k3d, with 25 min worst case.

This guide:
- Documents the chain mechanism engine-neutral.
- Gives recommended cleanup-wait baselines (1500s per-run, 1800s
  cross-test) backed by Mason's host-side read-only investigation on
  2026-05-15.
- Lists signals for "still converging" vs "actually stuck".
- Provides the escalation packet shape for actual stuck cases.
- Warns against force-delete / patch-finalizer before evidence.

Companion to the existing vcluster bounded convergence window guide
which covers the parallel chaos pod-delete + replacement window.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 ...luster-cluster-delete-convergence-guide.md | 139 ++++++++++++++++++
 1 file changed, 139 insertions(+)
 create mode 100644 docs/addon-vcluster-cluster-delete-convergence-guide.md
diff --git a/docs/addon-vcluster-cluster-delete-convergence-guide.md b/docs/addon-vcluster-cluster-delete-convergence-guide.md
new file mode 100644
index 0000000..b26af02
--- /dev/null
+++ b/docs/addon-vcluster-cluster-delete-convergence-guide.md
@@ -0,0 +1,139 @@
+# vcluster KB Cluster Delete Convergence
+
+> **Audience**: addon test author writing per-run cleanup logic that deletes a KB Cluster CR and waits for namespace to be empty
+> **Status**: stable
+> **Applies to**: any KB addon test running inside vcluster (Loft / Sealos / hand-rolled) backed by a shared host K8s
+> **Companion docs**:
+> - [`addon-vcluster-bounded-convergence-window-guide.md`](addon-vcluster-bounded-convergence-window-guide.md) — chaos pod-delete + replacement window; this guide is the parallel for CR delete + finalizer chain
+> - [`addon-terminating-archive-before-force-finalizer-guide.md`](addon-terminating-archive-before-force-finalizer-guide.md) — never force / patch finalizer before evidence
+
+When a test deletes a KB `Cluster` CR (`kubectl delete cluster <name>` or
+`terminationPolicy: Delete`), inside vcluster the actual host-side teardown
+of pods + PVCs + InstanceSet + namespace mappings is much slower than the
+same operation in single-tier K8s. If your test's `cleanup wait` is too
+short, you will misclassify slow-but-converging cleanup as a finalizer
+deadlock or a KB controller bug.
+
+## Symptom
+
+After `kubectl delete cluster <name>`:
+
+- Cluster object stays in `Deleting` with finalizer
+  `cluster.kubeblocks.io/finalizer`.
+- Component condition `wait for the workloads to be deleted` looping for
+  >3 min.
+- Pods phase `Failed/Error` with deletionTimestamp set but no finalizer
+  and not garbage-collected yet.
+- PVCs `Terminating` with `kubernetes.io/pvc-protection` finalizer.
+- InstanceSet `delete OK` in controller log but the object is still
+  visible from `kubectl get`.
+
+In single-tier k3d these typically clear within 2-3 min. In vcluster the
+same chain commonly takes 10-15 min, occasionally up to 25 min.
+
+## Mechanism (engine-neutral)
+
+```
+T0 : kubectl delete cluster X → Cluster has deletionTimestamp + cluster.kubeblocks.io/finalizer
+T0+: KB cluster-controller sees Cluster Deleting → cascades delete to Component → Component cascades to InstanceSet
+T0+: InstanceSet's owner-deletion handler issues delete for owned Pods, PVCs, headless Service
+T0+: vcluster syncer sees the vcluster-side Pod deletionTimestamp → propagates delete to host Pod
+T0+30s..3min: host kubelet runs preStop / SIGTERM / SIGKILL; host Pod enters Terminating
+T0+: host container runtime stops containers (engine flush, fsync); pod object GC
+T0+: vcluster syncer sees host Pod gone → propagates back to vcluster, removes vcluster Pod object
+T0+: vcluster InstanceSet's controller observes Pod gone → InstanceSet finalizer cleared → Component finalizer cleared → Cluster finalizer cleared
+T0+10..15min (typical): Cluster object actually disappears
+T0+25..30min (worst case): same but with engine taking longer to flush / fsync, or syncer GC running behind
+```
+
+Every "→" between vcluster and host is an async syncer hop with seconds-to-minutes
+of lag. The chain is long; the multiplier vs single-tier k3d is 4-8x.
+
+## Recommended cleanup-wait baseline
+
+| Test step | Single-tier k3d | vcluster |
+|---|---|---|
+| Per-run `cleanup wait` after `K delete cluster` | 180s | **≥1500s (25 min)** |
+| Cross-test namespace clean check | 300s | **≥1800s (30 min)** |
+| Soak teardown after EXIT trap | 600s | **≥1800s** |
+
+Concrete recommended snippet:
+
+```bash
+# After cleanup wait, check residual pods managed by KubeBlocks
+local cd=$((SECONDS + 1500))   # 25 min
+while [ "$SECONDS" -lt $cd ]; do
+  local pods
+  pods=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/dev/null | wc -l | tr -d ' ')
+  [ "$pods" = "0" ] && break
+  sleep 15
+done
+echo "cleanup wait done. residual pods=$pods"
+```
+
+If `pods > 0` after 25 min, route to DevOps with the residual object names
+and time window (per `addon-vcluster-bounded-convergence-window-guide.md`
+escalation path). Do NOT `--force --grace-period=0` and do NOT patch the
+finalizer — that loses the diagnostic evidence and may corrupt cluster
+state.
+
+## When NOT to assume cleanup is stuck
+
+The "stuck cleanup" pattern is convergent in vcluster. Before escalating:
+
+1. Run the cleanup wait for at least the recommended baseline.
+2. Check `kubectl get events` for `delete ... successful` lines from
+   InstanceSet — proves the controller side has fired.
+3. Check `kubectl logs deploy/kb-kubeblocks -n kb-system` for `wait for the
+   workloads to be deleted` looping repeatedly with no error — proves KB is
+   waiting on InstanceSet, not deadlocked.
+4. If all of the above are normal, **wait longer**. Cluster will converge.
+
+## When the cleanup IS stuck
+
+Stuck signals (must hold for >25 min after delete):
+
+- Cluster.deletionTimestamp + finalizer present
+- AND no progress from `kubectl get pod -n <ns>` for >10 min straight
+- AND host-side Pod (resolvable via `<pod>-x-<vc-ns>-x-<vc-name>` mapping)
+  is also stuck (request DevOps to read-only check)
+- AND KB controller log shows no recent reconciliation activity for that
+  cluster
+
+In that case use the escalation packet:
+
+```text
+Environment blocker (vcluster cleanup stuck):
+- target: vcluster <name>, ns <ns>, cluster <cluster>
+- cluster delete time: <ts>
+- residual: pod-X phase Failed, PVC data-pod-X Terminating
+- finalizer present: cluster.kubeblocks.io/finalizer
+- controller log: "wait for the workloads to be deleted" looping >25 min
+- ruled out: cluster delete event propagated (instanceset DELETE OK in logs)
+- exact action: read-only check host-side pod state + node kubelet + container runtime; if stuck, restart kubelet on affected node (host-side) without force-deleting vcluster objects
+- work continuing: other lanes / other vcluster
+```
+
+## Source observations
+
+- 2026-05-15 Henry batch 4 self-induced kill of in-flight cluster (~25 min
+  observed) — initially misclassified as finalizer deadlock; Mason
+  read-only investigation
+  (`mysql-cleanup-residual-chaos-b4-73664-mason-readonly-20260515T114316Z.tar.gz`
+  sha `1e9d7fec0b904c1a797ee3684c4f179cc13d1f5f2c966b20ed3065c9932cb58b`)
+  showed the delete event DID propagate through syncer at T0+0,
+  host kubelet entered Killing path at T0+0, host Pod actually
+  GC'd at T0+4..13 min, vcluster mapping cleared at +13 min. No
+  deadlock, just slow convergence.
+- Henry's batch 4 / batch 5 / batch 6 / soak scripts updated to use 1500s
+  cleanup wait baseline after this finding.
+
+## Related skills / docs
+
+- `skills/soak-test-classification/SKILL.md` — classify long-run findings;
+  cleanup-stuck-but-eventually-converged is `external-environmental-cascade`,
+  not invariant break
+- `docs/addon-vcluster-bounded-convergence-window-guide.md` — chaos pod-delete
+  side of the same vcluster syncer multiplier story
+- `docs/addon-terminating-archive-before-force-finalizer-guide.md` — never
+  force-finalize before evidence

From 00b680da58305351cf2d0287a556744c3d264781 Mon Sep 17 00:00:00 2001
From: Wei Cao <cyg.cao@gmail.com>
Date: Sat, 16 May 2026 04:02:48 +0800
Subject: [PATCH 2/3] docs(vcluster-cluster-delete-convergence): preserve
 kubectl rc in cleanup snippet

Per William review on PR #149: the previous cleanup snippet used
`K get pod ... 2>/dev/null | wc -l`, which silently converts kubectl
API timeout / RBAC denial / NotFound into "0 pods" and misclassifies
env failure as a clean cleanup.

Reworked to:
- Capture kubectl rc separately
- Treat rc != 0 as "wait + retry + record" (not "clean")
- At the end of the window, if rc still != 0, return env-class signal
- Cross-reference addon-kubectl-pipeline-evidence-integrity-guide.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 ...luster-cluster-delete-convergence-guide.md | 32 ++++++++++++++++---
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/docs/addon-vcluster-cluster-delete-convergence-guide.md b/docs/addon-vcluster-cluster-delete-convergence-guide.md
index b26af02..1749aa7 100644
--- a/docs/addon-vcluster-cluster-delete-convergence-guide.md
+++ b/docs/addon-vcluster-cluster-delete-convergence-guide.md
@@ -60,18 +60,40 @@ of lag. The chain is long; the multiplier vs single-tier k3d is 4-8x.
 Concrete recommended snippet:
 
 ```bash
-# After cleanup wait, check residual pods managed by KubeBlocks
+# After cleanup wait, check residual pods managed by KubeBlocks.
+# CRITICAL: preserve kubectl rc. `2>/dev/null | wc -l` would silently
+# convert an API timeout / RBAC denial / NotFound into "0 pods" and
+# misclassify env failure as a clean cleanup.
 local cd=$((SECONDS + 1500))   # 25 min
+local pods="" pods_rc=0
 while [ "$SECONDS" -lt $cd ]; do
-  local pods
-  pods=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/dev/null | wc -l | tr -d ' ')
+  local stdout stderr
+  stdout=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/tmp/cleanup.err)
+  pods_rc=$?
+  if [ "$pods_rc" -ne 0 ]; then
+    # API error: do NOT treat as "clean". Wait and retry; record for evidence.
+    echo "cleanup wait: kubectl rc=$pods_rc stderr=$(head -c 200 /tmp/cleanup.err)"
+    sleep 15
+    continue
+  fi
+  pods=$(echo "$stdout" | grep -c . || true)
   [ "$pods" = "0" ] && break
   sleep 15
 done
-echo "cleanup wait done. residual pods=$pods"
+if [ "$pods_rc" -ne 0 ]; then
+  echo "cleanup wait: API not healthy at the end (rc=$pods_rc) — route as env"
+  return 2
+fi
+echo "cleanup wait done. residual pods=$pods (rc=0 verified)"
 ```
 
-If `pods > 0` after 25 min, route to DevOps with the residual object names
+Three-track verdict (rc + stderr + observed count) avoids the silent-fallback
+trap. If `pods_rc != 0` at the end of the window, treat as environment, not
+"clean". See `addon-kubectl-pipeline-evidence-integrity-guide.md` for the
+general principle.
+
+If `pods > 0` (with rc=0) after 25 min, route to DevOps with the residual
+object names
 and time window (per `addon-vcluster-bounded-convergence-window-guide.md`
 escalation path). Do NOT `--force --grace-period=0` and do NOT patch the
 finalizer — that loses the diagnostic evidence and may corrupt cluster

From c7cc7003ca16f4e089ae11960df9d64dca01fac3 Mon Sep 17 00:00:00 2001
From: Wei Cao <cyg.cao@gmail.com>
Date: Sat, 16 May 2026 04:06:45 +0800
Subject: [PATCH 3/3] docs(vcluster-cluster-delete-convergence): fix set -e +
 per-iter stderr

Per William second-pass review on PR #149:

1. `stdout=$(K get ...)` under `set -e` exits before `pods_rc=$?` runs.
   Rewrote with `if cmd; then rc=0; else rc=$?; fi` so the rc survives
   in the failing branch under errexit.

2. Fixed `/tmp/cleanup.err` was overwritten every retry, losing evidence
   for non-final failures. Now per-iteration stderr file under
   `$EVD/cleanup-wait/get-pod-NNNN.err` keeps the full timeline.

Also switched `echo` of captured stdout to `printf '%s'` to avoid
backslash interpretation surprises.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 ...luster-cluster-delete-convergence-guide.md | 30 ++++++++++++++-----
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/docs/addon-vcluster-cluster-delete-convergence-guide.md b/docs/addon-vcluster-cluster-delete-convergence-guide.md
index 1749aa7..1693a31 100644
--- a/docs/addon-vcluster-cluster-delete-convergence-guide.md
+++ b/docs/addon-vcluster-cluster-delete-convergence-guide.md
@@ -63,20 +63,34 @@ Concrete recommended snippet:
 # After cleanup wait, check residual pods managed by KubeBlocks.
 # CRITICAL: preserve kubectl rc. `2>/dev/null | wc -l` would silently
 # convert an API timeout / RBAC denial / NotFound into "0 pods" and
-# misclassify env failure as a clean cleanup.
+# misclassify env failure as a clean cleanup. Also: under `set -e`,
+# `stdout=$(K ...)` would exit before `pods_rc=$?` runs, so use the
+# `if cmd; then ok; else rc=$?; fi` form so the failing branch keeps
+# the rc. Use a per-iteration stderr file to avoid clobbering on retry.
 local cd=$((SECONDS + 1500))   # 25 min
+local iter=0
 local pods="" pods_rc=0
+local cleanup_evd_dir="${EVD:-/tmp}/cleanup-wait"
+mkdir -p "$cleanup_evd_dir"
 while [ "$SECONDS" -lt $cd ]; do
-  local stdout stderr
-  stdout=$(K get pod -n "$NS" -l "app.kubernetes.io/instance=$cluster" --no-headers 2>/tmp/cleanup.err)
-  pods_rc=$?
+  iter=$((iter + 1))
+  local stderr_file="$cleanup_evd_dir/get-pod-$(printf '%04d' "$iter").err"
+  local stdout
+  if stdout=$(K get pod -n "$NS" \
+        -l "app.kubernetes.io/instance=$cluster" --no-headers \
+        2>"$stderr_file"); then
+    pods_rc=0
+  else
+    pods_rc=$?
+  fi
   if [ "$pods_rc" -ne 0 ]; then
-    # API error: do NOT treat as "clean". Wait and retry; record for evidence.
-    echo "cleanup wait: kubectl rc=$pods_rc stderr=$(head -c 200 /tmp/cleanup.err)"
+    # API error: do NOT treat as "clean". Wait and retry; per-iter file
+    # is kept for evidence (no clobbering across retries).
+    echo "cleanup wait iter=$iter: kubectl rc=$pods_rc stderr=$(head -c 200 "$stderr_file")"
     sleep 15
     continue
   fi
-  pods=$(echo "$stdout" | grep -c . || true)
+  pods=$(printf '%s' "$stdout" | grep -c . || true)
   [ "$pods" = "0" ] && break
   sleep 15
 done
@@ -84,7 +98,7 @@ if [ "$pods_rc" -ne 0 ]; then
   echo "cleanup wait: API not healthy at the end (rc=$pods_rc) — route as env"
   return 2
 fi
-echo "cleanup wait done. residual pods=$pods (rc=0 verified)"
+echo "cleanup wait done. residual pods=$pods (rc=0 verified, iters=$iter)"
 ```
 
 Three-track verdict (rc + stderr + observed count) avoids the silent-fallback