Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions docs/addon-vcluster-cluster-delete-convergence-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# vcluster KB Cluster Delete Convergence

> **Audience**: addon test author writing per-run cleanup logic that deletes a KB Cluster CR and waits for namespace to be empty
> **Status**: stable
> **Applies to**: any KB addon test running inside vcluster (Loft / Sealos / hand-rolled) backed by a shared host K8s
> **Companion docs**:
> - [`addon-vcluster-bounded-convergence-window-guide.md`](addon-vcluster-bounded-convergence-window-guide.md) — chaos pod-delete + replacement window; this guide is the parallel for CR delete + finalizer chain
> - [`addon-terminating-archive-before-force-finalizer-guide.md`](addon-terminating-archive-before-force-finalizer-guide.md) — never force / patch finalizer before evidence

When a test deletes a KB `Cluster` CR (`kubectl delete cluster <name>` or
`terminationPolicy: Delete`), inside vcluster the actual host-side teardown
of pods + PVCs + InstanceSet + namespace mappings is much slower than the
same operation in single-tier K8s. If your test's `cleanup wait` is too
short, you will misclassify slow-but-converging cleanup as a finalizer
deadlock or a KB controller bug.

## Symptom

After `kubectl delete cluster <name>`:

- Cluster object stays in `Deleting` with finalizer
`cluster.kubeblocks.io/finalizer`.
- Component condition `wait for the workloads to be deleted` looping for
>3 min.
- Pods phase `Failed/Error` with deletionTimestamp set but no finalizer
and not garbage-collected yet.
- PVCs `Terminating` with `kubernetes.io/pvc-protection` finalizer.
- InstanceSet `delete OK` in controller log but the object is still
visible from `kubectl get`.

In single-tier k3d these typically clear within 2-3 min. In vcluster the
same chain commonly takes 10-15 min, occasionally up to 25 min.

## Mechanism (engine-neutral)

```
T0 : kubectl delete cluster X → Cluster has deletionTimestamp + cluster.kubeblocks.io/finalizer
T0+: KB cluster-controller sees Cluster Deleting → cascades delete to Component → Component cascades to InstanceSet
T0+: InstanceSet's owner-deletion handler issues delete for owned Pods, PVCs, headless Service
T0+: vcluster syncer sees the vcluster-side Pod deletionTimestamp → propagates delete to host Pod
T0+30s..3min: host kubelet runs preStop / SIGTERM / SIGKILL; host Pod enters Terminating
T0+: host container runtime stops containers (engine flush, fsync); pod object GC
T0+: vcluster syncer sees host Pod gone → propagates back to vcluster, removes vcluster Pod object
T0+: vcluster InstanceSet's controller observes Pod gone → InstanceSet finalizer cleared → Component finalizer cleared → Cluster finalizer cleared
T0+10..15min (typical): Cluster object actually disappears
T0+25..30min (worst case): same but with engine taking longer to flush / fsync, or syncer GC running behind
```

Every "→" between vcluster and host is an async syncer hop with seconds-to-minutes
of lag. The chain is long; the multiplier vs single-tier k3d is 4-8x.

## Recommended cleanup-wait baseline

| Test step | Single-tier k3d | vcluster |
|---|---|---|
| Per-run `cleanup wait` after `K delete cluster` | 180s | **≥1500s (25 min)** |
| Cross-test namespace clean check | 300s | **≥1800s (30 min)** |
| Soak teardown after EXIT trap | 600s | **≥1800s** |

Concrete recommended snippet:

```bash
# After cleanup wait, check residual pods managed by KubeBlocks.
# CRITICAL: preserve kubectl rc. `2>/dev/null | wc -l` would silently
# convert an API timeout / RBAC denial / NotFound into "0 pods" and
# misclassify env failure as a clean cleanup. Also: under `set -e`,
# `stdout=$(K ...)` would exit before `pods_rc=$?` runs, so use the
# `if cmd; then ok; else rc=$?; fi` form so the failing branch keeps
# the rc. Use a per-iteration stderr file to avoid clobbering on retry.
local cd=$((SECONDS + 1500)) # 25 min
local iter=0
local pods="" pods_rc=0
local cleanup_evd_dir="${EVD:-/tmp}/cleanup-wait"
mkdir -p "$cleanup_evd_dir"
while [ "$SECONDS" -lt $cd ]; do
iter=$((iter + 1))
local stderr_file="$cleanup_evd_dir/get-pod-$(printf '%04d' "$iter").err"
local stdout
if stdout=$(K get pod -n "$NS" \
-l "app.kubernetes.io/instance=$cluster" --no-headers \
2>"$stderr_file"); then
pods_rc=0
else
pods_rc=$?
fi
if [ "$pods_rc" -ne 0 ]; then
# API error: do NOT treat as "clean". Wait and retry; per-iter file
# is kept for evidence (no clobbering across retries).
echo "cleanup wait iter=$iter: kubectl rc=$pods_rc stderr=$(head -c 200 "$stderr_file")"
sleep 15
continue
fi
pods=$(printf '%s' "$stdout" | grep -c . || true)
[ "$pods" = "0" ] && break
sleep 15
done
if [ "$pods_rc" -ne 0 ]; then
echo "cleanup wait: API not healthy at the end (rc=$pods_rc) — route as env"
return 2
fi
echo "cleanup wait done. residual pods=$pods (rc=0 verified, iters=$iter)"
```

Three-track verdict (rc + stderr + observed count) avoids the silent-fallback
trap. If `pods_rc != 0` at the end of the window, treat as environment, not
"clean". See `addon-kubectl-pipeline-evidence-integrity-guide.md` for the
general principle.

If `pods > 0` (with rc=0) after 25 min, route to DevOps with the residual
object names
and time window (per `addon-vcluster-bounded-convergence-window-guide.md`
escalation path). Do NOT `--force --grace-period=0` and do NOT patch the
finalizer — that loses the diagnostic evidence and may corrupt cluster
state.

## When NOT to assume cleanup is stuck

The "stuck cleanup" pattern is convergent in vcluster. Before escalating:

1. Run the cleanup wait for at least the recommended baseline.
2. Check `kubectl get events` for `delete ... successful` lines from
InstanceSet — proves the controller side has fired.
3. Check `kubectl logs deploy/kb-kubeblocks -n kb-system` for `wait for the
workloads to be deleted` looping repeatedly with no error — proves KB is
waiting on InstanceSet, not deadlocked.
4. If all of the above are normal, **wait longer**. Cluster will converge.

## When the cleanup IS stuck

Stuck signals (must hold for >25 min after delete):

- Cluster.deletionTimestamp + finalizer present
- AND no progress from `kubectl get pod -n <ns>` for >10 min straight
- AND host-side Pod (resolvable via `<pod>-x-<vc-ns>-x-<vc-name>` mapping)
is also stuck (request DevOps to read-only check)
- AND KB controller log shows no recent reconciliation activity for that
cluster

In that case use the escalation packet:

```text
Environment blocker (vcluster cleanup stuck):
- target: vcluster <name>, ns <ns>, cluster <cluster>
- cluster delete time: <ts>
- residual: pod-X phase Failed, PVC data-pod-X Terminating
- finalizer present: cluster.kubeblocks.io/finalizer
- controller log: "wait for the workloads to be deleted" looping >25 min
- ruled out: cluster delete event propagated (instanceset DELETE OK in logs)
- exact action: read-only check host-side pod state + node kubelet + container runtime; if stuck, restart kubelet on affected node (host-side) without force-deleting vcluster objects
- work continuing: other lanes / other vcluster
```

## Source observations

- 2026-05-15 Henry batch 4 self-induced kill of in-flight cluster (~25 min
observed) — initially misclassified as finalizer deadlock; Mason
read-only investigation
(`mysql-cleanup-residual-chaos-b4-73664-mason-readonly-20260515T114316Z.tar.gz`
sha `1e9d7fec0b904c1a797ee3684c4f179cc13d1f5f2c966b20ed3065c9932cb58b`)
showed the delete event DID propagate through syncer at T0+0,
host kubelet entered Killing path at T0+0, host Pod actually
GC'd at T0+4..13 min, vcluster mapping cleared at +13 min. No
deadlock, just slow convergence.
- Henry's batch 4 / batch 5 / batch 6 / soak scripts updated to use 1500s
cleanup wait baseline after this finding.

## Related skills / docs

- `skills/soak-test-classification/SKILL.md` — classify long-run findings;
cleanup-stuck-but-eventually-converged is `external-environmental-cascade`,
not invariant break
- `docs/addon-vcluster-bounded-convergence-window-guide.md` — chaos pod-delete
side of the same vcluster syncer multiplier story
- `docs/addon-terminating-archive-before-force-finalizer-guide.md` — never
force-finalize before evidence