docs: add vcluster KB Cluster delete convergence guide#149
Conversation
vcluster syncer adds a multi-hop async chain between vcluster apiserver, host kubelet, container runtime, and back. When a KB Cluster CR is deleted, the full teardown (Cluster finalizer → Component → InstanceSet → Pods → PVCs → vcluster mapping GC) typically takes 10-15 min in vcluster vs 2-3 min in single-tier k3d, with 25 min worst case. This guide: - Documents the chain mechanism engine-neutral. - Gives recommended cleanup-wait baselines (1500s per-run, 1800s cross-test) backed by Mason's host-side read-only investigation on 2026-05-15. - Lists signals for "still converging" vs "actually stuck". - Provides the escalation packet shape for actual stuck cases. - Warns against force-delete / patch-finalizer before evidence. Companion to the existing vcluster bounded convergence window guide which covers the parallel chaos pod-delete + replacement window. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
weicao
left a comment
There was a problem hiding this comment.
Blocking doc issue (cannot use GitHub "request changes" because this PR is under the same GitHub account): the recommended cleanup snippet can falsely declare cleanup complete when the API call fails. At docs/addon-vcluster-cluster-delete-convergence-guide.md:67, stderr is suppressed and the output is piped into wc -l; if K get pod ... times out or returns an auth/API error, the pipeline still produces 0, then line 68 breaks as if there are no residual pods. For a guide about vcluster convergence this is risky because API slowness is one of the exact cases we see in practice.
Please change the snippet to preserve the kubectl rc separately and only treat pods=0 as clean when the list command succeeded. On failure, print the rc/error and continue waiting or classify as API/environment evidence, not clean cleanup.
…anup snippet Per William review on PR #149: the previous cleanup snippet used `K get pod ... 2>/dev/null | wc -l`, which silently converts kubectl API timeout / RBAC denial / NotFound into "0 pods" and misclassifies env failure as a clean cleanup. Reworked to: - Capture kubectl rc separately - Treat rc != 0 as "wait + retry + record" (not "clean") - At the end of the window, if rc still != 0, return env-class signal - Cross-reference addon-kubectl-pipeline-evidence-integrity-guide.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
weicao
left a comment
There was a problem hiding this comment.
Second pass: this is closer, but the snippet still has one bash correctness issue. If the caller has set -e enabled, stdout=$(K get pod ... 2>/tmp/cleanup.err) will exit the script immediately on a non-zero kubectl rc before pods_rc=$? runs. Many addon harness scripts run with set -euo pipefail, so this example needs to be safe under that mode.
Please rewrite the command capture as an if assignment, e.g. if stdout=$(K get ... 2>"$errfile"); then pods_rc=0; else pods_rc=$?; fi, then handle pods_rc != 0. Also avoid fixed /tmp/cleanup.err if possible; use a per-run temp path so parallel cleanup loops do not stomp each other's stderr.
Per William second-pass review on PR #149: 1. `stdout=$(K get ...)` under `set -e` exits before `pods_rc=$?` runs. Rewrote with `if cmd; then rc=0; else rc=$?; fi` so the rc survives in the failing branch under errexit. 2. Fixed `/tmp/cleanup.err` was overwritten every retry, losing evidence for non-final failures. Now per-iteration stderr file under `$EVD/cleanup-wait/get-pod-NNNN.err` keeps the full timeline. Also switched `echo` of captured stdout to `printf '%s'` to avoid backslash interpretation surprises. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
weicao
left a comment
There was a problem hiding this comment.
Re-reviewed : the cleanup snippet now preserves kubectl controls the Kubernetes cluster manager.
Find more information at: https://kubernetes.io/docs/reference/kubectl/
Basic Commands (Beginner):
create Create a resource from a file or from stdin
expose Take a replication controller, service, deployment or pod and expose it as a new Kubernetes service
run Run a particular image on the cluster
set Set specific features on objects
Basic Commands (Intermediate):
explain Get documentation for a resource
get Display one or many resources
edit Edit a resource on the server
delete Delete resources by file names, stdin, resources and names, or by resources and label selector
Deploy Commands:
rollout Manage the rollout of a resource
scale Set a new size for a deployment, replica set, or replication controller
autoscale Auto-scale a deployment, replica set, stateful set, or replication controller
Cluster Management Commands:
certificate Modify certificate resources
cluster-info Display cluster information
top Display resource (CPU/memory) usage
cordon Mark node as unschedulable
uncordon Mark node as schedulable
drain Drain node in preparation for maintenance
taint Update the taints on one or more nodes
Troubleshooting and Debugging Commands:
describe Show details of a specific resource or group of resources
logs Print the logs for a container in a pod
attach Attach to a running container
exec Execute a command in a container
port-forward Forward one or more local ports to a pod
proxy Run a proxy to the Kubernetes API server
cp Copy files and directories to and from containers
auth Inspect authorization
debug Create debugging sessions for troubleshooting workloads and nodes
events List events
Advanced Commands:
diff Diff the live version against a would-be applied version
apply Apply a configuration to a resource by file name or stdin
patch Update fields of a resource
replace Replace a resource by file name or stdin
wait Experimental: Wait for a specific condition on one or many resources
kustomize Build a kustomization target from a directory or URL
Settings Commands:
label Update the labels on a resource
annotate Update the annotations on a resource
completion Output shell completion code for the specified shell (bash, zsh, fish, or powershell)
Subcommands provided by plugins:
Other Commands:
api-resources Print the supported API resources on the server
api-versions Print the supported API versions on the server, in the form of "group/version"
config Modify kubeconfig files
plugin Provides utilities for interacting with plugins
version Print the client and server version information
Usage:
kubectl [flags] [options]
Use "kubectl --help" for more information about a given command.
Use "kubectl options" for a list of global command-line options (applies to all commands). rc under and keeps per-iteration stderr evidence. This closes my blocking comments. Same GitHub account means I cannot use formal approve, but from my side PR #149 is OK.
weicao
left a comment
There was a problem hiding this comment.
Corrected review wording: re-reviewed c7cc700. The cleanup snippet now preserves kubectl rc under set -e and keeps per-iteration stderr evidence. This closes my blocking comments. Same GitHub account means I cannot use formal approve, but from my side PR #149 is OK.
|
Final docs-gate check: good topic, but not mergeable yet. Blockers:
After those fixes, I can do a tighter line-level pass. |
Summary
Add engine-neutral guide for KB Cluster CR delete convergence inside vcluster environments. Cleanup of a KB Cluster (Cluster → Component → InstanceSet → Pods → PVCs → vcluster mapping GC) takes 10-15 min typical / up to 25 min worst case in vcluster, vs 2-3 min in single-tier k3d. Tests that use shorter cleanup waits will misclassify slow-but-converging cleanup as a finalizer deadlock.
Why now
What this guide gives
Test plan
addon-vcluster-bounded-convergence-window-guide.md(covers chaos pod-delete + replacement window; this guide is its parallel for CR delete + finalizer chain).$()background subshell stdout hang trap as a separate doc PR (different topic, one doc one topic perkubeblocks-addon-docsconvention).🤖 Generated with Claude Code