fix(cli): guard canonical mcp tree against poisoned-cache flag collisions#454
Conversation
…ions The canonical 'dws mcp' tree is built from cached catalog data before the legacy command build and before Cobra dispatches anything, so a pflag panic there (a tool schema property named after the reserved --params flag, as cached during the 1.0.32 incident) aborted every invocation -- including 'dws cache refresh' and 'dws upgrade' -- and sat outside all three poisoned-cache guards (#447/#449/#452). Two layers, mirroring the existing guards: - applyFlagSpecs now skips reserved (--json/--params), duplicate and alias-colliding flag names and sanitizes shorthands instead of letting pflag panic; the skipped property stays reachable through the reserved JSON payload flags (same degradation semantics as #449). - newMCPCommand wraps the build in the #452 recover -> quarantine -> retry-once -> degrade-to-stub sequence, so even an unforeseen panic class no longer locks the CLI out.
Verification evidence (real poisoned cache from the 2026-05-25 incident)The original poisoned snapshot is preserved on the incident machine ( 1. v1.0.31 (pre-#447, GitHub release binary) — locked out, repair commands included$ export DWS_CACHE_DIR=/tmp/dws-poison-verify-1036
$ /tmp/dws-v1031/dws --version
Error: internal panic: chat_permission_grant flag redefined: params
$ /tmp/dws-v1031/dws cache refresh
Error: internal panic: chat_permission_grant flag redefined: params
$ /tmp/dws-v1031/dws contact user get-self -f json
Error: internal panic: chat_permission_grant flag redefined: params2. v1.0.36 as released (official binary, 4bc4b60) — still locked out, no self-heal fires$ ~/.local/bin/dws --version
dws version v1.0.36 (4bc4b60, 2026-06-10T14:47:10Z) # sanity check, clean cache
$ export DWS_CACHE_DIR=/tmp/dws-poison-verify-1036
$ dws --version
Error: internal panic: chat_permission_grant flag redefined: params
$ dws cache refresh
Error: internal panic: chat_permission_grant flag redefined: params
$ dws doctor
Error: internal panic: chat_permission_grant flag redefined: paramsNo quarantine warning is printed and the cache directory is left untouched — the #452 self-heal never runs. Stack trace (temporary 3. This branch (def4b5b) — same poisoned cache, first run heals$ export DWS_CACHE_DIR=/tmp/dws-poison-verify-fixed # fresh copy of the same poison
$ /tmp/dws-fixed --version
dws version dev
$ /tmp/dws-fixed cache refresh
[OK] 缓存刷新完成:已刷新 22 个服务,失败 0 个
缓存目录: /tmp/dws-poison-verify-fixed
$ /tmp/dws-fixed contact user get-self -f json
{
"result": [ ... normal payload ... ]The poisoned tool itself stays usable — the colliding $ /tmp/dws-fixed mcp chat chat_permission_grant --help
Flags:
--agentCode string Agent 标识,默认传wukong
--grantType string 授权策略: once|session|timed|permanent
-h, --help help for chat_permission_grant
--json string Base JSON object payload for this tool invocation
--params string Additional JSON object payload merged after --json
--scope string 授权 scope
--ttl string timed 授权有效期,如 1h/4h/24h/7d4. Tests$ go test ./internal/cli/... ./internal/app/...
ok github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/cli 0.639s
ok github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/app 53.545s
$ go test ./internal/cache/... ./internal/compat/...
ok github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/cache 0.345s
ok github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/compat 0.379s
$ go vet ./internal/cli/... ./internal/app/... # clean |
Post-release verification (re-cut v1.0.36 as published)The re-cut release (tag moved to 60ac0b4, published 2026-06-11T01:20:16Z) was downloaded from GitHub releases and run against a fresh sandboxed copy of the same 2026-05-25 poisoned cache: $ /tmp/dws-v1036-recut/dws --version # DWS_CACHE_DIR = poisoned sandbox
dws version v1.0.36 (60ac0b4, 2026-06-11T01:18:30Z)
$ /tmp/dws-v1036-recut/dws cache refresh
[OK] 缓存刷新完成:已刷新 22 个服务,失败 0 个
$ /tmp/dws-v1036-recut/dws contact user get-self -f json
{ "result": [ ... normal payload ... ] }The published binary self-heals on first run; Note: the npm publish step of the release workflow failed with |
Problem
The canonical
dws mcptree is built from cached catalog data before the legacy command build and before Cobra dispatches anything. A pflag panic there — e.g. a tool schema property named after the reserved--paramsflag, which is exactly what the 2026-05-25 incident cache contained (chat_permission_grant) — aborts every invocation, includingdws cache refreshanddws upgrade.None of the three poisoned-cache guards covers this path: #447's degrade and #452's quarantine wrap only the legacy envelope build (which runs after the canonical build), and #449's
canRegisterFlaglives ininternal/compat, notinternal/cli.Verified against the preserved real poisoned cache from the 2026-05-25 incident: v1.0.36 as released locks out completely (
--version,cache refresh,doctorall panic, no quarantine fires), so "getting 1.0.36 onto the machine is enough to escape" does not hold for this class.Fix (two layers, mirroring the existing guards)
internal/cli/canonical.go—applyFlagSpecsnow skips reserved (--json/--params), duplicate, and alias-colliding flag names and sanitizes shorthands instead of letting pflag panic (canRegisterToolFlag/safeToolShorthand, mirroring fix(compat): guard envelope-driven flag registration against pflag panics #449). A skipped property stays reachable through the reserved JSON payload flags.internal/app/root.go—newMCPCommandwraps the build in the fix(cli): self-heal a poisoned discovery cache by quarantining it and rebuilding #452 recover → quarantine → retry-once → degrade-to-stub sequence (test seam:buildMCPCommandFn), so even an unforeseen panic class in the canonical build can no longer brick the CLI.Verification
Same poisoned cache (sandboxed via
DWS_CACHE_DIR):--versioninternal panic: chat_permission_grant flag redefined: paramscache refreshcontact user get-selfmcp chat chat_permission_grant --helpparamsproperty skipped, value still passable via reserved--paramsTests: 4 new flag-guard tests (
internal/cli/canonical_flag_guard_test.go), 4 new panic-fallback tests mirroringlegacy_panic_fallback_test.go(internal/app/canonical_panic_fallback_test.go).go test ./internal/cli/... ./internal/app/... ./internal/cache/... ./internal/compat/...andgo vetall green.