Skip to content

fix(cli): guard canonical mcp tree against poisoned-cache flag collisions#454

Merged
PeterGuy326 merged 2 commits into
mainfrom
fix/canonical-flag-collision-guard
Jun 11, 2026
Merged

fix(cli): guard canonical mcp tree against poisoned-cache flag collisions#454
PeterGuy326 merged 2 commits into
mainfrom
fix/canonical-flag-collision-guard

Conversation

@PeterGuy326

Copy link
Copy Markdown
Collaborator

Problem

The canonical dws mcp tree is built from cached catalog data before the legacy command build and before Cobra dispatches anything. A pflag panic there — e.g. a tool schema property named after the reserved --params flag, which is exactly what the 2026-05-25 incident cache contained (chat_permission_grant) — aborts every invocation, including dws cache refresh and dws upgrade.

None of the three poisoned-cache guards covers this path: #447's degrade and #452's quarantine wrap only the legacy envelope build (which runs after the canonical build), and #449's canRegisterFlag lives in internal/compat, not internal/cli.

Verified against the preserved real poisoned cache from the 2026-05-25 incident: v1.0.36 as released locks out completely (--version, cache refresh, doctor all panic, no quarantine fires), so "getting 1.0.36 onto the machine is enough to escape" does not hold for this class.

Fix (two layers, mirroring the existing guards)

Verification

Same poisoned cache (sandboxed via DWS_CACHE_DIR):

v1.0.36 as released this branch
--version internal panic: chat_permission_grant flag redefined: params works
cache refresh same panic works, rebuilds 22 servers (poison cleared)
contact user get-self same panic works
mcp chat chat_permission_grant --help same panic works; colliding params property skipped, value still passable via reserved --params

Tests: 4 new flag-guard tests (internal/cli/canonical_flag_guard_test.go), 4 new panic-fallback tests mirroring legacy_panic_fallback_test.go (internal/app/canonical_panic_fallback_test.go). go test ./internal/cli/... ./internal/app/... ./internal/cache/... ./internal/compat/... and go vet all green.

…ions

The canonical 'dws mcp' tree is built from cached catalog data before the
legacy command build and before Cobra dispatches anything, so a pflag
panic there (a tool schema property named after the reserved --params
flag, as cached during the 1.0.32 incident) aborted every invocation --
including 'dws cache refresh' and 'dws upgrade' -- and sat outside all
three poisoned-cache guards (#447/#449/#452).

Two layers, mirroring the existing guards:
- applyFlagSpecs now skips reserved (--json/--params), duplicate and
  alias-colliding flag names and sanitizes shorthands instead of letting
  pflag panic; the skipped property stays reachable through the reserved
  JSON payload flags (same degradation semantics as #449).
- newMCPCommand wraps the build in the #452 recover -> quarantine ->
  retry-once -> degrade-to-stub sequence, so even an unforeseen panic
  class no longer locks the CLI out.
@PeterGuy326 PeterGuy326 requested a review from wxianfeng June 10, 2026 16:13
@PeterGuy326

Copy link
Copy Markdown
Collaborator Author

Verification evidence (real poisoned cache from the 2026-05-25 incident)

The original poisoned snapshot is preserved on the incident machine (cache.bak-1032switch-2122); the chat_permission_grant tool in it declares a schema property literally named params, colliding with the reserved --params payload flag. All runs below use a sandboxed copy via DWS_CACHE_DIR — the live ~/.dws/cache is untouched.

1. v1.0.31 (pre-#447, GitHub release binary) — locked out, repair commands included

$ export DWS_CACHE_DIR=/tmp/dws-poison-verify-1036
$ /tmp/dws-v1031/dws --version
Error: internal panic: chat_permission_grant flag redefined: params
$ /tmp/dws-v1031/dws cache refresh
Error: internal panic: chat_permission_grant flag redefined: params
$ /tmp/dws-v1031/dws contact user get-self -f json
Error: internal panic: chat_permission_grant flag redefined: params

2. v1.0.36 as released (official binary, 4bc4b60) — still locked out, no self-heal fires

$ ~/.local/bin/dws --version
dws version v1.0.36 (4bc4b60, 2026-06-10T14:47:10Z)   # sanity check, clean cache

$ export DWS_CACHE_DIR=/tmp/dws-poison-verify-1036
$ dws --version
Error: internal panic: chat_permission_grant flag redefined: params
$ dws cache refresh
Error: internal panic: chat_permission_grant flag redefined: params
$ dws doctor
Error: internal panic: chat_permission_grant flag redefined: params

No quarantine warning is printed and the cache directory is left untouched — the #452 self-heal never runs. Stack trace (temporary debug.PrintStack() in the top-level recover) pins the panic to the canonical tree, built before the guarded legacy path:

panic: chat_permission_grant flag redefined: params
  github.com/spf13/pflag.(*FlagSet).AddFlag
  internal/cli.applyFlagSpecs        internal/cli/canonical.go:554
  internal/cli.newToolCommand        internal/cli/canonical.go:533
  internal/cli.newProductCommand     internal/cli/canonical.go:271
  internal/cli.NewMCPCommand         internal/cli/canonical.go:99
  internal/app.newMCPCommand         internal/app/root.go:690
  internal/app.NewRootCommandWithEngine  internal/app/root.go:328

3. This branch (def4b5b) — same poisoned cache, first run heals

$ export DWS_CACHE_DIR=/tmp/dws-poison-verify-fixed   # fresh copy of the same poison
$ /tmp/dws-fixed --version
dws version dev
$ /tmp/dws-fixed cache refresh
[OK] 缓存刷新完成:已刷新 22 个服务,失败 0 个
缓存目录: /tmp/dws-poison-verify-fixed
$ /tmp/dws-fixed contact user get-self -f json
{
  "result": [ ... normal payload ... ]

The poisoned tool itself stays usable — the colliding params property is skipped (value remains passable through the reserved --params JSON payload), everything else registers normally:

$ /tmp/dws-fixed mcp chat chat_permission_grant --help
Flags:
      --agentCode string   Agent 标识,默认传wukong
      --grantType string   授权策略: once|session|timed|permanent
  -h, --help               help for chat_permission_grant
      --json string        Base JSON object payload for this tool invocation
      --params string      Additional JSON object payload merged after --json
      --scope string       授权 scope
      --ttl string         timed 授权有效期,如 1h/4h/24h/7d

4. Tests

$ go test ./internal/cli/... ./internal/app/...
ok  github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/cli   0.639s
ok  github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/app   53.545s
$ go test ./internal/cache/... ./internal/compat/...
ok  github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/cache   0.345s
ok  github.com/DingTalk-Real-AI/dingtalk-workspace-cli/internal/compat  0.379s
$ go vet ./internal/cli/... ./internal/app/...   # clean

@PeterGuy326 PeterGuy326 merged commit 60ac0b4 into main Jun 11, 2026
6 checks passed
@PeterGuy326 PeterGuy326 deleted the fix/canonical-flag-collision-guard branch June 11, 2026 01:16
@PeterGuy326

Copy link
Copy Markdown
Collaborator Author

Post-release verification (re-cut v1.0.36 as published)

The re-cut release (tag moved to 60ac0b4, published 2026-06-11T01:20:16Z) was downloaded from GitHub releases and run against a fresh sandboxed copy of the same 2026-05-25 poisoned cache:

$ /tmp/dws-v1036-recut/dws --version          # DWS_CACHE_DIR = poisoned sandbox
dws version v1.0.36 (60ac0b4, 2026-06-11T01:18:30Z)
$ /tmp/dws-v1036-recut/dws cache refresh
[OK] 缓存刷新完成:已刷新 22 个服务,失败 0 个
$ /tmp/dws-v1036-recut/dws contact user get-self -f json
{ "result": [ ... normal payload ... ] }

The published binary self-heals on first run; cache refresh clears the poison. dws upgrade and the install scripts both resolve GitHub releases/latest, so the primary distribution channel now serves the guarded build.

Note: the npm publish step of the release workflow failed with 403 You cannot publish over the previously published versions: 1.0.36 — npm version numbers are immutable by registry policy, so the npm package stays on the original cut until the next version. GitHub release assets, dws upgrade, and the install scripts are unaffected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants