chore(mcp/docker): various improvements and housekeeping#2443
chore(mcp/docker): various improvements and housekeeping#2443meowgorithm wants to merge 1 commit intomainfrom
Conversation
53294d4 to
3fcbf04
Compare
There was a problem hiding this comment.
Pull request overview
This PR refines how MCP tool binary payloads (image/audio Data) are normalized into padded base64, aiming to avoid misclassifying raw ASCII bytes as unpadded base64.
Changes:
- Add a regression test ensuring raw ASCII like
"abc"is encoded rather than treated as unpadded base64. - Replace permissive raw base64 decoding with a heuristic (
decodeLikelyBase64) to reduce accidental decoding of plain-text bytes. - Simplify
isValidBase64by delegating to the new decoder logic.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| internal/agent/tools/mcp/tools.go | Introduces heuristic base64 detection/decoding and updates validation to use it. |
| internal/agent/tools/mcp/tools_test.go | Adds a test case covering ambiguous raw ASCII input that previously could be misdetected as base64. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // decodeLikelyBase64 decodes canonical base64 and only accepts unpadded raw | ||
| // base64 when the input also contains a character that plain lowercase text | ||
| // cannot contain in base64. This avoids corrupting raw ASCII payload bytes like | ||
| // "abc" that RawStdEncoding could otherwise decode. |
There was a problem hiding this comment.
The doc comment is hard to parse/technically inaccurate: “plain lowercase text cannot contain in base64” reads like lowercase text is invalid base64, but base64 can contain lowercase letters. Consider rephrasing to explicitly say you only treat unpadded base64 as base64 when it contains characters outside [a-z] (or otherwise clarify the heuristic/ambiguity being handled).
| // decodeLikelyBase64 decodes canonical base64 and only accepts unpadded raw | |
| // base64 when the input also contains a character that plain lowercase text | |
| // cannot contain in base64. This avoids corrupting raw ASCII payload bytes like | |
| // "abc" that RawStdEncoding could otherwise decode. | |
| // decodeLikelyBase64 first tries to decode canonical (padded) base64. If that | |
| // fails, it only treats the input as unpadded base64 when the input length is | |
| // not a multiple of 4 and it contains at least one character outside 'a'–'z' | |
| // (for example, A–Z, 0–9, '+' or '/'). This avoids misclassifying plain lowercase | |
| // ASCII payloads like "abc" as base64, which RawStdEncoding could otherwise decode. |
| if len(s)%4 == 0 { | ||
| return nil, false | ||
| } | ||
| if !strings.ContainsAny(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/") { |
There was a problem hiding this comment.
The ContainsAny heuristic means valid unpadded raw base64 strings made entirely of lowercase letters (possible for some binary payloads) will be rejected and then double-encoded by ensureBase64. If that input can occur from MCP tool providers, this will corrupt media payloads. Either broaden the acceptance criteria for raw base64 or document that lowercase-only unpadded base64 is intentionally treated as plain text (and add a test to lock that behavior in).
| if !strings.ContainsAny(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/") { | |
| if !strings.ContainsAny(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/") { |
| @@ -216,21 +228,6 @@ func decodeBase64(data []byte) ([]byte, bool) { | |||
|
|
|||
| // isValidBase64 checks if the data appears to be valid base64-encoded content. | |||
There was a problem hiding this comment.
isValidBase64 now delegates to decodeLikelyBase64, which intentionally rejects some valid raw base64 (e.g., lowercase-only unpadded). The function name and comment suggest a strict validity check, but the behavior is now a “likelihood” heuristic. Consider renaming (e.g., isLikelyBase64) or updating the comment to reflect the narrower acceptance rules.
| // isValidBase64 checks if the data appears to be valid base64-encoded content. | |
| // isValidBase64 returns true if the data is likely to be base64-encoded content. | |
| // It uses decodeLikelyBase64, which employs a heuristic and may reject some | |
| // technically valid base64 encodings (for example, certain unpadded, lowercase- | |
| // only raw base64) to avoid misinterpreting plain ASCII data as base64. |
No description provided.