Skip to content

refactor: introduce canonical tool naming convention and update evalsets & model generators to reflect this change.#400

Merged
IsmailMehdi merged 3 commits into
mainfrom
trajectory_matcher
May 21, 2026
Merged

refactor: introduce canonical tool naming convention and update evalsets & model generators to reflect this change.#400
IsmailMehdi merged 3 commits into
mainfrom
trajectory_matcher

Conversation

@omkargaikwad23
Copy link
Copy Markdown
Collaborator

@omkargaikwad23 omkargaikwad23 commented May 21, 2026

Summary

The changes centralize tool name normalization into a new utility module, allowing the trajectory matcher to remain generator-agnostic while ensuring that tool identities are preserved across different agent platforms

Key Changes

  • Canonical Naming Convention: Standardized MCP tool names to the server__tool format (using double underscores) to distinguish between tools with the same name across different servers.
  • Normalization Utility: Added evalbench/generators/models/tool_naming.py to handle specific formatting differences from Claude Code (mcp__), Gemini CLI (mcp_), and Codex CLI.
  • Agnostic Scoring: Refactored TrajectoryMatcher to remove per-generator normalization logic, enabling simpler string-based comparisons of pre-normalized trajectories.
  • Dataset Migration: Updated expected_trajectory entries in evaluation sets (e.g., Cloud SQL and Bigtable datasets) to reflect the new canonical format.
  • Improved Testing: Added comprehensive unit tests for the naming helpers and updated trajectory matcher tests to verify the new standard.

…ets and model generators to reflect this change.
Comment thread evalbench/generators/models/tool_naming.py Fixed
@IsmailMehdi
Copy link
Copy Markdown
Collaborator

/gcbrun

@IsmailMehdi IsmailMehdi merged commit ed34fcd into main May 21, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants