Skip to content

refactor: filter native tools from trajectory#402

Open
omkargaikwad23 wants to merge 2 commits into
mainfrom
filter_non_mcp_tools
Open

refactor: filter native tools from trajectory#402
omkargaikwad23 wants to merge 2 commits into
mainfrom
filter_non_mcp_tools

Conversation

@omkargaikwad23
Copy link
Copy Markdown
Collaborator

Summary

  • Added is_canonical_mcp_name() helper in tool_naming.py that detects canonical
    __ form, so native/built-in harness tools (which lack the __ separator)
    can be cleanly excluded.
  • Applied the filter in extract_tools() across all three harness adapters (Gemini CLI,
    Claude Code, Codex), dropping harness-internal calls like update_topic,
    run_shell_command, Read, Bash, and Codex's shell from the trajectory.
  • Removed the now-redundant _CLAUDE_INFRA_TOOLS = {"ToolSearch"} carve-out — it's
    subsumed by the generic MCP-only filter.
  • Updated datasets/gemini-cli-tools/gemini-cli.evalset.json expected trajectories to
    reflect what the agent actually invokes (e.g. defensive get_instance before
    update_instance, and list_instances alone when it already returns state).
  • Added IsCanonicalMcpNameTest covering MCP form detection, native-tool rejection, and
    empty-segment edge cases.

Motivation

The trajectory matcher uses Jaccard set-similarity and the analyzer only counts a
scenario as correct at score == 100. Native harness tools (update_topic,
run_shell_command, etc.) were leaking into the actual set, dragging Jaccard below 1.0
and producing 0/2 correctness even when the agent called every MCP tool we cared
about. Filtering at the boundary lets dataset authors keep expected_trajectory focused
on user-visible MCP intent, and keeps the test stable across harnesses since each
agent's internal planning tools differ.

Test plan

  • python -m unittest test.tool_naming_test -v — 22/22 pass
  • Re-run ./evalbench/run.sh against gemini-cli-tools and confirm trajectory_matcher:
    2/2 = 100%
  • Spot-check Claude Code and Codex evalsets to confirm native tools (Read/Bash/shell)
    no longer appear in accumulated trajectories

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant