You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat(core): expose {{ tool_calls }} template variable for LLM graders
Add a new `{{ tool_calls }}` template variable that provides LLM graders
with a formatted summary of tool calls from agent execution. Previously,
LLM graders were blind to tool call details — only `{{ output }}` was
available (plain text).
The new variable formats each tool call as a compact line with the tool
name and key input fields (skill name for Skill, file_path for
Read/Write/Edit, command for Bash, pattern for Grep/Glob).
Changes:
- New `formatToolCalls()` utility in format-tool-calls.ts
- Add `toolCalls` field to EvaluationContext interface
- Add TOOL_CALLS to TEMPLATE_VARIABLES constants
- Thread toolCalls through orchestrator pipeline (~15 sites)
- Wire into all LLM grader prompt builders (~8 sites)
- Auto-append `[[ ## tool_calls ## ]]` section in default templates
- 12 new unit tests for formatToolCalls
- Update docs site and skill references
Closes#1121
* feat(examples): add tool-calls-template example for {{ tool_calls }} variable
Demonstrates using {{ tool_calls }} in LLM grader prompts to verify
skill invocation — an alternative to the deterministic skill-trigger
grader when LLM reasoning is needed.
Includes:
- Mock CLI agent returning Skill/Read/Edit/Bash tool calls
- LLM grader prompts using {{ tool_calls }} for positive/negative cases
- 3 test cases: deploy skill, review-pr skill, no-skill bugfix
* fix(examples): use root targets.yaml and file:// prompt prefix for tool-calls example
Move mock_agent and openrouter_grader targets to root .agentv/targets.yaml
instead of a per-example targets file. Fix prompt references to use
file:// prefix so they're resolved as file paths rather than inline text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(examples): use shared grader target and rename eval file
- Remove openrouter_grader target, use shared grader (via GRADER_TARGET)
- Rename dataset.eval.yaml to eval.yaml
- Verified with both mock_agent (3/3 pass) and copilot (tool_calls populated)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(examples): use workspace template with skills for copilot e2e
Replace mock CLI agent with real copilot-compatible workspace template
containing acme-deploy skill in all provider directories. Verified 3/3
pass with copilot target (skill triggered, rollback triggered, no skill
for unrelated). Remove mock_agent target from root targets.yaml.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(examples): single .agents skill + before_all hook + rubric assertions
- Keep only .agents/skills/acme-deploy/SKILL.md as single source of truth
- Add before_all hook to copy skills to .claude/skills/ in workspace
- Switch from llm-grader with custom prompts to rubric assertions
- Remove prompts/ directory and mock-agent.ts
- Remove mock_agent target from root targets.yaml
- Verified 3/3 pass with copilot at 100%
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add tool_calls context section to rubrics docs + flatten example assertions
Add "Context Available to Rubric Graders" section to rubrics.mdx
documenting that rubric assertions receive tool_calls and file_changes
context. Flatten example eval assertions from `type: rubrics` with
`criteria:` to plain string shorthand.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic graders, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the grader choice driven by the criteria rather than one fixed recipe.
124
124
125
+
## Context Available to Rubric Graders
126
+
127
+
Rubric assertions automatically receive the full evaluation context, not just the agent's text answer. When present, the following are appended to the grader prompt:
128
+
129
+
- **`file_changes`** — unified diff of workspace file changes (when `workspace` is configured)
130
+
- **`tool_calls`** — formatted summary of tool calls from agent execution (tool name + key inputs)
131
+
132
+
This means rubric criteria can reason about *what the agent did*, not only what it said. For example, you can check whether an agent invoked a specific skill:
133
+
134
+
```yaml
135
+
assertions:
136
+
- The agent invoked the acme-deploy skill
137
+
- The agent used Read to inspect the config file before editing
138
+
```
139
+
140
+
This is a lightweight alternative to the `skill-trigger` evaluator when you want to check tool usage with natural-language criteria.
0 commit comments