Skip to content

Commit 4038218

Browse files
christsoclaude
andauthored
feat(core): expose {{ tool_calls }} template variable for LLM graders (#1123)
* feat(core): expose {{ tool_calls }} template variable for LLM graders Add a new `{{ tool_calls }}` template variable that provides LLM graders with a formatted summary of tool calls from agent execution. Previously, LLM graders were blind to tool call details — only `{{ output }}` was available (plain text). The new variable formats each tool call as a compact line with the tool name and key input fields (skill name for Skill, file_path for Read/Write/Edit, command for Bash, pattern for Grep/Glob). Changes: - New `formatToolCalls()` utility in format-tool-calls.ts - Add `toolCalls` field to EvaluationContext interface - Add TOOL_CALLS to TEMPLATE_VARIABLES constants - Thread toolCalls through orchestrator pipeline (~15 sites) - Wire into all LLM grader prompt builders (~8 sites) - Auto-append `[[ ## tool_calls ## ]]` section in default templates - 12 new unit tests for formatToolCalls - Update docs site and skill references Closes #1121 * feat(examples): add tool-calls-template example for {{ tool_calls }} variable Demonstrates using {{ tool_calls }} in LLM grader prompts to verify skill invocation — an alternative to the deterministic skill-trigger grader when LLM reasoning is needed. Includes: - Mock CLI agent returning Skill/Read/Edit/Bash tool calls - LLM grader prompts using {{ tool_calls }} for positive/negative cases - 3 test cases: deploy skill, review-pr skill, no-skill bugfix * fix(examples): use root targets.yaml and file:// prompt prefix for tool-calls example Move mock_agent and openrouter_grader targets to root .agentv/targets.yaml instead of a per-example targets file. Fix prompt references to use file:// prefix so they're resolved as file paths rather than inline text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(examples): use shared grader target and rename eval file - Remove openrouter_grader target, use shared grader (via GRADER_TARGET) - Rename dataset.eval.yaml to eval.yaml - Verified with both mock_agent (3/3 pass) and copilot (tool_calls populated) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(examples): use workspace template with skills for copilot e2e Replace mock CLI agent with real copilot-compatible workspace template containing acme-deploy skill in all provider directories. Verified 3/3 pass with copilot target (skill triggered, rollback triggered, no skill for unrelated). Remove mock_agent target from root targets.yaml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(examples): single .agents skill + before_all hook + rubric assertions - Keep only .agents/skills/acme-deploy/SKILL.md as single source of truth - Add before_all hook to copy skills to .claude/skills/ in workspace - Switch from llm-grader with custom prompts to rubric assertions - Remove prompts/ directory and mock-agent.ts - Remove mock_agent target from root targets.yaml - Verified 3/3 pass with copilot at 100% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add tool_calls context section to rubrics docs + flatten example assertions Add "Context Available to Rubric Graders" section to rubrics.mdx documenting that rubric assertions receive tool_calls and file_changes context. Flatten example eval assertions from `type: rubrics` with `criteria:` to plain string shorthand. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 9d016b6 commit 4038218

12 files changed

Lines changed: 390 additions & 5 deletions

File tree

apps/web/src/content/docs/docs/evaluation/rubrics.mdx

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,23 @@ score = sum(criterion_score / 10 * weight) / sum(total_weights)
122122
123123
Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic graders, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the grader choice driven by the criteria rather than one fixed recipe.
124124
125+
## Context Available to Rubric Graders
126+
127+
Rubric assertions automatically receive the full evaluation context, not just the agent's text answer. When present, the following are appended to the grader prompt:
128+
129+
- **`file_changes`** — unified diff of workspace file changes (when `workspace` is configured)
130+
- **`tool_calls`** — formatted summary of tool calls from agent execution (tool name + key inputs)
131+
132+
This means rubric criteria can reason about *what the agent did*, not only what it said. For example, you can check whether an agent invoked a specific skill:
133+
134+
```yaml
135+
assertions:
136+
- The agent invoked the acme-deploy skill
137+
- The agent used Read to inspect the config file before editing
138+
```
139+
140+
This is a lightweight alternative to the `skill-trigger` evaluator when you want to check tool usage with natural-language criteria.
141+
125142
## Combining with Other Graders
126143

127144
Rubrics work alongside code and LLM graders:

apps/web/src/content/docs/docs/graders/llm-graders.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ Score the response from 0.0 to 1.0 based on:
7373
| `expected_output` | Full resolved expected array, JSON-serialized |
7474
| `output` | Full provider output array, JSON-serialized |
7575
| `file_changes` | Unified diff of workspace file changes (populated when `workspace` is configured) |
76+
| `tool_calls` | Formatted summary of tool calls from agent execution (tool name + key inputs per call) |
7677

7778
## Per-Grader Target
7879

@@ -228,6 +229,7 @@ Derived strings injected into grader prompts:
228229
| `expected_output` | Full resolved expected array, JSON-serialized |
229230
| `output` | Full provider output array, JSON-serialized |
230231
| `file_changes` | Unified diff of workspace file changes (populated when `workspace` is configured) |
232+
| `tool_calls` | Formatted summary of tool calls from agent execution (tool name + key inputs per call) |
231233

232234
**Example flow:**
233235

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Tool Calls Template Variable Demo
2+
#
3+
# Demonstrates using {{ tool_calls }} with rubric assertions to check
4+
# whether an agent invoked the right skills — without needing the
5+
# skill-trigger evaluator.
6+
#
7+
# Skills live in workspace/.agents/skills/. The before_all hook copies
8+
# them to .claude/skills/ so copilot and other providers can discover them.
9+
#
10+
# Run:
11+
# bun agentv eval examples/features/tool-calls-template/evals/eval.yaml --target copilot
12+
13+
name: tool-calls-template
14+
description: Rubric assertions with {{ tool_calls }} for skill verification
15+
16+
workspace:
17+
template: ../workspace/
18+
hooks:
19+
before_all:
20+
command:
21+
- bash
22+
- -c
23+
- 'WS=$(python3 -c "import json,sys;print(json.load(sys.stdin)[\"workspace_path\"])") && mkdir -p "$WS/.claude" && cp -r "$WS/.agents/skills" "$WS/.claude/skills"'
24+
25+
tests:
26+
- id: deploy-skill-triggered
27+
input: How do I deploy payments-api to production?
28+
assertions:
29+
- The agent invoked the acme-deploy skill
30+
31+
- id: rollback-skill-triggered
32+
input: I need to roll back user-service in staging, what's the procedure?
33+
assertions:
34+
- The agent invoked the acme-deploy skill
35+
36+
- id: no-skill-for-unrelated
37+
input: Write a Python function that parses JSON logs and extracts error messages.
38+
assertions:
39+
- The tool_calls section does not contain any entry starting with "Skill:" (file creation, Read, Edit, and Bash are fine)
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
---
2+
name: acme-deploy
3+
description: Use when the user asks about deploying services, checking deployment status, rollback procedures, or release management at Acme Corp
4+
---
5+
6+
# Acme Corp Deployment Procedures
7+
8+
## Overview
9+
10+
Internal deployment runbook for Acme Corp services. All deployments follow the Trident release pipeline.
11+
12+
## Deployment Commands
13+
14+
### Deploy to staging
15+
```bash
16+
trident push --env staging --service <service-name> --tag <git-sha>
17+
```
18+
19+
### Promote to production
20+
```bash
21+
trident promote --from staging --to prod --service <service-name> --approval-ticket <JIRA-ID>
22+
```
23+
Production deploys require a JIRA approval ticket (prefix: DEPLOY-).
24+
25+
### Rollback
26+
```bash
27+
trident rollback --env <env> --service <service-name> --to-version <previous-tag>
28+
```
29+
Rollbacks auto-notify #ops-alerts in Slack.
30+
31+
### Check deployment status
32+
```bash
33+
trident status --env <env> --service <service-name>
34+
```
35+
36+
## Service Registry
37+
38+
| Service | Owner Team | Staging URL | Prod URL |
39+
|---------|-----------|-------------|----------|
40+
| payments-api | Platform | payments.staging.acme.internal | payments.acme.internal |
41+
| user-service | Identity | users.staging.acme.internal | users.acme.internal |
42+
| notifications | Engagement | notify.staging.acme.internal | notify.acme.internal |
43+
44+
## Rules
45+
46+
- All prod deploys require a DEPLOY- JIRA ticket
47+
- Staging deploys are auto-approved during business hours (9am-5pm PT)
48+
- Rollbacks bypass approval but require post-mortem within 48h
49+
- Deploy freezes are announced in #engineering-announcements
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
/**
2+
* Formats tool calls from agent output messages into a human-readable summary.
3+
*
4+
* Used by `{{ tool_calls }}` template variable in LLM grader prompts.
5+
* Extracts key input fields per tool to keep the summary compact:
6+
* - Skill: `skill` arg
7+
* - Read/Write/Edit: `file_path`
8+
* - Bash: `command`
9+
* - Grep/Glob: `pattern`
10+
* - Other tools: first string-valued input field (if any)
11+
*
12+
* Returns empty string when there are no tool calls (template variable resolves to '').
13+
*/
14+
15+
import type { Message } from '../providers/types.js';
16+
17+
/**
18+
* Key input fields to extract per tool name.
19+
* Order matters — first matching field wins.
20+
*/
21+
const KEY_INPUT_FIELDS: ReadonlyMap<string, readonly string[]> = new Map([
22+
['Skill', ['skill']],
23+
['Read', ['file_path']],
24+
['Write', ['file_path']],
25+
['Edit', ['file_path']],
26+
['Bash', ['command']],
27+
['Grep', ['pattern']],
28+
['Glob', ['pattern']],
29+
]);
30+
31+
/** Fallback: pick the first short string-valued field from input. */
32+
const MAX_FALLBACK_LENGTH = 120;
33+
34+
export function formatToolCalls(output: readonly Message[] | undefined): string {
35+
if (!output) return '';
36+
37+
const lines: string[] = [];
38+
39+
for (const message of output) {
40+
if (!message.toolCalls) continue;
41+
for (const call of message.toolCalls) {
42+
const toolName = call.tool ?? 'unknown';
43+
const detail = extractKeyDetail(toolName, call.input);
44+
lines.push(detail ? `- ${toolName}: ${detail}` : `- ${toolName}`);
45+
}
46+
}
47+
48+
return lines.length > 0 ? lines.join('\n') : '';
49+
}
50+
51+
function extractKeyDetail(toolName: string, input: unknown): string {
52+
if (!input || typeof input !== 'object') return '';
53+
const record = input as Record<string, unknown>;
54+
55+
// Try known key fields for this tool
56+
const knownFields = KEY_INPUT_FIELDS.get(toolName);
57+
if (knownFields) {
58+
for (const field of knownFields) {
59+
const value = record[field];
60+
if (typeof value === 'string' && value.length > 0) {
61+
return truncate(value);
62+
}
63+
}
64+
}
65+
66+
// Fallback: first short string-valued field
67+
for (const value of Object.values(record)) {
68+
if (typeof value === 'string' && value.length > 0 && value.length <= MAX_FALLBACK_LENGTH) {
69+
return truncate(value);
70+
}
71+
}
72+
73+
return '';
74+
}
75+
76+
function truncate(value: string, maxLen = 120): string {
77+
if (value.length <= maxLen) return value;
78+
return `${value.slice(0, maxLen)}…`;
79+
}

packages/core/src/evaluation/graders/index.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ export {
5555
} from './llm-grader.js';
5656
export type { LlmGraderOptions } from './llm-grader.js';
5757

58+
export { formatToolCalls } from './format-tool-calls.js';
59+
5860
export { SkillTriggerGrader } from './skill-trigger.js';
5961

6062
export { assembleLlmGraderPrompt } from './llm-grader-prompt.js';

packages/core/src/evaluation/graders/llm-grader-prompt.ts

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ export function assembleLlmGraderPrompt(input: {
2424
evaluatorConfig?: LlmGraderConfig;
2525
output?: readonly Message[];
2626
fileChanges?: string;
27+
toolCalls?: string;
2728
graderTemplateOverride?: string;
2829
}): LlmGraderPromptAssembly {
2930
const {
@@ -32,6 +33,7 @@ export function assembleLlmGraderPrompt(input: {
3233
promptInputs,
3334
evaluatorConfig,
3435
fileChanges,
36+
toolCalls,
3537
graderTemplateOverride,
3638
} = input;
3739

@@ -41,19 +43,27 @@ export function assembleLlmGraderPrompt(input: {
4143
if (rubrics && rubrics.length > 0) {
4244
const hasScoreRanges = rubrics.some((r) => r.score_ranges && r.score_ranges.length > 0);
4345
if (hasScoreRanges) {
44-
return assembleScoreRange(evalCase, candidate, promptInputs, rubrics, fileChanges);
46+
return assembleScoreRange(evalCase, candidate, promptInputs, rubrics, fileChanges, toolCalls);
4547
}
46-
return assembleChecklist(evalCase, candidate, promptInputs, rubrics, fileChanges);
48+
return assembleChecklist(evalCase, candidate, promptInputs, rubrics, fileChanges, toolCalls);
4749
}
4850

49-
return assembleFreeform(evalCase, candidate, promptInputs, fileChanges, graderTemplateOverride);
51+
return assembleFreeform(
52+
evalCase,
53+
candidate,
54+
promptInputs,
55+
fileChanges,
56+
toolCalls,
57+
graderTemplateOverride,
58+
);
5059
}
5160

5261
function assembleFreeform(
5362
evalCase: EvalTest,
5463
candidate: string,
5564
promptInputs: PromptInputs,
5665
fileChanges?: string,
66+
toolCalls?: string,
5767
graderTemplateOverride?: string,
5868
): LlmGraderPromptAssembly {
5969
const formattedQuestion =
@@ -67,6 +77,7 @@ function assembleFreeform(
6777
[TEMPLATE_VARIABLES.EXPECTED_OUTPUT]: (evalCase.reference_answer ?? '').trim(),
6878
[TEMPLATE_VARIABLES.CRITERIA]: evalCase.criteria.trim(),
6979
[TEMPLATE_VARIABLES.FILE_CHANGES]: fileChanges ?? '',
80+
[TEMPLATE_VARIABLES.TOOL_CALLS]: toolCalls ?? '',
7081
// Deprecated aliases
7182
[TEMPLATE_VARIABLES.INPUT_TEXT]: formattedQuestion.trim(),
7283
[TEMPLATE_VARIABLES.OUTPUT_TEXT]: candidate.trim(),
@@ -77,10 +88,13 @@ function assembleFreeform(
7788
const template = graderTemplateOverride ?? DEFAULT_GRADER_TEMPLATE;
7889
let userPrompt = substituteVariables(template, variables);
7990

80-
// Append file_changes section to default template only when present
91+
// Append file_changes and tool_calls sections to default template only when present
8192
if (fileChanges && !graderTemplateOverride) {
8293
userPrompt += `\n\n[[ ## file_changes ## ]]\n${fileChanges}`;
8394
}
95+
if (toolCalls && !graderTemplateOverride) {
96+
userPrompt += `\n\n[[ ## tool_calls ## ]]\n${toolCalls}`;
97+
}
8498

8599
return {
86100
systemPrompt,
@@ -96,6 +110,7 @@ function assembleChecklist(
96110
promptInputs: PromptInputs,
97111
rubrics: readonly RubricItem[],
98112
fileChanges?: string,
113+
toolCalls?: string,
99114
): LlmGraderPromptAssembly {
100115
const formattedQuestion =
101116
promptInputs.question && promptInputs.question.trim().length > 0
@@ -123,6 +138,10 @@ function assembleChecklist(
123138
parts.push('[[ ## file_changes ## ]]', fileChanges, '');
124139
}
125140

141+
if (toolCalls) {
142+
parts.push('[[ ## tool_calls ## ]]', toolCalls, '');
143+
}
144+
126145
parts.push('[[ ## rubrics ## ]]');
127146

128147
for (const rubric of rubrics) {
@@ -150,6 +169,7 @@ function assembleScoreRange(
150169
promptInputs: PromptInputs,
151170
rubrics: readonly RubricItem[],
152171
fileChanges?: string,
172+
toolCalls?: string,
153173
): LlmGraderPromptAssembly {
154174
const formattedQuestion =
155175
promptInputs.question && promptInputs.question.trim().length > 0
@@ -178,6 +198,10 @@ function assembleScoreRange(
178198
parts.push('[[ ## file_changes ## ]]', fileChanges, '');
179199
}
180200

201+
if (toolCalls) {
202+
parts.push('[[ ## tool_calls ## ]]', toolCalls, '');
203+
}
204+
181205
parts.push('[[ ## scoring_criteria ## ]]');
182206

183207
for (const rubric of rubrics) {

0 commit comments

Comments
 (0)