docs: add agent example implementations with Phoenix tracing by ehutt · Pull Request #12369 · Arize-ai/phoenix

ehutt · 2026-03-25T18:09:05Z

Summary

Adds examples/agents/ with three instrumented agent implementations that run benchmark tasks and export traces to Phoenix:

tau-bench + OpenAI Agents SDK — Multi-turn retail customer service agent with 16 tools, user simulation, and policy enforcement
tau-bench + LangGraph — Same retail tasks on LangGraph for framework comparison
TRAJECT-Bench + LangGraph — Single-turn parallel and sequential tool-calling tasks across multiple domains (e-commerce, travel, finance, etc.)

Each implementation uses standard OpenInference instrumentation to produce LLM and tool spans in Phoenix. A batch orchestrator (run_scaled.py) runs all three implementations with 20 tasks each.

These examples demonstrate how to instrument agent frameworks with Phoenix and provide real agent traces for testing trajectory evaluation features.

Port the tau-bench retail customer service agent from OpenAI Agents SDK to LangGraph. Uses StateGraph(MessagesState) with agent/tools nodes, multi-turn conversation loop with simulated user, and LangChain OpenInference instrumentation for Phoenix tracing.

Adds 20-task stratified selections for all three implementations, batch runner, trace extraction, deterministic comparison, and LLM-powered error analysis with Claude synthesis.

Results: - tau-openai: 20/20 ran, 0% exact match (heavy over-calling with lookup tools) - tau-langgraph: 20/20 ran, 0% exact match (similar pattern, slightly better recall) - traject-langgraph: 20/20 ran, 80% exact match (tool selection accurate) Top failure modes: missing_tool (19), redundant_call (15), tool_error_mishandled (15) Top recommended evaluators: tool_selection, error_handling, parameter_accuracy

2,320 annotations written across 60 tasks (8 score types per root span). Catastrophic: 16/60 (27%), Suboptimal: 16/60 (27%), Clean: 28/60 (47%). Top failure modes: wrong_params (29), null_params_sent (22), missing_mutations (15).

Detailed analysis of 60 agent trajectories: 27% catastrophic, 27% suboptimal, 47% clean success. Key finding: compounding error cascades (tool error → escalation → all downstream tools skipped) account for 56% of catastrophic failures. Report includes cross-framework comparison, failure taxonomy, and prioritized eval metric recommendations.

- Add comprehensive README.md with setup, usage, architecture docs - Remove results/ from tracking (generated artifacts, not source) - Remove exploration/ from tracking (dev exploration scripts) - Update .gitignore to cover results, vendor, __pycache__, .venv

mintlify · 2026-03-25T18:10:35Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
arize-phoenix	🟢 Ready	View Preview	Mar 25, 2026, 6:11 PM

- Rename examples/agent-trajectory-evals/ -> examples/agents/ - Remove analysis/annotation scripts (analyze_errors, annotate_traces, compare_trajectories, extract_traces) - keep only agent implementations - Remove exploration/ dev artifacts - Rewrite README to focus on agent examples and how to run them

claude · 2026-03-25T18:36:43Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

- Add READMEs for tau_bench_langgraph and traject_bench_langgraph - Fix stale path and dead link in tau_bench_openai_agents README - Add tau-bench git dependency to requirements.txt - Add uv setup option, HuggingFace download note, and "Exploring Traces in Phoenix" section to top-level README

anticorrelator

1. calculate tool trust boundary (tau_bench_langgraph/tools.py, tau_bench_openai_agents equivalent)
The calculate tool passes the LLM-supplied expression directly to Calculate.invoke(). Since example code gets copy-pasted into real apps, a brief comment noting the trust boundary (e.g., # Caution: in production, sanitize or sandbox LLM-generated expressions) would help prevent users from inadvertently introducing an LLM-controlled code execution vector.

2. Docstring says is_done but method is is_stop (tau_bench_langgraph/user_sim.py, tau_bench_openai_agents/user_sim.py)
The docstring references is_done but the actual method is is_stop. A developer following the docstring will get AttributeError. Quick fix.

3. Stale directory path in help text (all three run.py files)
The --output help text references examples/agent-trajectory-evals/results/ but the actual directory is examples/agents/. Present in all three run.py files.

anticorrelator

lgtm, just left a few comments

- Add trust boundary comment on calculate tool (both implementations) - Fix docstring: is_done -> is_stop in user_sim.py (both implementations) - Fix stale path in --output help text in all three run.py files

* dataset exploration * implement openai agents sdk for tau-bench * use dev/train examples * add taubench langraph example * implement tau-bench LangGraph agent Port the tau-bench retail customer service agent from OpenAI Agents SDK to LangGraph. Uses StateGraph(MessagesState) with agent/tools nodes, multi-turn conversation loop with simulated user, and LangChain OpenInference instrumentation for Phoenix tracing. * implement stage 7: scaled runs & failure mode analysis pipeline Adds 20-task stratified selections for all three implementations, batch runner, trace extraction, deterministic comparison, and LLM-powered error analysis with Claude synthesis. * run full stage 7 pipeline: 60 tasks, comparison, and error analysis Results: - tau-openai: 20/20 ran, 0% exact match (heavy over-calling with lookup tools) - tau-langgraph: 20/20 ran, 0% exact match (similar pattern, slightly better recall) - traject-langgraph: 20/20 ran, 80% exact match (tool selection accurate) Top failure modes: missing_tool (19), redundant_call (15), tool_error_mishandled (15) Top recommended evaluators: tool_selection, error_handling, parameter_accuracy * add deep trajectory analysis with Phoenix trace annotations 2,320 annotations written across 60 tasks (8 score types per root span). Catastrophic: 16/60 (27%), Suboptimal: 16/60 (27%), Clean: 28/60 (47%). Top failure modes: wrong_params (29), null_params_sent (22), missing_mutations (15). * add comprehensive failure mode analysis report Detailed analysis of 60 agent trajectories: 27% catastrophic, 27% suboptimal, 47% clean success. Key finding: compounding error cascades (tool error → escalation → all downstream tools skipped) account for 56% of catastrophic failures. Report includes cross-framework comparison, failure taxonomy, and prioritized eval metric recommendations. * clean up example for PR: add README, remove run artifacts - Add comprehensive README.md with setup, usage, architecture docs - Remove results/ from tracking (generated artifacts, not source) - Remove exploration/ from tracking (dev exploration scripts) - Update .gitignore to cover results, vendor, __pycache__, .venv * rename to examples/agents, remove analysis scripts - Rename examples/agent-trajectory-evals/ -> examples/agents/ - Remove analysis/annotation scripts (analyze_errors, annotate_traces, compare_trajectories, extract_traces) - keep only agent implementations - Remove exploration/ dev artifacts - Rewrite README to focus on agent examples and how to run them * docs: improve agent example READMEs for onboarding - Add READMEs for tau_bench_langgraph and traject_bench_langgraph - Fix stale path and dead link in tau_bench_openai_agents README - Add tau-bench git dependency to requirements.txt - Add uv setup option, HuggingFace download note, and "Exploring Traces in Phoenix" section to top-level README * fix: address PR review comments - Add trust boundary comment on calculate tool (both implementations) - Fix docstring: is_done -> is_stop in user_sim.py (both implementations) - Fix stale path in --output help text in all three run.py files

ehutt added 10 commits March 25, 2026 10:43

dataset exploration

73b7906

implement openai agents sdk for tau-bench

70bcde6

use dev/train examples

0a47379

add taubench langraph example

02b1b9c

implement stage 7: scaled runs & failure mode analysis pipeline

2326513

Adds 20-task stratified selections for all three implementations, batch runner, trace extraction, deterministic comparison, and LLM-powered error analysis with Claude synthesis.

ehutt requested review from a team as code owners March 25, 2026 18:09

github-project-automation Bot added this to phoenix Mar 25, 2026

github-project-automation Bot moved this to 📘 Todo in phoenix Mar 25, 2026

dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 25, 2026

mintlify Bot deployed to staging March 25, 2026 18:11 View deployment

claude Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/agent-trajectory-evals/compare_trajectories.py Outdated

ehutt changed the title ~~Add agent trajectory evaluation example~~ docs: add agent example implementations with Phoenix tracing Mar 25, 2026

mintlify Bot deployed to staging March 25, 2026 18:27 View deployment

ehutt linked an issue Mar 27, 2026 that may be closed by this pull request

[trajectories] add python examples of agent experiments #11838

Closed

ehutt assigned axiomofjoy Apr 3, 2026

mintlify Bot deployed to staging April 3, 2026 22:48 View deployment

anticorrelator reviewed Apr 4, 2026

View reviewed changes

anticorrelator approved these changes Apr 4, 2026

View reviewed changes

github-project-automation Bot moved this from 📘 Todo to 👍 Approved in phoenix Apr 4, 2026

fix: address PR review comments

cf40d7e

- Add trust boundary comment on calculate tool (both implementations) - Fix docstring: is_done -> is_stop in user_sim.py (both implementations) - Fix stale path in --output help text in all three run.py files

mintlify Bot deployed to staging April 8, 2026 04:21 View deployment

ehutt merged commit 3176c33 into main Apr 8, 2026
42 checks passed

ehutt deleted the ehutt/trajectory-evals branch April 8, 2026 17:51

github-project-automation Bot moved this from 👍 Approved to ✅ Done in phoenix Apr 8, 2026

mikeldking mentioned this pull request Apr 13, 2026

feat(agent): assistant agent settings page #12637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add agent example implementations with Phoenix tracing#12369

docs: add agent example implementations with Phoenix tracing#12369
ehutt merged 13 commits intomainfrom
ehutt/trajectory-evals

ehutt commented Mar 25, 2026 •

edited

Loading

Uh oh!

mintlify Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude Bot commented Mar 25, 2026

Uh oh!

anticorrelator left a comment •

edited

Loading

Uh oh!

anticorrelator left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ehutt commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

mintlify Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Mar 25, 2026

Code review

Uh oh!

anticorrelator left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anticorrelator left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ehutt commented Mar 25, 2026 •

edited

Loading

mintlify Bot commented Mar 25, 2026 •

edited

Loading

anticorrelator left a comment •

edited

Loading

anticorrelator left a comment •

edited

Loading