Skip to content

docs: add agent example implementations with Phoenix tracing#12369

Merged
ehutt merged 13 commits intomainfrom
ehutt/trajectory-evals
Apr 8, 2026
Merged

docs: add agent example implementations with Phoenix tracing#12369
ehutt merged 13 commits intomainfrom
ehutt/trajectory-evals

Conversation

@ehutt
Copy link
Copy Markdown
Contributor

@ehutt ehutt commented Mar 25, 2026

Summary

Adds examples/agents/ with three instrumented agent implementations that run benchmark tasks and export traces to Phoenix:

  • tau-bench + OpenAI Agents SDK — Multi-turn retail customer service agent with 16 tools, user simulation, and policy enforcement
  • tau-bench + LangGraph — Same retail tasks on LangGraph for framework comparison
  • TRAJECT-Bench + LangGraph — Single-turn parallel and sequential tool-calling tasks across multiple domains (e-commerce, travel, finance, etc.)

Each implementation uses standard OpenInference instrumentation to produce LLM and tool spans in Phoenix. A batch orchestrator (run_scaled.py) runs all three implementations with 20 tasks each.

These examples demonstrate how to instrument agent frameworks with Phoenix and provide real agent traces for testing trajectory evaluation features.

ehutt added 10 commits March 25, 2026 10:43
Port the tau-bench retail customer service agent from OpenAI Agents SDK
to LangGraph. Uses StateGraph(MessagesState) with agent/tools nodes,
multi-turn conversation loop with simulated user, and LangChain
OpenInference instrumentation for Phoenix tracing.
Adds 20-task stratified selections for all three implementations,
batch runner, trace extraction, deterministic comparison, and
LLM-powered error analysis with Claude synthesis.
Results:
- tau-openai: 20/20 ran, 0% exact match (heavy over-calling with lookup tools)
- tau-langgraph: 20/20 ran, 0% exact match (similar pattern, slightly better recall)
- traject-langgraph: 20/20 ran, 80% exact match (tool selection accurate)

Top failure modes: missing_tool (19), redundant_call (15), tool_error_mishandled (15)
Top recommended evaluators: tool_selection, error_handling, parameter_accuracy
2,320 annotations written across 60 tasks (8 score types per root span).
Catastrophic: 16/60 (27%), Suboptimal: 16/60 (27%), Clean: 28/60 (47%).
Top failure modes: wrong_params (29), null_params_sent (22), missing_mutations (15).
Detailed analysis of 60 agent trajectories: 27% catastrophic, 27% suboptimal,
47% clean success. Key finding: compounding error cascades (tool error →
escalation → all downstream tools skipped) account for 56% of catastrophic
failures. Report includes cross-framework comparison, failure taxonomy, and
prioritized eval metric recommendations.
- Add comprehensive README.md with setup, usage, architecture docs
- Remove results/ from tracking (generated artifacts, not source)
- Remove exploration/ from tracking (dev exploration scripts)
- Update .gitignore to cover results, vendor, __pycache__, .venv
@ehutt ehutt requested review from a team as code owners March 25, 2026 18:09
@github-project-automation github-project-automation Bot moved this to 📘 Todo in phoenix Mar 25, 2026
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 25, 2026
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented Mar 25, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
arize-phoenix 🟢 Ready View Preview Mar 25, 2026, 6:11 PM

- Rename examples/agent-trajectory-evals/ -> examples/agents/
- Remove analysis/annotation scripts (analyze_errors, annotate_traces,
  compare_trajectories, extract_traces) - keep only agent implementations
- Remove exploration/ dev artifacts
- Rewrite README to focus on agent examples and how to run them
Comment thread examples/agent-trajectory-evals/compare_trajectories.py Outdated
@ehutt ehutt changed the title Add agent trajectory evaluation example docs: add agent example implementations with Phoenix tracing Mar 25, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 25, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@ehutt ehutt linked an issue Mar 27, 2026 that may be closed by this pull request
- Add READMEs for tau_bench_langgraph and traject_bench_langgraph
- Fix stale path and dead link in tau_bench_openai_agents README
- Add tau-bench git dependency to requirements.txt
- Add uv setup option, HuggingFace download note, and
  "Exploring Traces in Phoenix" section to top-level README
Copy link
Copy Markdown
Contributor

@anticorrelator anticorrelator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. calculate tool trust boundary (tau_bench_langgraph/tools.py, tau_bench_openai_agents equivalent)
The calculate tool passes the LLM-supplied expression directly to Calculate.invoke(). Since example code gets copy-pasted into real apps, a brief comment noting the trust boundary (e.g., # Caution: in production, sanitize or sandbox LLM-generated expressions) would help prevent users from inadvertently introducing an LLM-controlled code execution vector.

2. Docstring says is_done but method is is_stop (tau_bench_langgraph/user_sim.py, tau_bench_openai_agents/user_sim.py)
The docstring references is_done but the actual method is is_stop. A developer following the docstring will get AttributeError. Quick fix.

3. Stale directory path in help text (all three run.py files)
The --output help text references examples/agent-trajectory-evals/results/ but the actual directory is examples/agents/. Present in all three run.py files.

Copy link
Copy Markdown
Contributor

@anticorrelator anticorrelator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just left a few comments

@github-project-automation github-project-automation Bot moved this from 📘 Todo to 👍 Approved in phoenix Apr 4, 2026
- Add trust boundary comment on calculate tool (both implementations)
- Fix docstring: is_done -> is_stop in user_sim.py (both implementations)
- Fix stale path in --output help text in all three run.py files
@ehutt ehutt merged commit 3176c33 into main Apr 8, 2026
42 checks passed
@ehutt ehutt deleted the ehutt/trajectory-evals branch April 8, 2026 17:51
@github-project-automation github-project-automation Bot moved this from 👍 Approved to ✅ Done in phoenix Apr 8, 2026
mikeldking pushed a commit that referenced this pull request Apr 13, 2026
* dataset exploration

* implement openai agents sdk for tau-bench

* use dev/train examples

* add taubench langraph example

* implement tau-bench LangGraph agent

Port the tau-bench retail customer service agent from OpenAI Agents SDK
to LangGraph. Uses StateGraph(MessagesState) with agent/tools nodes,
multi-turn conversation loop with simulated user, and LangChain
OpenInference instrumentation for Phoenix tracing.

* implement stage 7: scaled runs & failure mode analysis pipeline

Adds 20-task stratified selections for all three implementations,
batch runner, trace extraction, deterministic comparison, and
LLM-powered error analysis with Claude synthesis.

* run full stage 7 pipeline: 60 tasks, comparison, and error analysis

Results:
- tau-openai: 20/20 ran, 0% exact match (heavy over-calling with lookup tools)
- tau-langgraph: 20/20 ran, 0% exact match (similar pattern, slightly better recall)
- traject-langgraph: 20/20 ran, 80% exact match (tool selection accurate)

Top failure modes: missing_tool (19), redundant_call (15), tool_error_mishandled (15)
Top recommended evaluators: tool_selection, error_handling, parameter_accuracy

* add deep trajectory analysis with Phoenix trace annotations

2,320 annotations written across 60 tasks (8 score types per root span).
Catastrophic: 16/60 (27%), Suboptimal: 16/60 (27%), Clean: 28/60 (47%).
Top failure modes: wrong_params (29), null_params_sent (22), missing_mutations (15).

* add comprehensive failure mode analysis report

Detailed analysis of 60 agent trajectories: 27% catastrophic, 27% suboptimal,
47% clean success. Key finding: compounding error cascades (tool error →
escalation → all downstream tools skipped) account for 56% of catastrophic
failures. Report includes cross-framework comparison, failure taxonomy, and
prioritized eval metric recommendations.

* clean up example for PR: add README, remove run artifacts

- Add comprehensive README.md with setup, usage, architecture docs
- Remove results/ from tracking (generated artifacts, not source)
- Remove exploration/ from tracking (dev exploration scripts)
- Update .gitignore to cover results, vendor, __pycache__, .venv

* rename to examples/agents, remove analysis scripts

- Rename examples/agent-trajectory-evals/ -> examples/agents/
- Remove analysis/annotation scripts (analyze_errors, annotate_traces,
  compare_trajectories, extract_traces) - keep only agent implementations
- Remove exploration/ dev artifacts
- Rewrite README to focus on agent examples and how to run them

* docs: improve agent example READMEs for onboarding

- Add READMEs for tau_bench_langgraph and traject_bench_langgraph
- Fix stale path and dead link in tau_bench_openai_agents README
- Add tau-bench git dependency to requirements.txt
- Add uv setup option, HuggingFace download note, and
  "Exploring Traces in Phoenix" section to top-level README

* fix: address PR review comments

- Add trust boundary comment on calculate tool (both implementations)
- Fix docstring: is_done -> is_stop in user_sim.py (both implementations)
- Fix stale path in --output help text in all three run.py files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[trajectories] add python examples of agent experiments

3 participants