docs: add agent example implementations with Phoenix tracing#12369
docs: add agent example implementations with Phoenix tracing#12369
Conversation
Port the tau-bench retail customer service agent from OpenAI Agents SDK to LangGraph. Uses StateGraph(MessagesState) with agent/tools nodes, multi-turn conversation loop with simulated user, and LangChain OpenInference instrumentation for Phoenix tracing.
Adds 20-task stratified selections for all three implementations, batch runner, trace extraction, deterministic comparison, and LLM-powered error analysis with Claude synthesis.
Results: - tau-openai: 20/20 ran, 0% exact match (heavy over-calling with lookup tools) - tau-langgraph: 20/20 ran, 0% exact match (similar pattern, slightly better recall) - traject-langgraph: 20/20 ran, 80% exact match (tool selection accurate) Top failure modes: missing_tool (19), redundant_call (15), tool_error_mishandled (15) Top recommended evaluators: tool_selection, error_handling, parameter_accuracy
2,320 annotations written across 60 tasks (8 score types per root span). Catastrophic: 16/60 (27%), Suboptimal: 16/60 (27%), Clean: 28/60 (47%). Top failure modes: wrong_params (29), null_params_sent (22), missing_mutations (15).
Detailed analysis of 60 agent trajectories: 27% catastrophic, 27% suboptimal, 47% clean success. Key finding: compounding error cascades (tool error → escalation → all downstream tools skipped) account for 56% of catastrophic failures. Report includes cross-framework comparison, failure taxonomy, and prioritized eval metric recommendations.
- Add comprehensive README.md with setup, usage, architecture docs - Remove results/ from tracking (generated artifacts, not source) - Remove exploration/ from tracking (dev exploration scripts) - Update .gitignore to cover results, vendor, __pycache__, .venv
|
Preview deployment for your docs. Learn more about Mintlify Previews.
|
- Rename examples/agent-trajectory-evals/ -> examples/agents/ - Remove analysis/annotation scripts (analyze_errors, annotate_traces, compare_trajectories, extract_traces) - keep only agent implementations - Remove exploration/ dev artifacts - Rewrite README to focus on agent examples and how to run them
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
- Add READMEs for tau_bench_langgraph and traject_bench_langgraph - Fix stale path and dead link in tau_bench_openai_agents README - Add tau-bench git dependency to requirements.txt - Add uv setup option, HuggingFace download note, and "Exploring Traces in Phoenix" section to top-level README
There was a problem hiding this comment.
1. calculate tool trust boundary (tau_bench_langgraph/tools.py, tau_bench_openai_agents equivalent)
The calculate tool passes the LLM-supplied expression directly to Calculate.invoke(). Since example code gets copy-pasted into real apps, a brief comment noting the trust boundary (e.g., # Caution: in production, sanitize or sandbox LLM-generated expressions) would help prevent users from inadvertently introducing an LLM-controlled code execution vector.
2. Docstring says is_done but method is is_stop (tau_bench_langgraph/user_sim.py, tau_bench_openai_agents/user_sim.py)
The docstring references is_done but the actual method is is_stop. A developer following the docstring will get AttributeError. Quick fix.
3. Stale directory path in help text (all three run.py files)
The --output help text references examples/agent-trajectory-evals/results/ but the actual directory is examples/agents/. Present in all three run.py files.
- Add trust boundary comment on calculate tool (both implementations) - Fix docstring: is_done -> is_stop in user_sim.py (both implementations) - Fix stale path in --output help text in all three run.py files
* dataset exploration * implement openai agents sdk for tau-bench * use dev/train examples * add taubench langraph example * implement tau-bench LangGraph agent Port the tau-bench retail customer service agent from OpenAI Agents SDK to LangGraph. Uses StateGraph(MessagesState) with agent/tools nodes, multi-turn conversation loop with simulated user, and LangChain OpenInference instrumentation for Phoenix tracing. * implement stage 7: scaled runs & failure mode analysis pipeline Adds 20-task stratified selections for all three implementations, batch runner, trace extraction, deterministic comparison, and LLM-powered error analysis with Claude synthesis. * run full stage 7 pipeline: 60 tasks, comparison, and error analysis Results: - tau-openai: 20/20 ran, 0% exact match (heavy over-calling with lookup tools) - tau-langgraph: 20/20 ran, 0% exact match (similar pattern, slightly better recall) - traject-langgraph: 20/20 ran, 80% exact match (tool selection accurate) Top failure modes: missing_tool (19), redundant_call (15), tool_error_mishandled (15) Top recommended evaluators: tool_selection, error_handling, parameter_accuracy * add deep trajectory analysis with Phoenix trace annotations 2,320 annotations written across 60 tasks (8 score types per root span). Catastrophic: 16/60 (27%), Suboptimal: 16/60 (27%), Clean: 28/60 (47%). Top failure modes: wrong_params (29), null_params_sent (22), missing_mutations (15). * add comprehensive failure mode analysis report Detailed analysis of 60 agent trajectories: 27% catastrophic, 27% suboptimal, 47% clean success. Key finding: compounding error cascades (tool error → escalation → all downstream tools skipped) account for 56% of catastrophic failures. Report includes cross-framework comparison, failure taxonomy, and prioritized eval metric recommendations. * clean up example for PR: add README, remove run artifacts - Add comprehensive README.md with setup, usage, architecture docs - Remove results/ from tracking (generated artifacts, not source) - Remove exploration/ from tracking (dev exploration scripts) - Update .gitignore to cover results, vendor, __pycache__, .venv * rename to examples/agents, remove analysis scripts - Rename examples/agent-trajectory-evals/ -> examples/agents/ - Remove analysis/annotation scripts (analyze_errors, annotate_traces, compare_trajectories, extract_traces) - keep only agent implementations - Remove exploration/ dev artifacts - Rewrite README to focus on agent examples and how to run them * docs: improve agent example READMEs for onboarding - Add READMEs for tau_bench_langgraph and traject_bench_langgraph - Fix stale path and dead link in tau_bench_openai_agents README - Add tau-bench git dependency to requirements.txt - Add uv setup option, HuggingFace download note, and "Exploring Traces in Phoenix" section to top-level README * fix: address PR review comments - Add trust boundary comment on calculate tool (both implementations) - Fix docstring: is_done -> is_stop in user_sim.py (both implementations) - Fix stale path in --output help text in all three run.py files
Summary
Adds
examples/agents/with three instrumented agent implementations that run benchmark tasks and export traces to Phoenix:Each implementation uses standard OpenInference instrumentation to produce LLM and tool spans in Phoenix. A batch orchestrator (
run_scaled.py) runs all three implementations with 20 tasks each.These examples demonstrate how to instrument agent frameworks with Phoenix and provide real agent traces for testing trajectory evaluation features.