feat: atif to trace trajectory conversion utility by ehutt · Pull Request #12414 · Arize-ai/phoenix

ehutt · 2026-03-27T23:00:54Z

Summary

Add a new phoenix.client.helpers.atif module that converts ATIF (Agent Trajectory Interchange Format) trajectories into OpenTelemetry-compatible trace data for Phoenix ingestion. This lets Harbor users upload agent trajectories as JSON and visualize them as structured traces in the Phoenix UI.

Trace hierarchy

Each trajectory produces one trace with a semantically correct span tree. LLM and TOOL spans are siblings under the orchestrating AGENT — matching the causal model (the agent runtime executes tools, not the LLM) and aligning with existing framework instrumentations (OpenAI, LangChain, etc.):

AGENT (root)
├── LLM          ← "I want to call search(q='test')"
├── TOOL         ← agent runtime executes search
├── LLM          ← "Here's what I found..."

Multi-turn conversations get nested per-turn AGENT spans:

AGENT (root)
├── AGENT turn_1
│   ├── LLM
│   ├── TOOL
│   └── LLM
├── AGENT turn_2
│   ├── LLM
│   └── LLM

Multi-agent / subagent handoffs

Trajectories referencing each other via subagent_trajectory_ref are automatically linked — the child trajectory's spans nest under the parent's TOOL span within a single trace. Upload parent and child trajectories together in one batch call for linking to work.

Continuation merging

When Harbor splits a session across files (session_id ending in -cont-N), continuations are automatically detected and merged into the same trace. The full agent session appears as one unified trace.

Multimodal support (v1.6+)

Image content parts are written using the OpenInference message.contents array format with image URLs stored in message_content.image.image.url.

Attribute mapping

Full mapping from ATIF fields to OpenInference attributes:

Token counts (prompt_tokens, completion_tokens, cached_tokens) → llm.token_count.*
Cost → llm.cost.total
Model name → llm.model_name
Tool definitions → llm.tools.{i}.tool.json_schema
Reasoning content → metadata.reasoning_content
Session ID → session.id on all spans
Full conversation history reconstructed as llm.input_messages

Other details

Deterministic IDs: trace/span IDs derived from session_id via SHA-256 (idempotent re-uploads)
Validation: strict validation against ATIF v1.0–v1.6 with clear error messages
Batch API: single upload_atif_trajectories_as_spans(client, trajectories, project_name=...) entry point
128 tests covering conversion, validation, integration, real Harbor fixtures, and edge cases

Tests

Unit tests for ATIF validation (schema versions, required fields, step structure)
Unit tests for span conversion (hierarchy, attributes, timing, deterministic IDs)
Integration tests with mock transport (batch upload, subagent linking, error handling)
Tests against real Harbor production trajectories (OpenHands, Terminus-2 with subagents, continuations, timeouts)
E2E verification script (verify_atif_upload.py) uploads all fixtures to a local Phoenix instance for visual inspection

Closes #12170

Adds helpers/atif module that validates ATIF trajectories (v1.0-v1.6), converts them to hierarchical OTel-compatible span trees, and uploads them via client.spans.log_spans(). Includes test fixtures and 46 tests.

Verified against the Harbor reference implementation (laude-institute/harbor): - agent.model_name is now optional (was incorrectly required) - step.timestamp is now optional (was incorrectly required) - observation allowed on any step source (was restricted to agent-only) - source_call_id and content are optional on ObservationResult - Removed non-spec fields: environment on root, latency_ms on metrics, total_tool_calls/total_latency_ms on final_metrics - Added missing spec fields: notes, continued_trajectory_ref, tool_definitions, reasoning_effort, is_copied_context, subagent_trajectory_ref - Conversion handles missing timestamps with fallback strategy - Conversion handles multimodal message content (list[ContentPart]) - Fixed test fixtures to match real ATIF structure - Added tests for optional fields and real Harbor trajectory format

Directly read the Harbor Pydantic models from the installed package (harbor/models/trajectories/*.py) and the ATIF RFC from GitHub to verify every field definition. Fixes from this audit: - Restore tool_calls and metrics to agent-only fields (per Step.validate_agent_only_fields) - Handle "usage" field as fallback for "metrics" (real HuggingFace trajectories use it) - Add real ATIF test fixture from obaydata/mcp-agent-trajectory-benchmark on HuggingFace - Add tests for real HuggingFace accountant trajectory (12 steps, 4 tool calls) - Add test for failed Harbor Claude Code trajectory (2 steps, 0 tokens) - Verification script now uploads HuggingFace sample trajectories

The HuggingFace mcp-agent-trajectory-benchmark files fail Harbor's own Pydantic validation (step-level "usage" is not a spec field). Move test out of TestValidTrajectories into TestNonConformantTrajectories with clear docstrings explaining this is an adapter bug we handle gracefully, not valid ATIF we endorse.

The HuggingFace mcp-agent-trajectory-benchmark files use step-level "usage" instead of "metrics" — a bug in that dataset's adapter that fails Harbor's own Pydantic validation. We should not silently accept invalid ATIF. Removed: - "usage" fallback in _convert.py (only reads spec "metrics" field now) - Non-conformant fixture and tests - sample_trajectories/ directory - HuggingFace section from verification script All remaining tests use spec-conformant ATIF data only (55 passing).

Phoenix stores attributes.metadata as a dict, not a serialized JSON string. Removes json.dumps() wrapping on metadata attributes so they render properly in the UI.

Compared ATIF spans against tau-bench-openai traces and added: - session.id on all spans (from ATIF session_id) - input.value as JSON on LLM spans (matching real trace format) - tool_call.id in output messages (from ATIF tool_call_id) - llm.token_count.prompt_details.cache_read (from ATIF cached_tokens) - llm.tools on LLM spans (from ATIF agent.tool_definitions, v1.5+)

Matches convention in real instrumented traces where LLM spans have generic names. Step numbering was arbitrary and confusing since gaps appeared when user/system steps were interleaved.

Previous approach only collected messages since the last agent step, which broke for consecutive agent steps (empty input) and lost context for non-consecutive ones. Now accumulates all prior messages (user, system, and assistant) as the conversation history, approximating what the LLM would have received as its prompt.

The conversation history now includes assistant messages with their tool_calls, and tool-role messages with the observation results matched by tool_call_id. This means the final LLM span's input messages are a complete record of the full conversation, usable for trajectory evals.

mintlify · 2026-03-27T23:01:06Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
arize-phoenix	🟢 Ready	View Preview	Mar 27, 2026, 11:01 PM

claude · 2026-03-27T23:09:36Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Add targeted pyright rule suppressions for files that work with loosely-typed ATIF Dict[str, Any] data, which inherently produces partially-unknown types from .get() calls in strict mode.

…den fixtures Address PR review feedback and add new features: - Fix ATIFContentPart to match Harbor spec (source object, not image_url) - Fix model_name typing to Optional[str] on ATIFAgent - Remove tool_calls/metrics from agent-only fields (allowed on any source) - Pull observation validation out of tool_calls block - Add schema version warning (minor > 6) and rejection (major >= 2) - Map cost_usd to llm.cost.total on LLM and root spans - Add multimodal content handling (message.contents attributes, image parts) - Add _build_subagent_ref_map for cross-trajectory parent-child linking - Rename to batch API: upload_atif_trajectories_as_spans(client, [...], project_name=) - Add 9 Harbor golden trajectory fixtures (OpenHands v1.5, Terminus-2 v1.6) - Add synthetic fixtures for multimodal, parallel mixed results, subagent linking - 109 tests covering all new features and Harbor golden files

Harbor's golden test files all use session_id="NORMALIZED_SESSION_ID" as a placeholder. Our deterministic ID generation (SHA-256 from session_id) produced identical trace/span IDs for all 6 files, causing them to merge into one malformed trace with overlapping spans. Replace with unique descriptive session IDs per fixture. Also consolidate verification script to upload everything into a single project, and improve the synthetic subagent fixture with realistic copied context (matching Harbor's is_copied_context pattern).

Replace placeholder file paths with real public image URLs (Wikipedia dice PNG, Hokusai wave JPEG) so the multimodal content is visible when inspecting traces in the Phoenix UI.

…tions Major converter redesign based on PR review feedback: - Remove CHAIN spans for user/system messages. They now only appear as llm.input_messages on the LLM spans that follow them, matching how real instrumented traces work. - Single-turn trajectories produce flat hierarchy: AGENT (root) → LLM → TOOL - Multi-turn trajectories get nested AGENT spans per user turn: AGENT (root) → AGENT (turn_1) → LLM → TOOL The first turn includes everything up to the second user message, avoiding empty turns from leading system/context steps. - Continuation trajectories (session_id ending in -cont-N) share the same trace_id as the original, so they merge into one trace. Span IDs remain distinct (derived from full session_id). - Fix continuation fixture session_id to use Harbor's -cont-N convention. - Update docstrings and comments to reflect new behavior. - 124 tests passing.

- Add is_continuation: true to root span metadata when the trajectory is a continuation (session_id ends in -cont-N) - Add has_copied_context: true to LLM span metadata when any preceding steps have is_copied_context set (replayed context from summarization) - 128 tests passing

Document all supported features: trace structure, multi-turn nesting, multi-agent/subagent handoffs, continuation merging, multimodal content, copied context flags, attribute mapping, and deterministic IDs.

…e limits

The LLM requests tool calls but the agent runtime executes them — they are peers, not parent-child. This aligns with OpenAI, LangChain, and other framework instrumentations. Also adds 1ms timestamp offsets so tools sort after their LLM span in the trace waterfall, fixes pyright TypedDict access errors in tests, and makes the verify script's Phoenix URL configurable.

ehutt added 13 commits March 26, 2026 16:51

feat(phoenix-client): add ATIF trajectory to trace conversion utility

a834f5f

Adds helpers/atif module that validates ATIF trajectories (v1.0-v1.6), converts them to hierarchical OTel-compatible span trees, and uploads them via client.spans.log_spans(). Includes test fixtures and 46 tests.

add e2e verification script for ATIF upload against local Phoenix

0103d90

docs: clarify usage field fallback is an adapter bug, not a spec variant

987c142

fix: pass metadata as dict instead of JSON string

58ba264

Phoenix stores attributes.metadata as a dict, not a serialized JSON string. Removes json.dumps() wrapping on metadata attributes so they render properly in the UI.

fix: name LLM spans "LLM" instead of "llm_call_{step_id}"

4ad6b2b

Matches convention in real instrumented traces where LLM spans have generic names. Step numbering was arbitrary and confusing since gaps appeared when user/system steps were interleaved.

Fix ATIF span metadata and timing

618e12f

ehutt requested a review from a team as a code owner March 27, 2026 23:00

github-project-automation Bot added this to phoenix Mar 27, 2026

github-project-automation Bot moved this to 📘 Todo in phoenix Mar 27, 2026

dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 27, 2026

mintlify Bot deployed to staging March 27, 2026 23:01 View deployment

github-advanced-security AI found potential problems Mar 27, 2026

View reviewed changes

Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_convert.py Fixed

Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_convert.py Fixed

fix: suppress pyright strict-mode warnings for ATIF module

8fa636e

Add targeted pyright rule suppressions for files that work with loosely-typed ATIF Dict[str, Any] data, which inherently produces partially-unknown types from .get() calls in strict mode.

mintlify Bot deployed to staging March 27, 2026 23:17 View deployment

fix: use .get() for NotRequired parent_id in ATIF tests

dd179e1

mintlify Bot deployed to staging March 27, 2026 23:18 View deployment

fix: remove unused type: ignore comments in ATIF convert module

bbbd062

mintlify Bot deployed to staging March 27, 2026 23:22 View deployment

fix: replace MD5 with SHA-256 for deterministic ID generation

e5ecef0

mintlify Bot deployed to staging March 27, 2026 23:26 View deployment

mikeldking self-assigned this Mar 30, 2026