Skip to content

feat: atif to trace trajectory conversion utility#12414

Merged
ehutt merged 26 commits intomainfrom
ehutt/trajectories-atif-to-trace-conversion
Apr 3, 2026
Merged

feat: atif to trace trajectory conversion utility#12414
ehutt merged 26 commits intomainfrom
ehutt/trajectories-atif-to-trace-conversion

Conversation

@ehutt
Copy link
Copy Markdown
Contributor

@ehutt ehutt commented Mar 27, 2026

Summary

Add a new phoenix.client.helpers.atif module that converts ATIF (Agent Trajectory Interchange Format) trajectories into OpenTelemetry-compatible trace data for Phoenix ingestion. This lets Harbor users upload agent trajectories as JSON and visualize them as structured traces in the Phoenix UI.

Trace hierarchy

Each trajectory produces one trace with a semantically correct span tree. LLM and TOOL spans are siblings under the orchestrating AGENT — matching the causal model (the agent runtime executes tools, not the LLM) and aligning with existing framework instrumentations (OpenAI, LangChain, etc.):

AGENT (root)
├── LLM          ← "I want to call search(q='test')"
├── TOOL         ← agent runtime executes search
├── LLM          ← "Here's what I found..."

Multi-turn conversations get nested per-turn AGENT spans:

AGENT (root)
├── AGENT turn_1
│   ├── LLM
│   ├── TOOL
│   └── LLM
├── AGENT turn_2
│   ├── LLM
│   └── LLM

Multi-agent / subagent handoffs

Trajectories referencing each other via subagent_trajectory_ref are automatically linked — the child trajectory's spans nest under the parent's TOOL span within a single trace. Upload parent and child trajectories together in one batch call for linking to work.

Continuation merging

When Harbor splits a session across files (session_id ending in -cont-N), continuations are automatically detected and merged into the same trace. The full agent session appears as one unified trace.

Multimodal support (v1.6+)

Image content parts are written using the OpenInference message.contents array format with image URLs stored in message_content.image.image.url.

Attribute mapping

Full mapping from ATIF fields to OpenInference attributes:

  • Token counts (prompt_tokens, completion_tokens, cached_tokens) → llm.token_count.*
  • Cost → llm.cost.total
  • Model name → llm.model_name
  • Tool definitions → llm.tools.{i}.tool.json_schema
  • Reasoning content → metadata.reasoning_content
  • Session ID → session.id on all spans
  • Full conversation history reconstructed as llm.input_messages

Other details

  • Deterministic IDs: trace/span IDs derived from session_id via SHA-256 (idempotent re-uploads)
  • Validation: strict validation against ATIF v1.0–v1.6 with clear error messages
  • Batch API: single upload_atif_trajectories_as_spans(client, trajectories, project_name=...) entry point
  • 128 tests covering conversion, validation, integration, real Harbor fixtures, and edge cases

Tests

  • Unit tests for ATIF validation (schema versions, required fields, step structure)
  • Unit tests for span conversion (hierarchy, attributes, timing, deterministic IDs)
  • Integration tests with mock transport (batch upload, subagent linking, error handling)
  • Tests against real Harbor production trajectories (OpenHands, Terminus-2 with subagents, continuations, timeouts)
  • E2E verification script (verify_atif_upload.py) uploads all fixtures to a local Phoenix instance for visual inspection
Screenshot 2026-04-02 at 1 24 49 PM

Closes #12170

ehutt added 13 commits March 26, 2026 16:51
Adds helpers/atif module that validates ATIF trajectories (v1.0-v1.6),
converts them to hierarchical OTel-compatible span trees, and uploads
them via client.spans.log_spans(). Includes test fixtures and 46 tests.
Verified against the Harbor reference implementation (laude-institute/harbor):
- agent.model_name is now optional (was incorrectly required)
- step.timestamp is now optional (was incorrectly required)
- observation allowed on any step source (was restricted to agent-only)
- source_call_id and content are optional on ObservationResult
- Removed non-spec fields: environment on root, latency_ms on metrics,
  total_tool_calls/total_latency_ms on final_metrics
- Added missing spec fields: notes, continued_trajectory_ref, tool_definitions,
  reasoning_effort, is_copied_context, subagent_trajectory_ref
- Conversion handles missing timestamps with fallback strategy
- Conversion handles multimodal message content (list[ContentPart])
- Fixed test fixtures to match real ATIF structure
- Added tests for optional fields and real Harbor trajectory format
Directly read the Harbor Pydantic models from the installed package
(harbor/models/trajectories/*.py) and the ATIF RFC from GitHub to
verify every field definition.

Fixes from this audit:
- Restore tool_calls and metrics to agent-only fields (per Step.validate_agent_only_fields)
- Handle "usage" field as fallback for "metrics" (real HuggingFace trajectories use it)
- Add real ATIF test fixture from obaydata/mcp-agent-trajectory-benchmark on HuggingFace
- Add tests for real HuggingFace accountant trajectory (12 steps, 4 tool calls)
- Add test for failed Harbor Claude Code trajectory (2 steps, 0 tokens)
- Verification script now uploads HuggingFace sample trajectories
The HuggingFace mcp-agent-trajectory-benchmark files fail Harbor's own
Pydantic validation (step-level "usage" is not a spec field). Move test
out of TestValidTrajectories into TestNonConformantTrajectories with
clear docstrings explaining this is an adapter bug we handle gracefully,
not valid ATIF we endorse.
The HuggingFace mcp-agent-trajectory-benchmark files use step-level
"usage" instead of "metrics" — a bug in that dataset's adapter that
fails Harbor's own Pydantic validation. We should not silently accept
invalid ATIF.

Removed:
- "usage" fallback in _convert.py (only reads spec "metrics" field now)
- Non-conformant fixture and tests
- sample_trajectories/ directory
- HuggingFace section from verification script

All remaining tests use spec-conformant ATIF data only (55 passing).
Phoenix stores attributes.metadata as a dict, not a serialized JSON
string. Removes json.dumps() wrapping on metadata attributes so they
render properly in the UI.
Compared ATIF spans against tau-bench-openai traces and added:
- session.id on all spans (from ATIF session_id)
- input.value as JSON on LLM spans (matching real trace format)
- tool_call.id in output messages (from ATIF tool_call_id)
- llm.token_count.prompt_details.cache_read (from ATIF cached_tokens)
- llm.tools on LLM spans (from ATIF agent.tool_definitions, v1.5+)
Matches convention in real instrumented traces where LLM spans have
generic names. Step numbering was arbitrary and confusing since gaps
appeared when user/system steps were interleaved.
Previous approach only collected messages since the last agent step,
which broke for consecutive agent steps (empty input) and lost context
for non-consecutive ones. Now accumulates all prior messages (user,
system, and assistant) as the conversation history, approximating
what the LLM would have received as its prompt.
The conversation history now includes assistant messages with their
tool_calls, and tool-role messages with the observation results matched
by tool_call_id. This means the final LLM span's input messages are a
complete record of the full conversation, usable for trajectory evals.
@ehutt ehutt requested a review from a team as a code owner March 27, 2026 23:00
@github-project-automation github-project-automation Bot moved this to 📘 Todo in phoenix Mar 27, 2026
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 27, 2026
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented Mar 27, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
arize-phoenix 🟢 Ready View Preview Mar 27, 2026, 11:01 PM

Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_convert.py Fixed
Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_convert.py Fixed
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 27, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Add targeted pyright rule suppressions for files that work with
loosely-typed ATIF Dict[str, Any] data, which inherently produces
partially-unknown types from .get() calls in strict mode.
Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_convert.py
Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_convert.py
Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_validate.py Outdated
Comment thread packages/phoenix-client/src/phoenix/client/helpers/atif/_convert.py
@github-project-automation github-project-automation Bot moved this from 📘 Todo to 👍 Approved in phoenix Mar 30, 2026
…den fixtures

Address PR review feedback and add new features:

- Fix ATIFContentPart to match Harbor spec (source object, not image_url)
- Fix model_name typing to Optional[str] on ATIFAgent
- Remove tool_calls/metrics from agent-only fields (allowed on any source)
- Pull observation validation out of tool_calls block
- Add schema version warning (minor > 6) and rejection (major >= 2)
- Map cost_usd to llm.cost.total on LLM and root spans
- Add multimodal content handling (message.contents attributes, image parts)
- Add _build_subagent_ref_map for cross-trajectory parent-child linking
- Rename to batch API: upload_atif_trajectories_as_spans(client, [...], project_name=)
- Add 9 Harbor golden trajectory fixtures (OpenHands v1.5, Terminus-2 v1.6)
- Add synthetic fixtures for multimodal, parallel mixed results, subagent linking
- 109 tests covering all new features and Harbor golden files
Harbor's golden test files all use session_id="NORMALIZED_SESSION_ID"
as a placeholder. Our deterministic ID generation (SHA-256 from
session_id) produced identical trace/span IDs for all 6 files, causing
them to merge into one malformed trace with overlapping spans.

Replace with unique descriptive session IDs per fixture. Also
consolidate verification script to upload everything into a single
project, and improve the synthetic subagent fixture with realistic
copied context (matching Harbor's is_copied_context pattern).
Replace placeholder file paths with real public image URLs (Wikipedia
dice PNG, Hokusai wave JPEG) so the multimodal content is visible
when inspecting traces in the Phoenix UI.
…tions

Major converter redesign based on PR review feedback:

- Remove CHAIN spans for user/system messages. They now only appear
  as llm.input_messages on the LLM spans that follow them, matching
  how real instrumented traces work.

- Single-turn trajectories produce flat hierarchy:
  AGENT (root) → LLM → TOOL

- Multi-turn trajectories get nested AGENT spans per user turn:
  AGENT (root) → AGENT (turn_1) → LLM → TOOL
  The first turn includes everything up to the second user message,
  avoiding empty turns from leading system/context steps.

- Continuation trajectories (session_id ending in -cont-N) share
  the same trace_id as the original, so they merge into one trace.
  Span IDs remain distinct (derived from full session_id).

- Fix continuation fixture session_id to use Harbor's -cont-N convention.
- Update docstrings and comments to reflect new behavior.
- 124 tests passing.
- Add is_continuation: true to root span metadata when the trajectory
  is a continuation (session_id ends in -cont-N)
- Add has_copied_context: true to LLM span metadata when any preceding
  steps have is_copied_context set (replayed context from summarization)
- 128 tests passing
Document all supported features: trace structure, multi-turn nesting,
multi-agent/subagent handoffs, continuation merging, multimodal content,
copied context flags, attribute mapping, and deterministic IDs.
The LLM requests tool calls but the agent runtime executes them —
they are peers, not parent-child. This aligns with OpenAI, LangChain,
and other framework instrumentations.

Also adds 1ms timestamp offsets so tools sort after their LLM span
in the trace waterfall, fixes pyright TypedDict access errors in
tests, and makes the verify script's Phoenix URL configurable.
@ehutt ehutt merged commit 28ecfe0 into main Apr 3, 2026
46 checks passed
@ehutt ehutt deleted the ehutt/trajectories-atif-to-trace-conversion branch April 3, 2026 00:47
@github-project-automation github-project-automation Bot moved this from 👍 Approved to ✅ Done in phoenix Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[trajectories] ATIF to trace conversion

3 participants