feat: atif to trace trajectory conversion utility#12414
Merged
Conversation
Adds helpers/atif module that validates ATIF trajectories (v1.0-v1.6), converts them to hierarchical OTel-compatible span trees, and uploads them via client.spans.log_spans(). Includes test fixtures and 46 tests.
Verified against the Harbor reference implementation (laude-institute/harbor): - agent.model_name is now optional (was incorrectly required) - step.timestamp is now optional (was incorrectly required) - observation allowed on any step source (was restricted to agent-only) - source_call_id and content are optional on ObservationResult - Removed non-spec fields: environment on root, latency_ms on metrics, total_tool_calls/total_latency_ms on final_metrics - Added missing spec fields: notes, continued_trajectory_ref, tool_definitions, reasoning_effort, is_copied_context, subagent_trajectory_ref - Conversion handles missing timestamps with fallback strategy - Conversion handles multimodal message content (list[ContentPart]) - Fixed test fixtures to match real ATIF structure - Added tests for optional fields and real Harbor trajectory format
Directly read the Harbor Pydantic models from the installed package (harbor/models/trajectories/*.py) and the ATIF RFC from GitHub to verify every field definition. Fixes from this audit: - Restore tool_calls and metrics to agent-only fields (per Step.validate_agent_only_fields) - Handle "usage" field as fallback for "metrics" (real HuggingFace trajectories use it) - Add real ATIF test fixture from obaydata/mcp-agent-trajectory-benchmark on HuggingFace - Add tests for real HuggingFace accountant trajectory (12 steps, 4 tool calls) - Add test for failed Harbor Claude Code trajectory (2 steps, 0 tokens) - Verification script now uploads HuggingFace sample trajectories
The HuggingFace mcp-agent-trajectory-benchmark files fail Harbor's own Pydantic validation (step-level "usage" is not a spec field). Move test out of TestValidTrajectories into TestNonConformantTrajectories with clear docstrings explaining this is an adapter bug we handle gracefully, not valid ATIF we endorse.
The HuggingFace mcp-agent-trajectory-benchmark files use step-level "usage" instead of "metrics" — a bug in that dataset's adapter that fails Harbor's own Pydantic validation. We should not silently accept invalid ATIF. Removed: - "usage" fallback in _convert.py (only reads spec "metrics" field now) - Non-conformant fixture and tests - sample_trajectories/ directory - HuggingFace section from verification script All remaining tests use spec-conformant ATIF data only (55 passing).
Phoenix stores attributes.metadata as a dict, not a serialized JSON string. Removes json.dumps() wrapping on metadata attributes so they render properly in the UI.
Compared ATIF spans against tau-bench-openai traces and added: - session.id on all spans (from ATIF session_id) - input.value as JSON on LLM spans (matching real trace format) - tool_call.id in output messages (from ATIF tool_call_id) - llm.token_count.prompt_details.cache_read (from ATIF cached_tokens) - llm.tools on LLM spans (from ATIF agent.tool_definitions, v1.5+)
Matches convention in real instrumented traces where LLM spans have generic names. Step numbering was arbitrary and confusing since gaps appeared when user/system steps were interleaved.
Previous approach only collected messages since the last agent step, which broke for consecutive agent steps (empty input) and lost context for non-consecutive ones. Now accumulates all prior messages (user, system, and assistant) as the conversation history, approximating what the LLM would have received as its prompt.
The conversation history now includes assistant messages with their tool_calls, and tool-role messages with the observation results matched by tool_call_id. This means the final LLM span's input messages are a complete record of the full conversation, usable for trajectory evals.
Contributor
|
Preview deployment for your docs. Learn more about Mintlify Previews.
|
Contributor
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
Add targeted pyright rule suppressions for files that work with loosely-typed ATIF Dict[str, Any] data, which inherently produces partially-unknown types from .get() calls in strict mode.
mikeldking
reviewed
Mar 30, 2026
mikeldking
reviewed
Mar 30, 2026
mikeldking
reviewed
Mar 30, 2026
mikeldking
reviewed
Mar 30, 2026
mikeldking
reviewed
Mar 30, 2026
mikeldking
approved these changes
Mar 30, 2026
…den fixtures Address PR review feedback and add new features: - Fix ATIFContentPart to match Harbor spec (source object, not image_url) - Fix model_name typing to Optional[str] on ATIFAgent - Remove tool_calls/metrics from agent-only fields (allowed on any source) - Pull observation validation out of tool_calls block - Add schema version warning (minor > 6) and rejection (major >= 2) - Map cost_usd to llm.cost.total on LLM and root spans - Add multimodal content handling (message.contents attributes, image parts) - Add _build_subagent_ref_map for cross-trajectory parent-child linking - Rename to batch API: upload_atif_trajectories_as_spans(client, [...], project_name=) - Add 9 Harbor golden trajectory fixtures (OpenHands v1.5, Terminus-2 v1.6) - Add synthetic fixtures for multimodal, parallel mixed results, subagent linking - 109 tests covering all new features and Harbor golden files
Harbor's golden test files all use session_id="NORMALIZED_SESSION_ID" as a placeholder. Our deterministic ID generation (SHA-256 from session_id) produced identical trace/span IDs for all 6 files, causing them to merge into one malformed trace with overlapping spans. Replace with unique descriptive session IDs per fixture. Also consolidate verification script to upload everything into a single project, and improve the synthetic subagent fixture with realistic copied context (matching Harbor's is_copied_context pattern).
Replace placeholder file paths with real public image URLs (Wikipedia dice PNG, Hokusai wave JPEG) so the multimodal content is visible when inspecting traces in the Phoenix UI.
…tions Major converter redesign based on PR review feedback: - Remove CHAIN spans for user/system messages. They now only appear as llm.input_messages on the LLM spans that follow them, matching how real instrumented traces work. - Single-turn trajectories produce flat hierarchy: AGENT (root) → LLM → TOOL - Multi-turn trajectories get nested AGENT spans per user turn: AGENT (root) → AGENT (turn_1) → LLM → TOOL The first turn includes everything up to the second user message, avoiding empty turns from leading system/context steps. - Continuation trajectories (session_id ending in -cont-N) share the same trace_id as the original, so they merge into one trace. Span IDs remain distinct (derived from full session_id). - Fix continuation fixture session_id to use Harbor's -cont-N convention. - Update docstrings and comments to reflect new behavior. - 124 tests passing.
- Add is_continuation: true to root span metadata when the trajectory is a continuation (session_id ends in -cont-N) - Add has_copied_context: true to LLM span metadata when any preceding steps have is_copied_context set (replayed context from summarization) - 128 tests passing
Document all supported features: trace structure, multi-turn nesting, multi-agent/subagent handoffs, continuation merging, multimodal content, copied context flags, attribute mapping, and deterministic IDs.
The LLM requests tool calls but the agent runtime executes them — they are peers, not parent-child. This aligns with OpenAI, LangChain, and other framework instrumentations. Also adds 1ms timestamp offsets so tools sort after their LLM span in the trace waterfall, fixes pyright TypedDict access errors in tests, and makes the verify script's Phoenix URL configurable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a new
phoenix.client.helpers.atifmodule that converts ATIF (Agent Trajectory Interchange Format) trajectories into OpenTelemetry-compatible trace data for Phoenix ingestion. This lets Harbor users upload agent trajectories as JSON and visualize them as structured traces in the Phoenix UI.Trace hierarchy
Each trajectory produces one trace with a semantically correct span tree. LLM and TOOL spans are siblings under the orchestrating AGENT — matching the causal model (the agent runtime executes tools, not the LLM) and aligning with existing framework instrumentations (OpenAI, LangChain, etc.):
Multi-turn conversations get nested per-turn AGENT spans:
Multi-agent / subagent handoffs
Trajectories referencing each other via
subagent_trajectory_refare automatically linked — the child trajectory's spans nest under the parent's TOOL span within a single trace. Upload parent and child trajectories together in one batch call for linking to work.Continuation merging
When Harbor splits a session across files (
session_idending in-cont-N), continuations are automatically detected and merged into the same trace. The full agent session appears as one unified trace.Multimodal support (v1.6+)
Image content parts are written using the OpenInference
message.contentsarray format with image URLs stored inmessage_content.image.image.url.Attribute mapping
Full mapping from ATIF fields to OpenInference attributes:
prompt_tokens,completion_tokens,cached_tokens) →llm.token_count.*llm.cost.totalllm.model_namellm.tools.{i}.tool.json_schemametadata.reasoning_contentsession.idon all spansllm.input_messagesOther details
session_idvia SHA-256 (idempotent re-uploads)upload_atif_trajectories_as_spans(client, trajectories, project_name=...)entry pointTests
verify_atif_upload.py) uploads all fixtures to a local Phoenix instance for visual inspectionCloses #12170