Skip to content

fix(js): prevent global tracer provider collisions across experiment phases#12303

Merged
mikeldking merged 9 commits intomainfrom
tracer-collision
Mar 30, 2026
Merged

fix(js): prevent global tracer provider collisions across experiment phases#12303
mikeldking merged 9 commits intomainfrom
tracer-collision

Conversation

@mikeldking
Copy link
Copy Markdown
Collaborator

@mikeldking mikeldking commented Mar 21, 2026

Summary

  • Problem: When runExperiment runs tasks then evaluators, each phase calls provider.register() to set itself as the global OTEL tracer provider. The second call silently replaces the first, causing task-phase spans to be orphaned and preventing clean shutdown of the first provider.
  • Adds attachGlobalTracerProvider / detachGlobalTracerProvider to phoenix-otel — these snapshot the current global OTEL state before mounting a new provider and restore it on detach, so multiple providers can safely take turns owning the global slot (stack-like semantics with out-of-order detach support).
  • Splits provider creation from global registration in experiment code (runExperiment, resumeExperiment, resumeEvaluation) — each phase now calls register({ global: false }) then explicitly attachGlobalTracerProvider(), and wraps its body in try/finally to guarantee forceFlush → shutdown → detach even on errors.
  • Makes phoenix-evals tracer dynamic — instead of capturing a static tracer at module load, the tracer now resolves from trace.getTracer() at call time so evaluator spans follow whichever provider is currently mounted as global.
  • Fixes pnpm proxy mismatchNodeTracerProvider.register() calls trace.setGlobalTracerProvider() using the SDK's internal import of @opentelemetry/api, which in pnpm workspaces can resolve to a different module instance (different symlink paths → different TraceAPI singletons). This caused snapshot/restore to operate on the wrong proxy, leaving the user's original provider unrestored. Fixed by routing all global state mutations through phoenix-otel's own trace/context/propagation imports via a new setGlobalProvider() helper.

Test plan

  • New tests in phoenix-otel verify sequential attach/detach, restore of external provider, and out-of-order detach
  • New test in phoenix-client (runExperimentTracing.test.ts) verifies task and eval phases get separate providers and both are properly cleaned up
  • New tests in phoenix-evals verify evaluator spans follow the current global provider across provider swaps (both createEvaluator and createClassificationEvaluator)
  • Existing experiment tests updated with attachGlobalTracerProvider mock and shutdown mock
  • Integration test (integration-tracer-collision.ts) validated against live Phoenix at localhost:6006 — creates dataset, runs experiment with 3 examples and 2 evaluators, verifies spans land in correct projects, and confirms the user's pre-existing global provider is properly restored after the experiment completes

@mikeldking mikeldking requested a review from a team as a code owner March 21, 2026 16:51
@github-project-automation github-project-automation Bot moved this to 📘 Todo in phoenix Mar 21, 2026
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented Mar 21, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
arize-phoenix 🟢 Ready View Preview Mar 21, 2026, 4:52 PM

@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 21, 2026
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Mar 21, 2026

Open in StackBlitz

@arizeai/phoenix-cli

npm i https://pkg.pr.new/@arizeai/phoenix-cli@12303

@arizeai/phoenix-client

npm i https://pkg.pr.new/@arizeai/phoenix-client@12303

@arizeai/phoenix-evals

npm i https://pkg.pr.new/@arizeai/phoenix-evals@12303

@arizeai/phoenix-mcp

npm i https://pkg.pr.new/@arizeai/phoenix-mcp@12303

@arizeai/phoenix-otel

npm i https://pkg.pr.new/@arizeai/phoenix-otel@12303

commit: 5642cd4

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 21, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

mikeldking and others added 6 commits March 27, 2026 15:47
…phases

When experiments run tasks then evaluators, both phases register their own
global tracer provider. The second `provider.register()` call silently
replaces the first, causing spans from the task phase to be lost and
preventing clean shutdown.

This change:
- Adds `attachGlobalTracerProvider` / `detachGlobalTracerProvider` to
  phoenix-otel that snapshot, swap, and restore the OTEL global state so
  multiple providers can take turns owning the global slot without
  collisions.
- Splits `register({ global: true })` into `register({ global: false })`
  + explicit `attachGlobalTracerProvider()` in experiment code so each
  phase manages its own lifecycle.
- Wraps experiment/evaluation bodies in try/finally to guarantee
  `forceFlush → shutdown → detach` even on errors.
- Makes phoenix-evals tracer dynamic (resolves from the global provider at
  call time) so evaluator spans follow whichever provider is currently
  mounted.
- Adds tests for provider swap, restore, out-of-order detach, and
  cross-provider span routing.
…smatch

NodeTracerProvider.register() calls trace.setGlobalTracerProvider(this)
using the SDK's internal import of @opentelemetry/api. In pnpm workspaces,
this can resolve to a different module instance than what phoenix-otel
imports, creating separate TraceAPI singletons with independent proxies.
The snapshot/restore mechanism then operates on the wrong proxy, leaving
the user's original provider unrestored after experiments complete.

Fix: replace provider.register() with setGlobalProvider() that explicitly
calls trace.setGlobalTracerProvider(), context.setGlobalContextManager(),
and propagation.setGlobalPropagator() through phoenix-otel's own imports
of @opentelemetry/api, ensuring all global state mutations go through
a single TraceAPI instance.

Also adds:
- @opentelemetry/context-async-hooks as a direct dep of phoenix-otel
- @opentelemetry/api + sdk-trace-node as devDeps of phoenix-client
- Integration test (tsx script) that validates the full experiment
  lifecycle against a live Phoenix server, including global provider
  restoration
…ix-client and phoenix-otel

Rename integration-tracer-collision.ts to integration-tracer-provider-lifecycle.ts,
strip verbose logging, and add .agents/skills with rules codifying the tracer provider
lifecycle patterns, experiment architecture, and testing conventions learned in this PR.
- Replace `arguments` with explicit rest args in phoenix-evals dynamic
  tracer proxy for clarity
- Add JSDoc explaining the lazy tracer pattern and the `as Tracer` cast
- Add comments on finally-block cleanup calls explaining they are
  safety nets for error paths (no-ops on the happy path)
@mikeldking
Copy link
Copy Markdown
Collaborator Author

Code Review

What this PR accomplishes

This PR fixes a global OTel tracer provider collision that occurs when runExperiment executes tasks then evaluators sequentially — each phase previously called provider.register(), silently replacing the first provider and orphaning task-phase spans.

Core changes:

  1. Stack-based mount system in phoenix-otelattachGlobalTracerProvider() / detachGlobalTracerProvider() snapshot the pre-existing global OTel state before mounting a new provider and restore it on detach. This allows multiple providers to safely take turns owning the global slot without losing the user's original provider.

  2. Explicit provider lifecycle in experiment code — All three entry points (runExperiment, resumeExperiment, resumeEvaluation) now create providers with global: false, attach/detach explicitly, and use try/finally to guarantee forceFlush → shutdown → detach even on errors.

  3. Dynamic tracer in phoenix-evals — The tracer now resolves from trace.getTracer() at call time instead of capturing a static tracer at module load, so evaluator spans follow whichever provider is currently mounted.

  4. pnpm module identity fix — All global state mutations route through phoenix-otel's own OTel API imports via setGlobalProvider(), avoiding the symlink-driven module identity divergence in pnpm workspaces.

Review updates (4c5b972)

  • Replaced arguments + unused named params in phoenix-evals dynamic tracer proxy with explicit ...args rest params for clarity
  • Added JSDoc on the lazy tracer factory explaining the pattern and the as Tracer cast
  • Added comments on the finally-block cleanup calls in runExperiment and resumeExperiment explaining they are safety nets for error paths (no-ops on the happy path)

Remaining notes for reviewers

  • Lock file churn: The pnpm-lock.yaml changes include version drifts (e.g., @types/node 25.5.0 → 25.2.3, oxlint 1.56.0 → 1.53.0). Verify these are intentional and won't cause issues on main.
  • Skill files: The .agents/skills/ documentation adds ~600 lines. Consider splitting into a separate PR if you want a lighter review.
  • Integration test timing: The 2-second setTimeout for span export could be flaky in CI — forceFlush() on the sentinel provider would be more reliable.

@github-project-automation github-project-automation Bot moved this from 🔍. Needs Review to 👍 Approved in phoenix Mar 30, 2026
@mikeldking mikeldking merged commit d233c11 into main Mar 30, 2026
44 checks passed
@mikeldking mikeldking deleted the tracer-collision branch March 30, 2026 21:18
@github-project-automation github-project-automation Bot moved this from 👍 Approved to ✅ Done in phoenix Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants