fix(js): prevent global tracer provider collisions across experiment phases#12303
fix(js): prevent global tracer provider collisions across experiment phases#12303mikeldking merged 9 commits intomainfrom
Conversation
|
Preview deployment for your docs. Learn more about Mintlify Previews.
|
@arizeai/phoenix-cli
@arizeai/phoenix-client
@arizeai/phoenix-evals
@arizeai/phoenix-mcp
@arizeai/phoenix-otel
commit: |
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
…phases
When experiments run tasks then evaluators, both phases register their own
global tracer provider. The second `provider.register()` call silently
replaces the first, causing spans from the task phase to be lost and
preventing clean shutdown.
This change:
- Adds `attachGlobalTracerProvider` / `detachGlobalTracerProvider` to
phoenix-otel that snapshot, swap, and restore the OTEL global state so
multiple providers can take turns owning the global slot without
collisions.
- Splits `register({ global: true })` into `register({ global: false })`
+ explicit `attachGlobalTracerProvider()` in experiment code so each
phase manages its own lifecycle.
- Wraps experiment/evaluation bodies in try/finally to guarantee
`forceFlush → shutdown → detach` even on errors.
- Makes phoenix-evals tracer dynamic (resolves from the global provider at
call time) so evaluator spans follow whichever provider is currently
mounted.
- Adds tests for provider swap, restore, out-of-order detach, and
cross-provider span routing.
…smatch NodeTracerProvider.register() calls trace.setGlobalTracerProvider(this) using the SDK's internal import of @opentelemetry/api. In pnpm workspaces, this can resolve to a different module instance than what phoenix-otel imports, creating separate TraceAPI singletons with independent proxies. The snapshot/restore mechanism then operates on the wrong proxy, leaving the user's original provider unrestored after experiments complete. Fix: replace provider.register() with setGlobalProvider() that explicitly calls trace.setGlobalTracerProvider(), context.setGlobalContextManager(), and propagation.setGlobalPropagator() through phoenix-otel's own imports of @opentelemetry/api, ensuring all global state mutations go through a single TraceAPI instance. Also adds: - @opentelemetry/context-async-hooks as a direct dep of phoenix-otel - @opentelemetry/api + sdk-trace-node as devDeps of phoenix-client - Integration test (tsx script) that validates the full experiment lifecycle against a live Phoenix server, including global provider restoration
…ix-client and phoenix-otel Rename integration-tracer-collision.ts to integration-tracer-provider-lifecycle.ts, strip verbose logging, and add .agents/skills with rules codifying the tracer provider lifecycle patterns, experiment architecture, and testing conventions learned in this PR.
c0e5f43 to
b214b7c
Compare
- Replace `arguments` with explicit rest args in phoenix-evals dynamic tracer proxy for clarity - Add JSDoc explaining the lazy tracer pattern and the `as Tracer` cast - Add comments on finally-block cleanup calls explaining they are safety nets for error paths (no-ops on the happy path)
Code ReviewWhat this PR accomplishesThis PR fixes a global OTel tracer provider collision that occurs when Core changes:
Review updates (4c5b972)
Remaining notes for reviewers
|
Summary
runExperimentruns tasks then evaluators, each phase callsprovider.register()to set itself as the global OTEL tracer provider. The second call silently replaces the first, causing task-phase spans to be orphaned and preventing clean shutdown of the first provider.attachGlobalTracerProvider/detachGlobalTracerProvidertophoenix-otel— these snapshot the current global OTEL state before mounting a new provider and restore it on detach, so multiple providers can safely take turns owning the global slot (stack-like semantics with out-of-order detach support).runExperiment,resumeExperiment,resumeEvaluation) — each phase now callsregister({ global: false })then explicitlyattachGlobalTracerProvider(), and wraps its body intry/finallyto guaranteeforceFlush → shutdown → detacheven on errors.phoenix-evalstracer dynamic — instead of capturing a static tracer at module load, the tracer now resolves fromtrace.getTracer()at call time so evaluator spans follow whichever provider is currently mounted as global.NodeTracerProvider.register()callstrace.setGlobalTracerProvider()using the SDK's internal import of@opentelemetry/api, which in pnpm workspaces can resolve to a different module instance (different symlink paths → differentTraceAPIsingletons). This caused snapshot/restore to operate on the wrong proxy, leaving the user's original provider unrestored. Fixed by routing all global state mutations through phoenix-otel's owntrace/context/propagationimports via a newsetGlobalProvider()helper.Test plan
phoenix-otelverify sequential attach/detach, restore of external provider, and out-of-order detachphoenix-client(runExperimentTracing.test.ts) verifies task and eval phases get separate providers and both are properly cleaned upphoenix-evalsverify evaluator spans follow the current global provider across provider swaps (bothcreateEvaluatorandcreateClassificationEvaluator)attachGlobalTracerProvidermock andshutdownmockintegration-tracer-collision.ts) validated against live Phoenix at localhost:6006 — creates dataset, runs experiment with 3 examples and 2 evaluators, verifies spans land in correct projects, and confirms the user's pre-existing global provider is properly restored after the experiment completes