Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56ea720eb8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
dougqh
left a comment
There was a problem hiding this comment.
Overall, it looks good to me.
But before merging, double check that the Codex comment isn't a problem
DD64bTraceId is a subclass of DDTraceId, so the JVM must initialize DDTraceId before DD64bTraceId. DDTraceId.<clinit> in turn initialized DD64bTraceId by building its ZERO/ONE constants via DD64bTraceId.from(), a circular initialization dependency. When the two classes were first touched concurrently from opposite ends -- the service-discovery task (muteTracing() -> blackholeSpan() -> DDTraceId.ZERO) racing the application's first span (IdGenerationStrategy.generateTraceId() -> DD64bTraceId.from()) -- each thread held one class-init lock and waited for the other, hanging trace creation. This surfaced as recurring 30s LogInjectionSmokeTest timeouts in CI (latent since #9705 added the scheduled muteTracing task). Break the cycle at its source while keeping DDTraceId.ZERO/ONE as public fields (preserving the API restored in #5021): ZERO/ONE are now instances of a private DDTraceId subtype (a sibling of DD64bTraceId), so DDTraceId.<clinit> no longer references the subclass. Replace the fragile "== DDTraceId.ZERO" identity checks with a value-based DDTraceId.isZero(). Those identity checks relied on every zero id being the single ZERO instance; isZero() recognizes a zero id of any concrete type, so the factories need not route 0 to the singleton and the propagation codecs no longer mishandle a zero parsed via the direct 64-bit factories. Add a forked regression test that initializes the two classes concurrently from opposite ends (deadlocks without the fix), plus isZero() coverage across the DDTraceId subtypes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b04a0d1 to
0e15d6c
Compare
Debugger benchmarksParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 9 metrics, 6 unstable metrics. See unchanged results
Request duration reports for reportsgantt
title reports - request duration [CI 0.99] : candidate=None, baseline=None
dateFormat X
axisFormat %s
section baseline
noprobe (340.653 µs) : 311, 370
. : milestone, 341,
basic (298.082 µs) : 291, 305
. : milestone, 298,
loop (8.981 ms) : 8976, 8987
. : milestone, 8981,
section candidate
noprobe (341.882 µs) : 302, 382
. : milestone, 342,
basic (294.321 µs) : 288, 300
. : milestone, 294,
loop (8.982 ms) : 8977, 8988
. : milestone, 8982,
|
Kafka / producer-benchmarkParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics. See unchanged results
|
Kafka / consumer-benchmarkParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics. See unchanged results
|
mcculls
left a comment
There was a problem hiding this comment.
I'd prefer to have a method isValid rather than isZero (where isValid() should return true when the id is not zero)
Also I'm worried that we could now create lots of DD64bTraceId(0) instances with the change to that class.
Finally some of the non-test comments are too verbose and should be simplified
…constant
- Rename DDTraceId.isZero() to value-based isValid() (OTel-aligned; mcculls);
flip the boolean at all 7 production call sites.
- Restore a cached zero singleton in DD64bTraceId so from(0)/fromHex("0") do
not allocate per call (mcculls).
- Extract the ZERO/ONE backing type to a dedicated DDTraceIdConstant class
(PerfectSlayer); update the JaCoCo coverage exclusion.
- Preserve equals/hashCode: ZERO/ONE stay value-equal (both directions) to the
equivalent DD64bTraceId, matching pre-PR behavior.
- Trim verbose non-test comments (mcculls, PerfectSlayer).
- Parameterize and rename TraceIdIsZeroTest -> TraceIdIsValidTest (PerfectSlayer).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Trim the verbose DDTraceId/DD64bTraceId/DDTraceIdConstant/XRayHttpCodec comments to terse navigation aids (review feedback). - Suppress EQ_CHECK_FOR_OPERAND_NOT_COMPATIBLE_WITH_THIS on the two cross-type equals methods (intentional value-equality between ZERO/ONE and the equivalent DD64bTraceId); fixes the check_base SpotBugs failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What Does This Do
Fixes a class-initialization deadlock between
DDTraceIdandDD64bTraceIdthat can hang trace creation at startup.DDTraceId.ZERO/ONEare now backed by a dedicated sibling type (DDTraceIdConstant) instead ofDD64bTraceId, soDDTraceId.<clinit>no longer initializes its own subclass. The publicDDTraceId.ZERO/ONEfields keep their type and stay value-equal to the equivalentDD64bTraceId(no binary- or behavior-incompatible change). A value-basedDDTraceId.isValid()replaces the== DDTraceId.ZEROsentinel checks.Motivation
DD64bTraceIdextendsDDTraceId, so the JVM initializesDDTraceIdfirst. ButDDTraceId.<clinit>built itsZERO/ONEconstants viaDD64bTraceId.from(...), which initializes the subclass while theDDTraceIdinit lock is held. When the two classes are first touched concurrently from opposite ends, each thread ends up holding one class-init lock and waiting for the other:dd-task-scheduler: the service-discovery task added in Add support for service discovery using JNA #9705 runsmuteTracing()->blackholeSpan()->DDTraceId.ZEROmain: the application's first span runsIdGenerationStrategy.generateTraceId()->DD64bTraceId.from()Trace creation then hangs. This surfaced as recurring ~30s
LogInjectionSmokeTesttimeouts on master (traceCount=0, process.alive=true, RC polls received: ~135). The forked-process thread dumps added in #11400 confirmed the cycle, and it reproduces deterministically.Additional Notes
ZERO/ONEstaypublic static final DDTraceIdfields (the surface deliberately restored in [6to7] Restore public DDTraceId class API #5021), but are now instances of a dedicated package-privateDDTraceIdsubtype,DDTraceIdConstant, that is a sibling ofDD64bTraceId. BecauseDDTraceId.<clinit>no longer references the subclass, the deadlock cannot happen regardless of timing.DDTraceId.isValid()(true for a non-zero id, matching OpenTelemetry) instead of== DDTraceId.ZERO. The identity checks assumed every zero id was the singleZEROinstance;isValid()recognizes a zero id of any concrete type, so a zero parsed via the direct 64-bit factories (DD64bTraceId.fromHexin the XRay/Haystack codecs) is handled correctly. It also recognizes an all-zero 128-bit id, which== ZEROsilently missed.DD64bTraceIdkeeps a cached zero singleton, sofrom(0)/fromHex("0")do not allocate a new instance per call.DDTraceId.ZERO/ONEstay value-equal in both directions (with matchinghashCode) to the equivalentDD64bTraceId, so existing equality behavior is preserved.DDTraceIdClinitDeadlockForkedTestruns in a fresh JVM and initializes the two classes concurrently from opposite ends; it deadlocks without the fix and passes with it.TraceIdIsValidTestandDDTraceIdConstantsTestcoverisValid()and the constants across theDDTraceIdsubtypes.muteTracing()task; it began manifesting recently as startup timing shifted.Contributor Checklist
type:andcomp:labels assignedJira ticket: N/A