Skip to content

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509

Open
bm1549 wants to merge 5 commits into
masterfrom
brian.marks/fix-ddtraceid-clinit-deadlock
Open

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509
bm1549 wants to merge 5 commits into
masterfrom
brian.marks/fix-ddtraceid-clinit-deadlock

Conversation

@bm1549
Copy link
Copy Markdown
Contributor

@bm1549 bm1549 commented May 29, 2026

What Does This Do

Fixes a class-initialization deadlock between DDTraceId and DD64bTraceId that can hang trace creation at startup. DDTraceId.ZERO/ONE are now backed by a dedicated sibling type (DDTraceIdConstant) instead of DD64bTraceId, so DDTraceId.<clinit> no longer initializes its own subclass. The public DDTraceId.ZERO/ONE fields keep their type and stay value-equal to the equivalent DD64bTraceId (no binary- or behavior-incompatible change). A value-based DDTraceId.isValid() replaces the == DDTraceId.ZERO sentinel checks.

Motivation

DD64bTraceId extends DDTraceId, so the JVM initializes DDTraceId first. But DDTraceId.<clinit> built its ZERO/ONE constants via DD64bTraceId.from(...), which initializes the subclass while the DDTraceId init lock is held. When the two classes are first touched concurrently from opposite ends, each thread ends up holding one class-init lock and waiting for the other:

  • dd-task-scheduler: the service-discovery task added in Add support for service discovery using JNA #9705 runs muteTracing() -> blackholeSpan() -> DDTraceId.ZERO
  • main: the application's first span runs IdGenerationStrategy.generateTraceId() -> DD64bTraceId.from()

Trace creation then hangs. This surfaced as recurring ~30s LogInjectionSmokeTest timeouts on master (traceCount=0, process.alive=true, RC polls received: ~135). The forked-process thread dumps added in #11400 confirmed the cycle, and it reproduces deterministically.

Additional Notes

  • Approach: break the cycle at its source. ZERO/ONE stay public static final DDTraceId fields (the surface deliberately restored in [6to7] Restore public DDTraceId class API #5021), but are now instances of a dedicated package-private DDTraceId subtype, DDTraceIdConstant, that is a sibling of DD64bTraceId. Because DDTraceId.<clinit> no longer references the subclass, the deadlock cannot happen regardless of timing.
  • Zero checks now use a value-based DDTraceId.isValid() (true for a non-zero id, matching OpenTelemetry) instead of == DDTraceId.ZERO. The identity checks assumed every zero id was the single ZERO instance; isValid() recognizes a zero id of any concrete type, so a zero parsed via the direct 64-bit factories (DD64bTraceId.fromHex in the XRay/Haystack codecs) is handled correctly. It also recognizes an all-zero 128-bit id, which == ZERO silently missed.
  • DD64bTraceId keeps a cached zero singleton, so from(0)/fromHex("0") do not allocate a new instance per call. DDTraceId.ZERO/ONE stay value-equal in both directions (with matching hashCode) to the equivalent DD64bTraceId, so existing equality behavior is preserved.
  • DDTraceIdClinitDeadlockForkedTest runs in a fresh JVM and initializes the two classes concurrently from opposite ends; it deadlocks without the fix and passes with it. TraceIdIsValidTest and DDTraceIdConstantsTest cover isValid() and the constants across the DDTraceId subtypes.
  • The deadlock has been latent since Add support for service discovery using JNA #9705 (Oct 2025) added the scheduled muteTracing() task; it began manifesting recently as startup timing shifted.

Contributor Checklist

  • Title formatted per the contribution guidelines
  • type: and comp: labels assigned
  • No issue-linking keywords used
  • CODEOWNERS update not required — added/removed files fall under directories already covered
  • Public documentation update not required (no new configuration or behavior)

Jira ticket: N/A

@bm1549 bm1549 added type: bug Bug report and fix comp: api Tracer public API tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 29, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 7 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | check_smoke 2/4   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Could not read workspace metadata from /go/src/github.com/DataDog/apm-reliability/dd-trace-java/.gradle/caches/8.14.5/groovy-dsl/022029af6e327cf6d0e836ddbc2ce0c1/metadata.bin

DataDog/apm-reliability/dd-trace-java | check_smoke 4/4   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Could not read workspace metadata from /go/src/github.com/DataDog/apm-reliability/dd-trace-java/.gradle/caches/8.14.5/groovy-dsl/022029af6e327cf6d0e836ddbc2ce0c1/metadata.bin due to IOException: Buffer underflow.

DataDog/apm-reliability/dd-trace-java | test_smoke_graalvm: [graalvm21]   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Could not read workspace metadata from '/go/src/github.com/DataDog/apm-reliability/dd-trace-java/.gradle/caches/8.14.5/groovy-dsl/00f1a51ce4c061d9042c287f4dd31629/metadata.bin'.

View all 7 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: ec20933 | Docs | Datadog PR Page | Give us feedback!

@bm1549 bm1549 marked this pull request as ready for review May 29, 2026 19:13
@bm1549 bm1549 requested a review from a team as a code owner May 29, 2026 19:13
@bm1549 bm1549 requested a review from dougqh May 29, 2026 19:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56ea720eb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread dd-trace-api/src/main/java/datadog/trace/api/DD64bTraceId.java
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.94 s 13.80 s [-0.0%; +2.1%] (no difference)
startup:insecure-bank:tracing:Agent 12.85 s 12.95 s [-2.0%; +0.5%] (no difference)
startup:petclinic:appsec:Agent 15.77 s 16.29 s [-11.9%; +5.5%] (unstable)
startup:petclinic:iast:Agent 16.42 s 15.86 s [-5.5%; +12.6%] (unstable)
startup:petclinic:profiling:Agent 16.23 s 16.37 s [-2.6%; +1.0%] (no difference)
startup:petclinic:tracing:Agent 15.13 s 14.97 s [-10.7%; +12.9%] (unstable)

Commit: ec20933f · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@bm1549 bm1549 requested a review from a team as a code owner May 29, 2026 19:29
@bm1549 bm1549 requested review from mcculls and removed request for a team May 29, 2026 19:29
Copy link
Copy Markdown
Contributor

@dougqh dougqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good to me.
But before merging, double check that the Codex comment isn't a problem

DD64bTraceId is a subclass of DDTraceId, so the JVM must initialize
DDTraceId before DD64bTraceId. DDTraceId.<clinit> in turn initialized
DD64bTraceId by building its ZERO/ONE constants via DD64bTraceId.from(),
a circular initialization dependency. When the two classes were first
touched concurrently from opposite ends -- the service-discovery task
(muteTracing() -> blackholeSpan() -> DDTraceId.ZERO) racing the
application's first span (IdGenerationStrategy.generateTraceId() ->
DD64bTraceId.from()) -- each thread held one class-init lock and waited
for the other, hanging trace creation. This surfaced as recurring 30s
LogInjectionSmokeTest timeouts in CI (latent since #9705 added the
scheduled muteTracing task).

Break the cycle at its source while keeping DDTraceId.ZERO/ONE as public
fields (preserving the API restored in #5021): ZERO/ONE are now instances
of a private DDTraceId subtype (a sibling of DD64bTraceId), so
DDTraceId.<clinit> no longer references the subclass.

Replace the fragile "== DDTraceId.ZERO" identity checks with a
value-based DDTraceId.isZero(). Those identity checks relied on every
zero id being the single ZERO instance; isZero() recognizes a zero id of
any concrete type, so the factories need not route 0 to the singleton and
the propagation codecs no longer mishandle a zero parsed via the direct
64-bit factories.

Add a forked regression test that initializes the two classes
concurrently from opposite ends (deadlocks without the fix), plus
isZero() coverage across the DDTraceId subtypes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bm1549 bm1549 force-pushed the brian.marks/fix-ddtraceid-clinit-deadlock branch from b04a0d1 to 0e15d6c Compare May 30, 2026 02:46
@bm1549 bm1549 requested a review from a team as a code owner May 30, 2026 02:46
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 30, 2026

Debugger benchmarks

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
ci_job_date 1780150450 1780150796
end_time 2026-05-30T14:15:37 2026-05-30T14:21:22
git_branch master brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_sha 064fda9 ded2c7c
start_time 2026-05-30T14:14:11 2026-05-30T14:19:57
See matching parameters
Baseline Candidate
ci_job_id 1726969699 1726969699
ci_pipeline_id 116055918 116055918
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
git_commit_date 1780109723 1780109723

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 9 metrics, 6 unstable metrics.

See unchanged results
scenario Δ mean agg_http_req_duration_min Δ mean agg_http_req_duration_p50 Δ mean agg_http_req_duration_p75 Δ mean agg_http_req_duration_p99 Δ mean throughput
scenario:noprobe unstable
[-21.542µs; +30.033µs] or [-7.329%; +10.218%]
unstable
[-36.009µs; +38.467µs] or [-10.571%; +11.292%]
unstable
[-46.994µs; +50.253µs] or [-13.168%; +14.081%]
unstable
[-65.232µs; +141.764µs] or [-5.583%; +12.133%]
same
scenario:basic unsure
[-9.611µs; -1.405µs] or [-3.575%; -0.523%]
same same unstable
[-200.000µs; +41.540µs] or [-18.139%; +3.767%]
unstable
[-125.442op/s; +125.442op/s] or [-5.018%; +5.018%]
scenario:loop same same same same same
Request duration reports for reports
gantt
    title reports - request duration [CI 0.99] : candidate=None, baseline=None
    dateFormat X
    axisFormat %s
section baseline
noprobe (340.653 µs) : 311, 370
.   : milestone, 341,
basic (298.082 µs) : 291, 305
.   : milestone, 298,
loop (8.981 ms) : 8976, 8987
.   : milestone, 8981,
section candidate
noprobe (341.882 µs) : 302, 382
.   : milestone, 342,
basic (294.321 µs) : 288, 300
.   : milestone, 294,
loop (8.982 ms) : 8977, 8988
.   : milestone, 8982,
Loading
  • baseline results
Scenario Request median duration [CI 0.99]
noprobe 340.653 µs [311.05 µs, 370.257 µs]
basic 298.082 µs [290.833 µs, 305.331 µs]
loop 8.981 ms [8.976 ms, 8.987 ms]
  • candidate results
Scenario Request median duration [CI 0.99]
noprobe 341.882 µs [301.518 µs, 382.247 µs]
basic 294.321 µs [288.216 µs, 300.426 µs]
loop 8.982 ms [8.977 ms, 8.988 ms]

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 30, 2026

Kafka / producer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_date 1780097219 1780109723
git_commit_sha 194ee63 ded2c7c
See matching parameters
Baseline Candidate
ci_job_date 1780150961 1780150961
ci_job_id 1726969697 1726969697
ci_pipeline_id 116055918 116055918
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.25 11.0.25
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.25+9-post-Ubuntu-1ubuntu122.04 11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-disabled-benchmarks/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaProduceBenchmark.benchProduce same

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 30, 2026

Kafka / consumer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_date 1780097219 1780109723
git_commit_sha 194ee63 ded2c7c
See matching parameters
Baseline Candidate
ci_job_date 1780150992 1780150992
ci_job_id 1726969698 1726969698
ci_pipeline_id 116055918 116055918
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.25 11.0.25
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.25+9-post-Ubuntu-1ubuntu122.04 11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaConsumerBenchmark.benchConsume unsure
[+2328.734op/s; +12483.668op/s] or [+0.799%; +4.286%]
scenario:only-tracing-dsm-disabled-benchmarks/KafkaConsumerBenchmark.benchConsume same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaConsumerBenchmark.benchConsume same

Comment thread dd-trace-api/src/main/java/datadog/trace/api/DD64bTraceId.java Outdated
Comment thread dd-trace-api/src/main/java/datadog/trace/api/DD64bTraceId.java
Copy link
Copy Markdown
Contributor

@mcculls mcculls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have a method isValid rather than isZero (where isValid() should return true when the id is not zero)

Also I'm worried that we could now create lots of DD64bTraceId(0) instances with the change to that class.

Finally some of the non-test comments are too verbose and should be simplified

Copy link
Copy Markdown
Contributor

@PerfectSlayer PerfectSlayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left early review comments

Finally some of the non-test comments are too verbose and should be simplified

Agreed, that hurts readability in core tracer parts.

Comment thread dd-trace-api/src/main/java/datadog/trace/api/DDTraceId.java Outdated
Comment thread dd-trace-core/src/test/java/datadog/trace/api/TraceIdIsZeroTest.java Outdated
bm1549 and others added 3 commits June 1, 2026 09:49
…constant

- Rename DDTraceId.isZero() to value-based isValid() (OTel-aligned; mcculls);
  flip the boolean at all 7 production call sites.
- Restore a cached zero singleton in DD64bTraceId so from(0)/fromHex("0") do
  not allocate per call (mcculls).
- Extract the ZERO/ONE backing type to a dedicated DDTraceIdConstant class
  (PerfectSlayer); update the JaCoCo coverage exclusion.
- Preserve equals/hashCode: ZERO/ONE stay value-equal (both directions) to the
  equivalent DD64bTraceId, matching pre-PR behavior.
- Trim verbose non-test comments (mcculls, PerfectSlayer).
- Parameterize and rename TraceIdIsZeroTest -> TraceIdIsValidTest (PerfectSlayer).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Trim the verbose DDTraceId/DD64bTraceId/DDTraceIdConstant/XRayHttpCodec
  comments to terse navigation aids (review feedback).
- Suppress EQ_CHECK_FOR_OPERAND_NOT_COMPATIBLE_WITH_THIS on the two cross-type
  equals methods (intentional value-equality between ZERO/ONE and the
  equivalent DD64bTraceId); fixes the check_base SpotBugs failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: api Tracer public API tag: ai generated Largely based on code generated by an AI or LLM type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants