Update client-side stats to use light weight Hashtable by dougqh · Pull Request #11382 · DataDog/dd-trace-java

dougqh · 2026-05-15T19:11:06Z

What Does This Do

Replaces the MetricKey based HashMap with a new AggregateTable based on the light Hashtable

Motivation

By using the light Hashtable, I'm able to avoid the biggest source of allocation in client-side stats: MetricKey

Hashtable provides utilities for searching the entries without constructing a new composite key object

First, the components are hashed together to find the corresponding bucket
Then the bucket can be traversed to see if the entries match the key components

And the custom Entry can hold multiple fields that comprise the data / metadata needed for metric aggregation and eviction policy

The end result is that both MetricKey and AggregateMetric can be merged into a single class AggregateEntry that is only constructed when there's no existing matching entry

Additional Notes

Stacked on top of #11381 (dougqh/conflating-metrics-background-work). #11409 — the Hashtable utility this PR builds on — landed in master separately, so the diff shown here is only the AggregateTable / AggregateEntry / ClearSignal work plus follow-up cleanups.

Restructures the consumer-side aggregate store. Three logical commits, intended to be reviewed in order:

1. Add `AggregateTable` + `AggregateEntry` backed by `Hashtable`

Introduces a multi-key hash table that lets the consumer thread look up the {labels → counters} entry directly from a SpanSnapshot's raw fields — no MetricKey allocation per snapshot, no per-snapshot UTF8 cache lookups, no CHM operations. Hot-path lookup is keyHash compute → Hashtable.Support.bucket → bucket walk → matches(keyHash, snapshot) → returned entry has the counters to mutate in place.

This commit is standalone — no call sites yet, only the new classes + unit tests for hit/miss/cap-overrun/expunge/clear behavior.

2. Swap `Aggregator` to use `AggregateTable` + route `disable()` clear through a `ClearSignal`

Replaces LRUCache<MetricKey, AggregateMetric> with AggregateTable in Aggregator. Drops the AggregateExpiry listener — drop reporting (onStatsAggregateDropped) moves to the cap-overrun path inside Drainer.accept.

Threading fix bundled here: ConflatingMetricsAggregator.disable() used to call aggregator.clearAggregates() and inbox.clear() directly from the Sink's IO callback thread, racing with the aggregator thread. That race was tolerable for LinkedHashMap (worst case = corrupted internal state right before everything got cleared anyway); it's not tolerable for Hashtable (chain corruption can NPE or loop). disable() now offers a ClearSignal to the inbox so the aggregator thread itself performs the clear — preserves the single-writer invariant for AggregateTable end-to-end. The offer is best-effort; the system self-heals on a subsequent downgrade cycle if the inbox happens to be full (commented at the call site).

Cap-overrun semantic change: the old LRUCache evicted least-recently-used in O(1). AggregateTable instead scans for a hitCount==0 entry to recycle (O(N) worst-case), and drops the new key if none exists. Practical impact: in steady state, an unrelated burst of new keys gets dropped (and reported via onStatsAggregateDropped) rather than evicting established keys. The cost trade-off is commented at the eviction site — eviction is expected rare because the cap is sized to the working set; cursor-caching is the future option if a workload runs persistently at cap. The existing test that asserted "service0 evicted in favor of service10" is updated to assert the new semantics; the other cap-related test ("evicted entry was already flushed") still passes unchanged.

3. Fold `MetricKey` + `AggregateMetric` into `AggregateEntry`

MetricKey existed for two reasons — being the LRUCache key (replaced by AggregateTable's Hashtable mechanics) and being the labels arg to MetricWriter.add (the only thing left). AggregateMetric was the counter/histogram counterpart. Folds both onto a single AggregateEntry (10 UTF8 label fields + 3 primitives + counters + histograms), changes MetricWriter.add(MetricKey, AggregateMetric) → add(AggregateEntry), and deletes MetricKey.java + MetricKeys.java + AggregateMetric.java.

The 12 UTF8 caches that used to be split between MetricKey (9) and ConflatingMetricsAggregator (3, with overlap) are consolidated on AggregateEntry. One cache per field type now.

Latent bug fix: the prior matches(SpanSnapshot) used Objects.equals on raw fields. If the same logical key was delivered once as String and once as UTF8BytesString (different CharSequence impls of identical content), Objects.equals returned false and the table would split into two entries for the same key. The new matches uses content-equality (UTF8BytesString.toString() returns the underlying String in O(1)), collapsing them correctly.

Test impact: AggregateEntry.of(...) mirrors the prior new MetricKey(...) positional args, so test diffs are mostly mechanical. About 56 test sites migrated across ConflatingMetricAggregatorTest, SerializingMetricWriterTest, and MetricsIntegrationTest.

Review polish

Follow-up commits address review feedback:

Use Hashtable.Support.create(maxAggregates, Support.MAX_RATIO) + Support.bucket + Support.insertHeadEntry(buckets, keyHash, entry) + Support.mutatingTableIterator to delegate to the helpers added on Add Hashtable and LongHashingUtils utilities #11409 — drops ~50 lines of bespoke bucket-array code.
Inline the report-time forEach lambda instead of a static BiConsumer constant (the JIT reuses non-capturing lambdas).
AggregateEntry.matches(long keyHash, SpanSnapshot) overload that pre-checks the hash, so chain walks read as one call.
@Nullable (javax.annotation) annotations on the four nullable label fields + their getters + of(...) parameters.
Objects.equals import in AggregateEntry.equals() (no more fully-qualified refs).
Design-trade-off comments on evictOneStale (O(N) scan rationale) and disable() (best-effort offer rationale).

Additional cleanups

A second round of review surfacing landed these:

MetricsIntegrationTest .aggregate runtime bug — the legacy entry.aggregate.recordDurations(...) form compiled under Groovy's dynamic dispatch but would have thrown MissingPropertyException at runtime. Fixed to call recordDurations directly on the entry. (Bot-flagged.)
Spock >> { closure } no-op assertions in ConflatingMetricAggregatorTest — the >> operator stubs a return value, so the closures verifying e.getHitCount() == X && ... were being evaluated and discarded. Wrapped 31 sites in assert so Groovy power-assert surfaces mismatches. All 41 tests still pass, so the previously-unverified assertions happened to hold. (Bot-flagged.)
Drop dead recordDurations(int, AtomicLongArray) batch API — vestige of master's Batch design. Production now only calls recordOneDuration(long). Migrated the three remaining test callers (AggregateEntryTest, SerializingMetricWriterTest, MetricsIntegrationTest) to loops of recordOneDuration calls, then deleted the batched method and its AtomicLongArray imports.
AggregateEntry.of() colon-split Javadoc warning — the test factory recovers (name, value) pairs from "name:value" strings by splitting at the first :, which is brittle if a peer-tag value contains a colon (URLs, IPv6, service:env). Added an explicit warning so callers know to keep test data colon-free in values.
ConflatingMetricsAggregatorDisableTest — new JUnit 5 coverage for the disable() → ClearSignal threading routing. The test fires DOWNGRADED from the test thread, waits for the no-flush window, then publishes a marker span with a distinct resource name and asserts the next flush captures only the marker — proving CLEAR actually wiped the original entry from the table. Catches both the missing-clear regression and the bucket-chain-corruption regression that the original threading race could produce.

Benchmarks

Producer `publish()` latency (single-threaded, 2 forks × 5 iter × 15s)

	Prior commit (stacked base)	This PR
SimpleSpan bench	3.116 µs/op	3.123 µs/op
DDSpan bench	2.506 µs/op	2.412 µs/op

All within noise on the producer side — this PR is a consumer-side refactor, so producer publish() shouldn't move much. The structural wins (one less class, no per-snapshot MetricKey allocation on the consumer, no double-cache lookups, smaller per-entry footprint) only become visible when the consumer is hammered hard enough that snapshot processing rate matters. That's exactly what the adversarial bench measures.

`AdversarialMetricsBenchmark` (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

Same benchmark used on #11381 (high-cardinality (service, operation, resource, peer.hostname) per op, random durations across 1ns–1s, random error/topLevel flags — designed to saturate every capacity bound at once).

	master	#11381 (parent)	this PR
Throughput avg (ops/s)	395,806 ± 2,619,133	4,889,660 ± 390,175	27,915,800 ± 1,219,470
Per-iteration progression (ops/s)	warmup 2,536,145 → 205,314 → 95,888 → 47,301 → 24,378	4,886,778 → 4,875,195 → 4,731,827 → 4,959,992 → 4,994,511	28,043,909 → 28,112,828 → 27,354,721 → 28,074,395 → 27,993,147
Stdev (ops/s)	680,180	101,327	316,692
`onStatsInboxFull` (drops at handoff)	n/a (no inbox on master)	199,862,634	2,893,855,052
`onStatsAggregateDropped`	11,642,039	84,002,323	16,301,696

~70× faster than master, ~5.7× faster than #11381 alone. The producer fast-path drop check (inbox.size() >= inbox.capacity()) returns immediately when the inbox is saturated, so when the consumer can drain ~5× faster (this PR's consumer-side win), the producer spins through publish() at the fast-path rate and inbox-full drops climb correspondingly (200M on #11381 → 2.9 B here). That's the design working as intended: under attack, load shedding happens at the cheapest place in the pipeline.

The aggregate-cache drop count actually fell (84M → 16M) because the faster consumer keeps the table from staying at cap as often — fewer snapshots arrive at a cap-overrun state with no stale entry to recycle.

Per-iteration shape stays flat (27.4M–28.1M ops/s, no warmup → measurement degradation), confirming the design holds steady-state under sustained load.

Net code delta: +1280 / −903 = +377 lines across 16 files. The growth is dominated by new test coverage (AggregateTableTest, AggregateEntryTest) plus the consolidated UTF8 caches landing on AggregateEntry; the production-code core (less MetricKey + MetricKeys + AggregateMetric minus AggregateEntry's additions) is roughly flat.

Test plan

./gradlew :dd-trace-core:test --tests 'datadog.trace.common.metrics.*' passes (incl. the new AggregateTableTest and AggregateEntryTest)
./gradlew :dd-trace-core:compileJava :dd-trace-core:compileTestGroovy :dd-trace-core:compileJmhJava :dd-trace-core:compileTraceAgentTestGroovy all green
./gradlew spotlessCheck clean
CI muzzle / integration suites
Validate stats.dropped_aggregates semantics at high cardinality (especially the new "drop new on cap overrun" path vs. the old "evict LRU" path)

🤖 Generated with Claude Code

Two general-purpose utilities used by the client-side stats aggregator work (PR #11382 and follow-ups), extracted into their own change so the metrics-specific PRs can build on a smaller, reviewable foundation. - Hashtable: a generic open-addressed-ish bucket table abstraction keyed by a 64-bit hash, with a public abstract Entry type so client code can subclass it for higher-arity keys. The metrics aggregator uses it to back its AggregateTable. - LongHashingUtils: chained 64-bit hash combiners with primitive overloads (boolean, short, int, long, Object). Used in place of varargs combiners to avoid Object[] allocation and boxing on the hot path. No callers within internal-api itself yet -- the metrics aggregator PR will introduce the first usages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Standalone classes for swapping the consumer-side LRUCache<MetricKey, AggregateMetric> with a multi-key Hashtable in the next commit. No call sites use them yet. - AggregateEntry extends Hashtable.Entry, holds the canonical MetricKey, the mutable AggregateMetric, and copies of the 13 raw SpanSnapshot fields for matches(). The 64-bit lookup hash is computed via chained LongHashingUtils.addToHash calls (no varargs, no boxing of short/boolean). - AggregateTable wraps a Hashtable.Entry[] from Hashtable.Support.create. findOrInsert(SpanSnapshot) walks the bucket comparing raw fields, falling back to MetricKeys.fromSnapshot on a true miss. On cap overrun, it scans for an entry with hitCount==0 and unlinks it; if none, it returns null and the caller drops the data point. - MetricKeys.fromSnapshot extracts the canonicalization logic (DDCache lookups + UTF8 encoding) from Aggregator.buildMetricKey, so the helper can be called from AggregateTable on miss. This also commits Hashtable and LongHashingUtils (added earlier, previously uncommitted) and lifts Hashtable.Entry / Hashtable.Support visibility so client code outside datadog.trace.util can build higher-arity tables -- the case the javadoc describes but the original visibility didn't actually support. Specifically: Entry is now public abstract with a protected ctor; keyHash, next(), and setNext() are public; Support's create / clear / bucketIndex / bucketIterator / mutatingBucketIterator methods are public. Tests: AggregateTableTest covers hit, miss, distinct-by-spanKind, peer-tag identity (including null vs non-null), cap overrun with stale victim, cap overrun with no victim (returns null), expungeStaleAggregates, forEach, clear, and that the canonical MetricKey is built at insert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace LRUCache<MetricKey, AggregateMetric> with the AggregateTable added in the prior commit. The hot path in Drainer.accept becomes: AggregateMetric aggregate = aggregates.findOrInsert(snapshot); if (aggregate != null) { aggregate.recordOneDuration(snapshot.tagAndDuration); dirty = true; } else { healthMetrics.onStatsAggregateDropped(); } On the steady-state hit path the lookup is a 64-bit hash compute + bucket walk + matches(snapshot) -- no MetricKey allocation, no SERVICE_NAMES / SPAN_KINDS / PEER_TAGS_CACHE lookups. The canonical MetricKey is now built once per unique key at insert time, in MetricKeys.fromSnapshot. Behavioral change in the cap-overrun path ----------------------------------------- The old LRUCache evicted least-recently-used: at cap, a new insert would push out the oldest entry regardless of whether it was live or stale. AggregateTable instead scans for a hitCount==0 entry to recycle, and drops the new key if none exists. Practical impact: in the common case where the table holds a stable set of recurring keys, an unrelated burst of new keys is dropped (and reported via onStatsAggregateDropped) rather than evicting the established keys. The existing test that asserted "service0 evicted in favor of service10" is updated to assert the new semantics. The other cap-related test ("should not report dropped aggregate when evicted entry was already flushed") still passes unchanged: after report() clears all entries to hitCount=0, the next wave of inserts recycles them. Threading fix ------------- ConflatingMetricsAggregator.disable() used to call aggregator.clearAggregates() and inbox.clear() directly from the Sink's IO event thread, racing with the aggregator thread mid-write. The race was tolerable for LinkedHashMap; it is not for AggregateTable (chain corruption can NPE or loop). disable() now offers a ClearSignal to the inbox so the aggregator thread itself performs the table clear and the inbox.clear(). Adds one SignalItem subclass + one branch in Drainer.accept; preserves the single-writer invariant for AggregateTable end-to-end. Removed: LRUCache import, AggregateExpiry inner class, the static buildMetricKey / materializePeerTags / encodePeerTag helpers (now in MetricKeys). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MetricKey existed for two reasons -- the prior LRUCache key role (now handled by AggregateTable's Hashtable.Entry mechanics) and as the labels argument to MetricWriter.add. The first is gone; the second is the only thing keeping MetricKey alive. Fold its UTF8-encoded label fields onto AggregateEntry, change MetricWriter.add to take AggregateEntry directly, and delete MetricKey + MetricKeys. What AggregateEntry now holds ----------------------------- - 10 UTF8BytesString label fields (resource, service, operationName, serviceSource, type, spanKind, httpMethod, httpEndpoint, grpcStatusCode, and a List<UTF8BytesString> peerTags for serialization). - 3 primitives (httpStatusCode, synthetic, traceRoot). - AggregateMetric (the value being accumulated). - The raw String[] peerTagPairs is retained alongside the encoded peerTags -- matches() compares it positionally against the snapshot's pairs; the encoded form is only consumed by the writer. matches(SpanSnapshot) compares the entry's UTF8 forms to the snapshot's raw String / CharSequence fields via content-equality (UTF8BytesString.toString() returns the underlying String in O(1)). This closes a latent bug in the prior raw-vs-raw matches(): if one snapshot delivered a tag value as String and a later snapshot delivered the same content as UTF8BytesString, the old Objects.equals would return false and the table would split into two entries. Content-equality matching collapses them into one. Consolidated caches ------------------- The static UTF8 caches that used to live partly on MetricKey (RESOURCE_CACHE, OPERATION_CACHE, SERVICE_SOURCE_CACHE, TYPE_CACHE, KIND_CACHE, HTTP_METHOD_CACHE, HTTP_ENDPOINT_CACHE, GRPC_STATUS_CODE_CACHE, SERVICE_CACHE) and partly on ConflatingMetricsAggregator (SERVICE_NAMES, SPAN_KINDS, PEER_TAGS_CACHE) are all now on AggregateEntry. The split was duplicating work -- SERVICE_NAMES and SERVICE_CACHE both cached service-name to UTF8BytesString. One cache per field now. API change: MetricWriter.add ---------------------------- Was: add(MetricKey key, AggregateMetric aggregate) Now: add(AggregateEntry entry) The aggregate lives on the entry. Single-arg. SerializingMetricWriter reads the same UTF8 fields off AggregateEntry that it previously read off MetricKey; the wire format is byte-identical. Test impact ----------- AggregateEntry.of(...) takes the same 13 positional args new MetricKey(...) took, so test diffs are mostly mechanical: new MetricKey(args) -> AggregateEntry.of(args) writer.add(key, _) -> writer.add(entry) ValidatingSink in SerializingMetricWriterTest now iterates List<AggregateEntry> directly. ConflatingMetricAggregatorTest's Spock matchers (~36 sites) rely on AggregateEntry.equals comparing the 13 label fields (not the aggregate) so the mock matches by labels regardless of the aggregate state at call time; post-invocation closures verify aggregate state. Benchmarks (2 forks x 5 iter x 15s) ----------------------------------- The change is consumer-thread only; producer publish() is unchanged. SimpleSpan bench: 3.123 +- 0.025 us/op (prior: 3.119 +- 0.018) DDSpan bench: 2.412 +- 0.022 us/op (prior: 2.463 +- 0.041) Both within noise -- the win is structural (one less class, one less allocation per miss, one fewer cache layer) rather than benchmarked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LongHashingUtilsTest (14 cases): - hashCodeX null sentinel + non-null pass-through - all primitive hash() overloads match the boxed Java hashCodes - hash(Object...) 2/3/4/5-arg overloads match the chained addToHash formula they are documented to constant-fold to - addToHash(long, primitive) overloads match the Object-version - linear-accumulation invariant (31 * h + v) holds across a sequence - iterable / deprecated int[] / deprecated Object[] variants match chained addToHash - intHash treats null as 0 (observable via hash(null, "x")) HashtableTest (24 cases across 5 nested classes): - D1: insert/get/remove/insertOrReplace/clear/forEach, in-place value mutation, null-key handling, hash-collision chaining with disambig- uating equals, remove-from-collided-chain leaves siblings intact - D2: pair-key identity, remove(pair), insertOrReplace matches on both parts, forEach - Support: capacity rounds up to a power of two, bucketIndex stays in range across a wide hash sample, clear nulls every slot - BucketIterator: walks only matching-hash entries in a chain, throws NoSuchElementException when exhausted - MutatingBucketIterator: remove from head-of-chain unlinks, replace swaps the entry while preserving chain, remove() without prior next() throws IllegalStateException Tests live in internal-api/src/test/java/datadog/trace/util and use the already-present JUnit 5 setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bring the new util/ files in line with google-java-format (tabs → spaces, line wrapping, javadoc list markup) so spotlessCheck passes in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@threads

Compares Hashtable.D1 and Hashtable.D2 against equivalent HashMap usage for add, update, and iterate operations. Each benchmark thread owns its own map (Scope.Thread), but @threads(8) is used so the allocation/GC pressure that Hashtable is designed to avoid surfaces in the throughput numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Guard Support.sizeFor against overflow and use Integer.highestOneBit; reject capacities above 1 << 30 instead of looping forever. - Add braces around single-statement while bodies in BucketIterator. - Split HashtableBenchmark into HashtableD1Benchmark / HashtableD2Benchmark. - Add regression tests for Support.sizeFor bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 5-arg Object overload was forwarding only obj0..obj3 to the int overload, silently dropping obj4. Also align LongHashingUtils.hash 3-arg signature with its 2/4/5-arg siblings (int parameters) and strengthen the 5-arg HashingUtilsTest to detect the missing-arg regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Split D1Tests and D2Tests into HashtableD1Test and HashtableD2Test; extract shared test entry classes into HashtableTestEntries. - Reduce visibility of LongHashingUtils.hash(int...) chaining overloads to package-private; they are internal building blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The iterator tests need a populated Hashtable.Entry[] to drive Support.bucketIterator / mutatingBucketIterator. Relaxing D1.buckets from private to package-private lets the same-package tests read it directly, removing the reflection helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The label fields and the mutable counters/histograms are 1:1 with each entry; carrying them on a separate object meant one extra allocation per unique key plus an indirection on every hot-path update. Merging them puts the counters directly on AggregateEntry, drops the entry.aggregate hop, and consolidates ERROR_TAG / TOP_LEVEL_TAG onto the same class the consumer uses to decode them. AggregateTable.findOrInsert now returns AggregateEntry. Callers in Aggregator and SerializingMetricWriter updated. Migrated AggregateMetricTest.groovy to AggregateEntryTest.java per project policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a context-passing forEach(T, BiConsumer) overload to AggregateTable, mirroring TagMap's pattern. Aggregator.report now hands the writer in as context to a static BiConsumer so no fresh Consumer is allocated each report cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the TagMap pattern: pairs the existing forEach(Consumer) with a forEach(T context, BiConsumer<T, TEntry>) overload so callers can hand side-band state to a non-capturing lambda and avoid the fresh-Consumer-per-call allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

datadog-datadog-prod-us1-2 · 2026-05-19T17:52:05Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | check_base

🔧 Fix in code (Fix with Cursor).
Execution failed for task ':dd-trace-core:spotbugsMain'. Verification failed: SpotBugs ended with exit code 1. See the report at: file:///go/src/github.com/DataDog/apm-reliability/dd-trace-java/workspace/dd-trace-core/build/reports/spotbugs/spotbugsMain.html

DataDog/apm-reliability/dd-trace-java | spotless

🔧 Fix in code (Fix with Cursor).
Spotless check failed due to code formatting issues in Java files, specifically in the documentation comments.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 200028d | Docs | Datadog PR Page | Give us feedback!}

Factors the unchecked (TEntry) cast out of D1.forEach / D2.forEach (and the BiConsumer variants) into Support.forEach(buckets, ...). The cast now lives in one place, mirroring how Entry.next() handles it, and the D1/D2 methods become one-liners. Downstream higher-arity tables built on Support gain the same helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that Hashtable.Support exposes the parameterized forEach helpers, AggregateTable's own forEach methods can drop their duplicated loop body and the (AggregateEntry) cast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sistency Addresses PR #11409 review comment #3276167001. The method parallels the primitive hash(boolean) / hash(int) / hash(long) / ... family, so naming it hash(Object) -- with null collapsing to Long.MIN_VALUE as a sentinel distinct from any real hashCode -- matches the rest of the public surface. Test call sites that pass a literal null now disambiguate against hash(int[]) / hash(Object[]) / hash(Iterable) via an (Object) cast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ric-key

…optimize-metric-key

@threads

Add Hashtable and LongHashingUtils to datadog.trace.util Two general-purpose utilities used by the client-side stats aggregator work (PR #11382 and follow-ups), extracted into their own change so the metrics-specific PRs can build on a smaller, reviewable foundation. - Hashtable: a generic open-addressed-ish bucket table abstraction keyed by a 64-bit hash, with a public abstract Entry type so client code can subclass it for higher-arity keys. The metrics aggregator uses it to back its AggregateTable. - LongHashingUtils: chained 64-bit hash combiners with primitive overloads (boolean, short, int, long, Object). Used in place of varargs combiners to avoid Object[] allocation and boxing on the hot path. No callers within internal-api itself yet -- the metrics aggregator PR will introduce the first usages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add unit tests for Hashtable and LongHashingUtils LongHashingUtilsTest (14 cases): - hashCodeX null sentinel + non-null pass-through - all primitive hash() overloads match the boxed Java hashCodes - hash(Object...) 2/3/4/5-arg overloads match the chained addToHash formula they are documented to constant-fold to - addToHash(long, primitive) overloads match the Object-version - linear-accumulation invariant (31 * h + v) holds across a sequence - iterable / deprecated int[] / deprecated Object[] variants match chained addToHash - intHash treats null as 0 (observable via hash(null, "x")) HashtableTest (24 cases across 5 nested classes): - D1: insert/get/remove/insertOrReplace/clear/forEach, in-place value mutation, null-key handling, hash-collision chaining with disambig- uating equals, remove-from-collided-chain leaves siblings intact - D2: pair-key identity, remove(pair), insertOrReplace matches on both parts, forEach - Support: capacity rounds up to a power of two, bucketIndex stays in range across a wide hash sample, clear nulls every slot - BucketIterator: walks only matching-hash entries in a chain, throws NoSuchElementException when exhausted - MutatingBucketIterator: remove from head-of-chain unlinks, replace swaps the entry while preserving chain, remove() without prior next() throws IllegalStateException Tests live in internal-api/src/test/java/datadog/trace/util and use the already-present JUnit 5 setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Apply spotless formatting to Hashtable and LongHashingUtils Bring the new util/ files in line with google-java-format (tabs → spaces, line wrapping, javadoc list markup) so spotlessCheck passes in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add JMH benchmarks for Hashtable.D1 and D2 Compares Hashtable.D1 and Hashtable.D2 against equivalent HashMap usage for add, update, and iterate operations. Each benchmark thread owns its own map (Scope.Thread), but @threads(8) is used so the allocation/GC pressure that Hashtable is designed to avoid surfaces in the throughput numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add benchmark results to HashtableBenchmark header Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Address review feedback on Hashtable - Guard Support.sizeFor against overflow and use Integer.highestOneBit; reject capacities above 1 << 30 instead of looping forever. - Add braces around single-statement while bodies in BucketIterator. - Split HashtableBenchmark into HashtableD1Benchmark / HashtableD2Benchmark. - Add regression tests for Support.sizeFor bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Fix dropped argument in HashingUtils 5-arg Object hash The 5-arg Object overload was forwarding only obj0..obj3 to the int overload, silently dropping obj4. Also align LongHashingUtils.hash 3-arg signature with its 2/4/5-arg siblings (int parameters) and strengthen the 5-arg HashingUtilsTest to detect the missing-arg regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Address review feedback on Hashtable - Split D1Tests and D2Tests into HashtableD1Test and HashtableD2Test; extract shared test entry classes into HashtableTestEntries. - Reduce visibility of LongHashingUtils.hash(int...) chaining overloads to package-private; they are internal building blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Drop reflection in iterator tests via package-private D1.buckets The iterator tests need a populated Hashtable.Entry[] to drive Support.bucketIterator / mutatingBucketIterator. Relaxing D1.buckets from private to package-private lets the same-package tests read it directly, removing the reflection helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add context-passing forEach to Hashtable.D1 and D2 Mirrors the TagMap pattern: pairs the existing forEach(Consumer) with a forEach(T context, BiConsumer<T, TEntry>) overload so callers can hand side-band state to a non-capturing lambda and avoid the fresh-Consumer-per-call allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Move forEach loop body to Support helper Factors the unchecked (TEntry) cast out of D1.forEach / D2.forEach (and the BiConsumer variants) into Support.forEach(buckets, ...). The cast now lives in one place, mirroring how Entry.next() handles it, and the D1/D2 methods become one-liners. Downstream higher-arity tables built on Support gain the same helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Move bucket-head cast to Support.bucket helper Adds Support.bucket(buckets, keyHash) which returns the bucket head already cast to the caller's concrete entry type. D1.get and D2.get now drop the raw-Entry intermediate variable and walk the chain via Entry.next() directly. The unchecked cast lives in one place, consistent with Entry.next() and Support.forEach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Drop d1_/d2_ prefix from per-table benchmark methods Holdover from when both lived in a shared HashtableBenchmark; redundant now that each lives in its own class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add Hashtable.Support helpers: MAX_RATIO, insertHeadEntry, MutatingTableIterator Three consumer-facing helpers that callers building higher-arity tables on top of Hashtable.Support kept open-coding: - MAX_RATIO_NUMERATOR / _DENOMINATOR: the 4/3 multiplier for sizing a bucket array from a target working-set under a 75% load factor. - insertHeadEntry(buckets, bucketIndex, entry): the (setNext + array-store) pair for splicing a new entry at the head of a bucket chain. - MutatingTableIterator + Support.mutatingTableIterator(buckets): walks every entry in the table (not filtered by hash) with remove() support, for sweeps like eviction and expunge that aren't keyed to a specific hash. Sibling of MutatingBucketIterator. Tests cover the table-wide iterator at head-of-bucket and mid-chain removal, empty buckets between live entries, exhaustion, and remove-without-next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Swap MAX_RATIO numerator/denominator pair for a single float + scaled create() Replace Support.MAX_RATIO_NUMERATOR / _DENOMINATOR with a single float MAX_RATIO constant, and add a Support.create(int, float) overload that takes a scale factor. Callers now write Support.create(n, MAX_RATIO) instead of stitching together the int arithmetic at the call site. The scaled size is truncated (not ceiled) before going through sizeFor. sizeFor already rounds up to the next power of two, so truncation just absorbs float fuzz that would otherwise push a result like 12 * 4/3 = 16.0000005f past 16 and double the bucket array size for no reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Tighten Hashtable docs + rename MAX_CAPACITY to MAX_BUCKETS Five small cleanups from a design re-review pass: 1. Support javadoc: drop the stale "methods are package-private" sentence; most of them were made public in earlier commits for higher-arity callers. Also drop the "nested BucketIterator" framing (iterators are peers of Support inside Hashtable, not nested inside Support). 2. MAX_RATIO javadoc: drop the Math.ceil recommendation; create(int, float) deliberately truncates and is the canonical pathway. 3. Document the null-hash treatment on D1.Entry.hash and D2.Entry.hash so the behavior difference is explicit: D1 uses Long.MIN_VALUE as a sentinel that's collision-free against any int-valued hashCode(); D2 has no such sentinel and relies on matches() to resolve null/null vs hash-0 collisions. 4. Rename Support.MAX_CAPACITY -> MAX_BUCKETS and sizeFor's parameter to requestedSize. The cap is on the bucket-array length, not entry count; the new name reflects that. Error messages updated to match. 5. Drop the `abstract` modifier on Hashtable in favor of `final` with a private constructor. Nothing actually subclasses Hashtable -- the abstract was a namespace device that read as "intended for extension." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Dedupe chain-head splice in D1/D2 via keyHash insertHeadEntry overload - Add Support.insertHeadEntry(buckets, long keyHash, entry) overload that derives the bucket index itself. Callers that already have a hash but not the index (the common case) now avoid the redundant bucketIndex(...) hop. - D1.insert, D1.insertOrReplace, D2.insert, D2.insertOrReplace: use the new overload, drop the (thisBuckets local, bucketIndex compute, setNext, store) sequence at each call site. - D2.buckets: drop the `private` modifier to match D1.buckets. Both are package-private so iterator tests in the same package can drive Support.bucketIterator against the table's bucket array. Added a short comment on both fields documenting the rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Tighten Entry.next encapsulation; doc hasNext; add D1/D2 getOrCreate Three follow-ups from the design review: - Make Hashtable.Entry.next private. All same-package readers (BucketIterator) already had a next() accessor; the leftover direct field reads now route through it. Closes the "mixed encapsulation" gap where some readers used the accessor and same-package ones reached for the field. - BucketIterator and MutatingBucketIterator now document that chain-walk work happens in next() (and the constructor for the first match); hasNext() is an O(1) field read. - Add D1.getOrCreate(K, Function) and D2.getOrCreate(K1, K2, BiFunction). Both reuse the lookup hash for the insert on miss, avoiding the double-hash that "get; if null then insert" callers would otherwise pay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Hashtable: add missing braces and detach removed/replaced entries Addresses PR #11409 review comments: - #3267164119 / #3267165525: wrap every single-line if/break body in braces (7 sites across BucketIterator, MutatingBucketIterator, and the full-table Iterator). - #3275947761 / #3275948108 (sarahchen6): null out the removed/replaced entry's next pointer after splicing it out of the chain in MutatingBucketIterator.remove / .replace. Applied the same fix to the full-table Iterator.remove for consistency. Rationale: detaching prevents accidental traversal through a removed entry via a stale reference and lets the GC reclaim a chain tail that the removed entry was the last referrer to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Rename LongHashingUtils.hashCodeX(Object) to hash(Object) for API consistency Addresses PR #11409 review comment #3276167001. The method parallels the primitive hash(boolean) / hash(int) / hash(long) / ... family, so naming it hash(Object) -- with null collapsing to Long.MIN_VALUE as a sentinel distinct from any real hashCode -- matches the rest of the public surface. Test call sites that pass a literal null now disambiguate against hash(int[]) / hash(Object[]) / hash(Iterable) via an (Object) cast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Merge branch 'master' into dougqh/util-hashtable Co-authored-by: devflow.devflow-routing-intake <devflow.devflow-routing-intake@kubernetes.us1.ddbuild.io>

…optimize-metric-key

…bility The verify(writer).add(MetricKey, AggregateMetric) signature is unique to #11381; downstream branches use AggregateEntry. Switching to verify(writer, times(2)).finishBucket() keeps the same behavioral guarantee (both cycles flushed) across the stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…optimize-metric-key

#11382 collapses MetricWriter.add(MetricKey, AggregateMetric) into add(AggregateEntry). Re-target the captor and accessors on this branch so the test compiles and the same end-to-end peer-tag verification holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…optimize-metric-key # Conflicts: # dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java

AggregateEntry consolidated MetricKey + AggregateMetric so recordDurations lives directly on AggregateEntry now. The previous entry1.aggregate. recordDurations(...) form compiles under Groovy's dynamic dispatch but would throw MissingPropertyException at runtime since there is no `aggregate` property. Resolves chatgpt-codex-connector's review comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `1 * writer.add(value) >> { closure }` pattern treats the closure as a stubbed return value -- Spock evaluates it but discards the result, so `e.getHitCount() == X && ...` was a silent no-op across 31 occurrences. Wrapping the expression in `assert` makes Groovy's power-assert throw on mismatch, which Spock surfaces as a real failure. Resolves chatgpt-codex-connector's review comment. All 41 tests still pass, so the previously-unverified assertions happened to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This method was a vestige of master's Batch design where multiple producer threads wrote into an AtomicLongArray slot concurrently and the aggregator drained ~64 durations per Batch in one call. The new producer/consumer split publishes one SpanSnapshot per span, so production only ever calls recordOneDuration(long). Migrate the three remaining callers (AggregateEntryTest, SerializingMetricWriterTest, MetricsIntegrationTest) to a loop of recordOneDuration(long) calls, then delete the batched method and its AtomicLongArray imports. Drops the recordDurationsIgnoresTrailingZeros test -- that behavior was a specific quirk of the batched API (count parameter shorter than the array length) and doesn't apply to recordOneDuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The factory recovers (name, value) pairs from pre-encoded "name:value" strings by splitting at the FIRST colon. Test-only, but worth being explicit so callers don't hand it a peer-tag value containing a colon (URLs, IPv6, service:env) and get a silently wrong (name, value) pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bundled fix in this PR routes the agent-downgrade clear through the inbox so the aggregator thread stays the sole writer to AggregateTable. Prior to this test, there was no regression coverage for that routing. The test fires DOWNGRADED from the test thread (production-like OkHttpSink callback path), waits for the immediate no-flush window, then publishes a marker span with a distinct resource name. The subsequent report's writer.add captor must see only the marker -- if CLEAR didn't actually wipe the original entry, the original "resource" would still be present and the assertion would catch it. Cannot directly verify thread identity of the clear from inside this test (CLEAR's inbox.clear() drops any latch signal we'd queue behind it), so this is an observable-contract test rather than a strict thread-id test. Still catches both the missing-clear regression and the bucket-chain-corruption regression that the original threading race could produce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…optimize-metric-key

…dinality Resolved conflicts: - AggregateEntry.java: kept #11387's cardinality-handler constructor shape; dropped the dead recordDurations(int, AtomicLongArray) batch API and its AtomicLongArray import (the #11382 cleanup). The of() colon-split warning from #11382 doesn't apply because #11387's of() takes peerTags as a pre-built list (no colon-splitting on test reconstruction). - ClientStatsAggregator.java: kept #11387's PeerTagSchema.of(..., healthMetrics) 3-arg signature; preserved #11382's read-order race fix (lastTimeDiscovered read before peerTags); preserved the resetCardinalityHandlers method from HEAD. - ClientStatsAggregatorTest.groovy: kept HEAD's cardinality-focused tests; applied the #11382 Spock-assert fix (wrap '>> { closure }' body expressions in assert) to all 31 sites so power-assert surfaces mismatches. - ClientStatsAggregatorBootstrapTest.java: renamed the reconcileSwapsSchemaWhenTagSetChanges test's ConflatingMetricsAggregator references to ClientStatsAggregator. - ConflatingMetricsAggregatorDisableTest.java: renamed file + references to ClientStatsAggregatorDisableTest with corresponding class-name change. - AdversarialMetricsBenchmark.java: updated ConflatingMetricsAggregator + ConflatingMetricsAggregatorBenchmark references to ClientStatsAggregator / ClientStatsAggregatorBenchmark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ory-efficiency Resolved conflicts: - AdversarialMetricsBenchmark.java: kept #11389's Javadoc (specific to AggregateTable + cardinality handlers), applied #11382's LongAdder + concise tearDown cleanup. - AggregateEntry.java: dropped recordDurations(int, AtomicLongArray) per the #11382 cleanup (no longer used after migrating the test callers to recordOneDuration). - MetricsIntegrationTest.groovy: migrated to per-duration recordOneDuration calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sarahchen6 · 2026-05-22T00:24:30Z

+  /** Hot-path constructor for the producer/consumer flow. Builds UTF8 fields via the caches. */
+  private AggregateEntry(SpanSnapshot s, long keyHash) {
+    super(keyHash);
+    this.resource = canonicalize(RESOURCE_CACHE, s.resourceName);


it looks like null resources are "canonicalize"d to EMPTY on line 417, but the contentEquals on line 267 expects null to equal null. Could adjusting the contentEquals method to match null to EMPTY as well be enough to address this?

Yeah, I've gone back and forth a few times on null vs EMPTY. I'll double check that aspect of it.

dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 15, 2026

dougqh mentioned this pull request May 15, 2026

Per-component / tag cardinality limits in client-side stats #11387

Draft

3 tasks

dougqh mentioned this pull request May 18, 2026

Add Hashtable and LongHashingUtils utilities #11409

Merged

2 tasks

dougqh and others added 3 commits May 18, 2026 15:18

dougqh force-pushed the dougqh/optimize-metric-key branch from 050a998 to 3738c85 Compare May 18, 2026 19:20

dougqh changed the base branch from dougqh/conflating-metrics-background-work to dougqh/util-hashtable May 18, 2026 19:21

dougqh and others added 8 commits May 18, 2026 15:40

Apply spotless formatting to Hashtable and LongHashingUtils

031dc89

Bring the new util/ files in line with google-java-format (tabs → spaces, line wrapping, javadoc list markup) so spotlessCheck passes in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add benchmark results to HashtableBenchmark header

a534e4f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dougqh commented May 19, 2026

View reviewed changes

Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java

dougqh and others added 3 commits May 19, 2026 13:41

dougqh and others added 3 commits May 19, 2026 13:58

Merge branch 'dougqh/util-hashtable' into dougqh/optimize-metric-key

7ab47f7

dougqh force-pushed the dougqh/util-hashtable branch from 66ec7f6 to e2642cd Compare May 20, 2026 18:05

dougqh and others added 3 commits May 20, 2026 14:19

Merge remote-tracking branch 'origin/master' into dougqh/optimize-met…

aab5b3b

…ric-key

Merge branch 'dougqh/util-hashtable' into dougqh/optimize-metric-key

88f169f

dougqh requested review from amarziali and sarahchen6 May 20, 2026 19:01

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

c2ff012

…optimize-metric-key

Base automatically changed from dougqh/util-hashtable to master May 20, 2026 21:47

dougqh mentioned this pull request May 21, 2026

Defer MetricKey construction to the aggregator thread #11381

Open

6 tasks

dougqh and others added 3 commits May 21, 2026 10:59

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

1d1ba4d

…optimize-metric-key

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

3d4f380

…optimize-metric-key

dougqh changed the base branch from master to dougqh/conflating-metrics-background-work May 21, 2026 17:55

dougqh and others added 13 commits May 21, 2026 13:58

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

ba0d390

…optimize-metric-key

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

f71384e

…optimize-metric-key

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

471f8f7

…optimize-metric-key

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

7a074e0

…optimize-metric-key

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

3fa7ac7

…optimize-metric-key

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

8ec956d

…optimize-metric-key # Conflicts: # dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

200028d

…optimize-metric-key

sarahchen6 reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update client-side stats to use light weight Hashtable#11382

Update client-side stats to use light weight Hashtable#11382
dougqh wants to merge 78 commits into
dougqh/conflating-metrics-background-workfrom
dougqh/optimize-metric-key

dougqh commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 •

edited by datadog-prod-us1-4 Bot

Loading

Uh oh!

sarahchen6 May 22, 2026

Uh oh!

dougqh May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dougqh commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

1. Add AggregateTable + AggregateEntry backed by Hashtable

2. Swap Aggregator to use AggregateTable + route disable() clear through a ClearSignal

3. Fold MetricKey + AggregateMetric into AggregateEntry

Review polish

Additional cleanups

Benchmarks

Producer publish() latency (single-threaded, 2 forks × 5 iter × 15s)

AdversarialMetricsBenchmark (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

Test plan

Uh oh!

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 • edited by datadog-prod-us1-4 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

sarahchen6 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

dougqh May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dougqh commented May 15, 2026 •

edited

Loading

1. Add `AggregateTable` + `AggregateEntry` backed by `Hashtable`

2. Swap `Aggregator` to use `AggregateTable` + route `disable()` clear through a `ClearSignal`

3. Fold `MetricKey` + `AggregateMetric` into `AggregateEntry`

Producer `publish()` latency (single-threaded, 2 forks × 5 iter × 15s)

`AdversarialMetricsBenchmark` (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 •

edited by datadog-prod-us1-4 Bot

Loading