From 129ebd110a6b004ec951cf5bbda838c64f1c8962 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 15 May 2026 15:58:33 -0400 Subject: [PATCH 01/33] Cap per-field metric tag cardinality via Property/TagCardinalityHandler Replaces the per-field DDCache layer inside AggregateEntry with the two new cardinality handlers. Each per-field handler holds a small HashMap working set; when its budget is exhausted, subsequent values collapse to a stable "blocked_by_tracer" sentinel UTF8BytesString rather than growing without bound. The handlers are reset on the aggregator thread at the end of each report() cycle (10s default), so the cardinality budget refreshes per reporting interval. Caches replaced (limits preserved from the prior DDCache sizes): RESOURCE_HANDLER 32 SERVICE_HANDLER 32 OPERATION_HANDLER 64 SERVICE_SOURCE_HANDLER 16 TYPE_HANDLER 8 SPAN_KIND_HANDLER 16 HTTP_METHOD_HANDLER 8 HTTP_ENDPOINT_HANDLER 32 GRPC_STATUS_CODE_HANDLER 32 PEER_TAG_HANDLERS per-tag-name TagCardinalityHandler, each 512 Two production-only changes to the handlers as the user wrote them: - Fixed import: datadog.collections.tagmap6lazy.TagMap doesn't exist; TagCardinalityHandler now imports datadog.trace.api.TagMap which has the Entry API the handler uses. - Added TagCardinalityHandler.register(String) overload so AggregateEntry's peer-tag canonicalization doesn't have to allocate a TagMap.Entry per call -- the snapshot already carries peer-tag values as a flattened String[] {name, value, ...}. AggregateEntry split into two construction paths: - forSnapshot(snapshot, agg): the hot path; runs each field through the appropriate handler. - of(...): test-only factory; bypasses the handlers and creates UTF8 instances directly, so tests don't pollute static handler state. Content- equality on the resulting entry still matches the production-built one. Thread-safety: handlers are HashMap-backed and not safe for concurrent access. Both forSnapshot and resetCardinalityHandlers must be called from the aggregator thread. After the prior commits that moved MetricKey construction to the aggregator thread, this is the only thread that canonicalizes; the test factory path runs on test threads but doesn't touch the handlers. Reset semantics: clearing the handler's working set drops the {value -> UTF8BytesString} mapping but doesn't invalidate existing AggregateEntry fields -- those keep their UTF8BytesString references alive on their own. Subsequent snapshots with the same content still resolve to the existing entries via content-equality matches(). New values after reset get freshly allocated UTF8BytesStrings via the handler. Known limitation (not fixed here): hashOf(SpanSnapshot) hashes from the raw snapshot fields, not from the post-handler canonical form. So when cardinality is exceeded, multiple distinct raw values that collapse to the "blocked_by_tracer" sentinel still produce distinct hashes and land in different AggregateEntry buckets -- the wire payload will carry multiple rows that all label as blocked. This is the same behavior the prior DDCache-based design would have had at capacity. Collapsing those into a single sentinel entry would require canonicalizing before hashing and is a follow-up. Tests: new CardinalityHandlerTest covers PropertyCardinalityHandler and TagCardinalityHandler in isolation (hit/miss, over-limit blocking, reset behavior, sentinel stability). Existing ConflatingMetricAggregatorTest / SerializingMetricWriterTest / AggregateTableTest all pass unchanged because the test factory bypasses handlers. Benchmarks (2 forks x 5 iter x 15s) -- producer side unchanged because the handlers live on the consumer thread: SimpleSpan bench: 3.114 +- 0.045 us/op (prior: 3.123 +- 0.018) DDSpan bench: 2.364 +- 0.113 us/op (prior: 2.412 +- 0.022) Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 279 +++++++++++------- .../trace/common/metrics/Aggregator.java | 3 + .../metrics/PropertyCardinalityHandler.java | 45 +++ .../common/metrics/TagCardinalityHandler.java | 76 +++++ .../metrics/CardinalityHandlerTest.java | 88 ++++++ 5 files changed, 384 insertions(+), 107 deletions(-) create mode 100644 dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java create mode 100644 dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java create mode 100644 dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index e2fda9fde47..55536b7a8f3 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -1,19 +1,15 @@ package datadog.trace.common.metrics; -import static datadog.trace.api.Functions.UTF8_ENCODE; -import static datadog.trace.bootstrap.instrumentation.api.UTF8BytesString.EMPTY; - -import datadog.trace.api.Pair; -import datadog.trace.api.cache.DDCache; -import datadog.trace.api.cache.DDCaches; import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; import datadog.trace.util.Hashtable; import datadog.trace.util.LongHashingUtils; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; +import java.util.HashMap; import java.util.List; -import java.util.function.Function; +import java.util.Map; +import java.util.Objects; /** * Hashtable entry for the consumer-side aggregator. Holds the UTF8-encoded label fields (the data @@ -24,45 +20,41 @@ * String} vs {@code UTF8BytesString} mixing on the same logical key collapses into one entry * instead of splitting. * - *

The static UTF8 caches that used to live on {@code MetricKey} and {@code - * ConflatingMetricsAggregator} are consolidated here. + *

UTF8 canonicalization runs through per-field {@link PropertyCardinalityHandler}s (and {@link + * TagCardinalityHandler}s for peer tags), so cardinality is capped per reporting interval and + * overflow values are bucketed into a {@code blocked_by_tracer} sentinel rather than allowed to + * grow without bound. The handlers are reset on the aggregator thread every reporting cycle via + * {@link #resetCardinalityHandlers()}. + * + *

Thread-safety: the cardinality handlers are not thread-safe. Only the aggregator thread + * may call {@link #forSnapshot} or {@link #resetCardinalityHandlers}. Test code uses {@link #of} + * which constructs entries without touching the handlers. */ final class AggregateEntry extends Hashtable.Entry { - // UTF8 caches consolidated from the previous MetricKey + ConflatingMetricsAggregator split. - private static final DDCache RESOURCE_CACHE = - DDCaches.newFixedSizeCache(32); - private static final DDCache SERVICE_CACHE = - DDCaches.newFixedSizeCache(32); - private static final DDCache OPERATION_CACHE = - DDCaches.newFixedSizeCache(64); - private static final DDCache SERVICE_SOURCE_CACHE = - DDCaches.newFixedSizeCache(16); - private static final DDCache TYPE_CACHE = DDCaches.newFixedSizeCache(8); - private static final DDCache SPAN_KIND_CACHE = - DDCaches.newFixedSizeCache(16); - private static final DDCache HTTP_METHOD_CACHE = - DDCaches.newFixedSizeCache(8); - private static final DDCache HTTP_ENDPOINT_CACHE = - DDCaches.newFixedSizeCache(32); - private static final DDCache GRPC_STATUS_CODE_CACHE = - DDCaches.newFixedSizeCache(32); - - /** - * Outer cache keyed by peer-tag name, with an inner per-name cache keyed by value. The inner - * cache produces the "name:value" encoded form the serializer writes. - */ - private static final DDCache< - String, Pair, Function>> - PEER_TAGS_CACHE = DDCaches.newFixedSizeCache(64); - - private static final Function< - String, Pair, Function>> - PEER_TAGS_CACHE_ADDER = - key -> - Pair.of( - DDCaches.newFixedSizeCache(512), - value -> UTF8BytesString.create(key + ":" + value)); + // Per-field cardinality limits. Identical to the prior DDCache sizes. + private static final PropertyCardinalityHandler RESOURCE_HANDLER = + new PropertyCardinalityHandler(32); + private static final PropertyCardinalityHandler SERVICE_HANDLER = + new PropertyCardinalityHandler(32); + private static final PropertyCardinalityHandler OPERATION_HANDLER = + new PropertyCardinalityHandler(64); + private static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = + new PropertyCardinalityHandler(16); + private static final PropertyCardinalityHandler TYPE_HANDLER = new PropertyCardinalityHandler(8); + private static final PropertyCardinalityHandler SPAN_KIND_HANDLER = + new PropertyCardinalityHandler(16); + private static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = + new PropertyCardinalityHandler(8); + private static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = + new PropertyCardinalityHandler(32); + private static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = + new PropertyCardinalityHandler(32); + + /** Per-peer-tag-name {@link TagCardinalityHandler}, each sized to 512 distinct values. */ + private static final Map PEER_TAG_HANDLERS = new HashMap<>(); + + private static final int PEER_TAG_VALUE_LIMIT = 512; private final UTF8BytesString resource; private final UTF8BytesString service; @@ -84,39 +76,79 @@ final class AggregateEntry extends Hashtable.Entry { final AggregateMetric aggregate; - /** Hot-path constructor for the producer/consumer flow. Builds UTF8 fields via the caches. */ - private AggregateEntry(SpanSnapshot s, long keyHash, AggregateMetric aggregate) { + /** Field-bearing constructor used by both the hot path and the test factory. */ + private AggregateEntry( + long keyHash, + UTF8BytesString resource, + UTF8BytesString service, + UTF8BytesString operationName, + UTF8BytesString serviceSource, + UTF8BytesString type, + UTF8BytesString spanKind, + UTF8BytesString httpMethod, + UTF8BytesString httpEndpoint, + UTF8BytesString grpcStatusCode, + short httpStatusCode, + boolean synthetic, + boolean traceRoot, + String[] peerTagPairsRaw, + List peerTags, + AggregateMetric aggregate) { super(keyHash); - this.resource = canonicalize(RESOURCE_CACHE, s.resourceName); - this.service = SERVICE_CACHE.computeIfAbsent(s.serviceName, UTF8_ENCODE); - this.operationName = canonicalize(OPERATION_CACHE, s.operationName); - this.serviceSource = - s.serviceNameSource == null - ? null - : canonicalize(SERVICE_SOURCE_CACHE, s.serviceNameSource); - this.type = canonicalize(TYPE_CACHE, s.spanType); - this.spanKind = SPAN_KIND_CACHE.computeIfAbsent(s.spanKind, UTF8BytesString::create); - this.httpMethod = - s.httpMethod == null - ? null - : HTTP_METHOD_CACHE.computeIfAbsent(s.httpMethod, UTF8BytesString::create); - this.httpEndpoint = - s.httpEndpoint == null - ? null - : HTTP_ENDPOINT_CACHE.computeIfAbsent(s.httpEndpoint, UTF8BytesString::create); - this.grpcStatusCode = - s.grpcStatusCode == null - ? null - : GRPC_STATUS_CODE_CACHE.computeIfAbsent(s.grpcStatusCode, UTF8BytesString::create); - this.httpStatusCode = s.httpStatusCode; - this.synthetic = s.synthetic; - this.traceRoot = s.traceRoot; - this.peerTagPairsRaw = s.peerTagPairs; - this.peerTags = materializePeerTags(s.peerTagPairs); + this.resource = resource; + this.service = service; + this.operationName = operationName; + this.serviceSource = serviceSource; + this.type = type; + this.spanKind = spanKind; + this.httpMethod = httpMethod; + this.httpEndpoint = httpEndpoint; + this.grpcStatusCode = grpcStatusCode; + this.httpStatusCode = httpStatusCode; + this.synthetic = synthetic; + this.traceRoot = traceRoot; + this.peerTagPairsRaw = peerTagPairsRaw; + this.peerTags = peerTags; this.aggregate = aggregate; } - /** Test-friendly factory mirroring the prior {@code new MetricKey(...)} positional args. */ + /** + * Production hot path: canonicalize each snapshot field via the cardinality handlers. Must be + * called on the aggregator thread. Null-valued fields short-circuit to {@link + * UTF8BytesString#EMPTY} (or {@code null} for optional ones) so they don't consume a cardinality + * slot. + */ + static AggregateEntry forSnapshot(SpanSnapshot s, AggregateMetric aggregate) { + return new AggregateEntry( + hashOf(s), + registerOrEmpty(RESOURCE_HANDLER, s.resourceName), + registerOrEmpty(SERVICE_HANDLER, s.serviceName), + registerOrEmpty(OPERATION_HANDLER, s.operationName), + s.serviceNameSource == null ? null : SERVICE_SOURCE_HANDLER.register(s.serviceNameSource), + registerOrEmpty(TYPE_HANDLER, s.spanType), + registerOrEmpty(SPAN_KIND_HANDLER, s.spanKind), + s.httpMethod == null ? null : HTTP_METHOD_HANDLER.register(s.httpMethod), + s.httpEndpoint == null ? null : HTTP_ENDPOINT_HANDLER.register(s.httpEndpoint), + s.grpcStatusCode == null ? null : GRPC_STATUS_CODE_HANDLER.register(s.grpcStatusCode), + s.httpStatusCode, + s.synthetic, + s.traceRoot, + s.peerTagPairs, + canonicalizePeerTags(s.peerTagPairs), + aggregate); + } + + private static UTF8BytesString registerOrEmpty( + PropertyCardinalityHandler handler, CharSequence value) { + return value == null ? UTF8BytesString.EMPTY : handler.register(value); + } + + /** + * Test-friendly factory mirroring the prior {@code new MetricKey(...)} positional args. Bypasses + * the cardinality handlers so tests don't pollute their state -- {@link UTF8BytesString}s are + * created directly. Content-equality on the resulting entry still matches an entry built via + * {@link #forSnapshot} from a snapshot of the same shape. + */ static AggregateEntry of( CharSequence resource, CharSequence service, @@ -132,7 +164,7 @@ static AggregateEntry of( CharSequence httpEndpoint, CharSequence grpcStatusCode) { String[] rawPairs = peerTagsToRawPairs(peerTags); - SpanSnapshot synthetic_snapshot = + SpanSnapshot syntheticSnapshot = new SpanSnapshot( resource, service == null ? null : service.toString(), @@ -149,12 +181,43 @@ static AggregateEntry of( grpcStatusCode == null ? null : grpcStatusCode.toString(), 0L); return new AggregateEntry( - synthetic_snapshot, hashOf(synthetic_snapshot), new AggregateMetric()); + hashOf(syntheticSnapshot), + createUtf8(resource), + createUtf8(service), + createUtf8(operationName), + serviceSource == null ? null : createUtf8(serviceSource), + createUtf8(type), + createUtf8(spanKind), + httpMethod == null ? null : createUtf8(httpMethod), + httpEndpoint == null ? null : createUtf8(httpEndpoint), + grpcStatusCode == null ? null : createUtf8(grpcStatusCode), + (short) httpStatusCode, + synthetic, + traceRoot, + rawPairs, + peerTags == null ? Collections.emptyList() : peerTags, + new AggregateMetric()); } - /** Construct from a snapshot at consumer-thread miss time. */ - static AggregateEntry forSnapshot(SpanSnapshot s, AggregateMetric aggregate) { - return new AggregateEntry(s, hashOf(s), aggregate); + /** + * Resets every cardinality handler's working set. Must be called on the aggregator thread. + * Existing entries continue to hold their previously-issued {@link UTF8BytesString} references; + * matches() uses content-equality so snapshots delivered after a reset still resolve to the + * existing entries. + */ + static void resetCardinalityHandlers() { + RESOURCE_HANDLER.reset(); + SERVICE_HANDLER.reset(); + OPERATION_HANDLER.reset(); + SERVICE_SOURCE_HANDLER.reset(); + TYPE_HANDLER.reset(); + SPAN_KIND_HANDLER.reset(); + HTTP_METHOD_HANDLER.reset(); + HTTP_ENDPOINT_HANDLER.reset(); + GRPC_STATUS_CODE_HANDLER.reset(); + for (TagCardinalityHandler h : PEER_TAG_HANDLERS.values()) { + h.reset(); + } } boolean matches(SpanSnapshot s) { @@ -175,12 +238,9 @@ && stringContentEquals(httpEndpoint, s.httpEndpoint) /** * Computes the 64-bit lookup hash for a {@link SpanSnapshot}. Chained per-field calls -- no - * varargs / Object[] allocation, no autoboxing on primitive overloads. The constructor's - * super({@code hashOf(s)}) call uses the same function so an entry built from a snapshot hashes - * to the same bucket the snapshot itself looks up. - * - *

Hashes are content-stable across {@code String} / {@code UTF8BytesString}: {@link - * UTF8BytesString#hashCode()} returns the underlying {@code String}'s hash. + * varargs / Object[] allocation, no autoboxing on primitive overloads. Hashes are content-stable + * across {@code String} / {@code UTF8BytesString} because {@link UTF8BytesString#hashCode()} + * returns the underlying {@code String}'s hash. */ static long hashOf(SpanSnapshot s) { long h = 0; @@ -270,16 +330,16 @@ public boolean equals(Object o) { return httpStatusCode == that.httpStatusCode && synthetic == that.synthetic && traceRoot == that.traceRoot - && java.util.Objects.equals(resource, that.resource) - && java.util.Objects.equals(service, that.service) - && java.util.Objects.equals(operationName, that.operationName) - && java.util.Objects.equals(serviceSource, that.serviceSource) - && java.util.Objects.equals(type, that.type) - && java.util.Objects.equals(spanKind, that.spanKind) + && Objects.equals(resource, that.resource) + && Objects.equals(service, that.service) + && Objects.equals(operationName, that.operationName) + && Objects.equals(serviceSource, that.serviceSource) + && Objects.equals(type, that.type) + && Objects.equals(spanKind, that.spanKind) && peerTags.equals(that.peerTags) - && java.util.Objects.equals(httpMethod, that.httpMethod) - && java.util.Objects.equals(httpEndpoint, that.httpEndpoint) - && java.util.Objects.equals(grpcStatusCode, that.grpcStatusCode); + && Objects.equals(httpMethod, that.httpMethod) + && Objects.equals(httpEndpoint, that.httpEndpoint) + && Objects.equals(grpcStatusCode, that.grpcStatusCode); } @Override @@ -289,15 +349,15 @@ public int hashCode() { // ----- helpers ----- - private static UTF8BytesString canonicalize( - DDCache cache, CharSequence charSeq) { - if (charSeq == null) { - return EMPTY; + /** Direct {@link UTF8BytesString} creation that bypasses the cardinality handlers. */ + private static UTF8BytesString createUtf8(CharSequence cs) { + if (cs == null) { + return UTF8BytesString.EMPTY; } - if (charSeq instanceof UTF8BytesString) { - return (UTF8BytesString) charSeq; + if (cs instanceof UTF8BytesString) { + return (UTF8BytesString) cs; } - return cache.computeIfAbsent(charSeq.toString(), UTF8BytesString::create); + return UTF8BytesString.create(cs.toString()); } /** UTF8 vs raw CharSequence content-equality, no allocation in the common (String) case. */ @@ -326,28 +386,33 @@ private static boolean stringContentEquals(UTF8BytesString a, String b) { return b != null && a.toString().equals(b); } - private static List materializePeerTags(String[] pairs) { + /** Production-path peer-tag canonicalization via per-name {@link TagCardinalityHandler}. */ + private static List canonicalizePeerTags(String[] pairs) { if (pairs == null || pairs.length == 0) { return Collections.emptyList(); } if (pairs.length == 2) { - return Collections.singletonList(encodePeerTag(pairs[0], pairs[1])); + return Collections.singletonList(handlerFor(pairs[0]).register(pairs[1])); } List tags = new ArrayList<>(pairs.length / 2); for (int i = 0; i < pairs.length; i += 2) { - tags.add(encodePeerTag(pairs[i], pairs[i + 1])); + tags.add(handlerFor(pairs[i]).register(pairs[i + 1])); } return tags; } - private static UTF8BytesString encodePeerTag(String name, String value) { - final Pair, Function> - cacheAndCreator = PEER_TAGS_CACHE.computeIfAbsent(name, PEER_TAGS_CACHE_ADDER); - return cacheAndCreator.getLeft().computeIfAbsent(value, cacheAndCreator.getRight()); + private static TagCardinalityHandler handlerFor(String peerTagName) { + TagCardinalityHandler h = PEER_TAG_HANDLERS.get(peerTagName); + if (h != null) { + return h; + } + h = new TagCardinalityHandler(peerTagName, PEER_TAG_VALUE_LIMIT); + PEER_TAG_HANDLERS.put(peerTagName, h); + return h; } /** - * Inverse of {@link #materializePeerTags}: takes pre-encoded UTF8 peer tags and recovers the raw + * Inverse of {@link #canonicalizePeerTags}: takes pre-encoded UTF8 peer tags and recovers the raw * {@code [name0, value0, name1, value1, ...]} pairs. Used by the test factory {@link #of}, not by * the hot path. */ diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java index b4fc59d5a1d..9bcd41f37e4 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java @@ -149,6 +149,9 @@ private void report(long when, SignalItem signal) { } dirty = false; } + // Reset cardinality handlers each report cycle so the per-field budgets refresh. + // Safe to call on this (aggregator) thread; handlers are HashMap-based and not thread-safe. + AggregateEntry.resetCardinalityHandlers(); signal.complete(); if (skipped) { log.debug("skipped metrics reporting because no points have changed"); diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java new file mode 100644 index 00000000000..61560a32a71 --- /dev/null +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -0,0 +1,45 @@ +package datadog.trace.common.metrics; + +import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; +import java.util.HashMap; + +public final class PropertyCardinalityHandler { + private final int cardinalityLimit; + + private final HashMap curUtf8s; + + private UTF8BytesString cacheBlocked = null; + + public PropertyCardinalityHandler(int cardinalityLimit) { + this.cardinalityLimit = cardinalityLimit; + + // pre-sizing properly to avoid rehashing + this.curUtf8s = new HashMap<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); + } + + public UTF8BytesString register(CharSequence value) { + if (this.curUtf8s.size() >= this.cardinalityLimit) { + return this.blockedByTracer(); + } + + UTF8BytesString existingUtf8 = this.curUtf8s.get(value); + if (existingUtf8 != null) return existingUtf8; + + // TODO: maybe use a fallback cache to reduce allocations across reset cycles + UTF8BytesString newUtf8 = UTF8BytesString.create(value); + this.curUtf8s.put(value, newUtf8); + return newUtf8; + } + + private UTF8BytesString blockedByTracer() { + UTF8BytesString cacheBlocked = this.cacheBlocked; + if (cacheBlocked != null) return cacheBlocked; + + this.cacheBlocked = cacheBlocked = UTF8BytesString.create("blocked_by_tracer"); + return cacheBlocked; + } + + public void reset() { + this.curUtf8s.clear(); + } +} diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java new file mode 100644 index 00000000000..eeac6caf817 --- /dev/null +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -0,0 +1,76 @@ +package datadog.trace.common.metrics; + +import datadog.trace.api.TagMap; +import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; +import java.util.HashMap; + +public final class TagCardinalityHandler { + private final String tag; + private final int cardinalityLimit; + + private final HashMap curUtf8Pairs; + + private UTF8BytesString cacheBlocked = null; + + public TagCardinalityHandler(String tag, int cardinalityLimit) { + this.tag = tag; + this.cardinalityLimit = cardinalityLimit; + + // pre-sizing properly to avoid rehashing + this.curUtf8Pairs = new HashMap<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); + } + + public UTF8BytesString register(TagMap.Entry entry) { + if (this.curUtf8Pairs.size() >= this.cardinalityLimit) { + return this.blockedByTracer(); + } + + if (!isValidType(entry)) { + return this.blockedByTracer(); + } + + // NOTE: This could lead to boxing -- not ideal + Object cacheKey = entry.objectValue(); + UTF8BytesString existing = this.curUtf8Pairs.get(cacheKey); + if (existing != null) return existing; + + // TODO: maybe use a fallback cache to reduce allocations across reset cycles + UTF8BytesString newPair = UTF8BytesString.create(this.tag + ":" + entry.stringValue()); + this.curUtf8Pairs.put(cacheKey, newPair); + return newPair; + } + + /** + * String-keyed overload for callers that already hold a {@code (tag, value)} pair as Strings and + * would rather not allocate a {@link TagMap.Entry} per lookup -- e.g. the metrics aggregator's + * peer-tag flow, where peer-tag values are flattened into a {@code String[]} on the snapshot. + */ + public UTF8BytesString register(String value) { + if (this.curUtf8Pairs.size() >= this.cardinalityLimit) { + return this.blockedByTracer(); + } + + UTF8BytesString existing = this.curUtf8Pairs.get(value); + if (existing != null) return existing; + + UTF8BytesString newPair = UTF8BytesString.create(this.tag + ":" + value); + this.curUtf8Pairs.put(value, newPair); + return newPair; + } + + private static final boolean isValidType(TagMap.Entry entry) { + return entry.isNumericPrimitive() || entry.objectValue() instanceof CharSequence; + } + + private UTF8BytesString blockedByTracer() { + UTF8BytesString cacheBlocked = this.cacheBlocked; + if (cacheBlocked != null) return cacheBlocked; + + this.cacheBlocked = cacheBlocked = UTF8BytesString.create(this.tag + ":blocked_by_tracer"); + return cacheBlocked; + } + + public void reset() { + this.curUtf8Pairs.clear(); + } +} diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java new file mode 100644 index 00000000000..bbdffb6061a --- /dev/null +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java @@ -0,0 +1,88 @@ +package datadog.trace.common.metrics; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotSame; +import static org.junit.jupiter.api.Assertions.assertSame; + +import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; +import org.junit.jupiter.api.Test; + +class CardinalityHandlerTest { + + @Test + void propertyReturnsSameInstanceForRepeatedValueUntilLimit() { + PropertyCardinalityHandler h = new PropertyCardinalityHandler(3); + UTF8BytesString a1 = h.register("a"); + UTF8BytesString a2 = h.register("a"); + assertSame(a1, a2); + assertEquals("a", a1.toString()); + } + + @Test + void propertyOverLimitReturnsBlockedSentinel() { + PropertyCardinalityHandler h = new PropertyCardinalityHandler(2); + UTF8BytesString a = h.register("a"); + UTF8BytesString b = h.register("b"); + UTF8BytesString blocked1 = h.register("c"); + UTF8BytesString blocked2 = h.register("d"); + + assertEquals("blocked_by_tracer", blocked1.toString()); + assertSame(blocked1, blocked2); // same sentinel for all overflow values + assertNotSame(blocked1, a); + assertNotSame(blocked1, b); + } + + @Test + void propertyResetRefreshesBudget() { + PropertyCardinalityHandler h = new PropertyCardinalityHandler(2); + h.register("a"); + h.register("b"); + UTF8BytesString blocked = h.register("c"); + assertEquals("blocked_by_tracer", blocked.toString()); + + h.reset(); + + // After reset, three distinct values fit again, but the previous instances aren't reused. + UTF8BytesString afterReset = h.register("a"); + assertEquals("a", afterReset.toString()); + UTF8BytesString c = h.register("c"); + assertEquals("c", c.toString()); + UTF8BytesString blockedAgain = h.register("d"); + UTF8BytesString blockedYetAgain = h.register("e"); + assertEquals("blocked_by_tracer", blockedAgain.toString()); + assertSame(blockedAgain, blockedYetAgain); + } + + @Test + void tagPrefixesValuesAndReusesUnderLimit() { + TagCardinalityHandler h = new TagCardinalityHandler("peer.hostname", 4); + UTF8BytesString first = h.register("host-a"); + UTF8BytesString second = h.register("host-a"); + UTF8BytesString other = h.register("host-b"); + + assertSame(first, second); + assertNotSame(first, other); + assertEquals("peer.hostname:host-a", first.toString()); + assertEquals("peer.hostname:host-b", other.toString()); + } + + @Test + void tagOverLimitReturnsTaggedSentinel() { + TagCardinalityHandler h = new TagCardinalityHandler("peer.service", 1); + h.register("svc-1"); + UTF8BytesString blocked = h.register("svc-2"); + assertEquals("peer.service:blocked_by_tracer", blocked.toString()); + } + + @Test + void tagResetRefreshesBudgetAndSentinelStaysStable() { + TagCardinalityHandler h = new TagCardinalityHandler("x", 1); + h.register("v1"); + UTF8BytesString blockedBefore = h.register("v2"); + h.reset(); + h.register("v1"); + UTF8BytesString blockedAfter = h.register("v2"); + // Both are the same sentinel instance (cacheBlocked is not cleared on reset). + assertSame(blockedBefore, blockedAfter); + } +} From 8aab88d3d6a6b4d47bf9fa2dfb2f34f704c1e171 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 15 May 2026 16:14:35 -0400 Subject: [PATCH 02/33] Canonicalize SpanSnapshot before hashing so blocked values collapse The prior commit ran every snapshot through the cardinality handlers but still hashed the raw snapshot fields. When a field exceeded its cardinality budget the handlers collapsed many distinct values to a single "blocked_by_tracer" sentinel, but the raw hashes were still all different -- so the blocked entries fragmented across the AggregateTable. This commit makes hash + match work off the canonical (post-handler) UTF8BytesString fields, so blocked values land in the same bucket and merge into one entry. How the lookup path changes --------------------------- A new package-private AggregateEntry.Canonical scratch buffer: - holds the 10 canonical UTF8BytesString refs, primitives, peerTags list, and the precomputed keyHash; - exposes populate(SpanSnapshot) which runs each field through the appropriate handler and computes the long hash from the canonical refs; - exposes matches(AggregateEntry) for content-equality lookup; - exposes toEntry(AggregateMetric) which copies its refs into a fresh AggregateEntry on miss. AggregateTable holds one Canonical instance and reuses it per findOrInsert. On a hit nothing is allocated -- the buffer's refs feed the bucket walk and matches() directly. On a miss the refs are copied into the new entry and the buffer is overwritten on the next call. Hash function ------------- hashOf now takes UTF8BytesString fields (plus primitives + peerTags list) instead of raw CharSequence/String from the snapshot. UTF8BytesString.hashCode returns the underlying String's hash, so: - content-equal entries built via AggregateEntry.of(...) (test factory, bypasses handlers) produce the same hash as entries built via Canonical.toEntry(...) (production, via handlers); - all values that collapsed to "blocked_by_tracer" share that sentinel instance and therefore that hashCode -- they land in the same bucket and merge into one entry. Matches ------- The SpanSnapshot-keyed matches() on AggregateEntry is gone. Lookup goes through Canonical.matches(entry) which compares the buffer's UTF8 fields against the entry's UTF8 fields via Objects.equals (content equality on UTF8BytesString). This is needed because across handler resets the UTF8BytesString instance referenced by an existing entry differs from the freshly-issued instance for the same content -- content-equality lets the existing entry survive resets. The peerTagPairsRaw field on AggregateEntry was previously kept for matching against snapshot.peerTagPairs (the flat String[]). Canonical.matches uses List.equals on the encoded UTF8 peerTags directly, so peerTagPairsRaw is dropped. New test in AggregateTableTest -- cardinalityBlockedValuesCollapseIntoOneEntry inserts 50 distinct services into a table whose SERVICE_HANDLER has a cardinality limit of 32, and asserts the final size is 33 (the 32 in-budget services plus a single collapsed "blocked_by_tracer" entry, not 50 separate entries). Benchmarks (2 forks x 5 iter x 15s) -- producer side unchanged: SimpleSpan bench: 3.117 +- 0.026 us/op (prior: 3.114 +- 0.045) DDSpan bench: 2.344 +- 0.114 us/op (prior: 2.364 +- 0.113) Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 407 +++++++++--------- .../trace/common/metrics/AggregateTable.java | 21 +- .../common/metrics/AggregateTableTest.java | 21 + 3 files changed, 247 insertions(+), 202 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 55536b7a8f3..c28bf5722f6 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -4,7 +4,6 @@ import datadog.trace.util.Hashtable; import datadog.trace.util.LongHashingUtils; import java.util.ArrayList; -import java.util.Arrays; import java.util.Collections; import java.util.HashMap; import java.util.List; @@ -15,40 +14,38 @@ * Hashtable entry for the consumer-side aggregator. Holds the UTF8-encoded label fields (the data * {@link SerializingMetricWriter} writes to the wire) plus the mutable {@link AggregateMetric}. * - *

{@link #matches(SpanSnapshot)} compares the entry's stored UTF8 forms against the snapshot's - * raw {@code CharSequence}/{@code String}/{@code String[]} fields via content-equality, so {@code - * String} vs {@code UTF8BytesString} mixing on the same logical key collapses into one entry - * instead of splitting. - * *

UTF8 canonicalization runs through per-field {@link PropertyCardinalityHandler}s (and {@link - * TagCardinalityHandler}s for peer tags), so cardinality is capped per reporting interval and - * overflow values are bucketed into a {@code blocked_by_tracer} sentinel rather than allowed to - * grow without bound. The handlers are reset on the aggregator thread every reporting cycle via - * {@link #resetCardinalityHandlers()}. + * TagCardinalityHandler}s for peer tags), so cardinality is capped per reporting interval. The + * critical property: hashing and matching happen after canonicalization, so when a field's + * cardinality budget is exhausted and overflow values collapse to a {@code blocked_by_tracer} + * sentinel, those values land in the same bucket and merge into a single entry rather than + * fragmenting. + * + *

The aggregator thread is the sole writer. {@link AggregateTable} holds a reusable {@link + * Canonical} scratch buffer so the canonicalization itself doesn't allocate per lookup; on a miss + * the buffer's references are copied into a fresh entry. On a hit nothing is allocated. * - *

Thread-safety: the cardinality handlers are not thread-safe. Only the aggregator thread - * may call {@link #forSnapshot} or {@link #resetCardinalityHandlers}. Test code uses {@link #of} - * which constructs entries without touching the handlers. + *

The handlers are reset on the aggregator thread every reporting cycle via {@link + * #resetCardinalityHandlers()}. + * + *

Thread-safety: the cardinality handlers and {@link Canonical} are not thread-safe. Only + * the aggregator thread may call {@link Canonical#populate} or {@link #resetCardinalityHandlers}. + * Test code uses {@link #of} which constructs entries without touching the handlers. */ final class AggregateEntry extends Hashtable.Entry { // Per-field cardinality limits. Identical to the prior DDCache sizes. - private static final PropertyCardinalityHandler RESOURCE_HANDLER = - new PropertyCardinalityHandler(32); - private static final PropertyCardinalityHandler SERVICE_HANDLER = - new PropertyCardinalityHandler(32); - private static final PropertyCardinalityHandler OPERATION_HANDLER = - new PropertyCardinalityHandler(64); - private static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = - new PropertyCardinalityHandler(16); - private static final PropertyCardinalityHandler TYPE_HANDLER = new PropertyCardinalityHandler(8); - private static final PropertyCardinalityHandler SPAN_KIND_HANDLER = + static final PropertyCardinalityHandler RESOURCE_HANDLER = new PropertyCardinalityHandler(32); + static final PropertyCardinalityHandler SERVICE_HANDLER = new PropertyCardinalityHandler(32); + static final PropertyCardinalityHandler OPERATION_HANDLER = new PropertyCardinalityHandler(64); + static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = new PropertyCardinalityHandler(16); - private static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = - new PropertyCardinalityHandler(8); - private static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = + static final PropertyCardinalityHandler TYPE_HANDLER = new PropertyCardinalityHandler(8); + static final PropertyCardinalityHandler SPAN_KIND_HANDLER = new PropertyCardinalityHandler(16); + static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = new PropertyCardinalityHandler(8); + static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = new PropertyCardinalityHandler(32); - private static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = + static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = new PropertyCardinalityHandler(32); /** Per-peer-tag-name {@link TagCardinalityHandler}, each sized to 512 distinct values. */ @@ -56,24 +53,19 @@ final class AggregateEntry extends Hashtable.Entry { private static final int PEER_TAG_VALUE_LIMIT = 512; - private final UTF8BytesString resource; - private final UTF8BytesString service; - private final UTF8BytesString operationName; - private final UTF8BytesString serviceSource; // nullable - private final UTF8BytesString type; - private final UTF8BytesString spanKind; - private final UTF8BytesString httpMethod; // nullable - private final UTF8BytesString httpEndpoint; // nullable - private final UTF8BytesString grpcStatusCode; // nullable - private final short httpStatusCode; - private final boolean synthetic; - private final boolean traceRoot; - - // Peer tags carried in two forms: raw String[] for matches() against the snapshot's pairs, - // and pre-encoded List ("name:value") for the serializer. - private final String[] peerTagPairsRaw; - private final List peerTags; - + final UTF8BytesString resource; + final UTF8BytesString service; + final UTF8BytesString operationName; + final UTF8BytesString serviceSource; // nullable + final UTF8BytesString type; + final UTF8BytesString spanKind; + final UTF8BytesString httpMethod; // nullable + final UTF8BytesString httpEndpoint; // nullable + final UTF8BytesString grpcStatusCode; // nullable + final short httpStatusCode; + final boolean synthetic; + final boolean traceRoot; + final List peerTags; final AggregateMetric aggregate; /** Field-bearing constructor used by both the hot path and the test factory. */ @@ -91,7 +83,6 @@ private AggregateEntry( short httpStatusCode, boolean synthetic, boolean traceRoot, - String[] peerTagPairsRaw, List peerTags, AggregateMetric aggregate) { super(keyHash); @@ -107,47 +98,15 @@ private AggregateEntry( this.httpStatusCode = httpStatusCode; this.synthetic = synthetic; this.traceRoot = traceRoot; - this.peerTagPairsRaw = peerTagPairsRaw; this.peerTags = peerTags; this.aggregate = aggregate; } - /** - * Production hot path: canonicalize each snapshot field via the cardinality handlers. Must be - * called on the aggregator thread. Null-valued fields short-circuit to {@link - * UTF8BytesString#EMPTY} (or {@code null} for optional ones) so they don't consume a cardinality - * slot. - */ - static AggregateEntry forSnapshot(SpanSnapshot s, AggregateMetric aggregate) { - return new AggregateEntry( - hashOf(s), - registerOrEmpty(RESOURCE_HANDLER, s.resourceName), - registerOrEmpty(SERVICE_HANDLER, s.serviceName), - registerOrEmpty(OPERATION_HANDLER, s.operationName), - s.serviceNameSource == null ? null : SERVICE_SOURCE_HANDLER.register(s.serviceNameSource), - registerOrEmpty(TYPE_HANDLER, s.spanType), - registerOrEmpty(SPAN_KIND_HANDLER, s.spanKind), - s.httpMethod == null ? null : HTTP_METHOD_HANDLER.register(s.httpMethod), - s.httpEndpoint == null ? null : HTTP_ENDPOINT_HANDLER.register(s.httpEndpoint), - s.grpcStatusCode == null ? null : GRPC_STATUS_CODE_HANDLER.register(s.grpcStatusCode), - s.httpStatusCode, - s.synthetic, - s.traceRoot, - s.peerTagPairs, - canonicalizePeerTags(s.peerTagPairs), - aggregate); - } - - private static UTF8BytesString registerOrEmpty( - PropertyCardinalityHandler handler, CharSequence value) { - return value == null ? UTF8BytesString.EMPTY : handler.register(value); - } - /** * Test-friendly factory mirroring the prior {@code new MetricKey(...)} positional args. Bypasses * the cardinality handlers so tests don't pollute their state -- {@link UTF8BytesString}s are - * created directly. Content-equality on the resulting entry still matches an entry built via - * {@link #forSnapshot} from a snapshot of the same shape. + * created directly. Content-equal entries from {@link Canonical#toEntry} still {@link #equals} an + * entry built via {@code of(...)}. */ static AggregateEntry of( CharSequence resource, @@ -163,47 +122,54 @@ static AggregateEntry of( CharSequence httpMethod, CharSequence httpEndpoint, CharSequence grpcStatusCode) { - String[] rawPairs = peerTagsToRawPairs(peerTags); - SpanSnapshot syntheticSnapshot = - new SpanSnapshot( - resource, - service == null ? null : service.toString(), - operationName, - serviceSource, - type, + UTF8BytesString resourceUtf = createUtf8(resource); + UTF8BytesString serviceUtf = createUtf8(service); + UTF8BytesString operationNameUtf = createUtf8(operationName); + UTF8BytesString serviceSourceUtf = serviceSource == null ? null : createUtf8(serviceSource); + UTF8BytesString typeUtf = createUtf8(type); + UTF8BytesString spanKindUtf = createUtf8(spanKind); + UTF8BytesString httpMethodUtf = httpMethod == null ? null : createUtf8(httpMethod); + UTF8BytesString httpEndpointUtf = httpEndpoint == null ? null : createUtf8(httpEndpoint); + UTF8BytesString grpcUtf = grpcStatusCode == null ? null : createUtf8(grpcStatusCode); + List peerTagsList = peerTags == null ? Collections.emptyList() : peerTags; + long keyHash = + hashOf( + resourceUtf, + serviceUtf, + operationNameUtf, + serviceSourceUtf, + typeUtf, + spanKindUtf, + httpMethodUtf, + httpEndpointUtf, + grpcUtf, (short) httpStatusCode, synthetic, traceRoot, - spanKind == null ? null : spanKind.toString(), - rawPairs, - httpMethod == null ? null : httpMethod.toString(), - httpEndpoint == null ? null : httpEndpoint.toString(), - grpcStatusCode == null ? null : grpcStatusCode.toString(), - 0L); + peerTagsList); return new AggregateEntry( - hashOf(syntheticSnapshot), - createUtf8(resource), - createUtf8(service), - createUtf8(operationName), - serviceSource == null ? null : createUtf8(serviceSource), - createUtf8(type), - createUtf8(spanKind), - httpMethod == null ? null : createUtf8(httpMethod), - httpEndpoint == null ? null : createUtf8(httpEndpoint), - grpcStatusCode == null ? null : createUtf8(grpcStatusCode), + keyHash, + resourceUtf, + serviceUtf, + operationNameUtf, + serviceSourceUtf, + typeUtf, + spanKindUtf, + httpMethodUtf, + httpEndpointUtf, + grpcUtf, (short) httpStatusCode, synthetic, traceRoot, - rawPairs, - peerTags == null ? Collections.emptyList() : peerTags, + peerTagsList, new AggregateMetric()); } /** * Resets every cardinality handler's working set. Must be called on the aggregator thread. * Existing entries continue to hold their previously-issued {@link UTF8BytesString} references; - * matches() uses content-equality so snapshots delivered after a reset still resolve to the - * existing entries. + * matches via content-equality so snapshots delivered after a reset still resolve to the existing + * entries. */ static void resetCardinalityHandlers() { RESOURCE_HANDLER.reset(); @@ -220,47 +186,42 @@ static void resetCardinalityHandlers() { } } - boolean matches(SpanSnapshot s) { - return httpStatusCode == s.httpStatusCode - && synthetic == s.synthetic - && traceRoot == s.traceRoot - && contentEquals(resource, s.resourceName) - && stringContentEquals(service, s.serviceName) - && contentEquals(operationName, s.operationName) - && contentEquals(serviceSource, s.serviceNameSource) - && contentEquals(type, s.spanType) - && stringContentEquals(spanKind, s.spanKind) - && Arrays.equals(peerTagPairsRaw, s.peerTagPairs) - && stringContentEquals(httpMethod, s.httpMethod) - && stringContentEquals(httpEndpoint, s.httpEndpoint) - && stringContentEquals(grpcStatusCode, s.grpcStatusCode); - } - /** - * Computes the 64-bit lookup hash for a {@link SpanSnapshot}. Chained per-field calls -- no - * varargs / Object[] allocation, no autoboxing on primitive overloads. Hashes are content-stable - * across {@code String} / {@code UTF8BytesString} because {@link UTF8BytesString#hashCode()} - * returns the underlying {@code String}'s hash. + * 64-bit lookup hash, computed over UTF8-encoded fields so that cardinality-blocked values (which + * all canonicalize to the same sentinel {@link UTF8BytesString}) collide in the same bucket. + * {@link UTF8BytesString#hashCode()} returns the underlying String hash, so entries built via + * {@link #of} produce the same hash as entries built from a snapshot with matching content. */ - static long hashOf(SpanSnapshot s) { + static long hashOf( + UTF8BytesString resource, + UTF8BytesString service, + UTF8BytesString operationName, + UTF8BytesString serviceSource, + UTF8BytesString type, + UTF8BytesString spanKind, + UTF8BytesString httpMethod, + UTF8BytesString httpEndpoint, + UTF8BytesString grpcStatusCode, + short httpStatusCode, + boolean synthetic, + boolean traceRoot, + List peerTags) { long h = 0; - h = LongHashingUtils.addToHash(h, s.resourceName); - h = LongHashingUtils.addToHash(h, s.serviceName); - h = LongHashingUtils.addToHash(h, s.operationName); - h = LongHashingUtils.addToHash(h, s.serviceNameSource); - h = LongHashingUtils.addToHash(h, s.spanType); - h = LongHashingUtils.addToHash(h, s.httpStatusCode); - h = LongHashingUtils.addToHash(h, s.synthetic); - h = LongHashingUtils.addToHash(h, s.traceRoot); - h = LongHashingUtils.addToHash(h, s.spanKind); - if (s.peerTagPairs != null) { - for (String p : s.peerTagPairs) { - h = LongHashingUtils.addToHash(h, p); - } + h = LongHashingUtils.addToHash(h, resource); + h = LongHashingUtils.addToHash(h, service); + h = LongHashingUtils.addToHash(h, operationName); + h = LongHashingUtils.addToHash(h, serviceSource); + h = LongHashingUtils.addToHash(h, type); + h = LongHashingUtils.addToHash(h, httpStatusCode); + h = LongHashingUtils.addToHash(h, synthetic); + h = LongHashingUtils.addToHash(h, traceRoot); + h = LongHashingUtils.addToHash(h, spanKind); + for (UTF8BytesString p : peerTags) { + h = LongHashingUtils.addToHash(h, p); } - h = LongHashingUtils.addToHash(h, s.httpMethod); - h = LongHashingUtils.addToHash(h, s.httpEndpoint); - h = LongHashingUtils.addToHash(h, s.grpcStatusCode); + h = LongHashingUtils.addToHash(h, httpMethod); + h = LongHashingUtils.addToHash(h, httpEndpoint); + h = LongHashingUtils.addToHash(h, grpcStatusCode); return h; } @@ -319,8 +280,8 @@ List getPeerTags() { /** * Equality on the 13 label fields (not on the aggregate). Used only by test mock matchers; the - * {@link Hashtable} does its own bucketing via {@link #keyHash} + {@link #matches(SpanSnapshot)} - * and never calls {@code equals}. + * {@link Hashtable} does its own bucketing via {@link #keyHash} + {@link Canonical#matches} and + * never calls {@code equals}. */ @Override public boolean equals(Object o) { @@ -347,8 +308,114 @@ public int hashCode() { return (int) keyHash; } + /** + * Reusable scratch buffer for canonicalizing a {@link SpanSnapshot} into UTF8 fields, computing + * its lookup hash, comparing against existing entries, and building a fresh entry on miss. + * + *

One instance is held by an {@link AggregateTable} and reused on every {@code findOrInsert} + * call. Single-threaded use only. Fields are deliberately mutable -- this is a hot-path scratch + * area, not a value class. + */ + static final class Canonical { + UTF8BytesString resource; + UTF8BytesString service; + UTF8BytesString operationName; + UTF8BytesString serviceSource; // nullable + UTF8BytesString type; + UTF8BytesString spanKind; + UTF8BytesString httpMethod; // nullable + UTF8BytesString httpEndpoint; // nullable + UTF8BytesString grpcStatusCode; // nullable + short httpStatusCode; + boolean synthetic; + boolean traceRoot; + List peerTags; + long keyHash; + + /** Canonicalize all fields from {@code s} through the handlers into this buffer. */ + void populate(SpanSnapshot s) { + this.resource = registerOrEmpty(RESOURCE_HANDLER, s.resourceName); + this.service = registerOrEmpty(SERVICE_HANDLER, s.serviceName); + this.operationName = registerOrEmpty(OPERATION_HANDLER, s.operationName); + this.serviceSource = + s.serviceNameSource == null ? null : SERVICE_SOURCE_HANDLER.register(s.serviceNameSource); + this.type = registerOrEmpty(TYPE_HANDLER, s.spanType); + this.spanKind = registerOrEmpty(SPAN_KIND_HANDLER, s.spanKind); + this.httpMethod = s.httpMethod == null ? null : HTTP_METHOD_HANDLER.register(s.httpMethod); + this.httpEndpoint = + s.httpEndpoint == null ? null : HTTP_ENDPOINT_HANDLER.register(s.httpEndpoint); + this.grpcStatusCode = + s.grpcStatusCode == null ? null : GRPC_STATUS_CODE_HANDLER.register(s.grpcStatusCode); + this.httpStatusCode = s.httpStatusCode; + this.synthetic = s.synthetic; + this.traceRoot = s.traceRoot; + this.peerTags = canonicalizePeerTags(s.peerTagPairs); + this.keyHash = + hashOf( + resource, + service, + operationName, + serviceSource, + type, + spanKind, + httpMethod, + httpEndpoint, + grpcStatusCode, + httpStatusCode, + synthetic, + traceRoot, + peerTags); + } + + /** + * Whether this canonicalized snapshot matches the given entry. Compares UTF8 fields via + * content-equality (so an entry surviving a handler reset still matches a freshly-canonicalized + * snapshot of the same content). + */ + boolean matches(AggregateEntry e) { + return httpStatusCode == e.httpStatusCode + && synthetic == e.synthetic + && traceRoot == e.traceRoot + && Objects.equals(resource, e.resource) + && Objects.equals(service, e.service) + && Objects.equals(operationName, e.operationName) + && Objects.equals(serviceSource, e.serviceSource) + && Objects.equals(type, e.type) + && Objects.equals(spanKind, e.spanKind) + && peerTags.equals(e.peerTags) + && Objects.equals(httpMethod, e.httpMethod) + && Objects.equals(httpEndpoint, e.httpEndpoint) + && Objects.equals(grpcStatusCode, e.grpcStatusCode); + } + + /** Build a new entry from the currently-populated canonical fields. */ + AggregateEntry toEntry(AggregateMetric aggregate) { + return new AggregateEntry( + keyHash, + resource, + service, + operationName, + serviceSource, + type, + spanKind, + httpMethod, + httpEndpoint, + grpcStatusCode, + httpStatusCode, + synthetic, + traceRoot, + peerTags, + aggregate); + } + } + // ----- helpers ----- + private static UTF8BytesString registerOrEmpty( + PropertyCardinalityHandler handler, CharSequence value) { + return value == null ? UTF8BytesString.EMPTY : handler.register(value); + } + /** Direct {@link UTF8BytesString} creation that bypasses the cardinality handlers. */ private static UTF8BytesString createUtf8(CharSequence cs) { if (cs == null) { @@ -360,32 +427,6 @@ private static UTF8BytesString createUtf8(CharSequence cs) { return UTF8BytesString.create(cs.toString()); } - /** UTF8 vs raw CharSequence content-equality, no allocation in the common (String) case. */ - private static boolean contentEquals(UTF8BytesString a, CharSequence b) { - if (a == null) { - return b == null; - } - if (b == null) { - return false; - } - // UTF8BytesString.toString() returns the underlying String -- O(1), no allocation. - String aStr = a.toString(); - if (b instanceof String) { - return aStr.equals(b); - } - if (b instanceof UTF8BytesString) { - return aStr.equals(b.toString()); - } - return aStr.contentEquals(b); - } - - private static boolean stringContentEquals(UTF8BytesString a, String b) { - if (a == null) { - return b == null; - } - return b != null && a.toString().equals(b); - } - /** Production-path peer-tag canonicalization via per-name {@link TagCardinalityHandler}. */ private static List canonicalizePeerTags(String[] pairs) { if (pairs == null || pairs.length == 0) { @@ -410,24 +451,4 @@ private static TagCardinalityHandler handlerFor(String peerTagName) { PEER_TAG_HANDLERS.put(peerTagName, h); return h; } - - /** - * Inverse of {@link #canonicalizePeerTags}: takes pre-encoded UTF8 peer tags and recovers the raw - * {@code [name0, value0, name1, value1, ...]} pairs. Used by the test factory {@link #of}, not by - * the hot path. - */ - private static String[] peerTagsToRawPairs(List peerTags) { - if (peerTags == null || peerTags.isEmpty()) { - return null; - } - String[] pairs = new String[peerTags.size() * 2]; - int i = 0; - for (UTF8BytesString peerTag : peerTags) { - String s = peerTag.toString(); - int colon = s.indexOf(':'); - pairs[i++] = colon < 0 ? s : s.substring(0, colon); - pairs[i++] = colon < 0 ? "" : s.substring(colon + 1); - } - return pairs; - } } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java index 08300eab296..38d45ef5e85 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java @@ -4,13 +4,14 @@ import java.util.function.Consumer; /** - * Consumer-side {@link AggregateMetric} store, keyed on the raw fields of a {@link SpanSnapshot}. + * Consumer-side {@link AggregateMetric} store, keyed on the canonical UTF8-encoded labels of a + * {@link SpanSnapshot}. * - *

Replaces the prior {@code LRUCache}. The win is on the - * steady-state hit path: a snapshot lookup is a 64-bit hash compute + bucket walk + field-wise - * {@code matches}, with no per-snapshot {@link AggregateEntry} allocation and no UTF8 cache - * lookups. The UTF8-encoded forms (formerly held on {@code MetricKey}) live on the {@link - * AggregateEntry} itself and are built once per unique key at insert time. + *

{@link #findOrInsert} canonicalizes the snapshot's fields through the cardinality handlers (so + * cardinality-blocked values share a sentinel and collapse into one entry) and then computes the + * lookup hash from that canonical form. Canonicalization runs into a reusable {@link + * AggregateEntry.Canonical} scratch buffer; on a hit nothing is allocated, on a miss the buffer's + * references are copied into a fresh entry and the buffer is overwritten on the next call. * *

Not thread-safe. The aggregator thread is the sole writer; {@link #clear()} must be * routed through the inbox rather than called from arbitrary threads. @@ -19,6 +20,7 @@ final class AggregateTable { private final Hashtable.Entry[] buckets; private final int maxAggregates; + private final AggregateEntry.Canonical canonical = new AggregateEntry.Canonical(); private int size; AggregateTable(int maxAggregates) { @@ -40,12 +42,13 @@ boolean isEmpty() { * the caller should drop the data point in that case. */ AggregateMetric findOrInsert(SpanSnapshot snapshot) { - long keyHash = AggregateEntry.hashOf(snapshot); + canonical.populate(snapshot); + long keyHash = canonical.keyHash; int bucketIndex = Hashtable.Support.bucketIndex(buckets, keyHash); for (Hashtable.Entry e = buckets[bucketIndex]; e != null; e = e.next()) { if (e.keyHash == keyHash) { AggregateEntry candidate = (AggregateEntry) e; - if (candidate.matches(snapshot)) { + if (canonical.matches(candidate)) { return candidate.aggregate; } } @@ -53,7 +56,7 @@ AggregateMetric findOrInsert(SpanSnapshot snapshot) { if (size >= maxAggregates && !evictOneStale()) { return null; } - AggregateEntry entry = AggregateEntry.forSnapshot(snapshot, new AggregateMetric()); + AggregateEntry entry = canonical.toEntry(new AggregateMetric()); entry.setNext(buckets[bucketIndex]); buckets[bucketIndex] = entry; size++; diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java index 44f2b36cb6b..b8bf8fd1a3b 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java @@ -87,6 +87,27 @@ void peerTagPairsParticipateInIdentity() { assertEquals(3, table.size()); } + @Test + void cardinalityBlockedValuesCollapseIntoOneEntry() { + // SERVICE_HANDLER has a cardinality limit of 32. With 50 distinct service names, services 33+ + // canonicalize to the "blocked_by_tracer" sentinel. Because the table hashes from the canonical + // (post-handler) form, all blocked services land in the same bucket and merge into a single + // entry rather than fragmenting. + AggregateEntry.resetCardinalityHandlers(); + AggregateTable table = new AggregateTable(128); + + for (int i = 0; i < 50; i++) { + AggregateMetric agg = table.findOrInsert(snapshot("svc-" + i, "op", "client")); + assertNotNull(agg); + agg.recordOneDuration(1L); + } + + // 32 in-budget services + 1 collapsed "blocked_by_tracer" entry = 33 total. + assertEquals(33, table.size()); + + AggregateEntry.resetCardinalityHandlers(); + } + @Test void capOverrunEvictsStaleEntry() { AggregateTable table = new AggregateTable(2); From 9b70705fd9dd289d3186e8fa1c87b5b0d8e7515c Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 15 May 2026 17:04:32 -0400 Subject: [PATCH 03/33] Defer peer-tag pair construction; capture values + canonicalize via schema-indexed handlers Replaces the producer's early {@code (name, value)}-pair encoding with a schema-based design: peer-tag values are captured into a parallel String array, and the consumer applies the matching {@link TagCardinalityHandler} by index using a {@link PeerTagSchema}'s parallel name/handler arrays. This removes the {@code Map} the prior commit left in {@code AggregateEntry} -- handler lookup is now a single array dereference instead of a hashmap probe. PeerTagSchema ------------- New package-private class that holds: - {@code String[] names} -- peer-tag names in stable order - {@code TagCardinalityHandler[] handlers} -- parallel to names Two schemas exist: a static singleton {@code INTERNAL} for the internal-kind {@code base.service} case, and a {@code CURRENT} schema for the peer- aggregation kinds (client/producer/consumer) that lazily refreshes when {@code features.peerTags()} returns a different set of names. Each {@link SpanSnapshot} captures the schema reference it was built against so producer and consumer agree on the indexing even if {@code CURRENT} changes between capture and consumption. A fast-path identity check (cached last input Set instance) keeps the {@code currentSyncedTo} call cheap: when the producer hands in the same Set instance as last time -- the steady-state case -- {@code currentSyncedTo} returns immediately without iterating names. The {@code matches()} loop only runs when the Set instance changes, which in production is rare (only on remote-config reconfiguration). Snapshot shape -------------- {@code SpanSnapshot.peerTagPairs} (a flat {@code [name0, value0, name1, value1, ...]} array) is replaced by: - {@code PeerTagSchema peerTagSchema} -- nullable; schema for the values - {@code String[] peerTagValues} -- parallel to schema.names The producer captures only values; the consumer constructs the encoded {@code "name:value"} UTF8 forms via {@code schema.handler(i).register(value)} on its own thread. Consumer-side cleanups bundled in --------------------------------- While here, also addresses the perf review items raised against the prior commit: - {@code hashOf}'s peer-tag loop is now indexed iteration; no more iterator allocation per snapshot. - {@code Canonical} now owns a reusable {@code peerTagsBuffer} ArrayList that's cleared+refilled per {@code populate} call -- zero allocation on the hit path. The buffer is copied into an immutable list only on miss when the entry needs to own it long-term. - {@code Canonical.matches} uses indexed list comparison; no iterator alloc in {@code List.equals}. - The {@code HashMap PEER_TAG_HANDLERS} on {@code AggregateEntry} is gone, replaced by the {@link PeerTagSchema}'s parallel array layout. Benchmark (2 forks x 5 iter x 15s) ---------------------------------- SimpleSpan bench: 3.165 +- 0.032 us/op (prior: 3.117 +- 0.026) DDSpan bench: 2.727 +- 0.018 us/op (prior: 2.344 +- 0.114) Some producer-side regression from the per-snapshot schema sync (volatile read + identity check). The fast-path identity comparison keeps it small; hoisting the sync out of the per-snapshot loop is possible but would change behavior in the edge case where {@code features.peerTags()} returns different Sets within a single trace (covered by an existing test). Choosing correctness over the marginal speedup. Tests ----- AggregateTableTest's snapshot builder is updated to construct a schema + values via {@code PeerTagSchema.currentSyncedTo}, exercising the same code path as production. Existing peer-tag test in {@code ConflatingMetricAggregatorTest} still passes unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 107 +++++++++------ .../metrics/ConflatingMetricsAggregator.java | 72 ++++++----- .../trace/common/metrics/PeerTagSchema.java | 122 ++++++++++++++++++ .../trace/common/metrics/SpanSnapshot.java | 20 ++- .../common/metrics/AggregateTableTest.java | 24 +++- 5 files changed, 264 insertions(+), 81 deletions(-) create mode 100644 dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index c28bf5722f6..225f03197e5 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -5,9 +5,7 @@ import datadog.trace.util.LongHashingUtils; import java.util.ArrayList; import java.util.Collections; -import java.util.HashMap; import java.util.List; -import java.util.Map; import java.util.Objects; /** @@ -48,11 +46,6 @@ final class AggregateEntry extends Hashtable.Entry { static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = new PropertyCardinalityHandler(32); - /** Per-peer-tag-name {@link TagCardinalityHandler}, each sized to 512 distinct values. */ - private static final Map PEER_TAG_HANDLERS = new HashMap<>(); - - private static final int PEER_TAG_VALUE_LIMIT = 512; - final UTF8BytesString resource; final UTF8BytesString service; final UTF8BytesString operationName; @@ -181,9 +174,7 @@ static void resetCardinalityHandlers() { HTTP_METHOD_HANDLER.reset(); HTTP_ENDPOINT_HANDLER.reset(); GRPC_STATUS_CODE_HANDLER.reset(); - for (TagCardinalityHandler h : PEER_TAG_HANDLERS.values()) { - h.reset(); - } + PeerTagSchema.resetAll(); } /** @@ -216,8 +207,10 @@ static long hashOf( h = LongHashingUtils.addToHash(h, synthetic); h = LongHashingUtils.addToHash(h, traceRoot); h = LongHashingUtils.addToHash(h, spanKind); - for (UTF8BytesString p : peerTags) { - h = LongHashingUtils.addToHash(h, p); + // indexed iteration -- avoids the iterator allocation a for-each over a List would do + int peerTagCount = peerTags.size(); + for (int i = 0; i < peerTagCount; i++) { + h = LongHashingUtils.addToHash(h, peerTags.get(i)); } h = LongHashingUtils.addToHash(h, httpMethod); h = LongHashingUtils.addToHash(h, httpEndpoint); @@ -329,7 +322,14 @@ static final class Canonical { short httpStatusCode; boolean synthetic; boolean traceRoot; - List peerTags; + + /** + * Reusable buffer of canonicalized peer-tag UTF8 forms. Cleared and refilled in {@link + * #populate}; on miss, {@link #toEntry} copies it into an immutable list for the entry to own. + * Zero allocation on the hit path. + */ + final ArrayList peerTagsBuffer = new ArrayList<>(4); + long keyHash; /** Canonicalize all fields from {@code s} through the handlers into this buffer. */ @@ -349,7 +349,7 @@ void populate(SpanSnapshot s) { this.httpStatusCode = s.httpStatusCode; this.synthetic = s.synthetic; this.traceRoot = s.traceRoot; - this.peerTags = canonicalizePeerTags(s.peerTagPairs); + populatePeerTags(s.peerTagSchema, s.peerTagValues); this.keyHash = hashOf( resource, @@ -364,7 +364,26 @@ void populate(SpanSnapshot s) { httpStatusCode, synthetic, traceRoot, - peerTags); + peerTagsBuffer); + } + + /** + * Fills {@link #peerTagsBuffer} with canonical UTF8 forms, applying {@code schema.handler(i)} + * to each non-null value at the same index. No allocation when the schema/values are absent or + * all values are null (buffer is just cleared). + */ + private void populatePeerTags(PeerTagSchema schema, String[] values) { + peerTagsBuffer.clear(); + if (schema == null || values == null) { + return; + } + int n = schema.size(); + for (int i = 0; i < n; i++) { + String v = values[i]; + if (v != null) { + peerTagsBuffer.add(schema.handler(i).register(v)); + } + } } /** @@ -382,14 +401,41 @@ boolean matches(AggregateEntry e) { && Objects.equals(serviceSource, e.serviceSource) && Objects.equals(type, e.type) && Objects.equals(spanKind, e.spanKind) - && peerTags.equals(e.peerTags) + && peerTagsEqual(peerTagsBuffer, e.peerTags) && Objects.equals(httpMethod, e.httpMethod) && Objects.equals(httpEndpoint, e.httpEndpoint) && Objects.equals(grpcStatusCode, e.grpcStatusCode); } - /** Build a new entry from the currently-populated canonical fields. */ + /** Indexed list comparison -- avoids the iterator a {@code List.equals} would allocate. */ + private static boolean peerTagsEqual(List a, List b) { + int n = a.size(); + if (n != b.size()) { + return false; + } + for (int i = 0; i < n; i++) { + if (!a.get(i).equals(b.get(i))) { + return false; + } + } + return true; + } + + /** + * Build a new entry from the currently-populated canonical fields. The peer-tag buffer is + * copied into an immutable list so the entry's reference stays stable across subsequent {@link + * #populate} calls. + */ AggregateEntry toEntry(AggregateMetric aggregate) { + List snapshottedPeerTags; + int n = peerTagsBuffer.size(); + if (n == 0) { + snapshottedPeerTags = Collections.emptyList(); + } else if (n == 1) { + snapshottedPeerTags = Collections.singletonList(peerTagsBuffer.get(0)); + } else { + snapshottedPeerTags = new ArrayList<>(peerTagsBuffer); + } return new AggregateEntry( keyHash, resource, @@ -404,7 +450,7 @@ AggregateEntry toEntry(AggregateMetric aggregate) { httpStatusCode, synthetic, traceRoot, - peerTags, + snapshottedPeerTags, aggregate); } } @@ -426,29 +472,4 @@ private static UTF8BytesString createUtf8(CharSequence cs) { } return UTF8BytesString.create(cs.toString()); } - - /** Production-path peer-tag canonicalization via per-name {@link TagCardinalityHandler}. */ - private static List canonicalizePeerTags(String[] pairs) { - if (pairs == null || pairs.length == 0) { - return Collections.emptyList(); - } - if (pairs.length == 2) { - return Collections.singletonList(handlerFor(pairs[0]).register(pairs[1])); - } - List tags = new ArrayList<>(pairs.length / 2); - for (int i = 0; i < pairs.length; i += 2) { - tags.add(handlerFor(pairs[i]).register(pairs[i + 1])); - } - return tags; - } - - private static TagCardinalityHandler handlerFor(String peerTagName) { - TagCardinalityHandler h = PEER_TAG_HANDLERS.get(peerTagName); - if (h != null) { - return h; - } - h = new TagCardinalityHandler(peerTagName, PEER_TAG_VALUE_LIMIT); - PEER_TAG_HANDLERS.put(peerTagName, h); - return h; - } } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java index c675fcb23c4..7497ed9a799 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java @@ -2,7 +2,6 @@ import static datadog.communication.ddagent.DDAgentFeaturesDiscovery.V06_METRICS_ENDPOINT; import static datadog.trace.api.DDSpanTypes.RPC; -import static datadog.trace.api.DDTags.BASE_SERVICE; import static datadog.trace.bootstrap.instrumentation.api.Tags.HTTP_ENDPOINT; import static datadog.trace.bootstrap.instrumentation.api.Tags.HTTP_METHOD; import static datadog.trace.bootstrap.instrumentation.api.Tags.SPAN_KIND; @@ -294,6 +293,15 @@ private boolean publish(CoreSpan span, boolean isTopLevel) { long tagAndDuration = span.getDurationNano() | (error ? ERROR_TAG : 0L) | (isTopLevel ? TOP_LEVEL_TAG : 0L); + PeerTagSchema peerTagSchema = peerTagSchemaFor(span); + String[] peerTagValues = + peerTagSchema == null ? null : capturePeerTagValues(span, peerTagSchema); + if (peerTagValues == null) { + // capture returned no non-null values -- drop the schema reference so the consumer doesn't + // bother iterating an all-null array. + peerTagSchema = null; + } + SpanSnapshot snapshot = new SpanSnapshot( span.getResourceName(), @@ -305,7 +313,8 @@ private boolean publish(CoreSpan span, boolean isTopLevel) { isSynthetic(span), span.getParentId() == 0, spanKind, - extractPeerTagPairs(span), + peerTagSchema, + peerTagValues, httpMethod, httpEndpoint, grpcStatusCode, @@ -317,41 +326,44 @@ private boolean publish(CoreSpan span, boolean isTopLevel) { return error; } - private String[] extractPeerTagPairs(CoreSpan span) { + /** + * Picks the peer-tag schema for a span. For peer-aggregation kinds, syncs the schema with + * {@code features.peerTags()} so producer and consumer share the same name/handler ordering. + * For internal-kind spans returns the static {@link PeerTagSchema#INTERNAL} schema. + */ + private PeerTagSchema peerTagSchemaFor(CoreSpan span) { if (span.isKind(PEER_AGGREGATION_KINDS)) { - final Set eligiblePeerTags = features.peerTags(); - String[] pairs = null; - int count = 0; - for (String peerTag : eligiblePeerTags) { - Object value = span.unsafeGetTag(peerTag); - if (value != null) { - if (pairs == null) { - // pairs are flattened [name, value, ...]; size for worst case - pairs = new String[eligiblePeerTags.size() * 2]; - } - pairs[count++] = peerTag; - pairs[count++] = value.toString(); - } - } - if (pairs == null) { + Set eligible = features.peerTags(); + if (eligible == null || eligible.isEmpty()) { return null; } - if (count < pairs.length) { - String[] trimmed = new String[count]; - System.arraycopy(pairs, 0, trimmed, 0, count); - return trimmed; - } - return pairs; - } else if (span.isKind(INTERNAL_KIND)) { - // in this case only the base service should be aggregated if present - final Object baseService = span.unsafeGetTag(BASE_SERVICE); - if (baseService != null) { - return new String[] {BASE_SERVICE, baseService.toString()}; - } + return PeerTagSchema.currentSyncedTo(eligible); + } + if (span.isKind(INTERNAL_KIND)) { + return PeerTagSchema.INTERNAL; } return null; } + /** + * Captures the span's peer tag values into a {@code String[]} parallel to {@code schema.names}. + * Returns {@code null} when none of the configured peer tags are set on the span. + */ + private static String[] capturePeerTagValues(CoreSpan span, PeerTagSchema schema) { + int n = schema.size(); + String[] values = null; + for (int i = 0; i < n; i++) { + Object v = span.unsafeGetTag(schema.name(i)); + if (v != null) { + if (values == null) { + values = new String[n]; + } + values[i] = v.toString(); + } + } + return values; + } + private static boolean isSynthetic(CoreSpan span) { return span.getOrigin() != null && SYNTHETICS_ORIGIN.equals(span.getOrigin().toString()); } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java new file mode 100644 index 00000000000..f41b2634da6 --- /dev/null +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -0,0 +1,122 @@ +package datadog.trace.common.metrics; + +import static datadog.trace.api.DDTags.BASE_SERVICE; + +import java.util.Set; + +/** + * Parallel arrays of peer-tag names and their {@link TagCardinalityHandler}s, indexed in lockstep. + * + *

Replaces the previous {@code Map} lookup with positional array + * access: the producer captures span tag values into a {@code String[]} parallel to {@link #names}, + * and the consumer applies {@link #handler(int)} at the same index to canonicalize. + * + *

Two schemas exist: + * + *

+ * + *

Each {@link SpanSnapshot} captures its own schema reference so producer and consumer agree on + * the indexing even if the current schema is replaced between capture and consumption. + * + *

Thread-safety: {@link #currentSyncedTo} may be called from producer threads; + * replacement of the volatile {@code CURRENT} reference is guarded by a lock. The {@link + * TagCardinalityHandler}s themselves are not thread-safe and must only be exercised on the + * aggregator thread (this is where the snapshot's schema is consumed). + */ +final class PeerTagSchema { + + private static final int VALUE_LIMIT_PER_TAG = 512; + + /** Singleton schema for internal-kind spans -- only {@code base.service}. */ + static final PeerTagSchema INTERNAL = new PeerTagSchema(new String[] {BASE_SERVICE}); + + /** Current schema for peer-aggregation kinds; replaced atomically when peer tag names change. */ + private static volatile PeerTagSchema CURRENT = new PeerTagSchema(new String[0]); + + /** + * Identity cache of the most recently observed {@code features.peerTags()} {@link Set} instance. + * The producer hot path checks this first and skips the {@code names}-vs-set comparison when the + * caller's set instance hasn't changed. In production this is the common case -- + * {@code DDAgentFeaturesDiscovery} returns the same Set instance until reconfiguration. + */ + private static volatile Set LAST_SYNCED_INPUT; + + final String[] names; + final TagCardinalityHandler[] handlers; + + private PeerTagSchema(String[] names) { + this.names = names; + this.handlers = new TagCardinalityHandler[names.length]; + for (int i = 0; i < names.length; i++) { + this.handlers[i] = new TagCardinalityHandler(names[i], VALUE_LIMIT_PER_TAG); + } + } + + /** + * Returns the current peer-aggregation schema, lazily refreshing it if the supplied {@code + * peerTagNames} differ from the cached set. Designed to be called from the producer hot path: the + * common case is a single volatile read and an array-length / set-contains comparison. + */ + static PeerTagSchema currentSyncedTo(Set peerTagNames) { + // Fast path: same Set instance as the last sync -> the cached schema is still valid, no + // matches() loop needed. In production this is the steady-state case. + if (peerTagNames == LAST_SYNCED_INPUT) { + return CURRENT; + } + PeerTagSchema cur = CURRENT; + if (matches(cur.names, peerTagNames)) { + LAST_SYNCED_INPUT = peerTagNames; + return cur; + } + synchronized (PeerTagSchema.class) { + cur = CURRENT; + if (!matches(cur.names, peerTagNames)) { + cur = new PeerTagSchema(peerTagNames.toArray(new String[0])); + CURRENT = cur; + } + LAST_SYNCED_INPUT = peerTagNames; + return cur; + } + } + + /** Resets the working sets of {@link #INTERNAL} and {@link #current()}. */ + static void resetAll() { + PeerTagSchema cur = CURRENT; + for (TagCardinalityHandler h : cur.handlers) { + h.reset(); + } + for (TagCardinalityHandler h : INTERNAL.handlers) { + h.reset(); + } + } + + int size() { + return names.length; + } + + String name(int i) { + return names[i]; + } + + TagCardinalityHandler handler(int i) { + return handlers[i]; + } + + private static boolean matches(String[] cur, Set set) { + if (cur.length != set.size()) { + return false; + } + for (String n : cur) { + if (!set.contains(n)) { + return false; + } + } + return true; + } +} diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java index b7f81712945..5967c1302c7 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java @@ -21,10 +21,18 @@ final class SpanSnapshot implements InboxItem { final String spanKind; /** - * Flattened name/value pairs of peer-tag matches: {@code [name0, value0, name1, value1, ...]}. - * {@code null} when there are no matches (the common case). + * Schema for {@link #peerTagValues}. {@code null} when the span has no peer tags. The schema + * carries the names + {@link TagCardinalityHandler}s in parallel array form; {@code + * peerTagValues} holds the per-span tag values at the same indices. */ - final String[] peerTagPairs; + final PeerTagSchema peerTagSchema; + + /** + * Peer tag values captured from the span, parallel to {@code peerTagSchema.names}. A {@code null} + * entry means the span didn't have that peer tag set. {@code null} (the whole array) when {@link + * #peerTagSchema} is {@code null}. + */ + final String[] peerTagValues; final String httpMethod; final String httpEndpoint; @@ -43,7 +51,8 @@ final class SpanSnapshot implements InboxItem { boolean synthetic, boolean traceRoot, String spanKind, - String[] peerTagPairs, + PeerTagSchema peerTagSchema, + String[] peerTagValues, String httpMethod, String httpEndpoint, String grpcStatusCode, @@ -57,7 +66,8 @@ final class SpanSnapshot implements InboxItem { this.synthetic = synthetic; this.traceRoot = traceRoot; this.spanKind = spanKind; - this.peerTagPairs = peerTagPairs; + this.peerTagSchema = peerTagSchema; + this.peerTagValues = peerTagValues; this.httpMethod = httpMethod; this.httpEndpoint = httpEndpoint; this.grpcStatusCode = grpcStatusCode; diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java index b8bf8fd1a3b..7a4f84c30dd 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java @@ -220,7 +220,8 @@ private static final class SnapshotBuilder { private final String service; private final String operation; private final String spanKind; - private String[] peerTagPairs; + private PeerTagSchema peerTagSchema; + private String[] peerTagValues; private long tagAndDuration = 0L; SnapshotBuilder(String service, String operation, String spanKind) { @@ -230,7 +231,23 @@ private static final class SnapshotBuilder { } SnapshotBuilder peerTags(String... namesAndValues) { - this.peerTagPairs = namesAndValues; + // Build a schema from the (name, value, name, value, ...) input. Synced through the + // production singleton so canonicalization actually goes through the same handlers the + // aggregator would use in production -- which is the surface the test wants to exercise. + java.util.LinkedHashSet names = new java.util.LinkedHashSet<>(); + for (int i = 0; i < namesAndValues.length; i += 2) { + names.add(namesAndValues[i]); + } + this.peerTagSchema = PeerTagSchema.currentSyncedTo(names); + this.peerTagValues = new String[peerTagSchema.size()]; + for (int i = 0; i < namesAndValues.length; i += 2) { + for (int j = 0; j < peerTagSchema.size(); j++) { + if (peerTagSchema.name(j).equals(namesAndValues[i])) { + peerTagValues[j] = namesAndValues[i + 1]; + break; + } + } + } return this; } @@ -245,7 +262,8 @@ SpanSnapshot build() { false, true, spanKind, - peerTagPairs, + peerTagSchema, + peerTagValues, null, null, null, From ceec2afefaafc714981f9b4ff1d02a876fd8d093 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 15 May 2026 17:35:16 -0400 Subject: [PATCH 04/33] Rename ConflatingMetricsAggregator to ClientStatsAggregator The "Conflating" in the name dates from the prior design that used a Batch pool + pending map to conflate up to 64 hits per inbox slot. That mechanism is gone -- the producer now publishes one SpanSnapshot per span and the consumer's AggregateTable is the conflation point. The new name matches the existing protocol/metric terminology (HealthMetrics.onClientStat*, stats.flush_payloads, etc.). File renames: ConflatingMetricsAggregator.java -> ClientStatsAggregator.java ConflatingMetricAggregatorTest.groovy -> ClientStatsAggregatorTest.groovy ConflatingMetricsAggregatorBenchmark -> ClientStatsAggregatorBenchmark ConflatingMetricsAggregatorDDSpan* -> ClientStatsAggregatorDDSpan* Plus all symbol references in MetricsAggregatorFactory and the test fixtures that referenced the old class name. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...va => ClientStatsAggregatorBenchmark.java} | 6 +- ...ClientStatsAggregatorDDSpanBenchmark.java} | 14 ++--- ...egator.java => ClientStatsAggregator.java} | 27 ++++----- .../metrics/MetricsAggregatorFactory.java | 2 +- ...roovy => ClientStatsAggregatorTest.groovy} | 60 +++++++++---------- .../common/metrics/FootprintForkedTest.groovy | 2 +- .../MetricsAggregatorFactoryTest.groovy | 2 +- 7 files changed, 56 insertions(+), 57 deletions(-) rename dd-trace-core/src/jmh/java/datadog/trace/common/metrics/{ConflatingMetricsAggregatorBenchmark.java => ClientStatsAggregatorBenchmark.java} (95%) rename dd-trace-core/src/jmh/java/datadog/trace/common/metrics/{ConflatingMetricsAggregatorDDSpanBenchmark.java => ClientStatsAggregatorDDSpanBenchmark.java} (85%) rename dd-trace-core/src/main/java/datadog/trace/common/metrics/{ConflatingMetricsAggregator.java => ClientStatsAggregator.java} (94%) rename dd-trace-core/src/test/groovy/datadog/trace/common/metrics/{ConflatingMetricAggregatorTest.groovy => ClientStatsAggregatorTest.groovy} (95%) diff --git a/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorBenchmark.java b/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ClientStatsAggregatorBenchmark.java similarity index 95% rename from dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorBenchmark.java rename to dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ClientStatsAggregatorBenchmark.java index b9a2f7f8c54..b9d72eaf3ab 100644 --- a/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorBenchmark.java +++ b/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ClientStatsAggregatorBenchmark.java @@ -34,12 +34,12 @@ @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(MICROSECONDS) @Fork(value = 1) -public class ConflatingMetricsAggregatorBenchmark { +public class ClientStatsAggregatorBenchmark { private final DDAgentFeaturesDiscovery featuresDiscovery = new FixedAgentFeaturesDiscovery( Collections.singleton("peer.hostname"), Collections.emptySet()); - private final ConflatingMetricsAggregator aggregator = - new ConflatingMetricsAggregator( + private final ClientStatsAggregator aggregator = + new ClientStatsAggregator( new WellKnownTags("", "", "", "", "", ""), Collections.emptySet(), featuresDiscovery, diff --git a/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorDDSpanBenchmark.java b/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ClientStatsAggregatorDDSpanBenchmark.java similarity index 85% rename from dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorDDSpanBenchmark.java rename to dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ClientStatsAggregatorDDSpanBenchmark.java index 02c6aaffc1a..06052c57ded 100644 --- a/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorDDSpanBenchmark.java +++ b/dd-trace-core/src/jmh/java/datadog/trace/common/metrics/ClientStatsAggregatorDDSpanBenchmark.java @@ -28,8 +28,8 @@ import org.openjdk.jmh.infra.Blackhole; /** - * Parallels {@link ConflatingMetricsAggregatorBenchmark} but uses real {@link DDSpan} instances - * instead of the lightweight {@code SimpleSpan} mock, so the JIT exercises the production {@link + * Parallels {@link ClientStatsAggregatorBenchmark} but uses real {@link DDSpan} instances instead + * of the lightweight {@code SimpleSpan} mock, so the JIT exercises the production {@link * CoreSpan#isKind} path (cached span.kind ordinal + bit-test) rather than the groovy mock's * dispatch. */ @@ -39,21 +39,21 @@ @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(MICROSECONDS) @Fork(value = 1) -public class ConflatingMetricsAggregatorDDSpanBenchmark { +public class ClientStatsAggregatorDDSpanBenchmark { private static final CoreTracer TRACER = CoreTracer.builder().writer(new NoopWriter()).strictTraceWrites(false).build(); private final DDAgentFeaturesDiscovery featuresDiscovery = - new ConflatingMetricsAggregatorBenchmark.FixedAgentFeaturesDiscovery( + new ClientStatsAggregatorBenchmark.FixedAgentFeaturesDiscovery( Collections.singleton("peer.hostname"), Collections.emptySet()); - private final ConflatingMetricsAggregator aggregator = - new ConflatingMetricsAggregator( + private final ClientStatsAggregator aggregator = + new ClientStatsAggregator( new WellKnownTags("", "", "", "", "", ""), Collections.emptySet(), featuresDiscovery, HealthMetrics.NO_OP, - new ConflatingMetricsAggregatorBenchmark.NullSink(), + new ClientStatsAggregatorBenchmark.NullSink(), 2048, 2048, false); diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java similarity index 94% rename from dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java rename to dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index 7497ed9a799..1b1aeec402a 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -39,9 +39,9 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -public final class ConflatingMetricsAggregator implements MetricsAggregator, EventListener { +public final class ClientStatsAggregator implements MetricsAggregator, EventListener { - private static final Logger log = LoggerFactory.getLogger(ConflatingMetricsAggregator.class); + private static final Logger log = LoggerFactory.getLogger(ClientStatsAggregator.class); private static final Map DEFAULT_HEADERS = Collections.singletonMap(DDAgentApi.DATADOG_META_TRACER_VERSION, DDTraceCoreInfo.VERSION); @@ -75,7 +75,7 @@ public final class ConflatingMetricsAggregator implements MetricsAggregator, Eve private volatile AgentTaskScheduler.Scheduled cancellation; - public ConflatingMetricsAggregator( + public ClientStatsAggregator( Config config, SharedCommunicationObjects sharedCommunicationObjects, HealthMetrics healthMetrics) { @@ -96,7 +96,7 @@ public ConflatingMetricsAggregator( config.isTraceResourceRenamingEnabled()); } - ConflatingMetricsAggregator( + ClientStatsAggregator( WellKnownTags wellKnownTags, Set ignoredResources, DDAgentFeaturesDiscovery features, @@ -118,7 +118,7 @@ public ConflatingMetricsAggregator( includeEndpointInMetrics); } - ConflatingMetricsAggregator( + ClientStatsAggregator( WellKnownTags wellKnownTags, Set ignoredResources, DDAgentFeaturesDiscovery features, @@ -142,7 +142,7 @@ public ConflatingMetricsAggregator( includeEndpointInMetrics); } - ConflatingMetricsAggregator( + ClientStatsAggregator( Set ignoredResources, DDAgentFeaturesDiscovery features, HealthMetrics healthMetric, @@ -327,9 +327,9 @@ private boolean publish(CoreSpan span, boolean isTopLevel) { } /** - * Picks the peer-tag schema for a span. For peer-aggregation kinds, syncs the schema with - * {@code features.peerTags()} so producer and consumer share the same name/handler ordering. - * For internal-kind spans returns the static {@link PeerTagSchema#INTERNAL} schema. + * Picks the peer-tag schema for a span. For peer-aggregation kinds, syncs the schema with {@code + * features.peerTags()} so producer and consumer share the same name/handler ordering. For + * internal-kind spans returns the static {@link PeerTagSchema#INTERNAL} schema. */ private PeerTagSchema peerTagSchemaFor(CoreSpan span) { if (span.isKind(PEER_AGGREGATION_KINDS)) { @@ -411,17 +411,16 @@ private void disable() { if (!features.supportsMetrics()) { log.debug("Disabling metric reporting because an agent downgrade was detected"); // Route the clear through the inbox so the aggregator thread is the only writer. - // AggregateTable is not thread-safe; calling clearAggregates() directly from this thread - // would race with Drainer.accept on the aggregator thread. + // AggregateTable is not thread-safe; clearing it directly from this thread would race + // with Drainer.accept on the aggregator thread. inbox.offer(CLEAR); } } - private static final class ReportTask - implements AgentTaskScheduler.Task { + private static final class ReportTask implements AgentTaskScheduler.Task { @Override - public void run(ConflatingMetricsAggregator target) { + public void run(ClientStatsAggregator target) { target.report(); } } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricsAggregatorFactory.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricsAggregatorFactory.java index 09464310113..b9530871763 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricsAggregatorFactory.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricsAggregatorFactory.java @@ -15,7 +15,7 @@ public static MetricsAggregator createMetricsAggregator( HealthMetrics healthMetrics) { if (config.isTracerMetricsEnabled()) { log.debug("tracer metrics enabled"); - return new ConflatingMetricsAggregator(config, sharedCommunicationObjects, healthMetrics); + return new ClientStatsAggregator(config, sharedCommunicationObjects, healthMetrics); } log.debug("tracer metrics disabled"); return NoOpMetricsAggregator.INSTANCE; diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ConflatingMetricAggregatorTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy similarity index 95% rename from dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ConflatingMetricAggregatorTest.groovy rename to dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy index 4dd0155443a..1fbdd63dff3 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ConflatingMetricAggregatorTest.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy @@ -18,7 +18,7 @@ import java.util.concurrent.TimeoutException import java.util.function.Supplier import spock.lang.Shared -class ConflatingMetricAggregatorTest extends DDSpecification { +class ClientStatsAggregatorTest extends DDSpecification { static Set empty = new HashSet<>() @@ -35,7 +35,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true WellKnownTags wellKnownTags = new WellKnownTags("runtimeid", "hostname", "env", "service", "version", "language") - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator( + ClientStatsAggregator aggregator = new ClientStatsAggregator( wellKnownTags, empty, features, @@ -65,7 +65,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true WellKnownTags wellKnownTags = new WellKnownTags("runtimeid", "hostname", "env", "service", "version", "language") - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator( + ClientStatsAggregator aggregator = new ClientStatsAggregator( wellKnownTags, [ignoredResourceName].toSet(), features, @@ -103,7 +103,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -149,7 +149,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -195,7 +195,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, true) aggregator.start() @@ -260,7 +260,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >>> [["country"], ["country", "georegion"],] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -327,7 +327,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> ["peer.hostname", "_dd.base_service"] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -380,7 +380,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, features, HealthMetrics.NO_OP, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -432,7 +432,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) long duration = 100 List trace = [ @@ -504,7 +504,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, true) aggregator.start() @@ -631,7 +631,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, true) aggregator.start() @@ -746,7 +746,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, true) aggregator.start() @@ -816,7 +816,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -888,7 +888,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, maxAggregates, queueSize, reportingInterval, SECONDS, false) long duration = 100 aggregator.start() @@ -956,7 +956,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { features.supportsMetrics() >> true features.peerTags() >> [] HealthMetrics healthMetrics = Mock(HealthMetrics) - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, healthMetrics, sink, writer, maxAggregates, queueSize, reportingInterval, SECONDS, false) long duration = 100 aggregator.start() @@ -990,7 +990,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { features.supportsMetrics() >> true features.peerTags() >> [] HealthMetrics healthMetrics = Mock(HealthMetrics) - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, healthMetrics, sink, writer, maxAggregates, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -1035,7 +1035,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, maxAggregates, queueSize, reportingInterval, SECONDS, false) long duration = 100 aggregator.start() @@ -1137,7 +1137,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, maxAggregates, queueSize, reportingInterval, SECONDS, false) long duration = 100 aggregator.start() @@ -1197,7 +1197,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, maxAggregates, queueSize, 1, SECONDS, false) long duration = 100 aggregator.start() @@ -1248,7 +1248,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, maxAggregates, queueSize, 1, SECONDS, false) long duration = 100 aggregator.start() @@ -1279,7 +1279,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { MetricWriter writer = Mock(MetricWriter) Sink sink = Stub(Sink) DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, maxAggregates, queueSize, 1, SECONDS, false) aggregator.start() @@ -1301,7 +1301,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> false features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, 200, MILLISECONDS, false) final spans = [ new SimpleSpan("service", "operation", "resource", "type", false, true, false, 0, 10, HTTP_OK) @@ -1333,7 +1333,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { Sink sink = Stub(Sink) DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, maxAggregates, queueSize, 1, SECONDS, false) when: @@ -1366,7 +1366,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { Sink sink = Stub(Sink) DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -1413,7 +1413,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -1468,7 +1468,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, true) aggregator.start() @@ -1559,7 +1559,7 @@ class ConflatingMetricAggregatorTest extends DDSpecification { DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true features.peerTags() >> [] - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator(empty, + ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -1632,14 +1632,14 @@ class ConflatingMetricAggregatorTest extends DDSpecification { aggregator.close() } - def reportAndWaitUntilEmpty(ConflatingMetricsAggregator aggregator) { + def reportAndWaitUntilEmpty(ClientStatsAggregator aggregator) { waitUntilEmpty(aggregator) aggregator.report() waitUntilEmpty(aggregator) } - def waitUntilEmpty(ConflatingMetricsAggregator aggregator) { + def waitUntilEmpty(ClientStatsAggregator aggregator) { int i = 0 while (!aggregator.inbox.isEmpty() && i++ < 100) { Thread.sleep(10) diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/FootprintForkedTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/FootprintForkedTest.groovy index eceedeb1935..86a91c23b3f 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/FootprintForkedTest.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/FootprintForkedTest.groovy @@ -37,7 +37,7 @@ class FootprintForkedTest extends DDSpecification { it.supportsMetrics() >> true it.peerTags() >> [] } - ConflatingMetricsAggregator aggregator = new ConflatingMetricsAggregator( + ClientStatsAggregator aggregator = new ClientStatsAggregator( new WellKnownTags("runtimeid","hostname", "env", "service", "version","language"), [].toSet() as Set, features, diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/MetricsAggregatorFactoryTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/MetricsAggregatorFactoryTest.groovy index 07f246bf9a9..dc9eb86fde3 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/MetricsAggregatorFactoryTest.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/MetricsAggregatorFactoryTest.groovy @@ -28,6 +28,6 @@ class MetricsAggregatorFactoryTest extends DDSpecification { expect: def aggregator = MetricsAggregatorFactory.createMetricsAggregator(config, sco, HealthMetrics.NO_OP, ) - assert aggregator instanceof ConflatingMetricsAggregator + assert aggregator instanceof ClientStatsAggregator } } From dd372e766d88510d4893c3e924ecd9ca7a3b918a Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 15 May 2026 17:35:34 -0400 Subject: [PATCH 05/33] Cleanups: fix previousCounts size, drop dead code Three small follow-ups carried over from a /techdebt pass: - TracerHealthMetrics: previousCounts array was sized 51, but the prior commits added a 52nd reporter (statsInboxFull). Without this fix the new counter's report() call would throw ArrayIndexOutOfBoundsException; the Flush task swallows that exception, so the failure would be silent (statsInboxFull would just never make it to statsd). - Aggregator: removes the now-dead public clearAggregates() method. The ClearSignal route from ClientStatsAggregator.disable() supplanted it several commits ago; the method had no remaining callers. - TagCardinalityHandler: removes the unused register(TagMap.Entry) overload and its isValidType helper. The String-keyed overload covers all current callers (AggregateEntry's peer-tag canonicalization). - PeerTagSchema: spotless-driven javadoc reflow only. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/Aggregator.java | 4 --- .../trace/common/metrics/PeerTagSchema.java | 4 +-- .../common/metrics/TagCardinalityHandler.java | 32 +------------------ .../core/monitor/TracerHealthMetrics.java | 2 +- 4 files changed, 4 insertions(+), 38 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java index 9bcd41f37e4..8fe25288acd 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java @@ -66,10 +66,6 @@ final class Aggregator implements Runnable { this.healthMetrics = healthMetrics; } - public void clearAggregates() { - this.aggregates.clear(); - } - @Override public void run() { Thread currentThread = Thread.currentThread(); diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java index f41b2634da6..4efaec4a0a2 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -42,8 +42,8 @@ final class PeerTagSchema { /** * Identity cache of the most recently observed {@code features.peerTags()} {@link Set} instance. * The producer hot path checks this first and skips the {@code names}-vs-set comparison when the - * caller's set instance hasn't changed. In production this is the common case -- - * {@code DDAgentFeaturesDiscovery} returns the same Set instance until reconfiguration. + * caller's set instance hasn't changed. In production this is the common case -- {@code + * DDAgentFeaturesDiscovery} returns the same Set instance until reconfiguration. */ private static volatile Set LAST_SYNCED_INPUT; diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index eeac6caf817..1fdfed5c7c4 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -1,6 +1,5 @@ package datadog.trace.common.metrics; -import datadog.trace.api.TagMap; import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; import java.util.HashMap; @@ -8,7 +7,7 @@ public final class TagCardinalityHandler { private final String tag; private final int cardinalityLimit; - private final HashMap curUtf8Pairs; + private final HashMap curUtf8Pairs; private UTF8BytesString cacheBlocked = null; @@ -20,31 +19,6 @@ public TagCardinalityHandler(String tag, int cardinalityLimit) { this.curUtf8Pairs = new HashMap<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); } - public UTF8BytesString register(TagMap.Entry entry) { - if (this.curUtf8Pairs.size() >= this.cardinalityLimit) { - return this.blockedByTracer(); - } - - if (!isValidType(entry)) { - return this.blockedByTracer(); - } - - // NOTE: This could lead to boxing -- not ideal - Object cacheKey = entry.objectValue(); - UTF8BytesString existing = this.curUtf8Pairs.get(cacheKey); - if (existing != null) return existing; - - // TODO: maybe use a fallback cache to reduce allocations across reset cycles - UTF8BytesString newPair = UTF8BytesString.create(this.tag + ":" + entry.stringValue()); - this.curUtf8Pairs.put(cacheKey, newPair); - return newPair; - } - - /** - * String-keyed overload for callers that already hold a {@code (tag, value)} pair as Strings and - * would rather not allocate a {@link TagMap.Entry} per lookup -- e.g. the metrics aggregator's - * peer-tag flow, where peer-tag values are flattened into a {@code String[]} on the snapshot. - */ public UTF8BytesString register(String value) { if (this.curUtf8Pairs.size() >= this.cardinalityLimit) { return this.blockedByTracer(); @@ -58,10 +32,6 @@ public UTF8BytesString register(String value) { return newPair; } - private static final boolean isValidType(TagMap.Entry entry) { - return entry.isNumericPrimitive() || entry.objectValue() instanceof CharSequence; - } - private UTF8BytesString blockedByTracer() { UTF8BytesString cacheBlocked = this.cacheBlocked; if (cacheBlocked != null) return cacheBlocked; diff --git a/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java b/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java index 76051645fcb..db384a7e42e 100644 --- a/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java +++ b/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java @@ -382,7 +382,7 @@ private static class Flush implements AgentTaskScheduler.Task Date: Fri, 15 May 2026 17:49:48 -0400 Subject: [PATCH 06/33] Hoist peer-tag schema sync to once per trace ClientStatsAggregator.publish was calling features.peerTags() + PeerTagSchema.currentSyncedTo for every span. Peer-tag configuration is stable for the duration of a single trace publish in production -- DDAgentFeaturesDiscovery returns the same Set instance until remote-config reconfiguration -- so the per-snapshot sync is wasted work. Move the sync to once per publish(trace) and pass the resolved schema to the inner publish(span, isTopLevel, peerAggSchema). INTERNAL-kind spans still use the static PeerTagSchema.INTERNAL regardless. Behavior boundary ----------------- Schema changes from features.peerTags() now take effect at the next publish(trace) call rather than mid-trace. Production-equivalent (a trace takes microseconds to milliseconds; remote-config refreshes are seconds apart), but a Spock test that used `>>> [...]` to mock different peerTags() returns on successive calls within one trace no longer makes sense in the new model. That test is rewritten to assert the production-relevant case: peer-tag NAMES are stable, peer-tag VALUES vary per span, distinct value combinations produce distinct aggregate buckets. Benchmark (2 forks x 5 iter x 15s) ---------------------------------- SimpleSpan bench: 3.133 +- 0.057 us/op (prior: 3.165 +- 0.032) DDSpan bench: 2.454 +- 0.082 us/op (prior: 2.727 +- 0.018) Recovers ~270 ns/op on the DDSpan bench -- most of the regression introduced by the per-snapshot lookup. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../common/metrics/ClientStatsAggregator.java | 31 +++++++++++-------- .../metrics/ClientStatsAggregatorTest.groovy | 13 +++++--- 2 files changed, 26 insertions(+), 18 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index 1b1aeec402a..c199dd2b403 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -243,6 +243,14 @@ public boolean publish(List> trace) { boolean forceKeep = false; int counted = 0; if (features.supportsMetrics()) { + // Sync the peer-aggregation schema once per trace; peer-tag configuration is stable for + // the duration of a single trace publish in production (DDAgentFeaturesDiscovery returns + // the same Set instance until remote-config reconfiguration). + Set eligiblePeerTags = features.peerTags(); + PeerTagSchema peerAggSchema = + (eligiblePeerTags == null || eligiblePeerTags.isEmpty()) + ? null + : PeerTagSchema.currentSyncedTo(eligiblePeerTags); for (CoreSpan span : trace) { boolean isTopLevel = span.isTopLevel(); if (shouldComputeMetric(span, isTopLevel)) { @@ -253,7 +261,7 @@ public boolean publish(List> trace) { break; } counted++; - forceKeep |= publish(span, isTopLevel); + forceKeep |= publish(span, isTopLevel, peerAggSchema); } } healthMetrics.onClientStatTraceComputed(counted, trace.size(), !forceKeep); @@ -268,7 +276,7 @@ private boolean shouldComputeMetric(CoreSpan span, boolean isTopLevel) { && span.getDurationNano() > 0; } - private boolean publish(CoreSpan span, boolean isTopLevel) { + private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peerAggSchema) { // Extract HTTP method and endpoint only if the feature is enabled String httpMethod = null; String httpEndpoint = null; @@ -293,7 +301,7 @@ private boolean publish(CoreSpan span, boolean isTopLevel) { long tagAndDuration = span.getDurationNano() | (error ? ERROR_TAG : 0L) | (isTopLevel ? TOP_LEVEL_TAG : 0L); - PeerTagSchema peerTagSchema = peerTagSchemaFor(span); + PeerTagSchema peerTagSchema = peerTagSchemaFor(span, peerAggSchema); String[] peerTagValues = peerTagSchema == null ? null : capturePeerTagValues(span, peerTagSchema); if (peerTagValues == null) { @@ -327,17 +335,14 @@ private boolean publish(CoreSpan span, boolean isTopLevel) { } /** - * Picks the peer-tag schema for a span. For peer-aggregation kinds, syncs the schema with {@code - * features.peerTags()} so producer and consumer share the same name/handler ordering. For - * internal-kind spans returns the static {@link PeerTagSchema#INTERNAL} schema. + * Picks the peer-tag schema for a span. The {@code peerAggSchema} argument is the per-trace + * cached schema (synced from {@code features.peerTags()} once in {@link #publish(List)}); it's + * {@code null} when no peer tags are configured. For internal-kind spans the static {@link + * PeerTagSchema#INTERNAL} schema is used regardless. */ - private PeerTagSchema peerTagSchemaFor(CoreSpan span) { - if (span.isKind(PEER_AGGREGATION_KINDS)) { - Set eligible = features.peerTags(); - if (eligible == null || eligible.isEmpty()) { - return null; - } - return PeerTagSchema.currentSyncedTo(eligible); + private static PeerTagSchema peerTagSchemaFor(CoreSpan span, PeerTagSchema peerAggSchema) { + if (peerAggSchema != null && span.isKind(PEER_AGGREGATION_KINDS)) { + return peerAggSchema; } if (span.isKind(INTERNAL_KIND)) { return PeerTagSchema.INTERNAL; diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy index 1fbdd63dff3..3cccc50c5a4 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy @@ -253,13 +253,16 @@ class ClientStatsAggregatorTest extends DDSpecification { "client" | "GET" | "/external/api" | true } - def "should create bucket for each set of peer tags"() { + def "should create separate buckets for distinct peer tag values"() { + // Peer-tag NAMES are configured per-tracer and stable for the duration of a trace publish; + // peer-tag VALUES vary per-span. Two spans with the same names but different values should + // produce two distinct aggregate buckets. setup: MetricWriter writer = Mock(MetricWriter) Sink sink = Stub(Sink) DDAgentFeaturesDiscovery features = Mock(DDAgentFeaturesDiscovery) features.supportsMetrics() >> true - features.peerTags() >>> [["country"], ["country", "georegion"],] + features.peerTags() >> ["country", "georegion"] ClientStatsAggregator aggregator = new ClientStatsAggregator(empty, features, HealthMetrics.NO_OP, sink, writer, 10, queueSize, reportingInterval, SECONDS, false) aggregator.start() @@ -270,7 +273,7 @@ class ClientStatsAggregatorTest extends DDSpecification { new SimpleSpan("service", "operation", "resource", "type", true, false, false, 0, 100, HTTP_OK) .setTag(SPAN_KIND, "client").setTag("country", "france").setTag("georegion", "europe"), new SimpleSpan("service", "operation", "resource", "type", true, false, false, 0, 100, HTTP_OK) - .setTag(SPAN_KIND, "client").setTag("country", "france").setTag("georegion", "europe") + .setTag(SPAN_KIND, "client").setTag("country", "germany").setTag("georegion", "europe") ]) aggregator.report() def latchTriggered = latch.await(2, SECONDS) @@ -289,7 +292,7 @@ class ClientStatsAggregatorTest extends DDSpecification { false, false, "client", - [UTF8BytesString.create("country:france")], + [UTF8BytesString.create("country:france"), UTF8BytesString.create("georegion:europe")], null, null, null @@ -307,7 +310,7 @@ class ClientStatsAggregatorTest extends DDSpecification { false, false, "client", - [UTF8BytesString.create("country:france"), UTF8BytesString.create("georegion:europe")], + [UTF8BytesString.create("country:germany"), UTF8BytesString.create("georegion:europe")], null, null, null From fb3236672dfc6244e85b382376ddf247ed5ee5a8 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 15 May 2026 18:14:15 -0400 Subject: [PATCH 07/33] Use cached span.kind ordinal in metrics producer; drop tag-map lookup JFR profiling showed ~21% of producer CPU time spent in tag-map lookups during ClientStatsAggregator.publish. One of those lookups -- span.kind -- is redundant because DDSpanContext already caches the kind as a byte ordinal that resolves to a String via a small array. - Add CoreSpan.getSpanKindString() with a default that falls back to the tag map for non-DDSpan impls; DDSpan overrides to delegate to the context's cached resolution. - Hoist schema.names array out of the capturePeerTagValues loop. - Avoid an unnecessary toString() in isSynthetic by declaring SYNTHETICS_ORIGIN as String and using contentEquals. Benchmark (ClientStatsAggregatorDDSpanBenchmark): before: 2.410 us/op after: 1.995 us/op (~17% improvement) vs. master baseline (6.428 us/op): now ~3.2x faster. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../common/metrics/ClientStatsAggregator.java | 20 +++++++++++-------- .../java/datadog/trace/core/CoreSpan.java | 10 ++++++++++ .../main/java/datadog/trace/core/DDSpan.java | 5 +++++ 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index c199dd2b403..d08ce611100 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -4,7 +4,6 @@ import static datadog.trace.api.DDSpanTypes.RPC; import static datadog.trace.bootstrap.instrumentation.api.Tags.HTTP_ENDPOINT; import static datadog.trace.bootstrap.instrumentation.api.Tags.HTTP_METHOD; -import static datadog.trace.bootstrap.instrumentation.api.Tags.SPAN_KIND; import static datadog.trace.common.metrics.AggregateMetric.ERROR_TAG; import static datadog.trace.common.metrics.AggregateMetric.TOP_LEVEL_TAG; import static datadog.trace.common.metrics.SignalItem.ClearSignal.CLEAR; @@ -46,7 +45,7 @@ public final class ClientStatsAggregator implements MetricsAggregator, EventList private static final Map DEFAULT_HEADERS = Collections.singletonMap(DDAgentApi.DATADOG_META_TRACER_VERSION, DDTraceCoreInfo.VERSION); - private static final CharSequence SYNTHETICS_ORIGIN = "synthetics"; + private static final String SYNTHETICS_ORIGIN = "synthetics"; private static final SpanKindFilter METRICS_ELIGIBLE_KINDS = SpanKindFilter.builder() @@ -293,9 +292,12 @@ private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peer Object grpcStatusObj = span.unsafeGetTag(InstrumentationTags.GRPC_STATUS_CODE); grpcStatusCode = grpcStatusObj != null ? grpcStatusObj.toString() : null; } - // CharSequence default keeps unsafeGetTag's generic at CharSequence so UTF8BytesString - // tag values don't trigger a ClassCastException on the String assignment. - final String spanKind = span.unsafeGetTag(SPAN_KIND, (CharSequence) "").toString(); + // DDSpan resolves this from a cached span.kind ordinal via a small lookup array, skipping a + // tag-map lookup. Other CoreSpan impls fall back to the tag map by default. + String spanKind = span.getSpanKindString(); + if (spanKind == null) { + spanKind = ""; + } boolean error = span.getError() > 0; long tagAndDuration = @@ -355,10 +357,11 @@ private static PeerTagSchema peerTagSchemaFor(CoreSpan span, PeerTagSchema pe * Returns {@code null} when none of the configured peer tags are set on the span. */ private static String[] capturePeerTagValues(CoreSpan span, PeerTagSchema schema) { - int n = schema.size(); + String[] names = schema.names; + int n = names.length; String[] values = null; for (int i = 0; i < n; i++) { - Object v = span.unsafeGetTag(schema.name(i)); + Object v = span.unsafeGetTag(names[i]); if (v != null) { if (values == null) { values = new String[n]; @@ -370,7 +373,8 @@ private static String[] capturePeerTagValues(CoreSpan span, PeerTagSchema sch } private static boolean isSynthetic(CoreSpan span) { - return span.getOrigin() != null && SYNTHETICS_ORIGIN.equals(span.getOrigin().toString()); + CharSequence origin = span.getOrigin(); + return origin != null && SYNTHETICS_ORIGIN.contentEquals(origin); } public void stop() { diff --git a/dd-trace-core/src/main/java/datadog/trace/core/CoreSpan.java b/dd-trace-core/src/main/java/datadog/trace/core/CoreSpan.java index 7d183670883..810b13884de 100644 --- a/dd-trace-core/src/main/java/datadog/trace/core/CoreSpan.java +++ b/dd-trace-core/src/main/java/datadog/trace/core/CoreSpan.java @@ -82,6 +82,16 @@ default U unsafeGetTag(CharSequence name) { boolean isKind(SpanKindFilter filter); + /** + * Returns the {@code span.kind} tag value as a String, or {@code null} if not set. Default + * implementation reads the tag map; {@link DDSpan} overrides to use a cached ordinal that + * resolves via a small lookup array, skipping the tag-map lookup on the hot path. + */ + default String getSpanKindString() { + Object v = unsafeGetTag(datadog.trace.bootstrap.instrumentation.api.Tags.SPAN_KIND); + return v == null ? null : v.toString(); + } + CharSequence getType(); /** diff --git a/dd-trace-core/src/main/java/datadog/trace/core/DDSpan.java b/dd-trace-core/src/main/java/datadog/trace/core/DDSpan.java index 4c438e1c915..943776e7577 100644 --- a/dd-trace-core/src/main/java/datadog/trace/core/DDSpan.java +++ b/dd-trace-core/src/main/java/datadog/trace/core/DDSpan.java @@ -963,6 +963,11 @@ public boolean isKind(SpanKindFilter filter) { return filter.matches(context.getSpanKindOrdinal()); } + @Override + public String getSpanKindString() { + return context.getSpanKindString(); + } + @Override public void copyPropagationAndBaggage(final AgentSpan source) { if (source instanceof DDSpan) { From 1221b2b4a8e83f1f674db41b16604b1afda684bf Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 15 May 2026 18:53:46 -0400 Subject: [PATCH 08/33] Add client metrics pipeline design doc Captures the producer/consumer split, the canonical-key trick that makes cardinality-blocking actually save space, the once-per-trace peer-tag schema sync, the role of each file in datadog.trace.common.metrics, and the rationale behind the redesign from ConflatingMetricsAggregator. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/client_metrics_design.md | 308 ++++++++++++++++++++++++++++++++++ 1 file changed, 308 insertions(+) create mode 100644 docs/client_metrics_design.md diff --git a/docs/client_metrics_design.md b/docs/client_metrics_design.md new file mode 100644 index 00000000000..489763fd413 --- /dev/null +++ b/docs/client_metrics_design.md @@ -0,0 +1,308 @@ +# Client-side metrics (stats aggregator) design + +This document describes the design of the **client-side metrics pipeline** that +lives under `dd-trace-core/.../common/metrics/`. The pipeline aggregates per-span +duration / count / error statistics on the tracer and sends rolled-up "client +stats" payloads to the Datadog Agent on a fixed reporting interval, so the agent +does not have to sample every span to know request rates and latencies. + +Code lives in package `datadog.trace.common.metrics`. + +## High-level shape + +``` + producer thread(s) aggregator thread + inbox + trace ─▶ ClientStatsAggregator.publish(trace) ──MPSC──▶ Aggregator.run + │ │ + │ per metrics-eligible span │ Drainer.accept + │ │ + │ allocates one SpanSnapshot ▼ + │ (immutable, ~15 refs) AggregateTable.findOrInsert + │ │ + │ inbox.offer(snapshot) │ canonicalize → hash + └────────────────────────────────────▶ │ → lookup or insert + │ + scheduled REPORT signal ──▶│ + │ Aggregator.report + │ → MetricWriter.add(entry) + │ → OkHttpSink (HTTP POST) + │ → reset cardinality handlers +``` + +Three rules govern the design: + +1. **The producer never touches shared state.** The hot path on the application + thread builds an immutable `SpanSnapshot` and offers it to a bounded MPSC + queue. No locks, no maps, no hashing of the metric key. +2. **The aggregator thread is the sole writer of every shared structure.** The + aggregate table, the cardinality handlers, the metric writer state — all of + them are accessed only from that thread. Control operations (clear, report, + stop) are themselves enqueued as `SignalItem`s so they serialize with data. +3. **Cardinality is bounded.** Per-field handlers cap the unique values; once a + field's budget is exhausted, overflow values collapse into a single + `blocked_by_tracer` sentinel so the aggregate table can't blow up. + +## Component map + +| Component | File | Role | +|---|---|---| +| `ClientStatsAggregator` | `ClientStatsAggregator.java` | Producer facade. Decides which spans are eligible, builds `SpanSnapshot`s, offers them to the inbox. Also owns the agent-feature check, the scheduled report timer, and the agent-downgrade handler. | +| `SpanSnapshot` | `SpanSnapshot.java` | Immutable, allocation-pooled-by-GC value posted from producer → aggregator. Carries raw label fields plus a duration word with `TOP_LEVEL` / `ERROR` bits OR-ed in. | +| `PeerTagSchema` | `PeerTagSchema.java` | Parallel `String[] names` + `TagCardinalityHandler[] handlers` describing the peer-aggregation tags in effect. One singleton for internal-kind spans; one volatile "current" schema for client/producer/consumer spans, refreshed from `DDAgentFeaturesDiscovery.peerTags()`. | +| `Aggregator` | `Aggregator.java` | Consumer thread `Runnable`. Drains the inbox; dispatches `SpanSnapshot`s into `AggregateTable`; processes signals (`REPORT`, `CLEAR`, `STOP`); calls the writer on report. | +| `AggregateTable` | `AggregateTable.java` | Hashtable-backed store keyed on the canonicalized labels. Owns a single reusable `Canonical` scratch buffer. Handles cap-overflow by evicting one stale entry or rejecting new ones. | +| `AggregateEntry` | `AggregateEntry.java` | `Hashtable.Entry` holding the 13 UTF8 label fields + the mutable `AggregateMetric`. Owns the static `PropertyCardinalityHandler`s for the fixed label fields, and `Canonical` for hot-path canonicalization. | +| `AggregateMetric` | `AggregateMetric.java` | Per-bucket accumulator: hit count, error count, top-level count, duration sum, ok/error latency histograms. Single-threaded; cleared each report. | +| `PropertyCardinalityHandler` | `PropertyCardinalityHandler.java` | Per-field UTF8 interner with a max-unique-values cap. Returns a `blocked_by_tracer` sentinel `UTF8BytesString` once the cap is hit. Reset by the aggregator each cycle. | +| `TagCardinalityHandler` | `TagCardinalityHandler.java` | Same pattern as the property handler, but the cached UTF8 form is the full `tag:value` pair (peer tags are wire-encoded as `tag:value`, not just the value). | +| `SerializingMetricWriter` / `OkHttpSink` | `SerializingMetricWriter.java`, `OkHttpSink.java` | Wire serialization (MessagePack) + HTTP POST to the agent's `/v0.6/stats` endpoint. | +| `MetricsAggregatorFactory` / `NoOpMetricsAggregator` | factory + no-op | Picks the real implementation when client stats are enabled and the agent supports the endpoint, no-op otherwise. | + +## Producer-side flow (`ClientStatsAggregator.publish`) + +The producer holds **no shared state**. Per trace it: + +1. Snapshots the current peer-aggregation schema **once per trace** (not per + span): + ```java + Set eligiblePeerTags = features.peerTags(); + PeerTagSchema peerAggSchema = + (eligiblePeerTags == null || eligiblePeerTags.isEmpty()) + ? null + : PeerTagSchema.currentSyncedTo(eligiblePeerTags); + ``` + `currentSyncedTo` has a fast path: identity-equal to the previously-synced + `Set` instance → return the cached schema (the common case, since + `DDAgentFeaturesDiscovery` returns the same `Set` until remote-config + reconfiguration). The cached schema is `volatile`; replacement is guarded by + a `synchronized` block. + +2. Iterates the trace; for each metrics-eligible span: + + - **Eligibility** (`shouldComputeMetric`): + ```java + (measured || isTopLevel || isKind(SERVER|CLIENT|PRODUCER|CONSUMER)) + && longRunningVersion <= 0 + && durationNano > 0 + ``` + `isMeasured` / `isTopLevel` are flag reads on `DDSpanContext`; `isKind` + reads the **cached `byte` span-kind ordinal** through a `SpanKindFilter` + bitmask test — no tag-map lookup. + + - **Resource-name ignore-list** breaks out of the trace early; the entire + trace is dropped on a match. + + - **Picks the peer-tag schema** (`peerTagSchemaFor`): for client/producer/ + consumer kinds → `peerAggSchema` (already synced for this trace); for + internal-kind spans → `PeerTagSchema.INTERNAL` (single `base.service` + entry); otherwise `null`. + + - **Captures peer-tag *values***, not pairs: walks `schema.names` and pulls + `unsafeGetTag(name)` for each, into a parallel `String[]`. Names + handlers + are the schema's job; the producer only carries raw values. Returns `null` + when no peer tags are set, in which case the schema reference is dropped + too so the consumer doesn't loop over an all-null array. + + - **Builds and offers** a `SpanSnapshot` to the MPSC inbox. The span-kind + string is taken from `CoreSpan.getSpanKindString()`, which DDSpan + overrides to resolve via the cached byte ordinal through a small lookup + array — **no tag-map lookup**. Origin equality uses `contentEquals`. + `httpMethod` / `httpEndpoint` are only fetched when + `traceClientStatsEndpoints=true`; `grpcStatusCode` only when span type is + `rpc`. + + - On inbox-full: the snapshot is dropped and `healthMetrics.onStatsInboxFull()` + fires. The producer never blocks. + +3. Reports `healthMetrics.onClientStatTraceComputed(counted, total, dropped)`. + + `forceKeep` is the only signal returned upward — `true` if any of the + trace's metrics-eligible spans had errors, so the trace writer keeps the + raw trace too. + +### Why the producer is lean + +The cumulative cost of running these checks on every finished span is the +single biggest concern. The producer deliberately avoids: + +- locking or synchronization of any kind on the hot path, +- hashing the metric key (deferred to the aggregator thread), +- map / cache lookups for label canonicalization (deferred), +- tag-map lookups when a span carries the relevant information on the context + itself (`span.kind` via the cached byte ordinal; `isMeasured`, `isTopLevel` + via flag reads), +- allocation beyond the `SpanSnapshot` itself and a single `String[]` for peer + tag values when any are present. + +## Aggregator-side flow (`Aggregator.run`) + +A single agent thread runs the `Aggregator.run` loop. The thread drains the +inbox via `inbox.drain(drainer)`; when the queue is empty it sleeps +`DEFAULT_SLEEP_MILLIS` (10 ms) and retries. The Drainer dispatches by item +type: + +- `SpanSnapshot` → `AggregateTable.findOrInsert(snapshot)` returns either an + existing or freshly-inserted `AggregateMetric`, then the snapshot's + `tagAndDuration` is recorded. If the table is at capacity and no stale entry + can be evicted, `healthMetrics.onStatsAggregateDropped()` fires. + +- `ReportSignal` → on the scheduled cadence (the default report interval is + 10 s; configurable via `tracerMetricsMaxAggregates` / reporting interval), + `Aggregator.report`: + 1. Expunges entries with `hitCount == 0` (stale). + 2. If anything remains, opens a bucket via `MetricWriter.startBucket(...)`, + walks `AggregateTable.forEach`, writes each entry, clears its metric. + 3. Calls `MetricWriter.finishBucket()` (which may do I/O and block). + 4. **Resets all cardinality handlers** so the next interval starts with a + fresh budget. Existing entries keep their previously-issued UTF8 + references, and matching is by content-equality, so canonicalizing a + post-reset snapshot against an existing entry still resolves to the + same bucket. + +- `ClearSignal` → drops the aggregate state. The downgrade handler + (`onEvent(DOWNGRADED, ...)`) offers `CLEAR` to the inbox rather than calling + `clearAggregates()` directly, so the aggregator thread remains the sole + writer of the table. + +- `StopSignal` → final report + thread exit. + +## The canonical-key trick (cardinality-safe deduplication) + +The lookup hash is computed from the **canonicalized** label fields, not the +raw `SpanSnapshot` fields. This is the property that makes +cardinality-blocking actually save space: + +```java +// AggregateTable.findOrInsert +canonical.populate(snapshot); // runs every field through its handler +long keyHash = canonical.keyHash; +int bucketIndex = Hashtable.Support.bucketIndex(buckets, keyHash); +for (Hashtable.Entry e = buckets[bucketIndex]; e != null; e = e.next()) { + if (e.keyHash == keyHash) { + AggregateEntry candidate = (AggregateEntry) e; + if (canonical.matches(candidate)) { + return candidate.aggregate; + } + } +} +// miss → toEntry, splice into bucket head +``` + +`Canonical.populate` runs each label field through its +`PropertyCardinalityHandler` (or `TagCardinalityHandler` for peer tags). Once a +handler's working set is full, **every subsequent unique value resolves to the +same `UTF8BytesString` sentinel** — so the hash computed from the canonical +form is identical for all blocked values. They land in the same bucket and +merge into one `AggregateEntry` rather than fragmenting into N entries. + +The `Canonical` scratch buffer is reused per `findOrInsert` call. On a hit, +nothing is allocated. On a miss, `toEntry` snapshots the buffer's references +into a fresh entry; the buffer is overwritten on the next call. + +### Hash chain (no varargs) + +`AggregateEntry.hashOf` uses chained primitive calls into +`LongHashingUtils.addToHash(long, T)` rather than a varargs `addToHash(long, +Object...)`. This avoids the `Object[]` allocation and boxing of the primitive +fields (`httpStatusCode`, `synthetic`, `traceRoot`) that varargs would force. + +## Reporting cadence and cardinality reset + +Two distinct cadences: + +- **Reporting interval** (default 10 s): when the report timer fires, + `ReportTask` calls `report()` which `inbox.offer(REPORT)`. The aggregator + drains up to that signal, then writes the bucket and resets the cardinality + handlers. The handlers reset *every reporting cycle*, so the per-field + budgets refresh. + +- **Schema sync**: `PeerTagSchema.currentSyncedTo` runs on the producer thread + per trace, with an identity-check fast path. The schema reference is + replaced atomically when remote-config reconfigures the peer-tag set. + +## Memory and lifetime + +- `AggregateMetric` is **not thread-safe**. It is mutated only by the + aggregator thread. +- `AggregateTable` is **not thread-safe**. All paths (producer-side `CLEAR`, + schedule-driven `REPORT`, drainer-driven inserts) route through the inbox. +- `Canonical` and the cardinality handlers are aggregator-thread-only. +- `PeerTagSchema.CURRENT` is `volatile` with `synchronized` replacement; the + schema's `TagCardinalityHandler`s themselves are aggregator-thread-only and + are reset alongside the property handlers each cycle. +- Entries retain their `UTF8BytesString` references across handler resets; + matches via content-equality so post-reset snapshots still resolve. +- Cap: `tracerMetricsMaxAggregates` bounds table size. Cap-overrun policy: + evict one stale entry (`hitCount == 0`) or drop the new data point. + +## Health metrics + +The producer reports per-trace stats via `HealthMetrics`: + +- `onClientStatTraceComputed(counted, totalSpans, dropped)` — per `publish`. +- `onStatsInboxFull()` — when the MPSC queue rejects an offer. +- `onClientStatPayloadSent()` / `onClientStatDowngraded()` / + `onClientStatErrorReceived()` — on agent-side outcomes. +- `onStatsAggregateDropped()` — when the aggregator thread can't fit a new + entry. + +## Failure modes + +| Failure | Effect | +|---|---| +| Inbox full | Snapshot dropped, `onStatsInboxFull` increments, producer continues. | +| Agent unavailable / errors | `OkHttpSink` reports `BAD_PAYLOAD` / `ERROR`; metric reporting continues. | +| Agent downgrade (no /v0.6/stats) | `disable()` offers `CLEAR` to the inbox; the aggregator wipes its table. Producer's `features.supportsMetrics()` returns false on subsequent calls, so new snapshots are not built. | +| Aggregate table full, no stale entry | New snapshot dropped, `onStatsAggregateDropped` increments. Existing entries continue to accumulate. | +| Cardinality budget exhausted | Overflow values canonicalize to a `blocked_by_tracer` sentinel and merge into one bucket. Total entry count stays bounded by `maxAggregates`. | +| Producer throws mid-trace | Caught by the writer's normal error path; `onClientStatTraceComputed` is not called for that trace. | + +## Why the redesign (history) + +The pipeline was previously `ConflatingMetricsAggregator` with: + +- producer-side `MetricKey` construction (string-canonicalization on the hot + path), +- a `LRUCache` of `MetricKey → AggregateMetric`, +- per-tag `DDCache` instances for canonicalization (one per label field), +- early computation of `tag:value` peer pairs on the producer thread. + +The current `ClientStatsAggregator` shape was motivated by JMH benchmarks that +showed the producer dominating CPU time. The major shifts: + +1. **Move all canonicalization off the producer.** Producer just shuffles + references into a `SpanSnapshot`. +2. **Replace `MetricKey` with inlined fields on `AggregateEntry`.** Removes a + per-snapshot allocation; lets us own the hash code on the entry itself. +3. **Replace the `LRUCache` with a `Hashtable`** keyed on canonicalized labels. + Hash is computed once per insert/lookup; chained primitive hashing avoids + boxing. +4. **Replace per-tag `DDCache`s with per-field `PropertyCardinalityHandler`s** + that share a `blocked_by_tracer` sentinel for cardinality overflow. Reset + each reporting cycle. +5. **Capture peer-tag values, not pairs.** Tag-name + handler live on + `PeerTagSchema`; the producer carries values in a parallel `String[]`. The + aggregator does the `tag:value` interning via `TagCardinalityHandler` on + its own thread. +6. **Sync peer-tag schema once per trace.** `currentSyncedTo` has an + identity-check fast path; the steady-state cost is one volatile read. +7. **Single owner of all shared state.** `disable()` routes through `CLEAR` + rather than mutating the aggregate table directly. + +### Benchmark summary + +`ClientStatsAggregatorDDSpanBenchmark` (64 client-kind DDSpans per op, single +trace, real `CoreTracer` with a no-op writer): + +| Variant | µs/op | +|---|---| +| master (`ConflatingMetricsAggregator`, baseline) | 6.428 | +| with `SpanSnapshot` + background aggregation | 2.454 | +| with peer-tag schema hoist | 2.410 | +| with cached span-kind ordinal + isSynthetic fix | 1.995 | + +The remaining producer-thread hotspots (from JFR sampling) are tag-map +lookups for `peer.hostname` / other peer-tag values inside +`capturePeerTagValues`. A bulk peer-tag accessor on `DDSpan` would crack that +chunk further, but is a structural change beyond the current package. From e72fd0110a2964654e3d5973c108c9d7f5cde43c Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 15:43:53 -0400 Subject: [PATCH 09/33] Add DDAgentFeaturesDiscovery.peerTagsRevision() Monotonically increases each time the discovered peerTags Set differs from the previous one. Lets callers detect peer-tag config changes with a long compare instead of a Set.equals (or leaning on Set-identity, which was an implementation accident, not part of the public contract). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../ddagent/DDAgentFeaturesDiscovery.java | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java b/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java index 10c1e57efd7..387491a426a 100644 --- a/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java +++ b/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java @@ -101,6 +101,7 @@ private static class State { String version; String telemetryProxyEndpoint; Set peerTags = emptySet(); + long peerTagsRevision; long lastTimeDiscovered; } @@ -138,11 +139,14 @@ protected long getFeaturesDiscoveryMinDelayMillis() { private synchronized void discoverIfOutdated(final long maxElapsedMs) { final long now = System.currentTimeMillis(); - final long elapsed = now - discoveryState.lastTimeDiscovered; + final State previous = discoveryState; + final long elapsed = now - previous.lastTimeDiscovered; if (elapsed > maxElapsedMs) { final State newState = new State(); doDiscovery(newState); newState.lastTimeDiscovered = now; + newState.peerTagsRevision = + previous.peerTagsRevision + (newState.peerTags.equals(previous.peerTags) ? 0L : 1L); // swap atomically states discoveryState = newState; } @@ -403,6 +407,16 @@ public Set peerTags() { return discoveryState.peerTags; } + /** + * Monotonically increasing counter bumped each time {@link #peerTags()} produces a Set that is + * not equal to the previous one. Callers can compare this against a cached snapshot to detect + * peer-tag config changes without re-comparing the Sets themselves -- e.g. the client-stats + * aggregator uses it to decide when to rebuild its {@code PeerTagSchema}. + */ + public long peerTagsRevision() { + return discoveryState.peerTagsRevision; + } + public String getMetricsEndpoint() { return discoveryState.metricsEndpoint; } From dce4b2c72ae43cb289a7e7904256ed308d38fd59 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 15:44:07 -0400 Subject: [PATCH 10/33] Move peer-tag schema cache from PeerTagSchema statics to ClientStatsAggregator PeerTagSchema previously held its current schema + last-synced-Set in static volatile fields with a synchronized rebuild. The "is it stale?" signal was an identity check on the Set instance returned by features.peerTags() -- a correct but indirect reading of a DDAgentFeaturesDiscovery invariant. Replace that with: - ClientStatsAggregator keeps its own (volatile PeerTagSchema, volatile long cachedPeerTagsRevision) cache pair, rebuilt under synchronized when the revision returned by features.peerTagsRevision() doesn't match. - PeerTagSchema becomes a pure data holder: static factory PeerTagSchema.of, the INTERNAL singleton, and an instance resetCardinalityHandlers(). The static CURRENT, LAST_SYNCED_INPUT, and the synchronized rebuild block are gone. - Aggregator gains a Runnable onResetCardinality hook fired right after AggregateEntry.resetCardinalityHandlers(). ClientStatsAggregator wires it to reset its cached schema's handlers each report cycle. - AggregateEntry.resetCardinalityHandlers() resets PeerTagSchema.INTERNAL directly instead of the removed PeerTagSchema.resetAll(). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 2 +- .../trace/common/metrics/Aggregator.java | 21 ++++- .../common/metrics/ClientStatsAggregator.java | 73 ++++++++++++++--- .../trace/common/metrics/PeerTagSchema.java | 79 ++++--------------- .../common/metrics/AggregateTableTest.java | 9 ++- docs/client_metrics_design.md | 43 +++++----- 6 files changed, 127 insertions(+), 100 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 225f03197e5..5c950fbb808 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -174,7 +174,7 @@ static void resetCardinalityHandlers() { HTTP_METHOD_HANDLER.reset(); HTTP_ENDPOINT_HANDLER.reset(); GRPC_STATUS_CODE_HANDLER.reset(); - PeerTagSchema.resetAll(); + PeerTagSchema.INTERNAL.resetCardinalityHandlers(); } /** diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java index 8fe25288acd..3b0c8c20110 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java @@ -28,6 +28,14 @@ final class Aggregator implements Runnable { private final long sleepMillis; + /** + * Per-cycle hook run on the aggregator thread right after {@link + * AggregateEntry#resetCardinalityHandlers()}. Used by {@link ClientStatsAggregator} to reset the + * peer-aggregation schema's handlers, which live outside {@link AggregateEntry}'s static set. May + * be {@code null}. + */ + private final Runnable onResetCardinality; + @SuppressFBWarnings( value = "AT_STALE_THREAD_WRITE_OF_PRIMITIVE", justification = "the field is confined to the agent thread running the Aggregator") @@ -39,7 +47,8 @@ final class Aggregator implements Runnable { int maxAggregates, long reportingInterval, TimeUnit reportingIntervalTimeUnit, - HealthMetrics healthMetrics) { + HealthMetrics healthMetrics, + Runnable onResetCardinality) { this( writer, inbox, @@ -47,7 +56,8 @@ final class Aggregator implements Runnable { reportingInterval, reportingIntervalTimeUnit, DEFAULT_SLEEP_MILLIS, - healthMetrics); + healthMetrics, + onResetCardinality); } Aggregator( @@ -57,13 +67,15 @@ final class Aggregator implements Runnable { long reportingInterval, TimeUnit reportingIntervalTimeUnit, long sleepMillis, - HealthMetrics healthMetrics) { + HealthMetrics healthMetrics, + Runnable onResetCardinality) { this.writer = writer; this.inbox = inbox; this.aggregates = new AggregateTable(maxAggregates); this.reportingIntervalNanos = reportingIntervalTimeUnit.toNanos(reportingInterval); this.sleepMillis = sleepMillis; this.healthMetrics = healthMetrics; + this.onResetCardinality = onResetCardinality; } @Override @@ -148,6 +160,9 @@ private void report(long when, SignalItem signal) { // Reset cardinality handlers each report cycle so the per-field budgets refresh. // Safe to call on this (aggregator) thread; handlers are HashMap-based and not thread-safe. AggregateEntry.resetCardinalityHandlers(); + if (onResetCardinality != null) { + onResetCardinality.run(); + } signal.complete(); if (skipped) { log.debug("skipped metrics reporting because no points have changed"); diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index d08ce611100..821a531e7b8 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -72,6 +72,19 @@ public final class ClientStatsAggregator implements MetricsAggregator, EventList private final HealthMetrics healthMetrics; private final boolean includeEndpointInMetrics; + /** + * Cached peer-aggregation schema and the {@link DDAgentFeaturesDiscovery#peerTagsRevision()} + * value it was built from. The producer-side hot path in {@link #publish(List)} checks the + * current revision against {@code cachedPeerTagsRevision} and only rebuilds when they differ. + * + *

Both fields are {@code volatile} because {@code publish} is called on arbitrary producer + * threads. The reset hook ({@link #resetCachedPeerAggSchema()}) runs on the aggregator thread and + * only mutates the schema's internal handler state (not these fields). + */ + private volatile long cachedPeerTagsRevision = -1L; + + private volatile PeerTagSchema cachedPeerAggSchema; + private volatile AgentTaskScheduler.Scheduled cancellation; public ClientStatsAggregator( @@ -160,7 +173,13 @@ public ClientStatsAggregator( this.sink = sink; this.aggregator = new Aggregator( - metricWriter, inbox, maxAggregates, reportingInterval, timeUnit, healthMetric); + metricWriter, + inbox, + maxAggregates, + reportingInterval, + timeUnit, + healthMetric, + this::resetCachedPeerAggSchema); this.thread = newAgentThread(METRICS_AGGREGATOR, aggregator); this.reportingInterval = reportingInterval; this.reportingIntervalTimeUnit = timeUnit; @@ -242,14 +261,10 @@ public boolean publish(List> trace) { boolean forceKeep = false; int counted = 0; if (features.supportsMetrics()) { - // Sync the peer-aggregation schema once per trace; peer-tag configuration is stable for - // the duration of a single trace publish in production (DDAgentFeaturesDiscovery returns - // the same Set instance until remote-config reconfiguration). - Set eligiblePeerTags = features.peerTags(); - PeerTagSchema peerAggSchema = - (eligiblePeerTags == null || eligiblePeerTags.isEmpty()) - ? null - : PeerTagSchema.currentSyncedTo(eligiblePeerTags); + // Sync the peer-aggregation schema once per trace. The cache is keyed on + // features.peerTagsRevision(), which only bumps when the agent's peer-tag set actually + // changes -- so the steady-state cost is a volatile read and a long compare. + PeerTagSchema peerAggSchema = peerAggSchema(features.peerTagsRevision()); for (CoreSpan span : trace) { boolean isTopLevel = span.isTopLevel(); if (shouldComputeMetric(span, isTopLevel)) { @@ -336,10 +351,46 @@ private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peer return error; } + /** + * Returns the peer-aggregation schema synced to the given revision, rebuilding it if the cached + * one is stale. Fast path: one volatile-read pair + a long compare. Rebuild is rare (peer-tag + * config changes), so the synchronization is only on the slow path. + */ + private PeerTagSchema peerAggSchema(long revision) { + if (revision == cachedPeerTagsRevision) { + return cachedPeerAggSchema; + } + return refreshPeerAggSchema(revision); + } + + private synchronized PeerTagSchema refreshPeerAggSchema(long revision) { + // Double-checked: another producer may have rebuilt while we were waiting on the monitor. + if (revision == cachedPeerTagsRevision) { + return cachedPeerAggSchema; + } + Set names = features.peerTags(); + PeerTagSchema schema = (names == null || names.isEmpty()) ? null : PeerTagSchema.of(names); + cachedPeerAggSchema = schema; + cachedPeerTagsRevision = revision; + return schema; + } + + /** + * Reset hook invoked on the aggregator thread at the end of each report cycle. Resets the cached + * peer-aggregation schema's cardinality handlers so per-field budgets refresh in lockstep with + * {@link AggregateEntry#resetCardinalityHandlers()}. + */ + private void resetCachedPeerAggSchema() { + PeerTagSchema schema = cachedPeerAggSchema; + if (schema != null) { + schema.resetCardinalityHandlers(); + } + } + /** * Picks the peer-tag schema for a span. The {@code peerAggSchema} argument is the per-trace - * cached schema (synced from {@code features.peerTags()} once in {@link #publish(List)}); it's - * {@code null} when no peer tags are configured. For internal-kind spans the static {@link + * cached schema (synced from {@code features.peerTagsRevision()} once in {@link #publish(List)}); + * it's {@code null} when no peer tags are configured. For internal-kind spans the static {@link * PeerTagSchema#INTERNAL} schema is used regardless. */ private static PeerTagSchema peerTagSchemaFor(CoreSpan span, PeerTagSchema peerAggSchema) { diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java index 4efaec4a0a2..6c80424e9d8 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -14,20 +14,19 @@ *

Two schemas exist: * *

* *

Each {@link SpanSnapshot} captures its own schema reference so producer and consumer agree on * the indexing even if the current schema is replaced between capture and consumption. * - *

Thread-safety: {@link #currentSyncedTo} may be called from producer threads; - * replacement of the volatile {@code CURRENT} reference is guarded by a lock. The {@link - * TagCardinalityHandler}s themselves are not thread-safe and must only be exercised on the - * aggregator thread (this is where the snapshot's schema is consumed). + *

Thread-safety: {@link TagCardinalityHandler}s are not thread-safe and must only be + * exercised on the aggregator thread. {@link #names} is final and safe to read from any thread. */ final class PeerTagSchema { @@ -36,20 +35,14 @@ final class PeerTagSchema { /** Singleton schema for internal-kind spans -- only {@code base.service}. */ static final PeerTagSchema INTERNAL = new PeerTagSchema(new String[] {BASE_SERVICE}); - /** Current schema for peer-aggregation kinds; replaced atomically when peer tag names change. */ - private static volatile PeerTagSchema CURRENT = new PeerTagSchema(new String[0]); - - /** - * Identity cache of the most recently observed {@code features.peerTags()} {@link Set} instance. - * The producer hot path checks this first and skips the {@code names}-vs-set comparison when the - * caller's set instance hasn't changed. In production this is the common case -- {@code - * DDAgentFeaturesDiscovery} returns the same Set instance until reconfiguration. - */ - private static volatile Set LAST_SYNCED_INPUT; - final String[] names; final TagCardinalityHandler[] handlers; + /** Builds a schema for the given peer-tag names. Order is determined by the {@link Set}. */ + static PeerTagSchema of(Set names) { + return new PeerTagSchema(names.toArray(new String[0])); + } + private PeerTagSchema(String[] names) { this.names = names; this.handlers = new TagCardinalityHandler[names.length]; @@ -59,39 +52,11 @@ private PeerTagSchema(String[] names) { } /** - * Returns the current peer-aggregation schema, lazily refreshing it if the supplied {@code - * peerTagNames} differ from the cached set. Designed to be called from the producer hot path: the - * common case is a single volatile read and an array-length / set-contains comparison. + * Resets every {@link TagCardinalityHandler}'s working set. Must be called on the aggregator + * thread; handlers are not thread-safe. */ - static PeerTagSchema currentSyncedTo(Set peerTagNames) { - // Fast path: same Set instance as the last sync -> the cached schema is still valid, no - // matches() loop needed. In production this is the steady-state case. - if (peerTagNames == LAST_SYNCED_INPUT) { - return CURRENT; - } - PeerTagSchema cur = CURRENT; - if (matches(cur.names, peerTagNames)) { - LAST_SYNCED_INPUT = peerTagNames; - return cur; - } - synchronized (PeerTagSchema.class) { - cur = CURRENT; - if (!matches(cur.names, peerTagNames)) { - cur = new PeerTagSchema(peerTagNames.toArray(new String[0])); - CURRENT = cur; - } - LAST_SYNCED_INPUT = peerTagNames; - return cur; - } - } - - /** Resets the working sets of {@link #INTERNAL} and {@link #current()}. */ - static void resetAll() { - PeerTagSchema cur = CURRENT; - for (TagCardinalityHandler h : cur.handlers) { - h.reset(); - } - for (TagCardinalityHandler h : INTERNAL.handlers) { + void resetCardinalityHandlers() { + for (TagCardinalityHandler h : handlers) { h.reset(); } } @@ -107,16 +72,4 @@ String name(int i) { TagCardinalityHandler handler(int i) { return handlers[i]; } - - private static boolean matches(String[] cur, Set set) { - if (cur.length != set.size()) { - return false; - } - for (String n : cur) { - if (!set.contains(n)) { - return false; - } - } - return true; - } } diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java index 7a4f84c30dd..af63811df8c 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java @@ -231,14 +231,15 @@ private static final class SnapshotBuilder { } SnapshotBuilder peerTags(String... namesAndValues) { - // Build a schema from the (name, value, name, value, ...) input. Synced through the - // production singleton so canonicalization actually goes through the same handlers the - // aggregator would use in production -- which is the surface the test wants to exercise. + // Build a schema directly from the (name, value, name, value, ...) input. In production the + // cached schema is owned by ClientStatsAggregator; these tests exercise AggregateTable and + // can use a fresh per-snapshot schema -- canonicalization is content-based so cardinality + // collapse still works across snapshots even with different handler instances. java.util.LinkedHashSet names = new java.util.LinkedHashSet<>(); for (int i = 0; i < namesAndValues.length; i += 2) { names.add(namesAndValues[i]); } - this.peerTagSchema = PeerTagSchema.currentSyncedTo(names); + this.peerTagSchema = PeerTagSchema.of(names); this.peerTagValues = new String[peerTagSchema.size()]; for (int i = 0; i < namesAndValues.length; i += 2) { for (int j = 0; j < peerTagSchema.size(); j++) { diff --git a/docs/client_metrics_design.md b/docs/client_metrics_design.md index 489763fd413..ca5f200c97f 100644 --- a/docs/client_metrics_design.md +++ b/docs/client_metrics_design.md @@ -66,17 +66,16 @@ The producer holds **no shared state**. Per trace it: 1. Snapshots the current peer-aggregation schema **once per trace** (not per span): ```java - Set eligiblePeerTags = features.peerTags(); - PeerTagSchema peerAggSchema = - (eligiblePeerTags == null || eligiblePeerTags.isEmpty()) - ? null - : PeerTagSchema.currentSyncedTo(eligiblePeerTags); + PeerTagSchema peerAggSchema = peerAggSchema(features.peerTagsRevision()); ``` - `currentSyncedTo` has a fast path: identity-equal to the previously-synced - `Set` instance → return the cached schema (the common case, since - `DDAgentFeaturesDiscovery` returns the same `Set` until remote-config - reconfiguration). The cached schema is `volatile`; replacement is guarded by - a `synchronized` block. + `peerAggSchema(...)` reads a `volatile long` revision held on the + aggregator and compares it to the value the cached `PeerTagSchema` was + built from. Match → return the cached schema (the common case, since + `peerTagsRevision()` only bumps when `DDAgentFeaturesDiscovery` observes a + peer-tag set that doesn't equal the previous one). Mismatch → take a + monitor on the aggregator, rebuild via `PeerTagSchema.of(names)`, and + publish the new schema + revision. The steady-state cost is one volatile + read + one long compare. 2. Iterates the trace; for each metrics-eligible span: @@ -217,9 +216,12 @@ Two distinct cadences: handlers. The handlers reset *every reporting cycle*, so the per-field budgets refresh. -- **Schema sync**: `PeerTagSchema.currentSyncedTo` runs on the producer thread - per trace, with an identity-check fast path. The schema reference is - replaced atomically when remote-config reconfigures the peer-tag set. +- **Schema sync**: `ClientStatsAggregator.peerAggSchema(long)` runs on the + producer thread per trace, keyed on `DDAgentFeaturesDiscovery.peerTagsRevision()`. + The cached schema is replaced when remote-config reconfigures the peer-tag + set (i.e., when the revision bumps). The schema's + `TagCardinalityHandler`s are reset on the aggregator thread each report + cycle via a hook passed into `Aggregator`. ## Memory and lifetime @@ -228,9 +230,11 @@ Two distinct cadences: - `AggregateTable` is **not thread-safe**. All paths (producer-side `CLEAR`, schedule-driven `REPORT`, drainer-driven inserts) route through the inbox. - `Canonical` and the cardinality handlers are aggregator-thread-only. -- `PeerTagSchema.CURRENT` is `volatile` with `synchronized` replacement; the - schema's `TagCardinalityHandler`s themselves are aggregator-thread-only and - are reset alongside the property handlers each cycle. +- The cached `PeerTagSchema` lives on `ClientStatsAggregator` as a `volatile` + field paired with the `peerTagsRevision` it was built from; rebuild is + guarded by a monitor on the aggregator instance. The schema's + `TagCardinalityHandler`s themselves are aggregator-thread-only and are + reset alongside the property handlers each cycle. - Entries retain their `UTF8BytesString` references across handler resets; matches via content-equality so post-reset snapshots still resolve. - Cap: `tracerMetricsMaxAggregates` bounds table size. Cap-overrun policy: @@ -285,8 +289,11 @@ showed the producer dominating CPU time. The major shifts: `PeerTagSchema`; the producer carries values in a parallel `String[]`. The aggregator does the `tag:value` interning via `TagCardinalityHandler` on its own thread. -6. **Sync peer-tag schema once per trace.** `currentSyncedTo` has an - identity-check fast path; the steady-state cost is one volatile read. +6. **Sync peer-tag schema once per trace.** The producer reads + `features.peerTagsRevision()` and compares it to the revision the cached + `PeerTagSchema` was built from; the steady-state cost is one volatile read + and one long compare. The cache lives on `ClientStatsAggregator`, not as + static state on `PeerTagSchema`. 7. **Single owner of all shared state.** `disable()` routes through `CLEAR` rather than mutating the aggregate table directly. From 5a5262262b42ec72e54ca976f3685efebfccd858 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 16:54:56 -0400 Subject: [PATCH 11/33] Fold AggregateMetric into AggregateEntry Adopts the optimize-metric-key design choice: one entry type that holds both the canonical label fields and the counter / histogram state. The prior split (AggregateMetric for counters, AggregateEntry for labels) required every counter read to hop through entry.aggregate -- ~30 sites across SerializingMetricWriter, the Aggregator, and the test suites. - AggregateEntry now owns ERROR_TAG, TOP_LEVEL_TAG, the okLatencies and errorLatencies histograms, hitCount/errorCount/topLevelCount/duration counters, and the recordOneDuration / recordDurations / clear methods that used to live on AggregateMetric. - AggregateMetric.java and AggregateMetricTest.groovy deleted. - AggregateTable.findOrInsert now returns AggregateEntry (not the inner AggregateMetric); Canonical.toEntry no longer takes an AggregateMetric arg. - Aggregator.Drainer reverts to AggregateEntry; the report lambda calls entry.clear() directly. - SerializingMetricWriter, ClientStatsAggregator imports, and all three test files updated to read counters from entry.* (not entry.aggregate.*). - AggregateEntryTest.java added with the recordOneDuration / recordDurations / clear coverage that AggregateMetricTest.groovy used to provide. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/worktrees/agent-a2dfcea2 | 1 + .claude/worktrees/agent-adf53b58 | 1 + .../trace/common/metrics/AggregateEntry.java | 116 ++++++++++++++++-- .../trace/common/metrics/AggregateMetric.java | 103 ---------------- .../trace/common/metrics/AggregateTable.java | 24 ++-- .../trace/common/metrics/Aggregator.java | 8 +- .../common/metrics/ClientStatsAggregator.java | 4 +- .../trace/common/metrics/MetricWriter.java | 2 +- .../metrics/SerializingMetricWriter.java | 13 +- .../trace/common/metrics/SpanSnapshot.java | 4 +- .../common/metrics/AggregateMetricTest.groovy | 105 ---------------- .../metrics/ClientStatsAggregatorTest.groovy | 62 +++++----- .../SerializingMetricWriterTest.groovy | 12 +- .../common/metrics/AggregateEntryTest.java | 93 ++++++++++++++ .../common/metrics/AggregateTableTest.java | 47 ++++--- 15 files changed, 285 insertions(+), 310 deletions(-) create mode 160000 .claude/worktrees/agent-a2dfcea2 create mode 160000 .claude/worktrees/agent-adf53b58 delete mode 100644 dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateMetric.java delete mode 100644 dd-trace-core/src/test/groovy/datadog/trace/common/metrics/AggregateMetricTest.groovy create mode 100644 dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java diff --git a/.claude/worktrees/agent-a2dfcea2 b/.claude/worktrees/agent-a2dfcea2 new file mode 160000 index 00000000000..fc4b1a36cee --- /dev/null +++ b/.claude/worktrees/agent-a2dfcea2 @@ -0,0 +1 @@ +Subproject commit fc4b1a36ceef9c610441436e2003a0d31f94aeee diff --git a/.claude/worktrees/agent-adf53b58 b/.claude/worktrees/agent-adf53b58 new file mode 160000 index 00000000000..4666c89336e --- /dev/null +++ b/.claude/worktrees/agent-adf53b58 @@ -0,0 +1 @@ +Subproject commit 4666c89336ea288846835fcb0cbbf3698504c841 diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 5c950fbb808..2af174df521 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -1,16 +1,20 @@ package datadog.trace.common.metrics; +import datadog.metrics.api.Histogram; import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; import datadog.trace.util.Hashtable; import datadog.trace.util.LongHashingUtils; +import edu.umd.cs.findbugs.annotations.SuppressFBWarnings; import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.Objects; +import java.util.concurrent.atomic.AtomicLongArray; /** * Hashtable entry for the consumer-side aggregator. Holds the UTF8-encoded label fields (the data - * {@link SerializingMetricWriter} writes to the wire) plus the mutable {@link AggregateMetric}. + * {@link SerializingMetricWriter} writes to the wire) plus the mutable counter / histogram state + * for the key. * *

UTF8 canonicalization runs through per-field {@link PropertyCardinalityHandler}s (and {@link * TagCardinalityHandler}s for peer tags), so cardinality is capped per reporting interval. The @@ -26,12 +30,20 @@ *

The handlers are reset on the aggregator thread every reporting cycle via {@link * #resetCardinalityHandlers()}. * - *

Thread-safety: the cardinality handlers and {@link Canonical} are not thread-safe. Only - * the aggregator thread may call {@link Canonical#populate} or {@link #resetCardinalityHandlers}. - * Test code uses {@link #of} which constructs entries without touching the handlers. + *

Thread-safety: not thread-safe. Counter and histogram updates, cardinality-handler + * registration, and {@link Canonical} use all run on the aggregator thread. Producer threads tag + * durations via {@link #ERROR_TAG} / {@link #TOP_LEVEL_TAG} bits and hand them off through the + * snapshot inbox. Test code uses {@link #of} which constructs entries without touching the + * cardinality handlers. */ +@SuppressFBWarnings( + value = {"AT_NONATOMIC_OPERATIONS_ON_SHARED_VARIABLE", "AT_STALE_THREAD_WRITE_OF_PRIMITIVE"}, + justification = "Explicitly not thread-safe. Accumulates counts and durations.") final class AggregateEntry extends Hashtable.Entry { + public static final long ERROR_TAG = 0x8000000000000000L; + public static final long TOP_LEVEL_TAG = 0x4000000000000000L; + // Per-field cardinality limits. Identical to the prior DDCache sizes. static final PropertyCardinalityHandler RESOURCE_HANDLER = new PropertyCardinalityHandler(32); static final PropertyCardinalityHandler SERVICE_HANDLER = new PropertyCardinalityHandler(32); @@ -59,7 +71,14 @@ final class AggregateEntry extends Hashtable.Entry { final boolean synthetic; final boolean traceRoot; final List peerTags; - final AggregateMetric aggregate; + + // Mutable aggregate state -- single-thread (aggregator) writer. + private final Histogram okLatencies = Histogram.newHistogram(); + private final Histogram errorLatencies = Histogram.newHistogram(); + private int errorCount; + private int hitCount; + private int topLevelCount; + private long duration; /** Field-bearing constructor used by both the hot path and the test factory. */ private AggregateEntry( @@ -76,8 +95,7 @@ private AggregateEntry( short httpStatusCode, boolean synthetic, boolean traceRoot, - List peerTags, - AggregateMetric aggregate) { + List peerTags) { super(keyHash); this.resource = resource; this.service = service; @@ -92,7 +110,81 @@ private AggregateEntry( this.synthetic = synthetic; this.traceRoot = traceRoot; this.peerTags = peerTags; - this.aggregate = aggregate; + } + + AggregateEntry recordDurations(int count, AtomicLongArray durations) { + this.hitCount += count; + for (int i = 0; i < count && i < durations.length(); ++i) { + long duration = durations.getAndSet(i, 0); + if ((duration & TOP_LEVEL_TAG) == TOP_LEVEL_TAG) { + duration ^= TOP_LEVEL_TAG; + ++topLevelCount; + } + if ((duration & ERROR_TAG) == ERROR_TAG) { + duration ^= ERROR_TAG; + errorLatencies.accept(duration); + ++errorCount; + } else { + okLatencies.accept(duration); + } + this.duration += duration; + } + return this; + } + + /** + * Records a single hit. {@code tagAndDuration} carries the duration nanos with optional {@link + * #ERROR_TAG} / {@link #TOP_LEVEL_TAG} bits OR-ed in. + */ + AggregateEntry recordOneDuration(long tagAndDuration) { + ++hitCount; + if ((tagAndDuration & TOP_LEVEL_TAG) == TOP_LEVEL_TAG) { + tagAndDuration ^= TOP_LEVEL_TAG; + ++topLevelCount; + } + if ((tagAndDuration & ERROR_TAG) == ERROR_TAG) { + tagAndDuration ^= ERROR_TAG; + errorLatencies.accept(tagAndDuration); + ++errorCount; + } else { + okLatencies.accept(tagAndDuration); + } + duration += tagAndDuration; + return this; + } + + int getErrorCount() { + return errorCount; + } + + int getHitCount() { + return hitCount; + } + + int getTopLevelCount() { + return topLevelCount; + } + + long getDuration() { + return duration; + } + + Histogram getOkLatencies() { + return okLatencies; + } + + Histogram getErrorLatencies() { + return errorLatencies; + } + + @SuppressFBWarnings("AT_NONATOMIC_64BIT_PRIMITIVE") + void clear() { + this.errorCount = 0; + this.hitCount = 0; + this.topLevelCount = 0; + this.duration = 0; + this.okLatencies.clear(); + this.errorLatencies.clear(); } /** @@ -154,8 +246,7 @@ static AggregateEntry of( (short) httpStatusCode, synthetic, traceRoot, - peerTagsList, - new AggregateMetric()); + peerTagsList); } /** @@ -426,7 +517,7 @@ private static boolean peerTagsEqual(List a, List snapshottedPeerTags; int n = peerTagsBuffer.size(); if (n == 0) { @@ -450,8 +541,7 @@ AggregateEntry toEntry(AggregateMetric aggregate) { httpStatusCode, synthetic, traceRoot, - snapshottedPeerTags, - aggregate); + snapshottedPeerTags); } } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateMetric.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateMetric.java deleted file mode 100644 index dba66a5ab9c..00000000000 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateMetric.java +++ /dev/null @@ -1,103 +0,0 @@ -package datadog.trace.common.metrics; - -import datadog.metrics.api.Histogram; -import edu.umd.cs.findbugs.annotations.SuppressFBWarnings; -import java.util.concurrent.atomic.AtomicLongArray; - -/** Not thread-safe. Accumulates counts and durations. */ -@SuppressFBWarnings( - value = {"AT_NONATOMIC_OPERATIONS_ON_SHARED_VARIABLE", "AT_STALE_THREAD_WRITE_OF_PRIMITIVE"}, - justification = "Explicitly not thread-safe. Accumulates counts and durations.") -public final class AggregateMetric { - - static final long ERROR_TAG = 0x8000000000000000L; - static final long TOP_LEVEL_TAG = 0x4000000000000000L; - - private final Histogram okLatencies; - private final Histogram errorLatencies; - private int errorCount; - private int hitCount; - private int topLevelCount; - private long duration; - - public AggregateMetric() { - okLatencies = Histogram.newHistogram(); - errorLatencies = Histogram.newHistogram(); - } - - public AggregateMetric recordDurations(int count, AtomicLongArray durations) { - this.hitCount += count; - for (int i = 0; i < count && i < durations.length(); ++i) { - long duration = durations.getAndSet(i, 0); - if ((duration & TOP_LEVEL_TAG) == TOP_LEVEL_TAG) { - duration ^= TOP_LEVEL_TAG; - ++topLevelCount; - } - if ((duration & ERROR_TAG) == ERROR_TAG) { - // then it's an error - duration ^= ERROR_TAG; - errorLatencies.accept(duration); - ++errorCount; - } else { - okLatencies.accept(duration); - } - this.duration += duration; - } - return this; - } - - /** - * Records a single hit. {@code tagAndDuration} carries the duration nanos with optional {@link - * #ERROR_TAG} / {@link #TOP_LEVEL_TAG} bits OR-ed in. - */ - public AggregateMetric recordOneDuration(long tagAndDuration) { - ++hitCount; - if ((tagAndDuration & TOP_LEVEL_TAG) == TOP_LEVEL_TAG) { - tagAndDuration ^= TOP_LEVEL_TAG; - ++topLevelCount; - } - if ((tagAndDuration & ERROR_TAG) == ERROR_TAG) { - tagAndDuration ^= ERROR_TAG; - errorLatencies.accept(tagAndDuration); - ++errorCount; - } else { - okLatencies.accept(tagAndDuration); - } - duration += tagAndDuration; - return this; - } - - public int getErrorCount() { - return errorCount; - } - - public int getHitCount() { - return hitCount; - } - - public int getTopLevelCount() { - return topLevelCount; - } - - public long getDuration() { - return duration; - } - - public Histogram getOkLatencies() { - return okLatencies; - } - - public Histogram getErrorLatencies() { - return errorLatencies; - } - - @SuppressFBWarnings("AT_NONATOMIC_64BIT_PRIMITIVE") - public void clear() { - this.errorCount = 0; - this.hitCount = 0; - this.topLevelCount = 0; - this.duration = 0; - this.okLatencies.clear(); - this.errorLatencies.clear(); - } -} diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java index 83813546a16..1f2421b35e1 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateTable.java @@ -7,7 +7,7 @@ import java.util.function.Consumer; /** - * Consumer-side {@link AggregateMetric} store, keyed on the canonical UTF8-encoded labels of a + * Consumer-side {@link AggregateEntry} store, keyed on the canonical UTF8-encoded labels of a * {@link SpanSnapshot}. * *

{@link #findOrInsert} canonicalizes the snapshot's fields through the cardinality handlers (so @@ -42,35 +42,35 @@ boolean isEmpty() { } /** - * Returns the {@link AggregateMetric} to update for {@code snapshot}, lazily creating an entry on - * miss. Returns {@code null} when the table is at capacity and no stale entry can be evicted -- - * the caller should drop the data point in that case. + * Returns the {@link AggregateEntry} to update for {@code snapshot}, lazily creating one on miss. + * Returns {@code null} when the table is at capacity and no stale entry can be evicted -- the + * caller should drop the data point in that case. */ - AggregateMetric findOrInsert(SpanSnapshot snapshot) { + AggregateEntry findOrInsert(SpanSnapshot snapshot) { canonical.populate(snapshot); long keyHash = canonical.keyHash; for (AggregateEntry candidate = Support.bucket(buckets, keyHash); candidate != null; candidate = candidate.next()) { if (candidate.keyHash == keyHash && canonical.matches(candidate)) { - return candidate.aggregate; + return candidate; } } if (size >= maxAggregates && !evictOneStale()) { return null; } - AggregateEntry entry = canonical.toEntry(new AggregateMetric()); + AggregateEntry entry = canonical.toEntry(); Support.insertHeadEntry(buckets, keyHash, entry); size++; - return entry.aggregate; + return entry; } - /** Unlink the first entry whose {@code AggregateMetric.getHitCount() == 0}. */ + /** Unlink the first entry whose {@code getHitCount() == 0}. */ private boolean evictOneStale() { for (MutatingTableIterator iter = Support.mutatingTableIterator(buckets); iter.hasNext(); ) { AggregateEntry e = iter.next(); - if (e.aggregate.getHitCount() == 0) { + if (e.getHitCount() == 0) { iter.remove(); size--; return true; @@ -92,12 +92,12 @@ void forEach(T context, BiConsumer consumer) { Support.forEach(buckets, context, consumer); } - /** Removes entries whose {@code AggregateMetric.getHitCount() == 0}. */ + /** Removes entries whose {@code getHitCount() == 0}. */ void expungeStaleAggregates() { for (MutatingTableIterator iter = Support.mutatingTableIterator(buckets); iter.hasNext(); ) { AggregateEntry e = iter.next(); - if (e.aggregate.getHitCount() == 0) { + if (e.getHitCount() == 0) { iter.remove(); size--; } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java index 466123c94ce..cdc90ac6725 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java @@ -127,9 +127,9 @@ public void accept(InboxItem item) { } } else if (item instanceof SpanSnapshot && !stopped) { SpanSnapshot snapshot = (SpanSnapshot) item; - AggregateMetric aggregate = aggregates.findOrInsert(snapshot); - if (aggregate != null) { - aggregate.recordOneDuration(snapshot.tagAndDuration); + AggregateEntry entry = aggregates.findOrInsert(snapshot); + if (entry != null) { + entry.recordOneDuration(snapshot.tagAndDuration); dirty = true; } else { // table at cap with no stale entry available to evict @@ -151,7 +151,7 @@ private void report(long when, SignalItem signal) { writer, (w, entry) -> { w.add(entry); - entry.aggregate.clear(); + entry.clear(); }); // note that this may do IO and block writer.finishBucket(); diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index 821a531e7b8..3e7b79f0fb2 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -4,8 +4,8 @@ import static datadog.trace.api.DDSpanTypes.RPC; import static datadog.trace.bootstrap.instrumentation.api.Tags.HTTP_ENDPOINT; import static datadog.trace.bootstrap.instrumentation.api.Tags.HTTP_METHOD; -import static datadog.trace.common.metrics.AggregateMetric.ERROR_TAG; -import static datadog.trace.common.metrics.AggregateMetric.TOP_LEVEL_TAG; +import static datadog.trace.common.metrics.AggregateEntry.ERROR_TAG; +import static datadog.trace.common.metrics.AggregateEntry.TOP_LEVEL_TAG; import static datadog.trace.common.metrics.SignalItem.ClearSignal.CLEAR; import static datadog.trace.common.metrics.SignalItem.ReportSignal.REPORT; import static datadog.trace.common.metrics.SignalItem.StopSignal.STOP; diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricWriter.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricWriter.java index c31825f6af8..905ba498760 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricWriter.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricWriter.java @@ -5,7 +5,7 @@ public interface MetricWriter { /** * Serialize one aggregate. The {@link AggregateEntry} carries both the label fields (resource, - * service, span.kind, peer tags, etc.) and the {@link AggregateMetric} counters being reported. + * service, span.kind, peer tags, etc.) and the counters being reported. */ void add(AggregateEntry entry); diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java index ba6ae6c2699..7644ebaf044 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java @@ -143,7 +143,6 @@ public void startBucket(int metricCount, long start, long duration) { @Override public void add(AggregateEntry entry) { - final AggregateMetric aggregate = entry.aggregate; // Calculate dynamic map size based on optional fields final boolean hasHttpMethod = entry.getHttpMethod() != null; final boolean hasHttpEndpoint = entry.getHttpEndpoint() != null; @@ -213,22 +212,22 @@ public void add(AggregateEntry entry) { } writer.writeUTF8(HITS); - writer.writeInt(aggregate.getHitCount()); + writer.writeInt(entry.getHitCount()); writer.writeUTF8(ERRORS); - writer.writeInt(aggregate.getErrorCount()); + writer.writeInt(entry.getErrorCount()); writer.writeUTF8(TOP_LEVEL_HITS); - writer.writeInt(aggregate.getTopLevelCount()); + writer.writeInt(entry.getTopLevelCount()); writer.writeUTF8(DURATION); - writer.writeLong(aggregate.getDuration()); + writer.writeLong(entry.getDuration()); writer.writeUTF8(OK_SUMMARY); - writer.writeBinary(aggregate.getOkLatencies().serialize()); + writer.writeBinary(entry.getOkLatencies().serialize()); writer.writeUTF8(ERROR_SUMMARY); - writer.writeBinary(aggregate.getErrorLatencies().serialize()); + writer.writeBinary(entry.getErrorLatencies().serialize()); } @Override diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java index 5967c1302c7..4fce49d0695 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java @@ -2,8 +2,8 @@ /** * Immutable per-span value posted from the producer to the aggregator thread. Carries the raw - * inputs the aggregator needs to build an {@link AggregateEntry} and update its {@link - * AggregateMetric}. + * inputs the aggregator needs to look up or build an {@link AggregateEntry} and update its + * counters. * *

All cache-canonicalization (service-name, span-kind, peer-tag string interning) happens on the * aggregator thread; the producer just shuffles references. diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/AggregateMetricTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/AggregateMetricTest.groovy deleted file mode 100644 index 140149d8324..00000000000 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/AggregateMetricTest.groovy +++ /dev/null @@ -1,105 +0,0 @@ -package datadog.trace.common.metrics - -import datadog.metrics.agent.AgentMeter -import datadog.metrics.impl.DDSketchHistograms -import datadog.metrics.impl.MonitoringImpl -import datadog.metrics.api.statsd.StatsDClient -import datadog.trace.test.util.DDSpecification - -import java.util.concurrent.TimeUnit -import java.util.concurrent.atomic.AtomicLongArray - -import static datadog.trace.common.metrics.AggregateMetric.ERROR_TAG -import static datadog.trace.common.metrics.AggregateMetric.TOP_LEVEL_TAG - -class AggregateMetricTest extends DDSpecification { - - def setupSpec() { - // Initialize AgentMeter with monitoring - this is the standard mechanism used in production - def monitoring = new MonitoringImpl(StatsDClient.NO_OP, 1, TimeUnit.SECONDS) - AgentMeter.registerIfAbsent(StatsDClient.NO_OP, monitoring, DDSketchHistograms.FACTORY) - // Create a timer to trigger DDSketchHistograms loading and Factory registration - // This simulates what happens during CoreTracer initialization (traceWriteTimer) - monitoring.newTimer("test.init") - } - - def "record durations sums up to total"() { - given: - AggregateMetric aggregate = new AggregateMetric() - when: - aggregate.recordDurations(3, new AtomicLongArray(1, 2, 3)) - then: - aggregate.getDuration() == 6 - } - - def "total durations include errors"() { - given: - AggregateMetric aggregate = new AggregateMetric() - when: - aggregate.recordDurations(3, new AtomicLongArray(1, 2, 3)) - then: - aggregate.getDuration() == 6 - } - - def "clear"() { - given: - AggregateMetric aggregate = new AggregateMetric() - .recordDurations(3, new AtomicLongArray(5, ERROR_TAG | 6, TOP_LEVEL_TAG | 7)) - when: - aggregate.clear() - then: - aggregate.getDuration() == 0 - aggregate.getErrorCount() == 0 - aggregate.getTopLevelCount() == 0 - aggregate.getHitCount() == 0 - } - - def "recordOneDuration accumulates ok and error and top-level"() { - given: - AggregateMetric aggregate = new AggregateMetric() - .recordOneDuration(10L) - .recordOneDuration(10L | TOP_LEVEL_TAG) - .recordOneDuration(10L | ERROR_TAG) - - expect: - aggregate.getHitCount() == 3 - aggregate.getDuration() == 30 - aggregate.getErrorCount() == 1 - aggregate.getTopLevelCount() == 1 - } - - def "ignore trailing zeros"() { - given: - AggregateMetric aggregate = new AggregateMetric() - when: - aggregate.recordDurations(3, new AtomicLongArray(1, 2, 3, 0, 0, 0)) - then: - aggregate.getDuration() == 6 - aggregate.getHitCount() == 3 - aggregate.getErrorCount() == 0 - } - - def "hit count includes errors"() { - given: - AggregateMetric aggregate = new AggregateMetric() - when: - aggregate.recordDurations(3, new AtomicLongArray(1, 2, 3 | ERROR_TAG)) - then: - aggregate.getHitCount() == 3 - aggregate.getErrorCount() == 1 - } - - def "ok and error durations tracked separately"() { - given: - AggregateMetric aggregate = new AggregateMetric() - when: - aggregate.recordDurations(10, - new AtomicLongArray(1, 100 | ERROR_TAG, 2, 99 | ERROR_TAG, 3, - 98 | ERROR_TAG, 4, 97 | ERROR_TAG)) - then: - def errorLatencies = aggregate.getErrorLatencies() - def okLatencies = aggregate.getOkLatencies() - errorLatencies.getMaxValue() >= 99 - okLatencies.getMaxValue() <= 5 - } -} diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy index 3cccc50c5a4..d8620e370f0 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/ClientStatsAggregatorTest.groovy @@ -134,7 +134,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 1 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 1 && e.getDuration() == 100 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -180,7 +180,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 1 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 1 && e.getDuration() == 100 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -232,7 +232,7 @@ class ClientStatsAggregatorTest extends DDSpecification { httpEndpoint, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 0 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 0 && e.getDuration() == 100 } (statsComputed ? 1 : 0) * writer.finishBucket() >> { latch.countDown() } @@ -297,7 +297,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 0 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 0 && e.getDuration() == 100 } 1 * writer.add( AggregateEntry.of( @@ -315,7 +315,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 0 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 0 && e.getDuration() == 100 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -362,7 +362,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 0 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 0 && e.getDuration() == 100 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -414,7 +414,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == topLevelCount && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == topLevelCount && e.getDuration() == 100 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -473,7 +473,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == count && e.aggregate.getDuration() == count * duration + e.getHitCount() == count && e.getDuration() == count * duration } 1 * writer.add(AggregateEntry.of( "resource2", @@ -490,7 +490,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == count && e.aggregate.getDuration() == count * duration * 2 + e.getHitCount() == count && e.getDuration() == count * duration * 2 } cleanup: @@ -544,7 +544,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == count && e.aggregate.getDuration() == count * duration + e.getHitCount() == count && e.getDuration() == count * duration } 1 * writer.finishBucket() >> { latch.countDown() } @@ -585,7 +585,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } 1 * writer.add(AggregateEntry.of( "resource", @@ -602,7 +602,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/orders/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration * 2 + e.getHitCount() == 1 && e.getDuration() == duration * 2 } 1 * writer.add(AggregateEntry.of( "resource", @@ -619,7 +619,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration * 3 + e.getHitCount() == 1 && e.getDuration() == duration * 3 } 1 * writer.finishBucket() >> { latch2.countDown() } @@ -683,7 +683,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } 1 * writer.add(AggregateEntry.of( "resource", @@ -700,7 +700,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration * 2 + e.getHitCount() == 1 && e.getDuration() == duration * 2 } 1 * writer.add(AggregateEntry.of( "resource", @@ -717,7 +717,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration * 3 + e.getHitCount() == 1 && e.getDuration() == duration * 3 } 1 * writer.add(AggregateEntry.of( "resource", @@ -734,7 +734,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/orders/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration * 4 + e.getHitCount() == 1 && e.getDuration() == duration * 4 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -787,7 +787,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } 1 * writer.add(AggregateEntry.of( "resource", @@ -804,7 +804,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration * 2 + e.getHitCount() == 1 && e.getDuration() == duration * 2 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -855,7 +855,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 2 && e.aggregate.getDuration() == 2 * duration + e.getHitCount() == 2 && e.getDuration() == 2 * duration } 1 * writer.add(AggregateEntry.of( "resource", @@ -872,7 +872,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } 1 * writer.finishBucket() >> { latch.countDown() } @@ -926,7 +926,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } } 0 * writer.add(AggregateEntry.of( @@ -1073,7 +1073,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } } 1 * writer.finishBucket() >> { latch.countDown() } @@ -1108,7 +1108,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } } 0 * writer.add(AggregateEntry.of( @@ -1175,7 +1175,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } } 1 * writer.finishBucket() >> { latch.countDown() } @@ -1234,7 +1234,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getDuration() == duration + e.getHitCount() == 1 && e.getDuration() == duration } } 1 * writer.finishBucket() >> { latch.countDown() } @@ -1401,7 +1401,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 1 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 1 && e.getDuration() == 100 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -1456,7 +1456,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 3 && e.aggregate.getTopLevelCount() == 3 && e.aggregate.getDuration() == 450 + e.getHitCount() == 3 && e.getTopLevelCount() == 3 && e.getDuration() == 450 } 1 * writer.finishBucket() >> { latch.countDown() } @@ -1511,7 +1511,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/users/:id", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 1 && e.aggregate.getDuration() == 100 + e.getHitCount() == 1 && e.getTopLevelCount() == 1 && e.getDuration() == 100 } 1 * writer.add( AggregateEntry.of( @@ -1529,7 +1529,7 @@ class ClientStatsAggregatorTest extends DDSpecification { "/api/orders", null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 1 && e.aggregate.getDuration() == 200 + e.getHitCount() == 1 && e.getTopLevelCount() == 1 && e.getDuration() == 200 } 1 * writer.add( AggregateEntry.of( @@ -1547,7 +1547,7 @@ class ClientStatsAggregatorTest extends DDSpecification { null, null )) >> { AggregateEntry e -> - e.aggregate.getHitCount() == 1 && e.aggregate.getTopLevelCount() == 1 && e.aggregate.getDuration() == 150 + e.getHitCount() == 1 && e.getTopLevelCount() == 1 && e.getDuration() == 150 } 1 * writer.finishBucket() >> { latch.countDown() } diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy index 08f0f7cbb92..c4f20a1c210 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy @@ -45,7 +45,7 @@ class SerializingMetricWriterTest extends DDSpecification { resource, service, operationName, serviceSource, type, httpStatusCode, synthetic, traceRoot, spanKind, peerTags, httpMethod, httpEndpoint, grpcStatusCode) - e.aggregate.recordDurations(hitCount, new AtomicLongArray(1L)) + e.recordDurations(hitCount, new AtomicLongArray(1L)) return e } @@ -284,7 +284,7 @@ class SerializingMetricWriterTest extends DDSpecification { int statCount = unpacker.unpackArrayHeader() assert statCount == content.size() for (AggregateEntry entry : content) { - AggregateMetric value = entry.aggregate + // counters now live on AggregateEntry int metricMapSize = unpacker.unpackMapHeader() // Calculate expected map size based on optional fields boolean hasHttpMethod = entry.getHttpMethod() != null @@ -349,16 +349,16 @@ class SerializingMetricWriterTest extends DDSpecification { ++elementCount } assert unpacker.unpackString() == "Hits" - assert unpacker.unpackInt() == value.getHitCount() + assert unpacker.unpackInt() == entry.getHitCount() ++elementCount assert unpacker.unpackString() == "Errors" - assert unpacker.unpackInt() == value.getErrorCount() + assert unpacker.unpackInt() == entry.getErrorCount() ++elementCount assert unpacker.unpackString() == "TopLevelHits" - assert unpacker.unpackInt() == value.getTopLevelCount() + assert unpacker.unpackInt() == entry.getTopLevelCount() ++elementCount assert unpacker.unpackString() == "Duration" - assert unpacker.unpackLong() == value.getDuration() + assert unpacker.unpackLong() == entry.getDuration() ++elementCount assert unpacker.unpackString() == "OkSummary" validateSketch(unpacker) diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java new file mode 100644 index 00000000000..25a08d94b23 --- /dev/null +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java @@ -0,0 +1,93 @@ +package datadog.trace.common.metrics; + +import static datadog.trace.common.metrics.AggregateEntry.ERROR_TAG; +import static datadog.trace.common.metrics.AggregateEntry.TOP_LEVEL_TAG; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +import datadog.metrics.agent.AgentMeter; +import datadog.metrics.api.statsd.StatsDClient; +import datadog.metrics.impl.DDSketchHistograms; +import datadog.metrics.impl.MonitoringImpl; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicLongArray; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.Test; + +class AggregateEntryTest { + + @BeforeAll + static void initAgentMeter() { + // recordOneDuration -> Histogram.accept needs AgentMeter to be initialized. + MonitoringImpl monitoring = new MonitoringImpl(StatsDClient.NO_OP, 1, TimeUnit.SECONDS); + AgentMeter.registerIfAbsent(StatsDClient.NO_OP, monitoring, DDSketchHistograms.FACTORY); + monitoring.newTimer("test.init"); + } + + @Test + void recordDurationsSumsToTotal() { + AggregateEntry entry = newEntry(); + entry.recordDurations(3, new AtomicLongArray(new long[] {1L, 2L, 3L})); + assertEquals(6, entry.getDuration()); + } + + @Test + void clearResetsAllCounters() { + AggregateEntry entry = newEntry(); + entry.recordDurations( + 3, new AtomicLongArray(new long[] {5L, ERROR_TAG | 6L, TOP_LEVEL_TAG | 7L})); + entry.clear(); + assertEquals(0, entry.getDuration()); + assertEquals(0, entry.getErrorCount()); + assertEquals(0, entry.getTopLevelCount()); + assertEquals(0, entry.getHitCount()); + } + + @Test + void recordOneDurationAccumulatesOkErrorAndTopLevel() { + AggregateEntry entry = newEntry(); + entry.recordOneDuration(10L); + entry.recordOneDuration(10L | TOP_LEVEL_TAG); + entry.recordOneDuration(10L | ERROR_TAG); + + assertEquals(3, entry.getHitCount()); + assertEquals(30, entry.getDuration()); + assertEquals(1, entry.getErrorCount()); + assertEquals(1, entry.getTopLevelCount()); + } + + @Test + void recordDurationsIgnoresTrailingZeros() { + AggregateEntry entry = newEntry(); + entry.recordDurations(3, new AtomicLongArray(new long[] {1L, 2L, 3L, 0L, 0L, 0L})); + assertEquals(6, entry.getDuration()); + assertEquals(3, entry.getHitCount()); + assertEquals(0, entry.getErrorCount()); + } + + @Test + void hitCountIncludesErrors() { + AggregateEntry entry = newEntry(); + entry.recordDurations(3, new AtomicLongArray(new long[] {1L, 2L, 3L | ERROR_TAG})); + assertEquals(3, entry.getHitCount()); + assertEquals(1, entry.getErrorCount()); + } + + @Test + void okAndErrorLatenciesTrackedSeparately() { + AggregateEntry entry = newEntry(); + entry.recordDurations( + 10, + new AtomicLongArray( + new long[] { + 1L, 100L | ERROR_TAG, 2L, 99L | ERROR_TAG, 3L, 98L | ERROR_TAG, 4L, 97L | ERROR_TAG + })); + assertTrue(entry.getErrorLatencies().getMaxValue() >= 99); + assertTrue(entry.getOkLatencies().getMaxValue() <= 5); + } + + private static AggregateEntry newEntry() { + return AggregateEntry.of( + "resource", "svc", "op", null, "type", 200, false, true, "client", null, null, null, null); + } +} diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java index af63811df8c..3c9e088b6c5 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java @@ -1,7 +1,7 @@ package datadog.trace.common.metrics; -import static datadog.trace.common.metrics.AggregateMetric.ERROR_TAG; -import static datadog.trace.common.metrics.AggregateMetric.TOP_LEVEL_TAG; +import static datadog.trace.common.metrics.AggregateEntry.ERROR_TAG; +import static datadog.trace.common.metrics.AggregateEntry.TOP_LEVEL_TAG; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertNotNull; import static org.junit.jupiter.api.Assertions.assertNotSame; @@ -25,8 +25,7 @@ class AggregateTableTest { @BeforeAll static void initAgentMeter() { - // AggregateMetric.recordOneDuration -> Histogram.accept needs AgentMeter to be initialized. - // Mirror what AggregateMetricTest does. + // AggregateEntry.recordOneDuration -> Histogram.accept needs AgentMeter to be initialized. MonitoringImpl monitoring = new MonitoringImpl(StatsDClient.NO_OP, 1, TimeUnit.SECONDS); AgentMeter.registerIfAbsent(StatsDClient.NO_OP, monitoring, DDSketchHistograms.FACTORY); monitoring.newTimer("test.init"); @@ -37,7 +36,7 @@ void insertOnMissReturnsNewAggregate() { AggregateTable table = new AggregateTable(8); SpanSnapshot s = snapshot("svc", "op", "client"); - AggregateMetric agg = table.findOrInsert(s); + AggregateEntry agg = table.findOrInsert(s); assertNotNull(agg); assertEquals(1, table.size()); @@ -50,8 +49,8 @@ void hitReturnsSameAggregateInstance() { SpanSnapshot s1 = snapshot("svc", "op", "client"); SpanSnapshot s2 = snapshot("svc", "op", "client"); - AggregateMetric first = table.findOrInsert(s1); - AggregateMetric second = table.findOrInsert(s2); + AggregateEntry first = table.findOrInsert(s1); + AggregateEntry second = table.findOrInsert(s2); assertSame(first, second); assertEquals(1, table.size()); @@ -61,8 +60,8 @@ void hitReturnsSameAggregateInstance() { void differentKindFieldsAreDistinct() { AggregateTable table = new AggregateTable(8); - AggregateMetric clientAgg = table.findOrInsert(snapshot("svc", "op", "client")); - AggregateMetric serverAgg = table.findOrInsert(snapshot("svc", "op", "server")); + AggregateEntry clientAgg = table.findOrInsert(snapshot("svc", "op", "client")); + AggregateEntry serverAgg = table.findOrInsert(snapshot("svc", "op", "server")); assertNotSame(clientAgg, serverAgg); assertEquals(2, table.size()); @@ -77,9 +76,9 @@ void peerTagPairsParticipateInIdentity() { builder("svc", "op", "client").peerTags("peer.hostname", "host-b").build(); SpanSnapshot noTags = builder("svc", "op", "client").build(); - AggregateMetric a = table.findOrInsert(withTags); - AggregateMetric b = table.findOrInsert(otherTags); - AggregateMetric c = table.findOrInsert(noTags); + AggregateEntry a = table.findOrInsert(withTags); + AggregateEntry b = table.findOrInsert(otherTags); + AggregateEntry c = table.findOrInsert(noTags); assertNotSame(a, b); assertNotSame(a, c); @@ -97,7 +96,7 @@ void cardinalityBlockedValuesCollapseIntoOneEntry() { AggregateTable table = new AggregateTable(128); for (int i = 0; i < 50; i++) { - AggregateMetric agg = table.findOrInsert(snapshot("svc-" + i, "op", "client")); + AggregateEntry agg = table.findOrInsert(snapshot("svc-" + i, "op", "client")); assertNotNull(agg); agg.recordOneDuration(1L); } @@ -112,19 +111,19 @@ void cardinalityBlockedValuesCollapseIntoOneEntry() { void capOverrunEvictsStaleEntry() { AggregateTable table = new AggregateTable(2); - AggregateMetric stale = table.findOrInsert(snapshot("svc-a", "op", "client")); + AggregateEntry stale = table.findOrInsert(snapshot("svc-a", "op", "client")); // do not record on stale -> hitCount stays at 0 - AggregateMetric live = table.findOrInsert(snapshot("svc-b", "op", "client")); + AggregateEntry live = table.findOrInsert(snapshot("svc-b", "op", "client")); live.recordOneDuration(10L | TOP_LEVEL_TAG); // hitCount=1, not evictable // table is full (size=2). Inserting a third should evict the stale one and succeed. - AggregateMetric newcomer = table.findOrInsert(snapshot("svc-c", "op", "client")); + AggregateEntry newcomer = table.findOrInsert(snapshot("svc-c", "op", "client")); assertNotNull(newcomer); assertEquals(2, table.size()); // re-inserting the stale snapshot should miss now (it was evicted) and produce a fresh entry - AggregateMetric staleAgain = table.findOrInsert(snapshot("svc-a", "op", "client")); + AggregateEntry staleAgain = table.findOrInsert(snapshot("svc-a", "op", "client")); assertNotSame(stale, staleAgain); } @@ -132,12 +131,12 @@ void capOverrunEvictsStaleEntry() { void capOverrunWithNoStaleReturnsNull() { AggregateTable table = new AggregateTable(2); - AggregateMetric a = table.findOrInsert(snapshot("svc-a", "op", "client")); - AggregateMetric b = table.findOrInsert(snapshot("svc-b", "op", "client")); + AggregateEntry a = table.findOrInsert(snapshot("svc-a", "op", "client")); + AggregateEntry b = table.findOrInsert(snapshot("svc-b", "op", "client")); a.recordOneDuration(10L); b.recordOneDuration(20L); - AggregateMetric c = table.findOrInsert(snapshot("svc-c", "op", "client")); + AggregateEntry c = table.findOrInsert(snapshot("svc-c", "op", "client")); assertNull(c); assertEquals(2, table.size()); } @@ -146,10 +145,10 @@ void capOverrunWithNoStaleReturnsNull() { void expungeStaleAggregatesRemovesZeroHitsOnly() { AggregateTable table = new AggregateTable(16); - AggregateMetric live = table.findOrInsert(snapshot("svc-live", "op", "client")); + AggregateEntry live = table.findOrInsert(snapshot("svc-live", "op", "client")); live.recordOneDuration(10L); - AggregateMetric stale1 = table.findOrInsert(snapshot("svc-stale1", "op", "client")); - AggregateMetric stale2 = table.findOrInsert(snapshot("svc-stale2", "op", "client")); + AggregateEntry stale1 = table.findOrInsert(snapshot("svc-stale1", "op", "client")); + AggregateEntry stale2 = table.findOrInsert(snapshot("svc-stale2", "op", "client")); assertEquals(3, table.size()); assertEquals(0, stale1.getHitCount()); assertEquals(0, stale2.getHitCount()); @@ -169,7 +168,7 @@ void forEachVisitsEveryEntry() { table.findOrInsert(snapshot("c", "op", "client")).recordOneDuration(3L | ERROR_TAG); Map visited = new HashMap<>(); - table.forEach(e -> visited.put(e.getService().toString(), e.aggregate.getDuration())); + table.forEach(e -> visited.put(e.getService().toString(), e.getDuration())); assertEquals(3, visited.size()); assertEquals(1L, visited.get("a")); From ece78c936768557204948ddf67ec2ac1404705f0 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 16:55:15 -0400 Subject: [PATCH 12/33] Remove accidentally-staged .claude/worktrees entries --- .claude/worktrees/agent-a2dfcea2 | 1 - .claude/worktrees/agent-adf53b58 | 1 - 2 files changed, 2 deletions(-) delete mode 160000 .claude/worktrees/agent-a2dfcea2 delete mode 160000 .claude/worktrees/agent-adf53b58 diff --git a/.claude/worktrees/agent-a2dfcea2 b/.claude/worktrees/agent-a2dfcea2 deleted file mode 160000 index fc4b1a36cee..00000000000 --- a/.claude/worktrees/agent-a2dfcea2 +++ /dev/null @@ -1 +0,0 @@ -Subproject commit fc4b1a36ceef9c610441436e2003a0d31f94aeee diff --git a/.claude/worktrees/agent-adf53b58 b/.claude/worktrees/agent-adf53b58 deleted file mode 160000 index 4666c89336e..00000000000 --- a/.claude/worktrees/agent-adf53b58 +++ /dev/null @@ -1 +0,0 @@ -Subproject commit 4666c89336ea288846835fcb0cbbf3698504c841 From b6c4f5fbd8c3cb1a0569bce066cd026b7fc590ff Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 20:02:48 -0400 Subject: [PATCH 13/33] Address review on AggregateEntry nullables + PeerTagSchema revision - Replace `// nullable` comments on AggregateEntry's 4 nullable label fields (entry + Canonical scratch buffer) with `@Nullable` annotations. Also annotate the matching getters and of(...) factory parameters. - Move the cache revision into PeerTagSchema as a final field (peerTagsRevision), built via PeerTagSchema.of(names, revision). One field on the schema carries the cache key, so the hot path is a single volatile read + long compare against schema.peerTagsRevision -- no separate cachedPeerTagsRevision field on ClientStatsAggregator. When peer tags are unconfigured the cache stores an empty schema (size 0) carrying the revision rather than null, so subsequent publishes still short-circuit on the fast path. peerTagSchemaFor treats `schema.size() == 0` as "skip peer-agg processing" for client/producer/consumer kinds. INTERNAL is built with a -1L sentinel revision. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 31 +++++++------ .../common/metrics/ClientStatsAggregator.java | 44 ++++++++++--------- .../trace/common/metrics/PeerTagSchema.java | 31 +++++++++---- .../common/metrics/AggregateTableTest.java | 2 +- 4 files changed, 65 insertions(+), 43 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 2af174df521..a2b679acdce 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -10,6 +10,7 @@ import java.util.List; import java.util.Objects; import java.util.concurrent.atomic.AtomicLongArray; +import javax.annotation.Nullable; /** * Hashtable entry for the consumer-side aggregator. Holds the UTF8-encoded label fields (the data @@ -61,12 +62,12 @@ final class AggregateEntry extends Hashtable.Entry { final UTF8BytesString resource; final UTF8BytesString service; final UTF8BytesString operationName; - final UTF8BytesString serviceSource; // nullable + @Nullable final UTF8BytesString serviceSource; final UTF8BytesString type; final UTF8BytesString spanKind; - final UTF8BytesString httpMethod; // nullable - final UTF8BytesString httpEndpoint; // nullable - final UTF8BytesString grpcStatusCode; // nullable + @Nullable final UTF8BytesString httpMethod; + @Nullable final UTF8BytesString httpEndpoint; + @Nullable final UTF8BytesString grpcStatusCode; final short httpStatusCode; final boolean synthetic; final boolean traceRoot; @@ -197,16 +198,16 @@ static AggregateEntry of( CharSequence resource, CharSequence service, CharSequence operationName, - CharSequence serviceSource, + @Nullable CharSequence serviceSource, CharSequence type, int httpStatusCode, boolean synthetic, boolean traceRoot, CharSequence spanKind, - List peerTags, - CharSequence httpMethod, - CharSequence httpEndpoint, - CharSequence grpcStatusCode) { + @Nullable List peerTags, + @Nullable CharSequence httpMethod, + @Nullable CharSequence httpEndpoint, + @Nullable CharSequence grpcStatusCode) { UTF8BytesString resourceUtf = createUtf8(resource); UTF8BytesString serviceUtf = createUtf8(service); UTF8BytesString operationNameUtf = createUtf8(operationName); @@ -322,6 +323,7 @@ UTF8BytesString getOperationName() { return operationName; } + @Nullable UTF8BytesString getServiceSource() { return serviceSource; } @@ -334,14 +336,17 @@ UTF8BytesString getSpanKind() { return spanKind; } + @Nullable UTF8BytesString getHttpMethod() { return httpMethod; } + @Nullable UTF8BytesString getHttpEndpoint() { return httpEndpoint; } + @Nullable UTF8BytesString getGrpcStatusCode() { return grpcStatusCode; } @@ -404,12 +409,12 @@ static final class Canonical { UTF8BytesString resource; UTF8BytesString service; UTF8BytesString operationName; - UTF8BytesString serviceSource; // nullable + @Nullable UTF8BytesString serviceSource; UTF8BytesString type; UTF8BytesString spanKind; - UTF8BytesString httpMethod; // nullable - UTF8BytesString httpEndpoint; // nullable - UTF8BytesString grpcStatusCode; // nullable + @Nullable UTF8BytesString httpMethod; + @Nullable UTF8BytesString httpEndpoint; + @Nullable UTF8BytesString grpcStatusCode; short httpStatusCode; boolean synthetic; boolean traceRoot; diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index 3e7b79f0fb2..9d2132165b5 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -73,16 +73,16 @@ public final class ClientStatsAggregator implements MetricsAggregator, EventList private final boolean includeEndpointInMetrics; /** - * Cached peer-aggregation schema and the {@link DDAgentFeaturesDiscovery#peerTagsRevision()} - * value it was built from. The producer-side hot path in {@link #publish(List)} checks the - * current revision against {@code cachedPeerTagsRevision} and only rebuilds when they differ. + * Cached peer-aggregation schema. The schema carries its own {@link + * PeerTagSchema#peerTagsRevision} (the {@link DDAgentFeaturesDiscovery#peerTagsRevision()} value + * it was built from); {@link #publish(List)} compares that against the current revision and only + * rebuilds when they differ. An empty schema (size 0) represents the "peer tags unconfigured" + * state; {@code null} only on the bootstrap window before the first publish. * - *

Both fields are {@code volatile} because {@code publish} is called on arbitrary producer - * threads. The reset hook ({@link #resetCachedPeerAggSchema()}) runs on the aggregator thread and - * only mutates the schema's internal handler state (not these fields). + *

{@code volatile} because {@code publish} is called on arbitrary producer threads. The reset + * hook ({@link #resetCachedPeerAggSchema()}) runs on the aggregator thread and only mutates the + * schema's internal handler state (not this field). */ - private volatile long cachedPeerTagsRevision = -1L; - private volatile PeerTagSchema cachedPeerAggSchema; private volatile AgentTaskScheduler.Scheduled cancellation; @@ -353,25 +353,29 @@ private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peer /** * Returns the peer-aggregation schema synced to the given revision, rebuilding it if the cached - * one is stale. Fast path: one volatile-read pair + a long compare. Rebuild is rare (peer-tag - * config changes), so the synchronization is only on the slow path. + * one is stale. Fast path: one volatile read + a long compare against the schema's own embedded + * revision. Rebuild is rare (peer-tag config changes), so the synchronization is only on the slow + * path. Always returns non-null -- an empty schema (size 0) represents the "peer tags + * unconfigured" state so subsequent calls still short-circuit on the fast path. */ private PeerTagSchema peerAggSchema(long revision) { - if (revision == cachedPeerTagsRevision) { - return cachedPeerAggSchema; + PeerTagSchema cached = cachedPeerAggSchema; + if (cached != null && cached.peerTagsRevision == revision) { + return cached; } return refreshPeerAggSchema(revision); } private synchronized PeerTagSchema refreshPeerAggSchema(long revision) { // Double-checked: another producer may have rebuilt while we were waiting on the monitor. - if (revision == cachedPeerTagsRevision) { - return cachedPeerAggSchema; + PeerTagSchema cached = cachedPeerAggSchema; + if (cached != null && cached.peerTagsRevision == revision) { + return cached; } Set names = features.peerTags(); - PeerTagSchema schema = (names == null || names.isEmpty()) ? null : PeerTagSchema.of(names); + PeerTagSchema schema = + PeerTagSchema.of(names == null ? Collections.emptySet() : names, revision); cachedPeerAggSchema = schema; - cachedPeerTagsRevision = revision; return schema; } @@ -389,12 +393,12 @@ private void resetCachedPeerAggSchema() { /** * Picks the peer-tag schema for a span. The {@code peerAggSchema} argument is the per-trace - * cached schema (synced from {@code features.peerTagsRevision()} once in {@link #publish(List)}); - * it's {@code null} when no peer tags are configured. For internal-kind spans the static {@link - * PeerTagSchema#INTERNAL} schema is used regardless. + * cached schema (synced from {@code features.peerTagsRevision()} once in {@link #publish(List)}) + * -- always non-null but possibly empty when peer tags are unconfigured. For internal-kind spans + * the static {@link PeerTagSchema#INTERNAL} schema is used regardless. */ private static PeerTagSchema peerTagSchemaFor(CoreSpan span, PeerTagSchema peerAggSchema) { - if (peerAggSchema != null && span.isKind(PEER_AGGREGATION_KINDS)) { + if (peerAggSchema.size() > 0 && span.isKind(PEER_AGGREGATION_KINDS)) { return peerAggSchema; } if (span.isKind(INTERNAL_KIND)) { diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java index 6c80424e9d8..533e69c847a 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -16,35 +16,48 @@ *

    *
  • {@link #INTERNAL} -- a singleton with one entry for {@code base.service}, used for * internal-kind spans where only the base service is aggregated. - *
  • A peer-aggregation schema built via {@link #of(Set)} for {@code client}/{@code - * producer}/{@code consumer} spans. Its lifecycle (including caching and rebuild on peer-tag - * config change) is owned by {@link ClientStatsAggregator}; this class is just the data - * holder. + *
  • A peer-aggregation schema built via {@link #of(Set, long)} for {@code client}/{@code + * producer}/{@code consumer} spans. {@link ClientStatsAggregator} caches the most recently + * built schema and compares its {@link #peerTagsRevision} against {@code + * DDAgentFeaturesDiscovery.peerTagsRevision()} to decide when to rebuild. *
* *

Each {@link SpanSnapshot} captures its own schema reference so producer and consumer agree on * the indexing even if the current schema is replaced between capture and consumption. * *

Thread-safety: {@link TagCardinalityHandler}s are not thread-safe and must only be - * exercised on the aggregator thread. {@link #names} is final and safe to read from any thread. + * exercised on the aggregator thread. {@link #names} and {@link #peerTagsRevision} are final and + * safe to read from any thread. */ final class PeerTagSchema { private static final int VALUE_LIMIT_PER_TAG = 512; + /** Sentinel revision for {@link #INTERNAL} -- it never changes. */ + static final long INTERNAL_REVISION = -1L; + /** Singleton schema for internal-kind spans -- only {@code base.service}. */ - static final PeerTagSchema INTERNAL = new PeerTagSchema(new String[] {BASE_SERVICE}); + static final PeerTagSchema INTERNAL = + new PeerTagSchema(new String[] {BASE_SERVICE}, INTERNAL_REVISION); final String[] names; final TagCardinalityHandler[] handlers; + /** + * The {@code DDAgentFeaturesDiscovery.peerTagsRevision()} value this schema was built from. Cache + * callers ({@link ClientStatsAggregator}) compare this against the current revision to decide + * whether to rebuild -- one final long carries the cache key on the schema itself. + */ + final long peerTagsRevision; + /** Builds a schema for the given peer-tag names. Order is determined by the {@link Set}. */ - static PeerTagSchema of(Set names) { - return new PeerTagSchema(names.toArray(new String[0])); + static PeerTagSchema of(Set names, long peerTagsRevision) { + return new PeerTagSchema(names.toArray(new String[0]), peerTagsRevision); } - private PeerTagSchema(String[] names) { + private PeerTagSchema(String[] names, long peerTagsRevision) { this.names = names; + this.peerTagsRevision = peerTagsRevision; this.handlers = new TagCardinalityHandler[names.length]; for (int i = 0; i < names.length; i++) { this.handlers[i] = new TagCardinalityHandler(names[i], VALUE_LIMIT_PER_TAG); diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java index 3c9e088b6c5..57ac6ddef8b 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java @@ -238,7 +238,7 @@ SnapshotBuilder peerTags(String... namesAndValues) { for (int i = 0; i < namesAndValues.length; i += 2) { names.add(namesAndValues[i]); } - this.peerTagSchema = PeerTagSchema.of(names); + this.peerTagSchema = PeerTagSchema.of(names, 0L); this.peerTagValues = new String[peerTagSchema.size()]; for (int i = 0; i < namesAndValues.length; i += 2) { for (int j = 0; j < peerTagSchema.size(); j++) { From 14f7f58272230dbd271732d790e6f1cf6e4ee49d Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:06:04 -0400 Subject: [PATCH 14/33] Consolidate cardinality-handler reset behind one entry point Reset was split between two owners: Aggregator.report called AggregateEntry.resetCardinalityHandlers (static handlers + INTERNAL) then ran a separate onResetCardinality callback that ClientStats wired up to reset its cached non-INTERNAL peer-agg schema. Anyone adding a new handler had to know which side to put it on. Make the callback the only entry point. ClientStatsAggregator. resetCardinalityHandlers (renamed from resetCachedPeerAggSchema) now calls AggregateEntry.resetCardinalityHandlers() itself plus the cached peer-agg schema reset. Aggregator.report just runs the callback -- it no longer knows about AggregateEntry's static state. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../datadog/trace/common/metrics/Aggregator.java | 7 ++++--- .../common/metrics/ClientStatsAggregator.java | 14 ++++++++------ 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java index cdc90ac6725..cf541121902 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java @@ -162,9 +162,10 @@ private void report(long when, SignalItem signal) { } dirty = false; } - // Reset cardinality handlers each report cycle so the per-field budgets refresh. - // Safe to call on this (aggregator) thread; handlers are HashMap-based and not thread-safe. - AggregateEntry.resetCardinalityHandlers(); + // Reset cardinality handlers each report cycle so the per-field budgets refresh. Single hook + // owned by ClientStatsAggregator -- it covers both the static property handlers on + // AggregateEntry and the cached peer-agg schema. Safe on this (aggregator) thread; handlers + // are HashMap-based and not thread-safe. if (onResetCardinality != null) { onResetCardinality.run(); } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index 9d2132165b5..eadef788bb0 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -80,7 +80,7 @@ public final class ClientStatsAggregator implements MetricsAggregator, EventList * state; {@code null} only on the bootstrap window before the first publish. * *

{@code volatile} because {@code publish} is called on arbitrary producer threads. The reset - * hook ({@link #resetCachedPeerAggSchema()}) runs on the aggregator thread and only mutates the + * hook ({@link #resetCardinalityHandlers()}) runs on the aggregator thread and only mutates the * schema's internal handler state (not this field). */ private volatile PeerTagSchema cachedPeerAggSchema; @@ -179,7 +179,7 @@ public ClientStatsAggregator( reportingInterval, timeUnit, healthMetric, - this::resetCachedPeerAggSchema); + this::resetCardinalityHandlers); this.thread = newAgentThread(METRICS_AGGREGATOR, aggregator); this.reportingInterval = reportingInterval; this.reportingIntervalTimeUnit = timeUnit; @@ -380,11 +380,13 @@ private synchronized PeerTagSchema refreshPeerAggSchema(long revision) { } /** - * Reset hook invoked on the aggregator thread at the end of each report cycle. Resets the cached - * peer-aggregation schema's cardinality handlers so per-field budgets refresh in lockstep with - * {@link AggregateEntry#resetCardinalityHandlers()}. + * Single reset hook invoked on the aggregator thread at the end of each report cycle. Resets all + * cardinality state in lockstep: the static property handlers + {@code PeerTagSchema.INTERNAL} + * (via {@link AggregateEntry#resetCardinalityHandlers()}) and the cached peer-aggregation schema. + * New handlers added anywhere in this pipeline should be reset from here. */ - private void resetCachedPeerAggSchema() { + private void resetCardinalityHandlers() { + AggregateEntry.resetCardinalityHandlers(); PeerTagSchema schema = cachedPeerAggSchema; if (schema != null) { schema.resetCardinalityHandlers(); From b953b3a15cd173b7fcbe021a53b2b765be886d8f Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:13:58 -0400 Subject: [PATCH 15/33] Parameterize PropertyCardinalityHandler on T extends CharSequence Each handler is now typed to its SpanSnapshot field type, so the HashMap's key class has well-defined equals/hashCode rather than the abstract CharSequence interface. For String-typed fields (service, spanKind, httpMethod, httpEndpoint, grpcStatusCode) the cache hits reliably. For CharSequence-typed fields (resource, operationName, serviceSource, type) consistency still depends on the producer returning a single concrete class per field -- a pre-existing runtime contract -- but the type system now prevents call sites from accidentally passing a different shape. registerOrEmpty is now generic so it threads T through. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 38 +++++++++++-------- .../metrics/PropertyCardinalityHandler.java | 15 ++++++-- .../metrics/CardinalityHandlerTest.java | 6 +-- 3 files changed, 38 insertions(+), 21 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index a2b679acdce..862c31e77aa 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -45,19 +45,27 @@ final class AggregateEntry extends Hashtable.Entry { public static final long ERROR_TAG = 0x8000000000000000L; public static final long TOP_LEVEL_TAG = 0x4000000000000000L; - // Per-field cardinality limits. Identical to the prior DDCache sizes. - static final PropertyCardinalityHandler RESOURCE_HANDLER = new PropertyCardinalityHandler(32); - static final PropertyCardinalityHandler SERVICE_HANDLER = new PropertyCardinalityHandler(32); - static final PropertyCardinalityHandler OPERATION_HANDLER = new PropertyCardinalityHandler(64); - static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = - new PropertyCardinalityHandler(16); - static final PropertyCardinalityHandler TYPE_HANDLER = new PropertyCardinalityHandler(8); - static final PropertyCardinalityHandler SPAN_KIND_HANDLER = new PropertyCardinalityHandler(16); - static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = new PropertyCardinalityHandler(8); - static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = - new PropertyCardinalityHandler(32); - static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = - new PropertyCardinalityHandler(32); + // Per-field cardinality limits. Identical to the prior DDCache sizes. Each handler's type + // parameter matches the corresponding SpanSnapshot field type so the cache map's key class has + // well-defined equals/hashCode. + static final PropertyCardinalityHandler RESOURCE_HANDLER = + new PropertyCardinalityHandler<>(32); + static final PropertyCardinalityHandler SERVICE_HANDLER = + new PropertyCardinalityHandler<>(32); + static final PropertyCardinalityHandler OPERATION_HANDLER = + new PropertyCardinalityHandler<>(64); + static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = + new PropertyCardinalityHandler<>(16); + static final PropertyCardinalityHandler TYPE_HANDLER = + new PropertyCardinalityHandler<>(8); + static final PropertyCardinalityHandler SPAN_KIND_HANDLER = + new PropertyCardinalityHandler<>(16); + static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = + new PropertyCardinalityHandler<>(8); + static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = + new PropertyCardinalityHandler<>(32); + static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = + new PropertyCardinalityHandler<>(32); final UTF8BytesString resource; final UTF8BytesString service; @@ -552,8 +560,8 @@ AggregateEntry toEntry() { // ----- helpers ----- - private static UTF8BytesString registerOrEmpty( - PropertyCardinalityHandler handler, CharSequence value) { + private static UTF8BytesString registerOrEmpty( + PropertyCardinalityHandler handler, T value) { return value == null ? UTF8BytesString.EMPTY : handler.register(value); } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index 61560a32a71..a9dc4d5265e 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -3,10 +3,19 @@ import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; import java.util.HashMap; -public final class PropertyCardinalityHandler { +/** + * Cardinality-capped UTF8 canonicalizer for one property field. + * + *

The type parameter {@code T} pins the input type per handler so the {@link HashMap} cache key + * is a class with well-defined {@code equals}/{@code hashCode} (e.g. {@code String}) rather than + * the abstract {@code CharSequence} interface, where {@code "foo".equals(UTF8BytesString + * .create("foo"))} is {@code false}. Each call site uses the type its {@code SpanSnapshot} field + * carries; the compiler then enforces type consistency across calls to a given handler. + */ +public final class PropertyCardinalityHandler { private final int cardinalityLimit; - private final HashMap curUtf8s; + private final HashMap curUtf8s; private UTF8BytesString cacheBlocked = null; @@ -17,7 +26,7 @@ public PropertyCardinalityHandler(int cardinalityLimit) { this.curUtf8s = new HashMap<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); } - public UTF8BytesString register(CharSequence value) { + public UTF8BytesString register(T value) { if (this.curUtf8s.size() >= this.cardinalityLimit) { return this.blockedByTracer(); } diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java index bbdffb6061a..3ca8f51626e 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java @@ -11,7 +11,7 @@ class CardinalityHandlerTest { @Test void propertyReturnsSameInstanceForRepeatedValueUntilLimit() { - PropertyCardinalityHandler h = new PropertyCardinalityHandler(3); + PropertyCardinalityHandler h = new PropertyCardinalityHandler<>(3); UTF8BytesString a1 = h.register("a"); UTF8BytesString a2 = h.register("a"); assertSame(a1, a2); @@ -20,7 +20,7 @@ void propertyReturnsSameInstanceForRepeatedValueUntilLimit() { @Test void propertyOverLimitReturnsBlockedSentinel() { - PropertyCardinalityHandler h = new PropertyCardinalityHandler(2); + PropertyCardinalityHandler h = new PropertyCardinalityHandler<>(2); UTF8BytesString a = h.register("a"); UTF8BytesString b = h.register("b"); UTF8BytesString blocked1 = h.register("c"); @@ -34,7 +34,7 @@ void propertyOverLimitReturnsBlockedSentinel() { @Test void propertyResetRefreshesBudget() { - PropertyCardinalityHandler h = new PropertyCardinalityHandler(2); + PropertyCardinalityHandler h = new PropertyCardinalityHandler<>(2); h.register("a"); h.register("b"); UTF8BytesString blocked = h.register("c"); From 40e8cbd4b5a60727fb1704ed8da57a8a23321e11 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:22:19 -0400 Subject: [PATCH 16/33] Add long-lived LRU cache to PropertyCardinalityHandler Previously, reset() cleared the only cache, so every reporting cycle re-allocated UTF8BytesString instances for every property value seen again. Sustained allocations on the aggregator thread proportional to the sum of per-field cardinality limits, ~bytes/sec, on every reset. Split the state in two: - seenThisCycle (HashSet): consumed-budget tracking, cleared on reset(). - utf8Cache (LinkedHashMap in access-order, 2x cardinalityLimit): long-lived; survives reset; LRU eviction once full. Workloads with stable value sets pay zero UTF8 allocations after the first cycle. The reused instances also short-circuit downstream equals to identity comparisons. Drops the TODO at the prior allocation site. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../metrics/PropertyCardinalityHandler.java | 73 ++++++++++++++----- 1 file changed, 56 insertions(+), 17 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index a9dc4d5265e..f6d526deeee 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -1,43 +1,81 @@ package datadog.trace.common.metrics; import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; -import java.util.HashMap; +import java.util.HashSet; +import java.util.LinkedHashMap; +import java.util.Map; /** * Cardinality-capped UTF8 canonicalizer for one property field. * - *

The type parameter {@code T} pins the input type per handler so the {@link HashMap} cache key - * is a class with well-defined {@code equals}/{@code hashCode} (e.g. {@code String}) rather than - * the abstract {@code CharSequence} interface, where {@code "foo".equals(UTF8BytesString - * .create("foo"))} is {@code false}. Each call site uses the type its {@code SpanSnapshot} field - * carries; the compiler then enforces type consistency across calls to a given handler. + *

The type parameter {@code T} pins the input type per handler so the cache key is a class with + * well-defined {@code equals}/{@code hashCode} (e.g. {@code String}) rather than the abstract + * {@code CharSequence} interface, where {@code "foo".equals(UTF8BytesString.create("foo"))} is + * {@code false}. Each call site uses the type its {@code SpanSnapshot} field carries; the compiler + * then enforces type consistency across calls to a given handler. + * + *

Two tiers of state: + * + *

    + *
  • {@link #seenThisCycle} -- values that have consumed a slot of the cardinality budget this + * reporting cycle. Cleared on {@link #reset()}. + *
  • {@link #utf8Cache} -- LRU-bounded reuse cache of previously-built {@link UTF8BytesString} + * instances. Survives {@code reset()}, so a value seen across multiple cycles canonicalizes + * to the same instance and avoids re-allocation. Bounded at {@code 2 * cardinalityLimit}; + * once full, the eldest entry is evicted by {@link LinkedHashMap}'s access-order tracking. + *
+ * + *

Reusing UTF8BytesString instances across cycles also benefits downstream identity-based + * comparisons: equality short-circuits to {@code ==} when both sides came from the cache. */ public final class PropertyCardinalityHandler { + /** Long-lived UTF8 cache holds this multiple of the per-cycle cardinality limit. */ + private static final int CACHE_MULTIPLIER = 2; + private final int cardinalityLimit; - private final HashMap curUtf8s; + /** Values that have consumed a slot of the cardinality budget this cycle. Cleared on reset. */ + private final HashSet seenThisCycle; + + /** + * LRU UTF8 cache; survives reset. Eviction handled by {@link LinkedHashMap#removeEldestEntry}. + */ + private final LinkedHashMap utf8Cache; private UTF8BytesString cacheBlocked = null; public PropertyCardinalityHandler(int cardinalityLimit) { this.cardinalityLimit = cardinalityLimit; + final int cacheLimit = cardinalityLimit * CACHE_MULTIPLIER; // pre-sizing properly to avoid rehashing - this.curUtf8s = new HashMap<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); + this.seenThisCycle = new HashSet<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); + this.utf8Cache = + new LinkedHashMap( + (int) Math.ceil(cacheLimit / 0.75) + 1, 0.75f, true /* access-order */) { + @Override + protected boolean removeEldestEntry(Map.Entry eldest) { + return size() > cacheLimit; + } + }; } public UTF8BytesString register(T value) { - if (this.curUtf8s.size() >= this.cardinalityLimit) { - return this.blockedByTracer(); + // Cardinality budget: first-time-this-cycle values consume a slot; overflow returns sentinel. + if (!this.seenThisCycle.contains(value)) { + if (this.seenThisCycle.size() >= this.cardinalityLimit) { + return this.blockedByTracer(); + } + this.seenThisCycle.add(value); } - UTF8BytesString existingUtf8 = this.curUtf8s.get(value); - if (existingUtf8 != null) return existingUtf8; + // UTF8 lookup: long-lived cache reuses across cycles. + UTF8BytesString cached = this.utf8Cache.get(value); + if (cached != null) return cached; - // TODO: maybe use a fallback cache to reduce allocations across reset cycles - UTF8BytesString newUtf8 = UTF8BytesString.create(value); - this.curUtf8s.put(value, newUtf8); - return newUtf8; + UTF8BytesString fresh = UTF8BytesString.create(value); + this.utf8Cache.put(value, fresh); + return fresh; } private UTF8BytesString blockedByTracer() { @@ -49,6 +87,7 @@ private UTF8BytesString blockedByTracer() { } public void reset() { - this.curUtf8s.clear(); + this.seenThisCycle.clear(); + // utf8Cache deliberately not cleared -- cross-cycle reuse is the point. } } From d88a86346bd1853b2cd6b7228c6879910d14ed43 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:22:39 -0400 Subject: [PATCH 17/33] Centralize per-field cardinality limits in MetricCardinalityLimits The 9 property limits and the peer-tag value limit were sprinkled inline. Pull them into a single class with per-field javadoc so the sizing rationale lives in one place. Six values change from the DDCache-inherited defaults based on workload analysis: - RESOURCE 32 -> 128 (highest-cardinality field; tight today) - HTTP_ENDPOINT 32 -> 64 (same shape as RESOURCE for HTTP-heavy) - TYPE 8 -> 16 (DDSpanTypes catalogue is ~30) - HTTP_METHOD 8 -> 16 (WebDAV/custom verbs push past 8) - SPAN_KIND 16 -> 8 (OTel defines exactly 5 standard kinds) - GRPC_STATUS 32 -> 24 (gRPC spec has exactly 17 codes) SERVICE, OPERATION, SERVICE_SOURCE, and PEER_TAG_VALUE keep their current values. Net worst-case memory delta: roughly +90 KB driven by the RESOURCE and HTTP_ENDPOINT bumps. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 24 +++--- .../metrics/MetricCardinalityLimits.java | 73 +++++++++++++++++++ .../trace/common/metrics/PeerTagSchema.java | 5 +- 3 files changed, 87 insertions(+), 15 deletions(-) create mode 100644 dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricCardinalityLimits.java diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 862c31e77aa..3fa64b89a6f 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -45,27 +45,27 @@ final class AggregateEntry extends Hashtable.Entry { public static final long ERROR_TAG = 0x8000000000000000L; public static final long TOP_LEVEL_TAG = 0x4000000000000000L; - // Per-field cardinality limits. Identical to the prior DDCache sizes. Each handler's type - // parameter matches the corresponding SpanSnapshot field type so the cache map's key class has - // well-defined equals/hashCode. + // Per-field cardinality handlers. Each handler's type parameter matches the corresponding + // SpanSnapshot field type so the cache key class has well-defined equals/hashCode. Limits live + // on MetricCardinalityLimits -- see that class for per-field rationale. static final PropertyCardinalityHandler RESOURCE_HANDLER = - new PropertyCardinalityHandler<>(32); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.RESOURCE); static final PropertyCardinalityHandler SERVICE_HANDLER = - new PropertyCardinalityHandler<>(32); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.SERVICE); static final PropertyCardinalityHandler OPERATION_HANDLER = - new PropertyCardinalityHandler<>(64); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.OPERATION); static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = - new PropertyCardinalityHandler<>(16); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.SERVICE_SOURCE); static final PropertyCardinalityHandler TYPE_HANDLER = - new PropertyCardinalityHandler<>(8); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.TYPE); static final PropertyCardinalityHandler SPAN_KIND_HANDLER = - new PropertyCardinalityHandler<>(16); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.SPAN_KIND); static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = - new PropertyCardinalityHandler<>(8); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.HTTP_METHOD); static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = - new PropertyCardinalityHandler<>(32); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.HTTP_ENDPOINT); static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = - new PropertyCardinalityHandler<>(32); + new PropertyCardinalityHandler<>(MetricCardinalityLimits.GRPC_STATUS_CODE); final UTF8BytesString resource; final UTF8BytesString service; diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricCardinalityLimits.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricCardinalityLimits.java new file mode 100644 index 00000000000..f7d91343d4b --- /dev/null +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/MetricCardinalityLimits.java @@ -0,0 +1,73 @@ +package datadog.trace.common.metrics; + +/** + * Per-field caps on the number of distinct values canonicalized per reporting cycle. Overflow + * values collapse to a {@code blocked_by_tracer} sentinel so they merge into one aggregate row + * instead of fragmenting the table. + * + *

Values are sized to the typical-service workload with headroom; "typical" estimates are noted + * inline. Raise if a workload routinely hits the sentinel; lower carries proportional memory + * savings but risks suppressing legitimate distinctions. + */ +final class MetricCardinalityLimits { + private MetricCardinalityLimits() {} + + /** + * Distinct {@code resource.name} values per cycle. Highest-cardinality field by far: DB-query + * obfuscations, HTTP route templates, custom resources. Typical service: 30-200 unique. + */ + static final int RESOURCE = 128; + + /** + * Distinct {@code service.name} values per cycle. Local service plus downstream peer-service + * names. Microservice meshes typically reference 10-50 distinct services. + */ + static final int SERVICE = 32; + + /** + * Distinct {@code operation.name} values per cycle. Names like {@code http.request}, {@code + * db.query}, etc. Typical service: 10-30 across integrations. + */ + static final int OPERATION = 64; + + /** + * Distinct {@code _dd.base_service} override values per cycle. Used rarely; usually empty or one + * of a handful per service. + */ + static final int SERVICE_SOURCE = 16; + + /** + * Distinct {@code span.type} values per cycle. {@code DDSpanTypes} catalog is ~30; a single + * service usually spans 5-10 integration types. + */ + static final int TYPE = 16; + + /** + * Distinct {@code span.kind} values per cycle. OTel defines exactly 5 (server/client/producer/ + * consumer/internal); 8 still leaves 60% headroom in case a producer invents new kinds. + */ + static final int SPAN_KIND = 8; + + /** + * Distinct HTTP method values per cycle. Standard verbs are 7-9; WebDAV/custom adds a few more. + */ + static final int HTTP_METHOD = 16; + + /** + * Distinct {@code http.endpoint} values per cycle. Path templates -- same shape as {@code + * RESOURCE} for HTTP-heavy services. Only used when {@code includeEndpointInMetrics} is enabled. + */ + static final int HTTP_ENDPOINT = 64; + + /** + * Distinct gRPC status code values per cycle. gRPC spec defines exactly 17 codes (0-16); 24 + * leaves headroom for unknown-code edge cases without wasting space. + */ + static final int GRPC_STATUS_CODE = 24; + + /** + * Distinct values per peer-tag name (e.g. distinct {@code peer.hostname} values). Each configured + * peer tag gets its own handler at this limit. + */ + static final int PEER_TAG_VALUE = 512; +} diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java index 533e69c847a..0dc6e1c9e23 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -31,8 +31,6 @@ */ final class PeerTagSchema { - private static final int VALUE_LIMIT_PER_TAG = 512; - /** Sentinel revision for {@link #INTERNAL} -- it never changes. */ static final long INTERNAL_REVISION = -1L; @@ -60,7 +58,8 @@ private PeerTagSchema(String[] names, long peerTagsRevision) { this.peerTagsRevision = peerTagsRevision; this.handlers = new TagCardinalityHandler[names.length]; for (int i = 0; i < names.length; i++) { - this.handlers[i] = new TagCardinalityHandler(names[i], VALUE_LIMIT_PER_TAG); + this.handlers[i] = + new TagCardinalityHandler(names[i], MetricCardinalityLimits.PEER_TAG_VALUE); } } From d01036fa9da214e0cbe6b99754888f25c2628326 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:28:50 -0400 Subject: [PATCH 18/33] Reimplement cardinality handlers as open-addressed flat arrays Replaces the previous LinkedHashMap-based design for PropertyCardinality Handler (and the HashMap-based TagCardinalityHandler) with parallel Object[] / UTF8BytesString[] arrays and linear-probing open addressing. Two tables per handler, "current cycle" and "prior cycle": - Capacity is the next power of two >= 2 * cardinalityLimit, so the linear-probing load factor stays <= 0.5 even when the budget is full. - Current tracks values that have consumed a slot of the cardinality budget this cycle. - Prior holds the just-completed cycle's entries verbatim. A first-time- this-cycle value that hits in prior reuses its UTF8BytesString instance -- no re-allocation. Implements the cross-reset reuse that the prior commit's LinkedHashMap LRU provided, with less overhead. Reset swaps the table pointers (just-completed cycle -> prior; the 2-cycles-ago tables get nulled out and become the new empty current). One O(capacity) pass, half the work of a copy-then-null. Wins: - No per-entry Node allocations (HashMap / LinkedHashMap) and no access-order linked-list maintenance per get. - Smaller working set: two Object[] + two UTF8BytesString[] per handler vs HashMap + HashSet + LinkedHashMap heap shapes. - Stable workloads pay zero UTF8BytesString allocations after the first cycle and produce identical references across cycles, so downstream equals short-circuits to ==. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../metrics/PropertyCardinalityHandler.java | 129 +++++++++++------- .../common/metrics/TagCardinalityHandler.java | 68 +++++++-- 2 files changed, 134 insertions(+), 63 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index f6d526deeee..fbe55eaa680 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -1,81 +1,99 @@ package datadog.trace.common.metrics; import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; -import java.util.HashSet; -import java.util.LinkedHashMap; -import java.util.Map; +import java.util.Arrays; /** * Cardinality-capped UTF8 canonicalizer for one property field. * - *

The type parameter {@code T} pins the input type per handler so the cache key is a class with + *

The type parameter {@code T} pins the input type per handler so the cache key class has * well-defined {@code equals}/{@code hashCode} (e.g. {@code String}) rather than the abstract * {@code CharSequence} interface, where {@code "foo".equals(UTF8BytesString.create("foo"))} is * {@code false}. Each call site uses the type its {@code SpanSnapshot} field carries; the compiler * then enforces type consistency across calls to a given handler. * - *

Two tiers of state: + *

Storage: open-addressed flat arrays with linear probing. Two parallel tables -- + * "current cycle" and "prior cycle". Capacity is the next power of two {@code >= 2 * + * cardinalityLimit} so probes stay short even when the budget is full. * *

    - *
  • {@link #seenThisCycle} -- values that have consumed a slot of the cardinality budget this - * reporting cycle. Cleared on {@link #reset()}. - *
  • {@link #utf8Cache} -- LRU-bounded reuse cache of previously-built {@link UTF8BytesString} - * instances. Survives {@code reset()}, so a value seen across multiple cycles canonicalizes - * to the same instance and avoids re-allocation. Bounded at {@code 2 * cardinalityLimit}; - * once full, the eldest entry is evicted by {@link LinkedHashMap}'s access-order tracking. + *
  • The current table tracks which values have consumed a slot of the cardinality budget this + * reporting cycle. Once {@link #cardinalityLimit} distinct values are present, further + * first-time values get the {@code blocked_by_tracer} sentinel. + *
  • The prior table holds the previous cycle's entries verbatim. A first-time-this-cycle value + * that hits in the prior table reuses its {@link UTF8BytesString} instance -- no + * re-allocation -- and inserts a reference into the current table. *
* - *

Reusing UTF8BytesString instances across cycles also benefits downstream identity-based - * comparisons: equality short-circuits to {@code ==} when both sides came from the cache. + *

Reset: swap the current and prior pointers, then null the (now) current. This is one + * O(capacity) pass rather than the two passes a copy-then-null would need. Workloads with a stable + * value set across cycles pay zero UTF8 allocations after the first cycle; the reused instances + * also short-circuit downstream equality to identity comparisons. */ public final class PropertyCardinalityHandler { - /** Long-lived UTF8 cache holds this multiple of the per-cycle cardinality limit. */ - private static final int CACHE_MULTIPLIER = 2; - private final int cardinalityLimit; + private final int capacityMask; - /** Values that have consumed a slot of the cardinality budget this cycle. Cleared on reset. */ - private final HashSet seenThisCycle; - - /** - * LRU UTF8 cache; survives reset. Eviction handled by {@link LinkedHashMap#removeEldestEntry}. - */ - private final LinkedHashMap utf8Cache; + // Open-addressed parallel arrays. keys[i] == null means the slot is empty; otherwise + // values[i] holds the canonical UTF8 for keys[i]. Object[] rather than T[] so we can swap + // refs without unchecked-array-of-generic gymnastics. + private Object[] curKeys; + private UTF8BytesString[] curValues; + private Object[] priorKeys; + private UTF8BytesString[] priorValues; + private int curSize; private UTF8BytesString cacheBlocked = null; public PropertyCardinalityHandler(int cardinalityLimit) { + if (cardinalityLimit <= 0) { + throw new IllegalArgumentException("cardinalityLimit must be positive: " + cardinalityLimit); + } this.cardinalityLimit = cardinalityLimit; - - final int cacheLimit = cardinalityLimit * CACHE_MULTIPLIER; - // pre-sizing properly to avoid rehashing - this.seenThisCycle = new HashSet<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); - this.utf8Cache = - new LinkedHashMap( - (int) Math.ceil(cacheLimit / 0.75) + 1, 0.75f, true /* access-order */) { - @Override - protected boolean removeEldestEntry(Map.Entry eldest) { - return size() > cacheLimit; - } - }; + // Capacity = next power of two >= 2 * cardinalityLimit. Linear-probing load factor stays + // <= 0.5 even when the budget is full, which keeps probe chains short. + final int capacity = Integer.highestOneBit(cardinalityLimit * 2 - 1) << 1; + this.capacityMask = capacity - 1; + this.curKeys = new Object[capacity]; + this.curValues = new UTF8BytesString[capacity]; + this.priorKeys = new Object[capacity]; + this.priorValues = new UTF8BytesString[capacity]; } public UTF8BytesString register(T value) { - // Cardinality budget: first-time-this-cycle values consume a slot; overflow returns sentinel. - if (!this.seenThisCycle.contains(value)) { - if (this.seenThisCycle.size() >= this.cardinalityLimit) { - return this.blockedByTracer(); - } - this.seenThisCycle.add(value); + final int slot = probe(this.curKeys, value); + if (this.curKeys[slot] != null) { + // Already seen this cycle -- consumed a budget slot earlier; reuse the cached UTF8. + return this.curValues[slot]; } + if (this.curSize >= this.cardinalityLimit) { + return this.blockedByTracer(); + } + // First-time-this-cycle value. Reuse from the prior cycle if possible to avoid re-allocation. + UTF8BytesString utf8; + final int priorSlot = probe(this.priorKeys, value); + if (this.priorKeys[priorSlot] != null) { + utf8 = this.priorValues[priorSlot]; + } else { + utf8 = UTF8BytesString.create(value); + } + this.curKeys[slot] = value; + this.curValues[slot] = utf8; + this.curSize += 1; + return utf8; + } - // UTF8 lookup: long-lived cache reuses across cycles. - UTF8BytesString cached = this.utf8Cache.get(value); - if (cached != null) return cached; - - UTF8BytesString fresh = UTF8BytesString.create(value); - this.utf8Cache.put(value, fresh); - return fresh; + /** + * Linear-probe to find {@code value}'s slot: either the slot occupied by an equal key, or the + * first empty slot in the probe chain. Capacity is a power of two; mask with {@link + * #capacityMask}. + */ + private int probe(Object[] keys, T value) { + int idx = value.hashCode() & this.capacityMask; + while (keys[idx] != null && !keys[idx].equals(value)) { + idx = (idx + 1) & this.capacityMask; + } + return idx; } private UTF8BytesString blockedByTracer() { @@ -87,7 +105,18 @@ private UTF8BytesString blockedByTracer() { } public void reset() { - this.seenThisCycle.clear(); - // utf8Cache deliberately not cleared -- cross-cycle reuse is the point. + // Flip pointers: the just-completed cycle becomes prior; what was prior (2 cycles ago) is + // recycled into the new (empty) current. + final Object[] tmpKeys = this.priorKeys; + final UTF8BytesString[] tmpValues = this.priorValues; + this.priorKeys = this.curKeys; + this.priorValues = this.curValues; + this.curKeys = tmpKeys; + this.curValues = tmpValues; + // Null the new current. The values pulled out of prior are still reachable through any + // AggregateEntry rows they ended up populating; this just drops the handler's references. + Arrays.fill(this.curKeys, null); + Arrays.fill(this.curValues, null); + this.curSize = 0; } } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index 1fdfed5c7c4..f5fa3d2482f 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -1,35 +1,69 @@ package datadog.trace.common.metrics; import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; -import java.util.HashMap; +import java.util.Arrays; +/** + * Cardinality-capped UTF8 canonicalizer for one peer-tag name. Output is the pre-encoded {@code + * "tag:value"} form the serializer writes. + * + *

Same open-addressed flat-array + prior-cycle reuse design as {@link + * PropertyCardinalityHandler} -- see that class for full description. + */ public final class TagCardinalityHandler { private final String tag; private final int cardinalityLimit; + private final int capacityMask; - private final HashMap curUtf8Pairs; + private Object[] curKeys; + private UTF8BytesString[] curValues; + private Object[] priorKeys; + private UTF8BytesString[] priorValues; + private int curSize; private UTF8BytesString cacheBlocked = null; public TagCardinalityHandler(String tag, int cardinalityLimit) { + if (cardinalityLimit <= 0) { + throw new IllegalArgumentException("cardinalityLimit must be positive: " + cardinalityLimit); + } this.tag = tag; this.cardinalityLimit = cardinalityLimit; - - // pre-sizing properly to avoid rehashing - this.curUtf8Pairs = new HashMap<>((int) Math.ceil(cardinalityLimit / 0.75) + 1); + final int capacity = Integer.highestOneBit(cardinalityLimit * 2 - 1) << 1; + this.capacityMask = capacity - 1; + this.curKeys = new Object[capacity]; + this.curValues = new UTF8BytesString[capacity]; + this.priorKeys = new Object[capacity]; + this.priorValues = new UTF8BytesString[capacity]; } public UTF8BytesString register(String value) { - if (this.curUtf8Pairs.size() >= this.cardinalityLimit) { + final int slot = probe(this.curKeys, value); + if (this.curKeys[slot] != null) { + return this.curValues[slot]; + } + if (this.curSize >= this.cardinalityLimit) { return this.blockedByTracer(); } + UTF8BytesString utf8; + final int priorSlot = probe(this.priorKeys, value); + if (this.priorKeys[priorSlot] != null) { + utf8 = this.priorValues[priorSlot]; + } else { + utf8 = UTF8BytesString.create(this.tag + ":" + value); + } + this.curKeys[slot] = value; + this.curValues[slot] = utf8; + this.curSize += 1; + return utf8; + } - UTF8BytesString existing = this.curUtf8Pairs.get(value); - if (existing != null) return existing; - - UTF8BytesString newPair = UTF8BytesString.create(this.tag + ":" + value); - this.curUtf8Pairs.put(value, newPair); - return newPair; + private int probe(Object[] keys, String value) { + int idx = value.hashCode() & this.capacityMask; + while (keys[idx] != null && !keys[idx].equals(value)) { + idx = (idx + 1) & this.capacityMask; + } + return idx; } private UTF8BytesString blockedByTracer() { @@ -41,6 +75,14 @@ private UTF8BytesString blockedByTracer() { } public void reset() { - this.curUtf8Pairs.clear(); + final Object[] tmpKeys = this.priorKeys; + final UTF8BytesString[] tmpValues = this.priorValues; + this.priorKeys = this.curKeys; + this.priorValues = this.curValues; + this.curKeys = tmpKeys; + this.curValues = tmpValues; + Arrays.fill(this.curKeys, null); + Arrays.fill(this.curValues, null); + this.curSize = 0; } } From 7b6c5f1e84bb31b7ddc516a29592f8d5a3df8c06 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:33:28 -0400 Subject: [PATCH 19/33] Drop parallel keys array in PropertyCardinalityHandler The stored UTF8BytesString can serve as the slot's identity on its own: its hashCode() returns the underlying String.hashCode (content-stable with whatever shape the input takes), and equality is checked via stored.toString().contentEquals(value) -- the JDK's content-equality routine that fast-paths to String.equals when the input is a String. Halves the per-handler array footprint: one UTF8BytesString[] per cycle (current + prior) instead of one Object[] + one UTF8BytesString[] per cycle. No behavior change. TagCardinalityHandler keeps the parallel-arrays shape because its stored UTF8 is "tag:value" and cannot be compared directly against the bare incoming value. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../metrics/PropertyCardinalityHandler.java | 73 +++++++++---------- 1 file changed, 35 insertions(+), 38 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index fbe55eaa680..357c34617a0 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -6,15 +6,19 @@ /** * Cardinality-capped UTF8 canonicalizer for one property field. * - *

The type parameter {@code T} pins the input type per handler so the cache key class has - * well-defined {@code equals}/{@code hashCode} (e.g. {@code String}) rather than the abstract - * {@code CharSequence} interface, where {@code "foo".equals(UTF8BytesString.create("foo"))} is - * {@code false}. Each call site uses the type its {@code SpanSnapshot} field carries; the compiler - * then enforces type consistency across calls to a given handler. + *

The type parameter {@code T} pins the input type per handler so the input class has + * well-defined, content-stable {@code hashCode}/{@code equals} (e.g. {@code String}) consistent + * with {@link UTF8BytesString#hashCode()} (which delegates to the underlying String). Each call + * site uses the type its {@code SpanSnapshot} field carries; the compiler then enforces type + * consistency across calls to a given handler. * - *

Storage: open-addressed flat arrays with linear probing. Two parallel tables -- - * "current cycle" and "prior cycle". Capacity is the next power of two {@code >= 2 * - * cardinalityLimit} so probes stay short even when the budget is full. + *

Storage: open-addressed flat arrays with linear probing. Two parallel {@code + * UTF8BytesString[]} tables -- "current cycle" and "prior cycle". Capacity is the next power of two + * {@code >= 2 * cardinalityLimit} so probes stay short even at the full budget. + * + *

The stored UTF8BytesString carries the slot's identity directly: probe equality is {@code + * stored.toString().contentEquals(value)}, which is the JDK's content-equality routine and + * fast-paths to {@code String.equals} when the input is a String. No parallel keys array needed. * *

    *
  • The current table tracks which values have consumed a slot of the cardinality budget this @@ -22,24 +26,21 @@ * first-time values get the {@code blocked_by_tracer} sentinel. *
  • The prior table holds the previous cycle's entries verbatim. A first-time-this-cycle value * that hits in the prior table reuses its {@link UTF8BytesString} instance -- no - * re-allocation -- and inserts a reference into the current table. + * re-allocation -- and stores that reference in the current table. *
* - *

Reset: swap the current and prior pointers, then null the (now) current. This is one - * O(capacity) pass rather than the two passes a copy-then-null would need. Workloads with a stable - * value set across cycles pay zero UTF8 allocations after the first cycle; the reused instances - * also short-circuit downstream equality to identity comparisons. + *

Reset: swap the current and prior pointers, then null the (now) current. One + * O(capacity) pass; half the work of a copy-then-null. Workloads with a stable value set across + * cycles pay zero UTF8 allocations after the first cycle, and the reused instances also + * short-circuit downstream equality to identity comparisons. */ public final class PropertyCardinalityHandler { private final int cardinalityLimit; private final int capacityMask; - // Open-addressed parallel arrays. keys[i] == null means the slot is empty; otherwise - // values[i] holds the canonical UTF8 for keys[i]. Object[] rather than T[] so we can swap - // refs without unchecked-array-of-generic gymnastics. - private Object[] curKeys; + // Single open-addressed table per cycle. The stored UTF8BytesString IS the slot identity -- + // equality is checked by comparing its underlying String against the incoming CharSequence. private UTF8BytesString[] curValues; - private Object[] priorKeys; private UTF8BytesString[] priorValues; private int curSize; @@ -54,43 +55,43 @@ public PropertyCardinalityHandler(int cardinalityLimit) { // <= 0.5 even when the budget is full, which keeps probe chains short. final int capacity = Integer.highestOneBit(cardinalityLimit * 2 - 1) << 1; this.capacityMask = capacity - 1; - this.curKeys = new Object[capacity]; this.curValues = new UTF8BytesString[capacity]; - this.priorKeys = new Object[capacity]; this.priorValues = new UTF8BytesString[capacity]; } public UTF8BytesString register(T value) { - final int slot = probe(this.curKeys, value); - if (this.curKeys[slot] != null) { + final int slot = probe(this.curValues, value); + final UTF8BytesString existing = this.curValues[slot]; + if (existing != null) { // Already seen this cycle -- consumed a budget slot earlier; reuse the cached UTF8. - return this.curValues[slot]; + return existing; } if (this.curSize >= this.cardinalityLimit) { return this.blockedByTracer(); } // First-time-this-cycle value. Reuse from the prior cycle if possible to avoid re-allocation. UTF8BytesString utf8; - final int priorSlot = probe(this.priorKeys, value); - if (this.priorKeys[priorSlot] != null) { - utf8 = this.priorValues[priorSlot]; + final int priorSlot = probe(this.priorValues, value); + final UTF8BytesString priorMatch = this.priorValues[priorSlot]; + if (priorMatch != null) { + utf8 = priorMatch; } else { utf8 = UTF8BytesString.create(value); } - this.curKeys[slot] = value; this.curValues[slot] = utf8; this.curSize += 1; return utf8; } /** - * Linear-probe to find {@code value}'s slot: either the slot occupied by an equal key, or the - * first empty slot in the probe chain. Capacity is a power of two; mask with {@link - * #capacityMask}. + * Linear-probe to find {@code value}'s slot: either the slot occupied by a content-equal + * UTF8BytesString, or the first empty slot in the probe chain. {@link UTF8BytesString#hashCode} + * is content-stable with the underlying String, so the same content hashes to the same slot + * regardless of whether the input is a String or UTF8BytesString. */ - private int probe(Object[] keys, T value) { + private int probe(UTF8BytesString[] values, T value) { int idx = value.hashCode() & this.capacityMask; - while (keys[idx] != null && !keys[idx].equals(value)) { + while (values[idx] != null && !values[idx].toString().contentEquals(value)) { idx = (idx + 1) & this.capacityMask; } return idx; @@ -107,15 +108,11 @@ private UTF8BytesString blockedByTracer() { public void reset() { // Flip pointers: the just-completed cycle becomes prior; what was prior (2 cycles ago) is // recycled into the new (empty) current. - final Object[] tmpKeys = this.priorKeys; - final UTF8BytesString[] tmpValues = this.priorValues; - this.priorKeys = this.curKeys; + final UTF8BytesString[] tmp = this.priorValues; this.priorValues = this.curValues; - this.curKeys = tmpKeys; - this.curValues = tmpValues; + this.curValues = tmp; // Null the new current. The values pulled out of prior are still reachable through any // AggregateEntry rows they ended up populating; this just drops the handler's references. - Arrays.fill(this.curKeys, null); Arrays.fill(this.curValues, null); this.curSize = 0; } From 10ca111adb0670cea11edb4b911df8b4e96d3baf Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:38:23 -0400 Subject: [PATCH 20/33] Drop type parameter from PropertyCardinalityHandler The type parameter was load-bearing when slot identity went through a parallel Object[] keys array (where T determined the runtime class whose equals/hashCode the HashMap used). The single-array shape probes via UTF8BytesString.hashCode() (content-stable with the underlying String) and stored.toString().contentEquals(value), so any CharSequence input -- String, UTF8BytesString, anything else with a content-stable hash -- collapses to the right slot. register(CharSequence value) is enough. AggregateEntry's 9 static handler declarations and the registerOrEmpty helper lose their type parameters too. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 45 +++++++++---------- .../metrics/PropertyCardinalityHandler.java | 23 +++++----- .../metrics/CardinalityHandlerTest.java | 6 +-- 3 files changed, 35 insertions(+), 39 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 3fa64b89a6f..43cc8c0e7e3 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -45,27 +45,26 @@ final class AggregateEntry extends Hashtable.Entry { public static final long ERROR_TAG = 0x8000000000000000L; public static final long TOP_LEVEL_TAG = 0x4000000000000000L; - // Per-field cardinality handlers. Each handler's type parameter matches the corresponding - // SpanSnapshot field type so the cache key class has well-defined equals/hashCode. Limits live - // on MetricCardinalityLimits -- see that class for per-field rationale. - static final PropertyCardinalityHandler RESOURCE_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.RESOURCE); - static final PropertyCardinalityHandler SERVICE_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.SERVICE); - static final PropertyCardinalityHandler OPERATION_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.OPERATION); - static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.SERVICE_SOURCE); - static final PropertyCardinalityHandler TYPE_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.TYPE); - static final PropertyCardinalityHandler SPAN_KIND_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.SPAN_KIND); - static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.HTTP_METHOD); - static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.HTTP_ENDPOINT); - static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = - new PropertyCardinalityHandler<>(MetricCardinalityLimits.GRPC_STATUS_CODE); + // Per-field cardinality handlers. Limits live on MetricCardinalityLimits -- see that class for + // per-field rationale. + static final PropertyCardinalityHandler RESOURCE_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.RESOURCE); + static final PropertyCardinalityHandler SERVICE_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.SERVICE); + static final PropertyCardinalityHandler OPERATION_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.OPERATION); + static final PropertyCardinalityHandler SERVICE_SOURCE_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.SERVICE_SOURCE); + static final PropertyCardinalityHandler TYPE_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.TYPE); + static final PropertyCardinalityHandler SPAN_KIND_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.SPAN_KIND); + static final PropertyCardinalityHandler HTTP_METHOD_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.HTTP_METHOD); + static final PropertyCardinalityHandler HTTP_ENDPOINT_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.HTTP_ENDPOINT); + static final PropertyCardinalityHandler GRPC_STATUS_CODE_HANDLER = + new PropertyCardinalityHandler(MetricCardinalityLimits.GRPC_STATUS_CODE); final UTF8BytesString resource; final UTF8BytesString service; @@ -560,8 +559,8 @@ AggregateEntry toEntry() { // ----- helpers ----- - private static UTF8BytesString registerOrEmpty( - PropertyCardinalityHandler handler, T value) { + private static UTF8BytesString registerOrEmpty( + PropertyCardinalityHandler handler, CharSequence value) { return value == null ? UTF8BytesString.EMPTY : handler.register(value); } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index 357c34617a0..1d5d9077ffc 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -6,19 +6,16 @@ /** * Cardinality-capped UTF8 canonicalizer for one property field. * - *

The type parameter {@code T} pins the input type per handler so the input class has - * well-defined, content-stable {@code hashCode}/{@code equals} (e.g. {@code String}) consistent - * with {@link UTF8BytesString#hashCode()} (which delegates to the underlying String). Each call - * site uses the type its {@code SpanSnapshot} field carries; the compiler then enforces type - * consistency across calls to a given handler. + *

Accepts any {@link CharSequence} input -- mixed {@code String}/{@code UTF8BytesString} of the + * same content collapse to one slot because {@link UTF8BytesString#hashCode()} delegates to the + * underlying String's hash and probe equality is the content-based {@code + * stored.toString().contentEquals(value)} (which fast-paths to {@code String.equals} when the input + * is a String). * *

Storage: open-addressed flat arrays with linear probing. Two parallel {@code * UTF8BytesString[]} tables -- "current cycle" and "prior cycle". Capacity is the next power of two - * {@code >= 2 * cardinalityLimit} so probes stay short even at the full budget. - * - *

The stored UTF8BytesString carries the slot's identity directly: probe equality is {@code - * stored.toString().contentEquals(value)}, which is the JDK's content-equality routine and - * fast-paths to {@code String.equals} when the input is a String. No parallel keys array needed. + * {@code >= 2 * cardinalityLimit} so probes stay short even at the full budget. The stored + * UTF8BytesString carries the slot's identity directly; no parallel keys array needed. * *

    *
  • The current table tracks which values have consumed a slot of the cardinality budget this @@ -34,7 +31,7 @@ * cycles pay zero UTF8 allocations after the first cycle, and the reused instances also * short-circuit downstream equality to identity comparisons. */ -public final class PropertyCardinalityHandler { +public final class PropertyCardinalityHandler { private final int cardinalityLimit; private final int capacityMask; @@ -59,7 +56,7 @@ public PropertyCardinalityHandler(int cardinalityLimit) { this.priorValues = new UTF8BytesString[capacity]; } - public UTF8BytesString register(T value) { + public UTF8BytesString register(CharSequence value) { final int slot = probe(this.curValues, value); final UTF8BytesString existing = this.curValues[slot]; if (existing != null) { @@ -89,7 +86,7 @@ public UTF8BytesString register(T value) { * is content-stable with the underlying String, so the same content hashes to the same slot * regardless of whether the input is a String or UTF8BytesString. */ - private int probe(UTF8BytesString[] values, T value) { + private int probe(UTF8BytesString[] values, CharSequence value) { int idx = value.hashCode() & this.capacityMask; while (values[idx] != null && !values[idx].toString().contentEquals(value)) { idx = (idx + 1) & this.capacityMask; diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java index 3ca8f51626e..bbdffb6061a 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java @@ -11,7 +11,7 @@ class CardinalityHandlerTest { @Test void propertyReturnsSameInstanceForRepeatedValueUntilLimit() { - PropertyCardinalityHandler h = new PropertyCardinalityHandler<>(3); + PropertyCardinalityHandler h = new PropertyCardinalityHandler(3); UTF8BytesString a1 = h.register("a"); UTF8BytesString a2 = h.register("a"); assertSame(a1, a2); @@ -20,7 +20,7 @@ void propertyReturnsSameInstanceForRepeatedValueUntilLimit() { @Test void propertyOverLimitReturnsBlockedSentinel() { - PropertyCardinalityHandler h = new PropertyCardinalityHandler<>(2); + PropertyCardinalityHandler h = new PropertyCardinalityHandler(2); UTF8BytesString a = h.register("a"); UTF8BytesString b = h.register("b"); UTF8BytesString blocked1 = h.register("c"); @@ -34,7 +34,7 @@ void propertyOverLimitReturnsBlockedSentinel() { @Test void propertyResetRefreshesBudget() { - PropertyCardinalityHandler h = new PropertyCardinalityHandler<>(2); + PropertyCardinalityHandler h = new PropertyCardinalityHandler(2); h.register("a"); h.register("b"); UTF8BytesString blocked = h.register("c"); From 4610078e64ef980ff9aabee69825d609c70e2270 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:46:51 -0400 Subject: [PATCH 21/33] Guard cardinality-handler ctor against pathological inputs - Both handlers now reject cardinalityLimit > 2^29 to prevent overflow in the (cardinalityLimit * 2 - 1) capacity calc. Practical limits are 8..512 so this is well beyond any realistic configuration. - TagCardinalityHandler's keys array is now String[] (was Object[]) to match the actual contract -- minor clarity win. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../metrics/PropertyCardinalityHandler.java | 6 ++++++ .../common/metrics/TagCardinalityHandler.java | 17 +++++++++++------ 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index 1d5d9077ffc..59361c10b37 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -47,6 +47,12 @@ public PropertyCardinalityHandler(int cardinalityLimit) { if (cardinalityLimit <= 0) { throw new IllegalArgumentException("cardinalityLimit must be positive: " + cardinalityLimit); } + // Upper bound prevents overflow in the (cardinalityLimit * 2 - 1) capacity calc below. + // Practical limits are 8..512; this cap is well beyond any realistic configuration. + if (cardinalityLimit > (1 << 29)) { + throw new IllegalArgumentException( + "cardinalityLimit must be at most 2^29: " + cardinalityLimit); + } this.cardinalityLimit = cardinalityLimit; // Capacity = next power of two >= 2 * cardinalityLimit. Linear-probing load factor stays // <= 0.5 even when the budget is full, which keeps probe chains short. diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index f5fa3d2482f..d7c37d51570 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -15,9 +15,9 @@ public final class TagCardinalityHandler { private final int cardinalityLimit; private final int capacityMask; - private Object[] curKeys; + private String[] curKeys; private UTF8BytesString[] curValues; - private Object[] priorKeys; + private String[] priorKeys; private UTF8BytesString[] priorValues; private int curSize; @@ -27,13 +27,18 @@ public TagCardinalityHandler(String tag, int cardinalityLimit) { if (cardinalityLimit <= 0) { throw new IllegalArgumentException("cardinalityLimit must be positive: " + cardinalityLimit); } + // Upper bound prevents overflow in the (cardinalityLimit * 2 - 1) capacity calc below. + if (cardinalityLimit > (1 << 29)) { + throw new IllegalArgumentException( + "cardinalityLimit must be at most 2^29: " + cardinalityLimit); + } this.tag = tag; this.cardinalityLimit = cardinalityLimit; final int capacity = Integer.highestOneBit(cardinalityLimit * 2 - 1) << 1; this.capacityMask = capacity - 1; - this.curKeys = new Object[capacity]; + this.curKeys = new String[capacity]; this.curValues = new UTF8BytesString[capacity]; - this.priorKeys = new Object[capacity]; + this.priorKeys = new String[capacity]; this.priorValues = new UTF8BytesString[capacity]; } @@ -58,7 +63,7 @@ public UTF8BytesString register(String value) { return utf8; } - private int probe(Object[] keys, String value) { + private int probe(String[] keys, String value) { int idx = value.hashCode() & this.capacityMask; while (keys[idx] != null && !keys[idx].equals(value)) { idx = (idx + 1) & this.capacityMask; @@ -75,7 +80,7 @@ private UTF8BytesString blockedByTracer() { } public void reset() { - final Object[] tmpKeys = this.priorKeys; + final String[] tmpKeys = this.priorKeys; final UTF8BytesString[] tmpValues = this.priorValues; this.priorKeys = this.curKeys; this.priorValues = this.curValues; From 713aa3406016a79a4cd95d41a4e26e5489a54b58 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 21:56:18 -0400 Subject: [PATCH 22/33] Make EMPTY the universal absent sentinel for AggregateEntry UTF8 fields PropertyCardinalityHandler.register(null) now returns UTF8BytesString .EMPTY. All AggregateEntry UTF8 fields are non-null. Callers stop checking for null at every site. - AggregateEntry: drop @Nullable on serviceSource/httpMethod/ httpEndpoint/grpcStatusCode (both the entry fields and the Canonical scratch buffer). Drop @Nullable on getters and on the of factory parameters. Drop the unused registerOrEmpty helper. - Canonical.populate: each field is now this.field = HANDLER.register (s.field) -- no inline conditionals. - of() factory: drop the value == null ? null : createUtf8(value) pattern; createUtf8 already returns EMPTY on null. - SerializingMetricWriter: switch the four presence checks from != null to != EMPTY (identity comparison on the singleton). Net win: nine identically-shaped call sites in Canonical.populate and a smaller null surface across the package. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 68 ++++++++----------- .../metrics/PropertyCardinalityHandler.java | 7 ++ .../metrics/SerializingMetricWriter.java | 13 ++-- .../SerializingMetricWriterTest.groovy | 9 +-- 4 files changed, 49 insertions(+), 48 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 43cc8c0e7e3..aa061b6e9f4 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -10,7 +10,6 @@ import java.util.List; import java.util.Objects; import java.util.concurrent.atomic.AtomicLongArray; -import javax.annotation.Nullable; /** * Hashtable entry for the consumer-side aggregator. Holds the UTF8-encoded label fields (the data @@ -69,12 +68,15 @@ final class AggregateEntry extends Hashtable.Entry { final UTF8BytesString resource; final UTF8BytesString service; final UTF8BytesString operationName; - @Nullable final UTF8BytesString serviceSource; + // Optional fields use UTF8BytesString.EMPTY as the "absent" sentinel rather than null. The + // cardinality handlers map null inputs to EMPTY, and createUtf8 does the same for the of(...) + // factory, so callers don't need to special-case absence. + final UTF8BytesString serviceSource; final UTF8BytesString type; final UTF8BytesString spanKind; - @Nullable final UTF8BytesString httpMethod; - @Nullable final UTF8BytesString httpEndpoint; - @Nullable final UTF8BytesString grpcStatusCode; + final UTF8BytesString httpMethod; + final UTF8BytesString httpEndpoint; + final UTF8BytesString grpcStatusCode; final short httpStatusCode; final boolean synthetic; final boolean traceRoot; @@ -205,25 +207,25 @@ static AggregateEntry of( CharSequence resource, CharSequence service, CharSequence operationName, - @Nullable CharSequence serviceSource, + CharSequence serviceSource, CharSequence type, int httpStatusCode, boolean synthetic, boolean traceRoot, CharSequence spanKind, - @Nullable List peerTags, - @Nullable CharSequence httpMethod, - @Nullable CharSequence httpEndpoint, - @Nullable CharSequence grpcStatusCode) { + List peerTags, + CharSequence httpMethod, + CharSequence httpEndpoint, + CharSequence grpcStatusCode) { UTF8BytesString resourceUtf = createUtf8(resource); UTF8BytesString serviceUtf = createUtf8(service); UTF8BytesString operationNameUtf = createUtf8(operationName); - UTF8BytesString serviceSourceUtf = serviceSource == null ? null : createUtf8(serviceSource); + UTF8BytesString serviceSourceUtf = createUtf8(serviceSource); UTF8BytesString typeUtf = createUtf8(type); UTF8BytesString spanKindUtf = createUtf8(spanKind); - UTF8BytesString httpMethodUtf = httpMethod == null ? null : createUtf8(httpMethod); - UTF8BytesString httpEndpointUtf = httpEndpoint == null ? null : createUtf8(httpEndpoint); - UTF8BytesString grpcUtf = grpcStatusCode == null ? null : createUtf8(grpcStatusCode); + UTF8BytesString httpMethodUtf = createUtf8(httpMethod); + UTF8BytesString httpEndpointUtf = createUtf8(httpEndpoint); + UTF8BytesString grpcUtf = createUtf8(grpcStatusCode); List peerTagsList = peerTags == null ? Collections.emptyList() : peerTags; long keyHash = hashOf( @@ -330,7 +332,6 @@ UTF8BytesString getOperationName() { return operationName; } - @Nullable UTF8BytesString getServiceSource() { return serviceSource; } @@ -343,17 +344,14 @@ UTF8BytesString getSpanKind() { return spanKind; } - @Nullable UTF8BytesString getHttpMethod() { return httpMethod; } - @Nullable UTF8BytesString getHttpEndpoint() { return httpEndpoint; } - @Nullable UTF8BytesString getGrpcStatusCode() { return grpcStatusCode; } @@ -416,12 +414,12 @@ static final class Canonical { UTF8BytesString resource; UTF8BytesString service; UTF8BytesString operationName; - @Nullable UTF8BytesString serviceSource; + UTF8BytesString serviceSource; UTF8BytesString type; UTF8BytesString spanKind; - @Nullable UTF8BytesString httpMethod; - @Nullable UTF8BytesString httpEndpoint; - @Nullable UTF8BytesString grpcStatusCode; + UTF8BytesString httpMethod; + UTF8BytesString httpEndpoint; + UTF8BytesString grpcStatusCode; short httpStatusCode; boolean synthetic; boolean traceRoot; @@ -437,18 +435,15 @@ static final class Canonical { /** Canonicalize all fields from {@code s} through the handlers into this buffer. */ void populate(SpanSnapshot s) { - this.resource = registerOrEmpty(RESOURCE_HANDLER, s.resourceName); - this.service = registerOrEmpty(SERVICE_HANDLER, s.serviceName); - this.operationName = registerOrEmpty(OPERATION_HANDLER, s.operationName); - this.serviceSource = - s.serviceNameSource == null ? null : SERVICE_SOURCE_HANDLER.register(s.serviceNameSource); - this.type = registerOrEmpty(TYPE_HANDLER, s.spanType); - this.spanKind = registerOrEmpty(SPAN_KIND_HANDLER, s.spanKind); - this.httpMethod = s.httpMethod == null ? null : HTTP_METHOD_HANDLER.register(s.httpMethod); - this.httpEndpoint = - s.httpEndpoint == null ? null : HTTP_ENDPOINT_HANDLER.register(s.httpEndpoint); - this.grpcStatusCode = - s.grpcStatusCode == null ? null : GRPC_STATUS_CODE_HANDLER.register(s.grpcStatusCode); + this.resource = RESOURCE_HANDLER.register(s.resourceName); + this.service = SERVICE_HANDLER.register(s.serviceName); + this.operationName = OPERATION_HANDLER.register(s.operationName); + this.serviceSource = SERVICE_SOURCE_HANDLER.register(s.serviceNameSource); + this.type = TYPE_HANDLER.register(s.spanType); + this.spanKind = SPAN_KIND_HANDLER.register(s.spanKind); + this.httpMethod = HTTP_METHOD_HANDLER.register(s.httpMethod); + this.httpEndpoint = HTTP_ENDPOINT_HANDLER.register(s.httpEndpoint); + this.grpcStatusCode = GRPC_STATUS_CODE_HANDLER.register(s.grpcStatusCode); this.httpStatusCode = s.httpStatusCode; this.synthetic = s.synthetic; this.traceRoot = s.traceRoot; @@ -559,11 +554,6 @@ AggregateEntry toEntry() { // ----- helpers ----- - private static UTF8BytesString registerOrEmpty( - PropertyCardinalityHandler handler, CharSequence value) { - return value == null ? UTF8BytesString.EMPTY : handler.register(value); - } - /** Direct {@link UTF8BytesString} creation that bypasses the cardinality handlers. */ private static UTF8BytesString createUtf8(CharSequence cs) { if (cs == null) { diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index 59361c10b37..164ecffd05c 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -62,7 +62,14 @@ public PropertyCardinalityHandler(int cardinalityLimit) { this.priorValues = new UTF8BytesString[capacity]; } + /** + * Canonicalizes {@code value} through the cardinality budget and per-cycle reuse cache. Null + * inputs map to {@link UTF8BytesString#EMPTY} -- callers don't need to pre-check. + */ public UTF8BytesString register(CharSequence value) { + if (value == null) { + return UTF8BytesString.EMPTY; + } final int slot = probe(this.curValues, value); final UTF8BytesString existing = this.curValues[slot]; if (existing != null) { diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java index 7644ebaf044..f592dfe26f6 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SerializingMetricWriter.java @@ -143,11 +143,14 @@ public void startBucket(int metricCount, long start, long duration) { @Override public void add(AggregateEntry entry) { - // Calculate dynamic map size based on optional fields - final boolean hasHttpMethod = entry.getHttpMethod() != null; - final boolean hasHttpEndpoint = entry.getHttpEndpoint() != null; - final boolean hasServiceSource = entry.getServiceSource() != null; - final boolean hasGrpcStatusCode = entry.getGrpcStatusCode() != null; + // Calculate dynamic map size based on optional fields. AggregateEntry uses + // UTF8BytesString.EMPTY + // as the "absent" sentinel for these optional fields (see AggregateEntry); identity comparison + // against the singleton. + final boolean hasHttpMethod = entry.getHttpMethod() != EMPTY; + final boolean hasHttpEndpoint = entry.getHttpEndpoint() != EMPTY; + final boolean hasServiceSource = entry.getServiceSource() != EMPTY; + final boolean hasGrpcStatusCode = entry.getGrpcStatusCode() != EMPTY; final int mapSize = 15 + (hasServiceSource ? 1 : 0) diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy index c4f20a1c210..1e5f21e13e0 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/metrics/SerializingMetricWriterTest.groovy @@ -1,6 +1,7 @@ package datadog.trace.common.metrics import static datadog.trace.api.config.GeneralConfig.EXPERIMENTAL_PROPAGATE_PROCESS_TAGS_ENABLED +import static datadog.trace.bootstrap.instrumentation.api.UTF8BytesString.EMPTY import static java.util.concurrent.TimeUnit.MILLISECONDS import static java.util.concurrent.TimeUnit.SECONDS @@ -287,10 +288,10 @@ class SerializingMetricWriterTest extends DDSpecification { // counters now live on AggregateEntry int metricMapSize = unpacker.unpackMapHeader() // Calculate expected map size based on optional fields - boolean hasHttpMethod = entry.getHttpMethod() != null - boolean hasHttpEndpoint = entry.getHttpEndpoint() != null - boolean hasServiceSource = entry.getServiceSource() != null - boolean hasGrpcStatusCode = entry.getGrpcStatusCode() != null + boolean hasHttpMethod = entry.getHttpMethod() != EMPTY + boolean hasHttpEndpoint = entry.getHttpEndpoint() != EMPTY + boolean hasServiceSource = entry.getServiceSource() != EMPTY + boolean hasGrpcStatusCode = entry.getGrpcStatusCode() != EMPTY int expectedMapSize = 15 + (hasServiceSource ? 1 : 0) + (hasHttpMethod ? 1 : 0) + (hasHttpEndpoint ? 1 : 0) + (hasGrpcStatusCode ? 1 : 0) assert metricMapSize == expectedMapSize int elementCount = 0 From 617bc51e011448c566dc1ae433ebda4b84e358e3 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 22:04:52 -0400 Subject: [PATCH 23/33] Use EMPTY consistently for absent values in peer-tag canonicalization - TagCardinalityHandler.register now mirrors PropertyCardinalityHandler: null input returns UTF8BytesString.EMPTY. - Canonical.populatePeerTags now calls register for every schema slot and tests the result against EMPTY rather than the input against null. The wire-format buffer still holds only present peer tags (EMPTY is elided), but the check is now consistent with how AggregateEntry's scalar UTF8 fields handle absence. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../datadog/trace/common/metrics/AggregateEntry.java | 11 ++++++----- .../trace/common/metrics/TagCardinalityHandler.java | 7 +++++++ 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index aa061b6e9f4..91202db20a3 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -467,8 +467,9 @@ void populate(SpanSnapshot s) { /** * Fills {@link #peerTagsBuffer} with canonical UTF8 forms, applying {@code schema.handler(i)} - * to each non-null value at the same index. No allocation when the schema/values are absent or - * all values are null (buffer is just cleared). + * to each value at the same index. Handler returns {@code EMPTY} for null inputs; we elide + * those from the buffer so the wire-format list-of-pairs only contains present peer tags. No + * allocation when the schema/values are absent or all values are null (buffer is just cleared). */ private void populatePeerTags(PeerTagSchema schema, String[] values) { peerTagsBuffer.clear(); @@ -477,9 +478,9 @@ private void populatePeerTags(PeerTagSchema schema, String[] values) { } int n = schema.size(); for (int i = 0; i < n; i++) { - String v = values[i]; - if (v != null) { - peerTagsBuffer.add(schema.handler(i).register(v)); + UTF8BytesString utf8 = schema.handler(i).register(values[i]); + if (utf8 != UTF8BytesString.EMPTY) { + peerTagsBuffer.add(utf8); } } } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index d7c37d51570..2f0e7dbaa4d 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -42,7 +42,14 @@ public TagCardinalityHandler(String tag, int cardinalityLimit) { this.priorValues = new UTF8BytesString[capacity]; } + /** + * Canonicalizes {@code value} through the cardinality budget and per-cycle reuse cache. Null + * inputs map to {@link UTF8BytesString#EMPTY} -- callers don't need to pre-check. + */ public UTF8BytesString register(String value) { + if (value == null) { + return UTF8BytesString.EMPTY; + } final int slot = probe(this.curKeys, value); if (this.curKeys[slot] != null) { return this.curValues[slot]; From e1dec836a22f0e6fab6067c17303f7a26ecee219 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 22:16:05 -0400 Subject: [PATCH 24/33] Tighten handler visibility + add tests for EMPTY-on-null contract #4: PropertyCardinalityHandler and TagCardinalityHandler are only consumed within this package; drop `public` from the class declarations, constructors, and methods. They're package-private now. #6: Add tests that lock down the EMPTY-on-null contract that the rest of the package depends on: - CardinalityHandlerTest covers both handlers: register(null) -> EMPTY, and registering null repeatedly doesn't consume the cardinality budget. - AggregateEntryTest covers the entry shape: optional fields built from a snapshot with null inputs resolve to EMPTY; populated optional fields carry their value. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../metrics/PropertyCardinalityHandler.java | 8 ++-- .../common/metrics/TagCardinalityHandler.java | 8 ++-- .../common/metrics/AggregateEntryTest.java | 42 +++++++++++++++++++ .../metrics/CardinalityHandlerTest.java | 29 +++++++++++++ 4 files changed, 79 insertions(+), 8 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index 164ecffd05c..f43d1864fc8 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -31,7 +31,7 @@ * cycles pay zero UTF8 allocations after the first cycle, and the reused instances also * short-circuit downstream equality to identity comparisons. */ -public final class PropertyCardinalityHandler { +final class PropertyCardinalityHandler { private final int cardinalityLimit; private final int capacityMask; @@ -43,7 +43,7 @@ public final class PropertyCardinalityHandler { private UTF8BytesString cacheBlocked = null; - public PropertyCardinalityHandler(int cardinalityLimit) { + PropertyCardinalityHandler(int cardinalityLimit) { if (cardinalityLimit <= 0) { throw new IllegalArgumentException("cardinalityLimit must be positive: " + cardinalityLimit); } @@ -66,7 +66,7 @@ public PropertyCardinalityHandler(int cardinalityLimit) { * Canonicalizes {@code value} through the cardinality budget and per-cycle reuse cache. Null * inputs map to {@link UTF8BytesString#EMPTY} -- callers don't need to pre-check. */ - public UTF8BytesString register(CharSequence value) { + UTF8BytesString register(CharSequence value) { if (value == null) { return UTF8BytesString.EMPTY; } @@ -115,7 +115,7 @@ private UTF8BytesString blockedByTracer() { return cacheBlocked; } - public void reset() { + void reset() { // Flip pointers: the just-completed cycle becomes prior; what was prior (2 cycles ago) is // recycled into the new (empty) current. final UTF8BytesString[] tmp = this.priorValues; diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index 2f0e7dbaa4d..c8a0b8779e3 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -10,7 +10,7 @@ *

    Same open-addressed flat-array + prior-cycle reuse design as {@link * PropertyCardinalityHandler} -- see that class for full description. */ -public final class TagCardinalityHandler { +final class TagCardinalityHandler { private final String tag; private final int cardinalityLimit; private final int capacityMask; @@ -23,7 +23,7 @@ public final class TagCardinalityHandler { private UTF8BytesString cacheBlocked = null; - public TagCardinalityHandler(String tag, int cardinalityLimit) { + TagCardinalityHandler(String tag, int cardinalityLimit) { if (cardinalityLimit <= 0) { throw new IllegalArgumentException("cardinalityLimit must be positive: " + cardinalityLimit); } @@ -46,7 +46,7 @@ public TagCardinalityHandler(String tag, int cardinalityLimit) { * Canonicalizes {@code value} through the cardinality budget and per-cycle reuse cache. Null * inputs map to {@link UTF8BytesString#EMPTY} -- callers don't need to pre-check. */ - public UTF8BytesString register(String value) { + UTF8BytesString register(String value) { if (value == null) { return UTF8BytesString.EMPTY; } @@ -86,7 +86,7 @@ private UTF8BytesString blockedByTracer() { return cacheBlocked; } - public void reset() { + void reset() { final String[] tmpKeys = this.priorKeys; final UTF8BytesString[] tmpValues = this.priorValues; this.priorKeys = this.curKeys; diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java index 25a08d94b23..057478d46a4 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateEntryTest.java @@ -1,8 +1,11 @@ package datadog.trace.common.metrics; +import static datadog.trace.bootstrap.instrumentation.api.UTF8BytesString.EMPTY; import static datadog.trace.common.metrics.AggregateEntry.ERROR_TAG; import static datadog.trace.common.metrics.AggregateEntry.TOP_LEVEL_TAG; import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotSame; +import static org.junit.jupiter.api.Assertions.assertSame; import static org.junit.jupiter.api.Assertions.assertTrue; import datadog.metrics.agent.AgentMeter; @@ -86,6 +89,45 @@ void okAndErrorLatenciesTrackedSeparately() { assertTrue(entry.getOkLatencies().getMaxValue() <= 5); } + @Test + void absentOptionalFieldsResolveToEmptySentinel() { + // serviceSource / httpMethod / httpEndpoint / grpcStatusCode = null on input -> EMPTY on the + // entry. EMPTY is the universal "absent" sentinel; SerializingMetricWriter and equality use + // identity comparison against it. + AggregateEntry entry = newEntry(); + assertSame(EMPTY, entry.getServiceSource()); + assertSame(EMPTY, entry.getHttpMethod()); + assertSame(EMPTY, entry.getHttpEndpoint()); + assertSame(EMPTY, entry.getGrpcStatusCode()); + } + + @Test + void presentOptionalFieldsCarryTheirValue() { + AggregateEntry entry = + AggregateEntry.of( + "resource", + "svc", + "op", + "src", + "type", + 200, + false, + true, + "client", + null, + "GET", + "/api/v1/foo", + "0"); + assertNotSame(EMPTY, entry.getServiceSource()); + assertNotSame(EMPTY, entry.getHttpMethod()); + assertNotSame(EMPTY, entry.getHttpEndpoint()); + assertNotSame(EMPTY, entry.getGrpcStatusCode()); + assertEquals("src", entry.getServiceSource().toString()); + assertEquals("GET", entry.getHttpMethod().toString()); + assertEquals("/api/v1/foo", entry.getHttpEndpoint().toString()); + assertEquals("0", entry.getGrpcStatusCode().toString()); + } + private static AggregateEntry newEntry() { return AggregateEntry.of( "resource", "svc", "op", null, "type", 200, false, true, "client", null, null, null, null); diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java index bbdffb6061a..b6b3a216e5a 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java @@ -85,4 +85,33 @@ void tagResetRefreshesBudgetAndSentinelStaysStable() { // Both are the same sentinel instance (cacheBlocked is not cleared on reset). assertSame(blockedBefore, blockedAfter); } + + @Test + void propertyRegisterOfNullReturnsEmpty() { + PropertyCardinalityHandler h = new PropertyCardinalityHandler(4); + // Null input short-circuits to UTF8BytesString.EMPTY -- the universal "absent" sentinel that + // AggregateEntry's optional UTF8 fields use in place of null. + assertSame(UTF8BytesString.EMPTY, h.register(null)); + } + + @Test + void propertyRegisterOfNullDoesNotConsumeBudget() { + PropertyCardinalityHandler h = new PropertyCardinalityHandler(2); + h.register(null); + h.register(null); + h.register(null); + // Three null registrations didn't consume the budget; two real values still fit. + assertEquals("a", h.register("a").toString()); + assertEquals("b", h.register("b").toString()); + // Third real value spills to the blocked sentinel (limit = 2). + assertEquals("blocked_by_tracer", h.register("c").toString()); + } + + @Test + void tagRegisterOfNullReturnsEmpty() { + TagCardinalityHandler h = new TagCardinalityHandler("peer.hostname", 4); + // Null returns EMPTY (no "tag:" prefix applied -- the sentinel is the same EMPTY singleton + // every handler returns for null input). + assertSame(UTF8BytesString.EMPTY, h.register(null)); + } } From 2336bb5aa5da26dfa0c1b5fa72d6c5496507435e Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Tue, 19 May 2026 23:36:38 -0400 Subject: [PATCH 25/33] Notify on peer-tag cardinality blocks Adds a per-cycle one-shot warn log + HealthMetrics counter (`stats.tag_cardinality_blocked` with `tag:`) when a peer-tag value gets collapsed to the `blocked_by_tracer` sentinel because its cardinality budget is exhausted. Implemented as a `register(int i, String value)` method on `PeerTagSchema` that does the post-block notification work; `TagCardinalityHandler` exposes `blockedSentinel()` so the schema can identity-compare and stays free of logger / health metric coupling. Warn-once gating uses a `Set` of names seen this cycle, cleared by `resetCardinalityHandlers()`. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/AggregateEntry.java | 11 ++- .../common/metrics/ClientStatsAggregator.java | 3 +- .../trace/common/metrics/PeerTagSchema.java | 95 +++++++++++++++---- .../common/metrics/TagCardinalityHandler.java | 9 ++ .../trace/core/monitor/HealthMetrics.java | 9 ++ .../core/monitor/TracerHealthMetrics.java | 5 + .../common/metrics/AggregateTableTest.java | 3 +- 7 files changed, 109 insertions(+), 26 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 91202db20a3..8f2ae1cc6b3 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -466,10 +466,11 @@ void populate(SpanSnapshot s) { } /** - * Fills {@link #peerTagsBuffer} with canonical UTF8 forms, applying {@code schema.handler(i)} - * to each value at the same index. Handler returns {@code EMPTY} for null inputs; we elide - * those from the buffer so the wire-format list-of-pairs only contains present peer tags. No - * allocation when the schema/values are absent or all values are null (buffer is just cleared). + * Fills {@link #peerTagsBuffer} with canonical UTF8 forms, applying the schema's per-tag + * handler + warn-once notification at the same index. Returns {@code EMPTY} for null inputs; + * we elide those from the buffer so the wire-format list-of-pairs only contains present peer + * tags. No allocation when the schema/values are absent or all values are null (buffer is just + * cleared). */ private void populatePeerTags(PeerTagSchema schema, String[] values) { peerTagsBuffer.clear(); @@ -478,7 +479,7 @@ private void populatePeerTags(PeerTagSchema schema, String[] values) { } int n = schema.size(); for (int i = 0; i < n; i++) { - UTF8BytesString utf8 = schema.handler(i).register(values[i]); + UTF8BytesString utf8 = schema.register(i, values[i]); if (utf8 != UTF8BytesString.EMPTY) { peerTagsBuffer.add(utf8); } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index eadef788bb0..1f212c0ed65 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -374,7 +374,8 @@ private synchronized PeerTagSchema refreshPeerAggSchema(long revision) { } Set names = features.peerTags(); PeerTagSchema schema = - PeerTagSchema.of(names == null ? Collections.emptySet() : names, revision); + PeerTagSchema.of( + names == null ? Collections.emptySet() : names, revision, healthMetrics); cachedPeerAggSchema = schema; return schema; } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java index 0dc6e1c9e23..7fcdc00fd77 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -2,41 +2,54 @@ import static datadog.trace.api.DDTags.BASE_SERVICE; +import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString; +import datadog.trace.core.monitor.HealthMetrics; +import java.util.HashSet; import java.util.Set; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * Parallel arrays of peer-tag names and their {@link TagCardinalityHandler}s, indexed in lockstep. * *

    Replaces the previous {@code Map} lookup with positional array * access: the producer captures span tag values into a {@code String[]} parallel to {@link #names}, - * and the consumer applies {@link #handler(int)} at the same index to canonicalize. + * and the consumer calls {@link #register(int, String)} at the same index to canonicalize the + * value through the per-tag cardinality handler. * *

    Two schemas exist: * *

      *
    • {@link #INTERNAL} -- a singleton with one entry for {@code base.service}, used for * internal-kind spans where only the base service is aggregated. - *
    • A peer-aggregation schema built via {@link #of(Set, long)} for {@code client}/{@code - * producer}/{@code consumer} spans. {@link ClientStatsAggregator} caches the most recently - * built schema and compares its {@link #peerTagsRevision} against {@code + *
    • A peer-aggregation schema built via {@link #of(Set, long, HealthMetrics)} for {@code + * client}/{@code producer}/{@code consumer} spans. {@link ClientStatsAggregator} caches the + * most recently built schema and compares its {@link #peerTagsRevision} against {@code * DDAgentFeaturesDiscovery.peerTagsRevision()} to decide when to rebuild. *
    * + *

    Cardinality blocks emit a one-shot warn log per reporting cycle per tag (tracked via {@link + * #warnedCardinality}) and accumulate a per-tag block counter (tracked via {@link #blockedCounts}) + * that is flushed to {@link HealthMetrics#onTagCardinalityBlocked(String, long)} once per affected + * tag at cycle reset. All per-cycle state resets in {@link #resetCardinalityHandlers()}. + * *

    Each {@link SpanSnapshot} captures its own schema reference so producer and consumer agree on * the indexing even if the current schema is replaced between capture and consumption. * - *

    Thread-safety: {@link TagCardinalityHandler}s are not thread-safe and must only be - * exercised on the aggregator thread. {@link #names} and {@link #peerTagsRevision} are final and - * safe to read from any thread. + *

    Thread-safety: {@link TagCardinalityHandler}s and the warn-once set are not + * thread-safe and must only be exercised on the aggregator thread. {@link #names} and {@link + * #peerTagsRevision} are final and safe to read from any thread. */ final class PeerTagSchema { + private static final Logger log = LoggerFactory.getLogger(PeerTagSchema.class); + /** Sentinel revision for {@link #INTERNAL} -- it never changes. */ static final long INTERNAL_REVISION = -1L; /** Singleton schema for internal-kind spans -- only {@code base.service}. */ static final PeerTagSchema INTERNAL = - new PeerTagSchema(new String[] {BASE_SERVICE}, INTERNAL_REVISION); + new PeerTagSchema(new String[] {BASE_SERVICE}, INTERNAL_REVISION, HealthMetrics.NO_OP); final String[] names; final TagCardinalityHandler[] handlers; @@ -48,15 +61,34 @@ final class PeerTagSchema { */ final long peerTagsRevision; + private final HealthMetrics healthMetrics; + + /** + * Per-cycle warn-once gating. {@code Set.add(name)} returns true exactly the first time a tag + * gets blocked this cycle, which is the only time we want to emit the warn log. Cleared by + * {@link #resetCardinalityHandlers()}. + */ + private final Set warnedCardinality = new HashSet<>(); + + /** + * Per-tag block counter, indexed in lockstep with {@link #names}. Incremented on every blocked + * value during the cycle; flushed to {@link HealthMetrics#onTagCardinalityBlocked(String, long)} + * and zeroed in {@link #resetCardinalityHandlers()}. Single statsd call per affected tag per + * cycle keeps a misconfigured high-cardinality tag from flooding the metrics pipe. + */ + private final long[] blockedCounts; + /** Builds a schema for the given peer-tag names. Order is determined by the {@link Set}. */ - static PeerTagSchema of(Set names, long peerTagsRevision) { - return new PeerTagSchema(names.toArray(new String[0]), peerTagsRevision); + static PeerTagSchema of(Set names, long peerTagsRevision, HealthMetrics healthMetrics) { + return new PeerTagSchema(names.toArray(new String[0]), peerTagsRevision, healthMetrics); } - private PeerTagSchema(String[] names, long peerTagsRevision) { + private PeerTagSchema(String[] names, long peerTagsRevision, HealthMetrics healthMetrics) { this.names = names; this.peerTagsRevision = peerTagsRevision; + this.healthMetrics = healthMetrics; this.handlers = new TagCardinalityHandler[names.length]; + this.blockedCounts = new long[names.length]; for (int i = 0; i < names.length; i++) { this.handlers[i] = new TagCardinalityHandler(names[i], MetricCardinalityLimits.PEER_TAG_VALUE); @@ -64,13 +96,42 @@ private PeerTagSchema(String[] names, long peerTagsRevision) { } /** - * Resets every {@link TagCardinalityHandler}'s working set. Must be called on the aggregator - * thread; handlers are not thread-safe. + * Canonicalizes the peer-tag value at slot {@code i}. Returns {@link UTF8BytesString#EMPTY} for + * null inputs and the handler's {@code ":blocked_by_tracer"} sentinel when the per-tag + * cardinality budget is exhausted. Increments the per-tag block counter on every block and emits + * a one-shot warn log per cycle per tag; the counter is flushed to {@link HealthMetrics} in + * {@link #resetCardinalityHandlers()}. + */ + UTF8BytesString register(int i, String value) { + TagCardinalityHandler handler = handlers[i]; + UTF8BytesString result = handler.register(value); + if (handler.isBlockedResult(result)) { + blockedCounts[i]++; + String name = names[i]; + if (warnedCardinality.add(name)) { + log.warn( + "Cardinality limit reached for peer tag '{}'; further values are reported as" + + " 'blocked_by_tracer' until the next reporting cycle", + name); + } + } + return result; + } + + /** + * Resets every {@link TagCardinalityHandler}'s working set, flushes accumulated per-tag block + * counts to {@link HealthMetrics}, and clears the per-cycle warn-once tracking. Must be called + * on the aggregator thread; handlers are not thread-safe. */ void resetCardinalityHandlers() { - for (TagCardinalityHandler h : handlers) { - h.reset(); + for (int i = 0; i < handlers.length; i++) { + handlers[i].reset(); + if (blockedCounts[i] > 0) { + healthMetrics.onTagCardinalityBlocked(names[i], blockedCounts[i]); + blockedCounts[i] = 0; + } } + warnedCardinality.clear(); } int size() { @@ -80,8 +141,4 @@ int size() { String name(int i) { return names[i]; } - - TagCardinalityHandler handler(int i) { - return handlers[i]; - } } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index c8a0b8779e3..d96f16f4024 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -78,6 +78,15 @@ private int probe(String[] keys, String value) { return idx; } + /** + * Whether {@code result} (returned from a prior {@link #register} call) is this handler's + * blocked sentinel. The size check short-circuits the hot path so the sentinel is never + * materialized before any value has actually been blocked this cycle. + */ + boolean isBlockedResult(UTF8BytesString result) { + return this.curSize >= this.cardinalityLimit && result == blockedByTracer(); + } + private UTF8BytesString blockedByTracer() { UTF8BytesString cacheBlocked = this.cacheBlocked; if (cacheBlocked != null) return cacheBlocked; diff --git a/dd-trace-core/src/main/java/datadog/trace/core/monitor/HealthMetrics.java b/dd-trace-core/src/main/java/datadog/trace/core/monitor/HealthMetrics.java index d1c7fe126b4..6f9a263f593 100644 --- a/dd-trace-core/src/main/java/datadog/trace/core/monitor/HealthMetrics.java +++ b/dd-trace-core/src/main/java/datadog/trace/core/monitor/HealthMetrics.java @@ -98,6 +98,15 @@ public void onStatsAggregateDropped() {} */ public void onStatsInboxFull() {} + /** + * Reports a batch of {@code count} tag values collapsed into the {@code blocked_by_tracer} + * sentinel for {@code tag} during the just-completed reporting cycle (per-tag cardinality budget + * exhausted, or per-value length cap exceeded). Called from the aggregator thread once per + * affected tag at cycle reset, so the implementation can do a single counter update rather than + * one per blocked value. + */ + public void onTagCardinalityBlocked(String tag, long count) {} + /** * @return Human-readable summary of the current health metrics. */ diff --git a/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java b/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java index db384a7e42e..c00ef708abf 100644 --- a/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java +++ b/dd-trace-core/src/main/java/datadog/trace/core/monitor/TracerHealthMetrics.java @@ -363,6 +363,11 @@ public void onStatsInboxFull() { statsInboxFull.increment(); } + @Override + public void onTagCardinalityBlocked(String tag, long count) { + statsd.count("stats.tag_cardinality_blocked", count, new String[] {"tag:" + tag}); + } + @Override public void close() { if (null != cancellation) { diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java index 57ac6ddef8b..c90594b1895 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/AggregateTableTest.java @@ -238,7 +238,8 @@ SnapshotBuilder peerTags(String... namesAndValues) { for (int i = 0; i < namesAndValues.length; i += 2) { names.add(namesAndValues[i]); } - this.peerTagSchema = PeerTagSchema.of(names, 0L); + this.peerTagSchema = + PeerTagSchema.of(names, 0L, datadog.trace.core.monitor.HealthMetrics.NO_OP); this.peerTagValues = new String[peerTagSchema.size()]; for (int i = 0; i < namesAndValues.length; i += 2) { for (int j = 0; j < peerTagSchema.size(); j++) { From 5b6c5aae5f5497965b7a8ce22beda227312fcc16 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Wed, 20 May 2026 13:24:08 -0400 Subject: [PATCH 26/33] Address PR #11387 review: dual-role docs, rename, @Nullable, consumer-side reconcile - PropertyCardinalityHandler / TagCardinalityHandler: header comment explaining the limiter-and-cache dual role and the prior-cycle reuse trick that preserves UTF8 caching across resets. - ClientStatsAggregator: rename peerAggSchema -> peerTagSchema across field, method, and parameter; disambiguate the inner per-span local as spanPeerTagSchema (return of peerTagSchemaFor). - SpanSnapshot: replace prose "or null" docstrings with javax.annotation.@Nullable on peerTagSchema/peerTagValues fields and their constructor params. - Consumer-side peer-tag reconciliation: * DDAgentFeaturesDiscovery: drop State.peerTagsRevision + bump logic + peerTagsRevision() accessor. Expose getLastTimeDiscovered(). * PeerTagSchema: rename peerTagsRevision -> lastTimeDiscovered, drop final (consumer-thread-only mutation), add hasSameTagsAs(Set). * ClientStatsAggregator: producer hot path is now a single volatile read with a one-time synchronized bootstrap; resetCardinalityHandlers runs reconcilePeerTagSchema first, which fast-paths on timestamp equality and either bumps in place (preserving warm handlers when the tag set is unchanged) or swaps in a fresh schema. The schema's timestamp field no longer needs to be volatile because mutation is confined to the aggregator thread. Note: the @Nullable annotations on AggregateEntry's errorLatencies and related fields only apply after the downstream lazy-init / Canonical buffer work; those land in a separate commit on the downstream branches. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../ddagent/DDAgentFeaturesDiscovery.java | 15 +- .../common/metrics/ClientStatsAggregator.java | 138 +++++++++++------- .../trace/common/metrics/PeerTagSchema.java | 68 ++++++--- .../metrics/PropertyCardinalityHandler.java | 11 ++ .../trace/common/metrics/SpanSnapshot.java | 10 +- .../common/metrics/TagCardinalityHandler.java | 16 +- 6 files changed, 166 insertions(+), 92 deletions(-) diff --git a/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java b/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java index 387491a426a..514ab59ec3a 100644 --- a/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java +++ b/communication/src/main/java/datadog/communication/ddagent/DDAgentFeaturesDiscovery.java @@ -101,7 +101,6 @@ private static class State { String version; String telemetryProxyEndpoint; Set peerTags = emptySet(); - long peerTagsRevision; long lastTimeDiscovered; } @@ -145,8 +144,6 @@ private synchronized void discoverIfOutdated(final long maxElapsedMs) { final State newState = new State(); doDiscovery(newState); newState.lastTimeDiscovered = now; - newState.peerTagsRevision = - previous.peerTagsRevision + (newState.peerTags.equals(previous.peerTags) ? 0L : 1L); // swap atomically states discoveryState = newState; } @@ -408,13 +405,13 @@ public Set peerTags() { } /** - * Monotonically increasing counter bumped each time {@link #peerTags()} produces a Set that is - * not equal to the previous one. Callers can compare this against a cached snapshot to detect - * peer-tag config changes without re-comparing the Sets themselves -- e.g. the client-stats - * aggregator uses it to decide when to rebuild its {@code PeerTagSchema}. + * Wall-clock timestamp ({@link System#currentTimeMillis()}) of the most recent successful + * feature discovery, or {@code 0L} if discovery has never run. Callers (e.g. the client-stats + * aggregator) snapshot this alongside {@link #peerTags()} to detect when discovery has refreshed + * and a cached view of feature state may be stale. */ - public long peerTagsRevision() { - return discoveryState.peerTagsRevision; + public long getLastTimeDiscovered() { + return discoveryState.lastTimeDiscovered; } public String getMetricsEndpoint() { diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java index 1f212c0ed65..393181b5936 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/ClientStatsAggregator.java @@ -73,17 +73,22 @@ public final class ClientStatsAggregator implements MetricsAggregator, EventList private final boolean includeEndpointInMetrics; /** - * Cached peer-aggregation schema. The schema carries its own {@link - * PeerTagSchema#peerTagsRevision} (the {@link DDAgentFeaturesDiscovery#peerTagsRevision()} value - * it was built from); {@link #publish(List)} compares that against the current revision and only - * rebuilds when they differ. An empty schema (size 0) represents the "peer tags unconfigured" - * state; {@code null} only on the bootstrap window before the first publish. + * Cached peer-tag schema. Producers read this reference once per trace and pass it through to the + * consumer in {@link SpanSnapshot}; they never inspect the schema's timestamp or rebuild it. + * Reconciliation is the aggregator thread's job: {@link #resetCardinalityHandlers()} compares the + * schema's {@link PeerTagSchema#lastTimeDiscovered} against {@link + * DDAgentFeaturesDiscovery#getLastTimeDiscovered()} once per reporting cycle and either updates + * the timestamp in place (when the tag set is unchanged, preserving the schema's warm cardinality + * handlers) or swaps in a freshly-built schema. * - *

    {@code volatile} because {@code publish} is called on arbitrary producer threads. The reset - * hook ({@link #resetCardinalityHandlers()}) runs on the aggregator thread and only mutates the - * schema's internal handler state (not this field). + *

    An empty schema (size 0) represents the "peer tags unconfigured" state; {@code null} only on + * the bootstrap window before {@link #bootstrapPeerTagSchema()} runs on the first publish. + * + *

    {@code volatile} so the consumer's reconcile-time replacement is visible to producer + * threads; the schema's own internal mutable state (handlers, block counters, timestamp) is + * exercised only on the aggregator thread. */ - private volatile PeerTagSchema cachedPeerAggSchema; + private volatile PeerTagSchema cachedPeerTagSchema; private volatile AgentTaskScheduler.Scheduled cancellation; @@ -261,10 +266,14 @@ public boolean publish(List> trace) { boolean forceKeep = false; int counted = 0; if (features.supportsMetrics()) { - // Sync the peer-aggregation schema once per trace. The cache is keyed on - // features.peerTagsRevision(), which only bumps when the agent's peer-tag set actually - // changes -- so the steady-state cost is a volatile read and a long compare. - PeerTagSchema peerAggSchema = peerAggSchema(features.peerTagsRevision()); + // Producer-side fast path: one volatile read and use whatever schema is currently cached. + // The aggregator thread keeps this schema in sync with feature discovery in + // resetCardinalityHandlers(). The only producer-side rebuild is the one-time bootstrap on + // the first publish. + PeerTagSchema peerTagSchema = cachedPeerTagSchema; + if (peerTagSchema == null) { + peerTagSchema = bootstrapPeerTagSchema(); + } for (CoreSpan span : trace) { boolean isTopLevel = span.isTopLevel(); if (shouldComputeMetric(span, isTopLevel)) { @@ -275,7 +284,7 @@ public boolean publish(List> trace) { break; } counted++; - forceKeep |= publish(span, isTopLevel, peerAggSchema); + forceKeep |= publish(span, isTopLevel, peerTagSchema); } } healthMetrics.onClientStatTraceComputed(counted, trace.size(), !forceKeep); @@ -290,7 +299,7 @@ private boolean shouldComputeMetric(CoreSpan span, boolean isTopLevel) { && span.getDurationNano() > 0; } - private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peerAggSchema) { + private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peerTagSchema) { // Extract HTTP method and endpoint only if the feature is enabled String httpMethod = null; String httpEndpoint = null; @@ -318,13 +327,13 @@ private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peer long tagAndDuration = span.getDurationNano() | (error ? ERROR_TAG : 0L) | (isTopLevel ? TOP_LEVEL_TAG : 0L); - PeerTagSchema peerTagSchema = peerTagSchemaFor(span, peerAggSchema); + PeerTagSchema spanPeerTagSchema = peerTagSchemaFor(span, peerTagSchema); String[] peerTagValues = - peerTagSchema == null ? null : capturePeerTagValues(span, peerTagSchema); + spanPeerTagSchema == null ? null : capturePeerTagValues(span, spanPeerTagSchema); if (peerTagValues == null) { // capture returned no non-null values -- drop the schema reference so the consumer doesn't // bother iterating an all-null array. - peerTagSchema = null; + spanPeerTagSchema = null; } SpanSnapshot snapshot = @@ -338,7 +347,7 @@ private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peer isSynthetic(span), span.getParentId() == 0, spanKind, - peerTagSchema, + spanPeerTagSchema, peerTagValues, httpMethod, httpEndpoint, @@ -352,57 +361,84 @@ private boolean publish(CoreSpan span, boolean isTopLevel, PeerTagSchema peer } /** - * Returns the peer-aggregation schema synced to the given revision, rebuilding it if the cached - * one is stale. Fast path: one volatile read + a long compare against the schema's own embedded - * revision. Rebuild is rare (peer-tag config changes), so the synchronization is only on the slow - * path. Always returns non-null -- an empty schema (size 0) represents the "peer tags - * unconfigured" state so subsequent calls still short-circuit on the fast path. + * One-time producer-side bootstrap of {@link #cachedPeerTagSchema}. Synchronized double-check + * guards against two producers racing on the very first publish; after this returns, {@code + * cachedPeerTagSchema} is non-null forever and the aggregator thread is the sole subsequent + * mutator (see {@link #reconcilePeerTagSchema()}). */ - private PeerTagSchema peerAggSchema(long revision) { - PeerTagSchema cached = cachedPeerAggSchema; - if (cached != null && cached.peerTagsRevision == revision) { + private synchronized PeerTagSchema bootstrapPeerTagSchema() { + PeerTagSchema cached = cachedPeerTagSchema; + if (cached != null) { return cached; } - return refreshPeerAggSchema(revision); + PeerTagSchema schema = buildPeerTagSchema(); + cachedPeerTagSchema = schema; + return schema; } - private synchronized PeerTagSchema refreshPeerAggSchema(long revision) { - // Double-checked: another producer may have rebuilt while we were waiting on the monitor. - PeerTagSchema cached = cachedPeerAggSchema; - if (cached != null && cached.peerTagsRevision == revision) { - return cached; - } + /** Builds a fresh {@link PeerTagSchema} from the current state of feature discovery. */ + private PeerTagSchema buildPeerTagSchema() { Set names = features.peerTags(); - PeerTagSchema schema = - PeerTagSchema.of( - names == null ? Collections.emptySet() : names, revision, healthMetrics); - cachedPeerAggSchema = schema; - return schema; + return PeerTagSchema.of( + names == null ? Collections.emptySet() : names, + features.getLastTimeDiscovered(), + healthMetrics); } /** - * Single reset hook invoked on the aggregator thread at the end of each report cycle. Resets all - * cardinality state in lockstep: the static property handlers + {@code PeerTagSchema.INTERNAL} - * (via {@link AggregateEntry#resetCardinalityHandlers()}) and the cached peer-aggregation schema. - * New handlers added anywhere in this pipeline should be reset from here. + * Single reset hook invoked on the aggregator thread at the end of each report cycle. Reconciles + * the cached peer-tag schema against the latest feature discovery, then resets all cardinality + * state in lockstep: the static property handlers + {@code PeerTagSchema.INTERNAL} (via {@link + * AggregateEntry#resetCardinalityHandlers()}) and the cached peer-tag schema (with whatever + * reconciliation just produced). New handlers added anywhere in this pipeline should be reset + * from here. */ private void resetCardinalityHandlers() { + reconcilePeerTagSchema(); AggregateEntry.resetCardinalityHandlers(); - PeerTagSchema schema = cachedPeerAggSchema; + PeerTagSchema schema = cachedPeerTagSchema; if (schema != null) { schema.resetCardinalityHandlers(); } } /** - * Picks the peer-tag schema for a span. The {@code peerAggSchema} argument is the per-trace - * cached schema (synced from {@code features.peerTagsRevision()} once in {@link #publish(List)}) - * -- always non-null but possibly empty when peer tags are unconfigured. For internal-kind spans - * the static {@link PeerTagSchema#INTERNAL} schema is used regardless. + * Reconciles {@link #cachedPeerTagSchema} with the latest feature discovery. Runs on the + * aggregator thread once per reporting cycle. Cheap fast path: a long compare against the cached + * schema's embedded timestamp short-circuits when discovery hasn't refreshed since the schema was + * built. On mismatch, a set compare distinguishes "discovery refreshed but tags unchanged" (just + * bump the timestamp in place to preserve the warm cardinality handlers) from "tags actually + * changed" (build a new schema and swap the volatile reference). + */ + private void reconcilePeerTagSchema() { + PeerTagSchema cached = cachedPeerTagSchema; + if (cached == null) { + // First reset before the first publish -- producer-side bootstrap hasn't run yet. + return; + } + long latestDiscoveredAt = features.getLastTimeDiscovered(); + if (cached.lastTimeDiscovered == latestDiscoveredAt) { + return; + } + Set latestNames = features.peerTags(); + Set normalized = latestNames == null ? Collections.emptySet() : latestNames; + if (cached.hasSameTagsAs(normalized)) { + cached.lastTimeDiscovered = latestDiscoveredAt; + } else { + cachedPeerTagSchema = PeerTagSchema.of(normalized, latestDiscoveredAt, healthMetrics); + } + } + + /** + * Picks the peer-tag schema for a span. The {@code peerTagSchema} argument is the per-trace + * cached schema (read once in {@link #publish(List)} via the volatile {@link + * #cachedPeerTagSchema}, with {@link #bootstrapPeerTagSchema()} taking care of the first-publish + * window) -- always non-null but possibly empty when peer tags are unconfigured. For + * internal-kind spans the static {@link PeerTagSchema#INTERNAL} schema is used regardless. */ - private static PeerTagSchema peerTagSchemaFor(CoreSpan span, PeerTagSchema peerAggSchema) { - if (peerAggSchema.size() > 0 && span.isKind(PEER_AGGREGATION_KINDS)) { - return peerAggSchema; + private static PeerTagSchema peerTagSchemaFor(CoreSpan span, PeerTagSchema peerTagSchema) { + if (peerTagSchema.size() > 0 && span.isKind(PEER_AGGREGATION_KINDS)) { + return peerTagSchema; } if (span.isKind(INTERNAL_KIND)) { return PeerTagSchema.INTERNAL; diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java index 7fcdc00fd77..d66b2e497d7 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -14,8 +14,8 @@ * *

    Replaces the previous {@code Map} lookup with positional array * access: the producer captures span tag values into a {@code String[]} parallel to {@link #names}, - * and the consumer calls {@link #register(int, String)} at the same index to canonicalize the - * value through the per-tag cardinality handler. + * and the consumer calls {@link #register(int, String)} at the same index to canonicalize the value + * through the per-tag cardinality handler. * *

    Two schemas exist: * @@ -24,8 +24,9 @@ * internal-kind spans where only the base service is aggregated. *

  • A peer-aggregation schema built via {@link #of(Set, long, HealthMetrics)} for {@code * client}/{@code producer}/{@code consumer} spans. {@link ClientStatsAggregator} caches the - * most recently built schema and compares its {@link #peerTagsRevision} against {@code - * DDAgentFeaturesDiscovery.peerTagsRevision()} to decide when to rebuild. + * most recently built schema and reconciles it on the aggregator thread once per reporting + * cycle by comparing {@link #lastTimeDiscovered} against {@code + * DDAgentFeaturesDiscovery.getLastTimeDiscovered()}. *
* *

Cardinality blocks emit a one-shot warn log per reporting cycle per tag (tracked via {@link @@ -36,37 +37,39 @@ *

Each {@link SpanSnapshot} captures its own schema reference so producer and consumer agree on * the indexing even if the current schema is replaced between capture and consumption. * - *

Thread-safety: {@link TagCardinalityHandler}s and the warn-once set are not - * thread-safe and must only be exercised on the aggregator thread. {@link #names} and {@link - * #peerTagsRevision} are final and safe to read from any thread. + *

Thread-safety: all mutable state ({@link TagCardinalityHandler}s, the warn-once set, + * {@link #blockedCounts}, and {@link #lastTimeDiscovered}) is exercised only on the aggregator + * thread. {@link #names} and {@link #handlers} are final and safe to read from any thread; producer + * threads access them through the volatile {@code cachedPeerTagSchema} reference in {@link + * ClientStatsAggregator}. */ final class PeerTagSchema { private static final Logger log = LoggerFactory.getLogger(PeerTagSchema.class); - /** Sentinel revision for {@link #INTERNAL} -- it never changes. */ - static final long INTERNAL_REVISION = -1L; - /** Singleton schema for internal-kind spans -- only {@code base.service}. */ static final PeerTagSchema INTERNAL = - new PeerTagSchema(new String[] {BASE_SERVICE}, INTERNAL_REVISION, HealthMetrics.NO_OP); + // -1L sentinel; INTERNAL is never reconciled, so the value just has to be distinct from any + // real System.currentTimeMillis() that the aggregator might observe. + new PeerTagSchema(new String[] {BASE_SERVICE}, -1L, HealthMetrics.NO_OP); final String[] names; final TagCardinalityHandler[] handlers; /** - * The {@code DDAgentFeaturesDiscovery.peerTagsRevision()} value this schema was built from. Cache - * callers ({@link ClientStatsAggregator}) compare this against the current revision to decide - * whether to rebuild -- one final long carries the cache key on the schema itself. + * The {@code DDAgentFeaturesDiscovery.getLastTimeDiscovered()} value this schema was built from. + * The aggregator thread reads and updates this once per reporting cycle when reconciling against + * the latest discovery; producer threads never touch it. Plain (non-volatile, non-final) because + * the aggregator is the sole reader/writer. */ - final long peerTagsRevision; + long lastTimeDiscovered; private final HealthMetrics healthMetrics; /** * Per-cycle warn-once gating. {@code Set.add(name)} returns true exactly the first time a tag - * gets blocked this cycle, which is the only time we want to emit the warn log. Cleared by - * {@link #resetCardinalityHandlers()}. + * gets blocked this cycle, which is the only time we want to emit the warn log. Cleared by {@link + * #resetCardinalityHandlers()}. */ private final Set warnedCardinality = new HashSet<>(); @@ -79,13 +82,13 @@ final class PeerTagSchema { private final long[] blockedCounts; /** Builds a schema for the given peer-tag names. Order is determined by the {@link Set}. */ - static PeerTagSchema of(Set names, long peerTagsRevision, HealthMetrics healthMetrics) { - return new PeerTagSchema(names.toArray(new String[0]), peerTagsRevision, healthMetrics); + static PeerTagSchema of(Set names, long lastTimeDiscovered, HealthMetrics healthMetrics) { + return new PeerTagSchema(names.toArray(new String[0]), lastTimeDiscovered, healthMetrics); } - private PeerTagSchema(String[] names, long peerTagsRevision, HealthMetrics healthMetrics) { + private PeerTagSchema(String[] names, long lastTimeDiscovered, HealthMetrics healthMetrics) { this.names = names; - this.peerTagsRevision = peerTagsRevision; + this.lastTimeDiscovered = lastTimeDiscovered; this.healthMetrics = healthMetrics; this.handlers = new TagCardinalityHandler[names.length]; this.blockedCounts = new long[names.length]; @@ -95,6 +98,25 @@ private PeerTagSchema(String[] names, long peerTagsRevision, HealthMetrics healt } } + /** + * Whether this schema's tag names exactly match {@code other}. Used by the aggregator's reconcile + * path: when a feature discovery refresh bumps {@link + * DDAgentFeaturesDiscovery#getLastTimeDiscovered()} but the resulting set is unchanged, the + * aggregator can keep this schema (and its warm cardinality handlers) and just bump {@link + * #lastTimeDiscovered} instead of rebuilding. + */ + boolean hasSameTagsAs(Set other) { + if (this.names.length != other.size()) { + return false; + } + for (String name : this.names) { + if (!other.contains(name)) { + return false; + } + } + return true; + } + /** * Canonicalizes the peer-tag value at slot {@code i}. Returns {@link UTF8BytesString#EMPTY} for * null inputs and the handler's {@code ":blocked_by_tracer"} sentinel when the per-tag @@ -120,8 +142,8 @@ UTF8BytesString register(int i, String value) { /** * Resets every {@link TagCardinalityHandler}'s working set, flushes accumulated per-tag block - * counts to {@link HealthMetrics}, and clears the per-cycle warn-once tracking. Must be called - * on the aggregator thread; handlers are not thread-safe. + * counts to {@link HealthMetrics}, and clears the per-cycle warn-once tracking. Must be called on + * the aggregator thread; handlers are not thread-safe. */ void resetCardinalityHandlers() { for (int i = 0; i < handlers.length; i++) { diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index f43d1864fc8..14af0bd0b27 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -6,6 +6,17 @@ /** * Cardinality-capped UTF8 canonicalizer for one property field. * + *

Dual role -- limiter and cache. Prior versions ran a per-field {@code DDCache} for UTF8 + * reuse with a separate global cardinality cap on top. Under high load that wasn't enough to stave + * off long GC cycles: every miss still concatenated / UTF8-encoded the value before the cache could + * store it. A cardinality limiter and a recent-value cache are both sets of recently used + * values, so this class collapses them into one structure. Cardinality limiting happens first, + * which lets the blocked path skip the concatenation and encoding entirely. + * + *

A pure limiter would fully reset each reporting cycle and destroy the cache. To preserve UTF8 + * reuse across resets, the handler keeps the previous cycle's entries verbatim in a parallel table + * and reuses any matching {@link UTF8BytesString} when a value first appears in the new cycle. + * *

Accepts any {@link CharSequence} input -- mixed {@code String}/{@code UTF8BytesString} of the * same content collapse to one slot because {@link UTF8BytesString#hashCode()} delegates to the * underlying String's hash and probe equality is the content-based {@code diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java index 4fce49d0695..7b44029cfcd 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/SpanSnapshot.java @@ -1,5 +1,7 @@ package datadog.trace.common.metrics; +import javax.annotation.Nullable; + /** * Immutable per-span value posted from the producer to the aggregator thread. Carries the raw * inputs the aggregator needs to look up or build an {@link AggregateEntry} and update its @@ -25,14 +27,14 @@ final class SpanSnapshot implements InboxItem { * carries the names + {@link TagCardinalityHandler}s in parallel array form; {@code * peerTagValues} holds the per-span tag values at the same indices. */ - final PeerTagSchema peerTagSchema; + @Nullable final PeerTagSchema peerTagSchema; /** * Peer tag values captured from the span, parallel to {@code peerTagSchema.names}. A {@code null} * entry means the span didn't have that peer tag set. {@code null} (the whole array) when {@link * #peerTagSchema} is {@code null}. */ - final String[] peerTagValues; + @Nullable final String[] peerTagValues; final String httpMethod; final String httpEndpoint; @@ -51,8 +53,8 @@ final class SpanSnapshot implements InboxItem { boolean synthetic, boolean traceRoot, String spanKind, - PeerTagSchema peerTagSchema, - String[] peerTagValues, + @Nullable PeerTagSchema peerTagSchema, + @Nullable String[] peerTagValues, String httpMethod, String httpEndpoint, String grpcStatusCode, diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index d96f16f4024..7cb6076dabc 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -7,8 +7,14 @@ * Cardinality-capped UTF8 canonicalizer for one peer-tag name. Output is the pre-encoded {@code * "tag:value"} form the serializer writes. * - *

Same open-addressed flat-array + prior-cycle reuse design as {@link - * PropertyCardinalityHandler} -- see that class for full description. + *

Like {@link PropertyCardinalityHandler}, this serves a dual role -- cardinality limiter and + * UTF8 cache fused into one set of recently used values, with the prior cycle's entries retained so + * UTF8 reuse survives the per-cycle reset. See {@link PropertyCardinalityHandler} for the full + * rationale and storage layout. + * + *

The structural difference here is that the cached {@link UTF8BytesString} holds the {@code + * "tag:value"} concatenation rather than the bare value, so a parallel {@code String[]} keys table + * is needed to probe by the raw value. */ final class TagCardinalityHandler { private final String tag; @@ -79,9 +85,9 @@ private int probe(String[] keys, String value) { } /** - * Whether {@code result} (returned from a prior {@link #register} call) is this handler's - * blocked sentinel. The size check short-circuits the hot path so the sentinel is never - * materialized before any value has actually been blocked this cycle. + * Whether {@code result} (returned from a prior {@link #register} call) is this handler's blocked + * sentinel. The size check short-circuits the hot path so the sentinel is never materialized + * before any value has actually been blocked this cycle. */ boolean isBlockedResult(UTF8BytesString result) { return this.curSize >= this.cardinalityLimit && result == blockedByTracer(); From 3c3c8b1ce264d089646e49b3fb16b380f7bed8c1 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Wed, 20 May 2026 13:47:05 -0400 Subject: [PATCH 27/33] Lock in cardinality-handler prior-cycle UTF8 reuse with explicit tests Addresses PR #11387 review (test coverage gap): - Fix misleading comment in propertyResetRefreshesBudget ("the previous instances aren't reused") -- they ARE reused; the test only passed because it asserted on .toString() content rather than identity. - Add propertyPriorCycleInstancesAreReusedAcrossReset: explicit assertSame check that registering the same value after a reset returns the SAME UTF8BytesString instance from the prior cycle. This is the "dual role as cache" property the canonical-key lookup depends on. - Add propertyPriorCycleReuseSurvivesOneResetButNotTwo: nails down the reuse window depth (one cycle, not two). - Add tagPriorCycleInstancesAreReusedAcrossReset mirroring the property handler test for the tag handler (cached "tag:value" UTF8BytesString). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../metrics/CardinalityHandlerTest.java | 53 ++++++++++++++++++- 1 file changed, 52 insertions(+), 1 deletion(-) diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java index b6b3a216e5a..08ecbdef628 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/CardinalityHandlerTest.java @@ -42,7 +42,10 @@ void propertyResetRefreshesBudget() { h.reset(); - // After reset, three distinct values fit again, but the previous instances aren't reused. + // After reset, three distinct values fit again. Prior-cycle instances are reused + // (see propertyPriorCycleInstancesAreReusedAcrossReset for the dedicated check); here + // we just confirm that the budget refreshed so values previously blocked now have + // a slot. UTF8BytesString afterReset = h.register("a"); assertEquals("a", afterReset.toString()); UTF8BytesString c = h.register("c"); @@ -53,6 +56,39 @@ void propertyResetRefreshesBudget() { assertSame(blockedAgain, blockedYetAgain); } + @Test + void propertyPriorCycleInstancesAreReusedAcrossReset() { + // Dual role: the handler is also a UTF8 cache. Values held in the prior cycle are + // reused on the first registration in the new cycle, so aggregate entries that hold a + // reference to a UTF8BytesString still match on identity after the per-cycle reset. + // This is the cache-survives-reset property the canonical-key lookup depends on. + PropertyCardinalityHandler h = new PropertyCardinalityHandler(4); + UTF8BytesString aBefore = h.register("a"); + UTF8BytesString bBefore = h.register("b"); + + h.reset(); + + assertSame(aBefore, h.register("a")); + assertSame(bBefore, h.register("b")); + // Same-cycle subsequent registration continues to return the reused instance. + assertSame(aBefore, h.register("a")); + } + + @Test + void propertyPriorCycleReuseSurvivesOneResetButNotTwo() { + // Reuse window is one cycle deep -- the handler swaps current/prior on reset, so a + // value last seen two cycles ago is no longer cached and will be re-allocated. + PropertyCardinalityHandler h = new PropertyCardinalityHandler(4); + UTF8BytesString first = h.register("a"); + + h.reset(); + h.reset(); + + UTF8BytesString afterTwoResets = h.register("a"); + assertNotSame(first, afterTwoResets); + assertEquals("a", afterTwoResets.toString()); + } + @Test void tagPrefixesValuesAndReusesUnderLimit() { TagCardinalityHandler h = new TagCardinalityHandler("peer.hostname", 4); @@ -86,6 +122,21 @@ void tagResetRefreshesBudgetAndSentinelStaysStable() { assertSame(blockedBefore, blockedAfter); } + @Test + void tagPriorCycleInstancesAreReusedAcrossReset() { + // Mirrors propertyPriorCycleInstancesAreReusedAcrossReset: the pre-built "tag:value" + // UTF8BytesString from the prior cycle is reused on the first registration in the new + // cycle -- no re-concatenation, no re-encoding. + TagCardinalityHandler h = new TagCardinalityHandler("peer.hostname", 4); + UTF8BytesString hostABefore = h.register("host-a"); + UTF8BytesString hostBBefore = h.register("host-b"); + + h.reset(); + + assertSame(hostABefore, h.register("host-a")); + assertSame(hostBBefore, h.register("host-b")); + } + @Test void propertyRegisterOfNullReturnsEmpty() { PropertyCardinalityHandler h = new PropertyCardinalityHandler(4); From 9bbe2d0b96f270729ae99aea4213edf603725d59 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Thu, 21 May 2026 11:51:06 -0400 Subject: [PATCH 28/33] Rename bootstrap test to ClientStatsAggregator + adapt PeerTagSchemaTest #11387's ClientStatsAggregator renames ConflatingMetricsAggregator; the test file's name and class refs need to match. PeerTagSchemaTest's PeerTagSchema.of() calls need the (Set, long, HealthMetrics) signature this branch introduced. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...> ClientStatsAggregatorBootstrapTest.java} | 16 +++++------ .../common/metrics/PeerTagSchemaTest.java | 27 ++++++++++++++----- 2 files changed, 28 insertions(+), 15 deletions(-) rename dd-trace-core/src/test/java/datadog/trace/common/metrics/{ConflatingMetricsAggregatorBootstrapTest.java => ClientStatsAggregatorBootstrapTest.java} (96%) diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorBootstrapTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/ClientStatsAggregatorBootstrapTest.java similarity index 96% rename from dd-trace-core/src/test/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorBootstrapTest.java rename to dd-trace-core/src/test/java/datadog/trace/common/metrics/ClientStatsAggregatorBootstrapTest.java index 76347e505c0..f6ee6ee8859 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/ConflatingMetricsAggregatorBootstrapTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/ClientStatsAggregatorBootstrapTest.java @@ -21,7 +21,7 @@ import org.junit.jupiter.api.Test; /** - * Coverage for the {@code ConflatingMetricsAggregator} peer-tag schema bootstrap and reconcile + * Coverage for the {@code ClientStatsAggregator} peer-tag schema bootstrap and reconcile * paths. * *

    @@ -36,7 +36,7 @@ * correctly across cycles. *
*/ -class ConflatingMetricsAggregatorBootstrapTest { +class ClientStatsAggregatorBootstrapTest { @Test void bootstrapHappensOnceOnFirstPublish() { @@ -50,8 +50,8 @@ void bootstrapHappensOnceOnFirstPublish() { when(features.peerTags()).thenReturn(Collections.singleton("peer.hostname")); when(features.getLastTimeDiscovered()).thenReturn(1000L); - ConflatingMetricsAggregator aggregator = - new ConflatingMetricsAggregator( + ClientStatsAggregator aggregator = + new ClientStatsAggregator( Collections.emptySet(), features, healthMetrics, @@ -87,8 +87,8 @@ void reconcileSkipsDeepCompareWhenTimestampMatches() throws Exception { when(features.peerTags()).thenReturn(Collections.singleton("peer.hostname")); when(features.getLastTimeDiscovered()).thenReturn(1000L); - ConflatingMetricsAggregator aggregator = - new ConflatingMetricsAggregator( + ClientStatsAggregator aggregator = + new ClientStatsAggregator( Collections.emptySet(), features, healthMetrics, @@ -156,8 +156,8 @@ void reconcileSurvivesTimestampBumpWhenTagsUnchanged() throws Exception { // Timestamp bumps every reconcile -- forces reconcile into the slow path each time. when(features.getLastTimeDiscovered()).thenReturn(1L, 2L, 3L); - ConflatingMetricsAggregator aggregator = - new ConflatingMetricsAggregator( + ClientStatsAggregator aggregator = + new ClientStatsAggregator( Collections.emptySet(), features, healthMetrics, diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java index 6b9f557d046..4711cb09ca6 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java @@ -5,6 +5,7 @@ import static org.junit.jupiter.api.Assertions.assertFalse; import static org.junit.jupiter.api.Assertions.assertTrue; +import datadog.trace.core.monitor.HealthMetrics; import java.util.Arrays; import java.util.Collections; import java.util.HashSet; @@ -22,7 +23,7 @@ class PeerTagSchemaTest { @Test void ofBuildsSchemaFromSetWithTimestamp() { Set tags = new LinkedHashSet<>(Arrays.asList("peer.hostname", "peer.service")); - PeerTagSchema schema = PeerTagSchema.of(tags, 1234L); + PeerTagSchema schema = PeerTagSchema.of(tags, 1234L, HealthMetrics.NO_OP); assertArrayEquals(new String[] {"peer.hostname", "peer.service"}, schema.names); assertEquals(1234L, schema.lastTimeDiscovered); @@ -31,7 +32,8 @@ void ofBuildsSchemaFromSetWithTimestamp() { @Test void ofHandlesEmptySet() { - PeerTagSchema schema = PeerTagSchema.of(Collections.emptySet(), 0L); + PeerTagSchema schema = + PeerTagSchema.of(Collections.emptySet(), 0L, HealthMetrics.NO_OP); assertEquals(0, schema.size()); assertEquals(0, schema.names.length); @@ -46,7 +48,10 @@ void internalSingletonCarriesBaseService() { @Test void hasSameTagsAsReturnsTrueForExactMatch() { PeerTagSchema schema = - PeerTagSchema.of(new LinkedHashSet<>(Arrays.asList("peer.hostname", "peer.service")), 1L); + PeerTagSchema.of( + new LinkedHashSet<>(Arrays.asList("peer.hostname", "peer.service")), + 1L, + HealthMetrics.NO_OP); // Same content via a different Set reference -- this is the case the reconcile fast-path // depends on (Set returned from a fresh discovery cycle is content-equal to the prior one). @@ -56,7 +61,9 @@ void hasSameTagsAsReturnsTrueForExactMatch() { @Test void hasSameTagsAsReturnsFalseWhenSetGrew() { - PeerTagSchema schema = PeerTagSchema.of(Collections.singleton("peer.hostname"), 1L); + PeerTagSchema schema = + PeerTagSchema.of( + Collections.singleton("peer.hostname"), 1L, HealthMetrics.NO_OP); Set larger = new HashSet<>(Arrays.asList("peer.hostname", "peer.service")); assertFalse(schema.hasSameTagsAs(larger)); @@ -65,21 +72,27 @@ void hasSameTagsAsReturnsFalseWhenSetGrew() { @Test void hasSameTagsAsReturnsFalseWhenSetShrank() { PeerTagSchema schema = - PeerTagSchema.of(new LinkedHashSet<>(Arrays.asList("peer.hostname", "peer.service")), 1L); + PeerTagSchema.of( + new LinkedHashSet<>(Arrays.asList("peer.hostname", "peer.service")), + 1L, + HealthMetrics.NO_OP); assertFalse(schema.hasSameTagsAs(Collections.singleton("peer.hostname"))); } @Test void hasSameTagsAsReturnsFalseWhenContentDifferent() { - PeerTagSchema schema = PeerTagSchema.of(Collections.singleton("peer.hostname"), 1L); + PeerTagSchema schema = + PeerTagSchema.of( + Collections.singleton("peer.hostname"), 1L, HealthMetrics.NO_OP); assertFalse(schema.hasSameTagsAs(Collections.singleton("peer.service"))); } @Test void hasSameTagsAsHandlesEmpty() { - PeerTagSchema empty = PeerTagSchema.of(Collections.emptySet(), 1L); + PeerTagSchema empty = + PeerTagSchema.of(Collections.emptySet(), 1L, HealthMetrics.NO_OP); assertTrue(empty.hasSameTagsAs(Collections.emptySet())); assertFalse(empty.hasSameTagsAs(Collections.singleton("peer.hostname"))); From 0c50037746813f36f9cc39aa4fa0e413c7e19ac2 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Thu, 21 May 2026 16:22:49 -0400 Subject: [PATCH 29/33] Drop unused Tags imports flagged by codenarc Leftover from removing the isKind() override in TraceGenerator earlier in this session -- I dropped the SpanKindFilter import but missed datadog.trace.bootstrap.instrumentation.api.Tags, which is no longer referenced in either file. Resolves codenarcTest and codenarcTraceAgentTest UnusedImport violations. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../groovy/datadog/trace/common/writer/TraceGenerator.groovy | 1 - dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy | 1 - 2 files changed, 2 deletions(-) diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy index 1e251f09bf2..a6b45b60aa7 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy @@ -11,7 +11,6 @@ import datadog.trace.api.ProcessTags import datadog.trace.api.TagMap import datadog.trace.api.sampling.PrioritySampling import datadog.trace.bootstrap.instrumentation.api.AgentSpanLink -import datadog.trace.bootstrap.instrumentation.api.Tags import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString import datadog.trace.core.CoreSpan import datadog.trace.core.Metadata diff --git a/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy b/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy index e7b08915d5f..665739cfaff 100644 --- a/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy +++ b/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy @@ -9,7 +9,6 @@ import datadog.trace.api.DDTags import datadog.trace.api.DDTraceId import datadog.trace.api.IdGenerationStrategy import datadog.trace.api.TagMap -import datadog.trace.bootstrap.instrumentation.api.Tags import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString import datadog.trace.core.CoreSpan import datadog.trace.core.Metadata From 078382f6f53ae78a2087d632aba98b61f5819c3c Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Thu, 21 May 2026 16:50:11 -0400 Subject: [PATCH 30/33] Drop unused SpanKindFilter imports flagged by codenarc Leftover from earlier cleanup of the isKind() override -- #11387 hadn't yet cascaded that part, so the import is stale here too. Resolves codenarcTest and codenarcTraceAgentTest UnusedImport violations. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../groovy/datadog/trace/common/writer/TraceGenerator.groovy | 1 - dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy | 1 - 2 files changed, 2 deletions(-) diff --git a/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy b/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy index a6b45b60aa7..66bdbab137b 100644 --- a/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy +++ b/dd-trace-core/src/test/groovy/datadog/trace/common/writer/TraceGenerator.groovy @@ -15,7 +15,6 @@ import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString import datadog.trace.core.CoreSpan import datadog.trace.core.Metadata import datadog.trace.core.MetadataConsumer -import datadog.trace.core.SpanKindFilter import java.util.concurrent.ThreadLocalRandom import java.util.concurrent.TimeUnit diff --git a/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy b/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy index 665739cfaff..e668d0112a6 100644 --- a/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy +++ b/dd-trace-core/src/traceAgentTest/groovy/TraceGenerator.groovy @@ -13,7 +13,6 @@ import datadog.trace.bootstrap.instrumentation.api.UTF8BytesString import datadog.trace.core.CoreSpan import datadog.trace.core.Metadata import datadog.trace.core.MetadataConsumer -import datadog.trace.core.SpanKindFilter import java.util.concurrent.ThreadLocalRandom import java.util.concurrent.TimeUnit From 4171d15c874a1c7e06053508dd023332ae618dda Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Thu, 21 May 2026 17:37:48 -0400 Subject: [PATCH 31/33] Sync client_metrics_design doc with reconcile-on-aggregator-thread The doc described an old design where the producer thread per-trace read a peerTagsRevision() and rebuilt the cached PeerTagSchema under a monitor. The actual implementation (cascaded from #11381) runs reconcile once per report cycle on the aggregator thread via the onReportCycle hook, keyed on getLastTimeDiscovered(). Producers do nothing more than a volatile read of the cached schema. Updates: - Producer-side flow: drop the per-trace sync description; document the volatile-read steady state and the one-time synchronized bootstrap on first publish. - New "Aggregator-side reconcile" section under "Reporting cadence and cardinality reset" describing the timestamp fast path, the same-tags slow path that preserves warm handlers, and the read-order race fix (timestamp before names). - Memory and lifetime: replace peerTagsRevision pairing with the on-schema lastTimeDiscovered + per-aggregator-instance lifecycle. - "Why the redesign" point 6: rewritten to describe the aggregator- thread reconcile rather than the producer-side revision check. Resolves dougqh's open review thread about peerTagsRevision vs lastTimeDiscovered. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/client_metrics_design.md | 74 ++++++++++++++++++++++------------- 1 file changed, 47 insertions(+), 27 deletions(-) diff --git a/docs/client_metrics_design.md b/docs/client_metrics_design.md index ca5f200c97f..bdf24b92274 100644 --- a/docs/client_metrics_design.md +++ b/docs/client_metrics_design.md @@ -63,19 +63,22 @@ Three rules govern the design: The producer holds **no shared state**. Per trace it: -1. Snapshots the current peer-aggregation schema **once per trace** (not per - span): +1. Reads the **cached peer-aggregation schema** from a volatile field on + `ClientStatsAggregator`: ```java - PeerTagSchema peerAggSchema = peerAggSchema(features.peerTagsRevision()); + PeerTagSchema schema = cachedPeerTagSchema; + if (schema == null) { schema = bootstrapPeerTagSchema(); } ``` - `peerAggSchema(...)` reads a `volatile long` revision held on the - aggregator and compares it to the value the cached `PeerTagSchema` was - built from. Match → return the cached schema (the common case, since - `peerTagsRevision()` only bumps when `DDAgentFeaturesDiscovery` observes a - peer-tag set that doesn't equal the previous one). Mismatch → take a - monitor on the aggregator, rebuild via `PeerTagSchema.of(names)`, and - publish the new schema + revision. The steady-state cost is one volatile - read + one long compare. + The steady-state cost is one volatile read. The producer does **not** + reconcile the schema against `DDAgentFeaturesDiscovery` — that's the + aggregator thread's job, run once per reporting cycle (see + [Aggregator-side reconcile](#aggregator-side-reconcile) below). + + The bootstrap path is a synchronized double-check that runs exactly once, + on the very first publish. It builds the initial schema by reading + `features.getLastTimeDiscovered()` *first*, then `features.peerTags()` + (read-order matters; see the inline Javadoc on `buildPeerTagSchema`). The + schema cache is per-`ClientStatsAggregator` instance, not static. 2. Iterates the trace; for each metrics-eligible span: @@ -93,7 +96,7 @@ The producer holds **no shared state**. Per trace it: trace is dropped on a match. - **Picks the peer-tag schema** (`peerTagSchemaFor`): for client/producer/ - consumer kinds → `peerAggSchema` (already synced for this trace); for + consumer kinds → the cached peer-aggregation schema from step 1; for internal-kind spans → `PeerTagSchema.INTERNAL` (single `base.service` entry); otherwise `null`. @@ -216,12 +219,24 @@ Two distinct cadences: handlers. The handlers reset *every reporting cycle*, so the per-field budgets refresh. -- **Schema sync**: `ClientStatsAggregator.peerAggSchema(long)` runs on the - producer thread per trace, keyed on `DDAgentFeaturesDiscovery.peerTagsRevision()`. - The cached schema is replaced when remote-config reconfigures the peer-tag - set (i.e., when the revision bumps). The schema's - `TagCardinalityHandler`s are reset on the aggregator thread each report - cycle via a hook passed into `Aggregator`. +- **Schema sync** (`reconcilePeerTagSchema`): + runs on the **aggregator thread** at the start of every report cycle, via a + hook (`onReportCycle`) passed into `Aggregator`. Fast path: compares the + cached schema's embedded `lastTimeDiscovered` against + `features.getLastTimeDiscovered()` — match → no-op. Mismatch path: reads + `features.peerTags()`; if the tag set is unchanged, just bumps the cached + schema's `lastTimeDiscovered` in place (preserving its warm + `TagCardinalityHandler`s); if the tag set changed, builds a fresh + `PeerTagSchema` and writes it to the volatile `cachedPeerTagSchema`. The + schema's `TagCardinalityHandler`s are reset alongside the property handlers + in the same cycle. + + **Read-order note.** `DDAgentFeaturesDiscovery` exposes `peerTags()` and + `getLastTimeDiscovered()` as separate accessors over its volatile state. + Both `buildPeerTagSchema` and `reconcilePeerTagSchema` read the timestamp + *before* the tag set so that an interleaving discovery refresh leaves the + schema "older than its names" rather than "newer", letting the next + reconcile cycle detect the mismatch and self-heal. ## Memory and lifetime @@ -231,10 +246,12 @@ Two distinct cadences: schedule-driven `REPORT`, drainer-driven inserts) route through the inbox. - `Canonical` and the cardinality handlers are aggregator-thread-only. - The cached `PeerTagSchema` lives on `ClientStatsAggregator` as a `volatile` - field paired with the `peerTagsRevision` it was built from; rebuild is - guarded by a monitor on the aggregator instance. The schema's - `TagCardinalityHandler`s themselves are aggregator-thread-only and are - reset alongside the property handlers each cycle. + field. Bootstrap (one-time, on the very first publish) is a synchronized + double-check; thereafter only the aggregator thread mutates the field, via + `reconcilePeerTagSchema` once per report cycle. The schema itself carries + the `lastTimeDiscovered` value it was built from. The schema's + `TagCardinalityHandler`s are aggregator-thread-only and are reset + alongside the property handlers each cycle. - Entries retain their `UTF8BytesString` references across handler resets; matches via content-equality so post-reset snapshots still resolve. - Cap: `tracerMetricsMaxAggregates` bounds table size. Cap-overrun policy: @@ -289,11 +306,14 @@ showed the producer dominating CPU time. The major shifts: `PeerTagSchema`; the producer carries values in a parallel `String[]`. The aggregator does the `tag:value` interning via `TagCardinalityHandler` on its own thread. -6. **Sync peer-tag schema once per trace.** The producer reads - `features.peerTagsRevision()` and compares it to the revision the cached - `PeerTagSchema` was built from; the steady-state cost is one volatile read - and one long compare. The cache lives on `ClientStatsAggregator`, not as - static state on `PeerTagSchema`. +6. **Move peer-tag schema reconcile off the producer.** The producer just + reads the volatile cached `PeerTagSchema` (steady-state: one volatile + read). Schema reconciliation runs once per report cycle on the aggregator + thread (`reconcilePeerTagSchema`), keyed on + `DDAgentFeaturesDiscovery.getLastTimeDiscovered()` with a same-tags + slow-path that preserves warm cardinality handlers across discovery + refreshes. The cache lives on `ClientStatsAggregator`, not as static + state on `PeerTagSchema`. 7. **Single owner of all shared state.** `disable()` routes through `CLEAR` rather than mutating the aggregate table directly. From 3fb86d32764dfd789bf48319612667ce42a552d3 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Thu, 21 May 2026 23:06:44 -0400 Subject: [PATCH 32/33] Spread input hash before masking in cardinality-handler probes Both PropertyCardinalityHandler and TagCardinalityHandler linear-probe on (value.hashCode() & capacityMask). Without a spreader, inputs that share a low-bit pattern (e.g. URL templates with a common prefix, or String.hashCode values clustered around 0 for short strings) collapse onto the same probe chain. With the load factor capped at 0.5 the chain length is bounded but can still grow under pathological inputs. Mixing the input hash with its upper half (h ^ (h >>> 16)) before masking spreads the high bits down, same trick HashMap.hash uses. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../trace/common/metrics/PropertyCardinalityHandler.java | 7 ++++++- .../trace/common/metrics/TagCardinalityHandler.java | 8 +++++++- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java index 14af0bd0b27..e9e257928f5 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PropertyCardinalityHandler.java @@ -109,9 +109,14 @@ UTF8BytesString register(CharSequence value) { * UTF8BytesString, or the first empty slot in the probe chain. {@link UTF8BytesString#hashCode} * is content-stable with the underlying String, so the same content hashes to the same slot * regardless of whether the input is a String or UTF8BytesString. + * + *

Mixes the input hash with its upper half ({@code h ^ (h >>> 16)}) before masking so that + * inputs sharing a low-bit pattern (e.g. URL templates with a common prefix) don't collapse onto + * the same probe chain. Same trick {@code HashMap.hash} uses. */ private int probe(UTF8BytesString[] values, CharSequence value) { - int idx = value.hashCode() & this.capacityMask; + int h = value.hashCode(); + int idx = (h ^ (h >>> 16)) & this.capacityMask; while (values[idx] != null && !values[idx].toString().contentEquals(value)) { idx = (idx + 1) & this.capacityMask; } diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java index 7cb6076dabc..70725589045 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/TagCardinalityHandler.java @@ -76,8 +76,14 @@ UTF8BytesString register(String value) { return utf8; } + /** + * Mixes the input hash with its upper half ({@code h ^ (h >>> 16)}) before masking so that inputs + * sharing a low-bit pattern don't collapse onto the same probe chain. Same trick {@code + * HashMap.hash} uses. + */ private int probe(String[] keys, String value) { - int idx = value.hashCode() & this.capacityMask; + int h = value.hashCode(); + int idx = (h ^ (h >>> 16)) & this.capacityMask; while (keys[idx] != null && !keys[idx].equals(value)) { idx = (idx + 1) & this.capacityMask; } From e5cfb549fdfbbc89d278cb97b9ea8bae1410f1a5 Mon Sep 17 00:00:00 2001 From: Douglas Q Hawkins Date: Fri, 22 May 2026 07:49:31 -0400 Subject: [PATCH 33/33] Apply Spotless Javadoc reflows on metrics files Pure formatting -- google-java-format reflows of Javadoc paragraph breaks and parameter wrapping. No behavior change. Picked up from a prior session's spotlessApply that wasn't bundled into the relevant commit. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../datadog/trace/common/metrics/AggregateEntry.java | 6 +++--- .../java/datadog/trace/common/metrics/PeerTagSchema.java | 8 ++++---- .../metrics/ClientStatsAggregatorBootstrapTest.java | 3 +-- .../datadog/trace/common/metrics/PeerTagSchemaTest.java | 9 +++------ 4 files changed, 11 insertions(+), 15 deletions(-) diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java index 27b359636f3..e5d8a59c7bd 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java @@ -446,9 +446,9 @@ void populate(SpanSnapshot s) { /** * Fills {@link #peerTagsBuffer} with canonical UTF8 forms, applying the schema's per-tag - * handler + warn-once notification at the same index. Returns {@code EMPTY} for null inputs; - * we elide those from the buffer so the wire-format list-of-pairs only contains present peer - * tags. No allocation when the schema/values are absent or all values are null (buffer is just + * handler + warn-once notification at the same index. Returns {@code EMPTY} for null inputs; we + * elide those from the buffer so the wire-format list-of-pairs only contains present peer tags. + * No allocation when the schema/values are absent or all values are null (buffer is just * cleared). */ private void populatePeerTags(PeerTagSchema schema, String[] values) { diff --git a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java index 295ab27117c..2b0fb8bcdc9 100644 --- a/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java +++ b/dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java @@ -87,10 +87,10 @@ static PeerTagSchema of(Set names, long lastTimeDiscovered, HealthMetric } /** - * Test-only factory that takes the names array directly so tests can build a schema in a - * specific order without going through a {@link Set}. Uses {@link HealthMetrics#NO_OP} and a - * sentinel discovery timestamp; tests exercising the cardinality-handler reset path should use - * {@link #of(Set, long, HealthMetrics)} instead. + * Test-only factory that takes the names array directly so tests can build a schema in a specific + * order without going through a {@link Set}. Uses {@link HealthMetrics#NO_OP} and a sentinel + * discovery timestamp; tests exercising the cardinality-handler reset path should use {@link + * #of(Set, long, HealthMetrics)} instead. */ static PeerTagSchema testSchema(String[] names) { return new PeerTagSchema(names, 0L, HealthMetrics.NO_OP); diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/ClientStatsAggregatorBootstrapTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/ClientStatsAggregatorBootstrapTest.java index bcc262e8b92..cde75221ac9 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/ClientStatsAggregatorBootstrapTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/ClientStatsAggregatorBootstrapTest.java @@ -26,8 +26,7 @@ import org.mockito.ArgumentCaptor; /** - * Coverage for the {@code ClientStatsAggregator} peer-tag schema bootstrap and reconcile - * paths. + * Coverage for the {@code ClientStatsAggregator} peer-tag schema bootstrap and reconcile paths. * *

    *
  • {@link #bootstrapHappensOnceOnFirstPublish()} -- verifies the synchronized producer-side diff --git a/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java b/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java index 4711cb09ca6..a8876c86d25 100644 --- a/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java +++ b/dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java @@ -62,8 +62,7 @@ void hasSameTagsAsReturnsTrueForExactMatch() { @Test void hasSameTagsAsReturnsFalseWhenSetGrew() { PeerTagSchema schema = - PeerTagSchema.of( - Collections.singleton("peer.hostname"), 1L, HealthMetrics.NO_OP); + PeerTagSchema.of(Collections.singleton("peer.hostname"), 1L, HealthMetrics.NO_OP); Set larger = new HashSet<>(Arrays.asList("peer.hostname", "peer.service")); assertFalse(schema.hasSameTagsAs(larger)); @@ -83,16 +82,14 @@ void hasSameTagsAsReturnsFalseWhenSetShrank() { @Test void hasSameTagsAsReturnsFalseWhenContentDifferent() { PeerTagSchema schema = - PeerTagSchema.of( - Collections.singleton("peer.hostname"), 1L, HealthMetrics.NO_OP); + PeerTagSchema.of(Collections.singleton("peer.hostname"), 1L, HealthMetrics.NO_OP); assertFalse(schema.hasSameTagsAs(Collections.singleton("peer.service"))); } @Test void hasSameTagsAsHandlesEmpty() { - PeerTagSchema empty = - PeerTagSchema.of(Collections.emptySet(), 1L, HealthMetrics.NO_OP); + PeerTagSchema empty = PeerTagSchema.of(Collections.emptySet(), 1L, HealthMetrics.NO_OP); assertTrue(empty.hasSameTagsAs(Collections.emptySet())); assertFalse(empty.hasSameTagsAs(Collections.singleton("peer.hostname")));