Skip to content

Commit aa8d323

Browse files
authored
Make log-based metrics available with diagnostics tooling (#613)
* Add DiagnosticsMetrics class for enhanced metrics recording This commit introduces the DiagnosticsMetrics class, which provides a high-level API for recording metrics using the `@vtex/diagnostics-nodejs` library. It includes functionality for recording latency, incrementing counters, and setting gauge values while managing attribute limits to prevent high cardinality. The new metrics system is initialized in the startApp function, allowing for backward compatibility with the existing MetricsAccumulator. * Add unit tests for DiagnosticsMetrics class This commit introduces a comprehensive test suite for the DiagnosticsMetrics class, covering initialization, latency recording, counter incrementing, and gauge setting functionalities. The tests ensure proper handling of attributes, including limits on the number of attributes, and verify that metrics are recorded correctly under various scenarios, including error handling during initialization. * Add method to update HTTP agent metrics for diagnostics This commit introduces the `updateHttpAgentMetrics` method in the HttpAgentSingleton class, which periodically exports the current state of the HTTP agent as diagnostic metrics. It reports the current number of sockets, free sockets, and pending requests as gauges. This stats are used in `MetricsAccumulator` to produce the same metrics in log-based format. * Integrate HTTP agent metrics update into status tracking This commit updates the `trackStatus` function to include a call to `HttpAgentSingleton.updateHttpAgentMetrics()`, enhancing the diagnostics metrics by exporting HTTP agent statistics. This addition complements the existing legacy status tracking functionality. * Add unit tests for HttpAgentSingleton class This commit introduces a new test suite for the HttpAgentSingleton class, validating the functionality of HTTP agent statistics retrieval and metrics reporting. The tests cover scenarios for both populated and empty agent states, ensuring accurate gauge reporting to diagnostics metrics. * Enhance total incoming requests reporting in middleware This commit updates the trackIncomingRequestStats middleware to include detailed metric reporting for total incoming requests. It introduces a cumulative counter for total requests ensuring that metrics are reported to diagnostics if available. * Enhance closed incoming request metrics reporting in middleware This commit modifies the requestClosed function to include detailed reporting of closed request metrics to diagnostics. It ensures that metrics are incremented and reported with relevant route and status information, improving the overall monitoring capabilities of the middleware. * Add diagnostics metrics reporting for aborted requests in middleware This commit enhances the requestAborted function to report metrics for aborted requests to diagnostics. It includes relevant route and status information, improving monitoring capabilities. * Add unit tests for requestStats middleware This commit introduces a comprehensive test suite for the requestStats middleware, validating the functionality of tracking incoming request statistics, including total, aborted, and closed requests. The tests ensure accurate reporting to diagnostics metrics and handle various scenarios, including the absence of global diagnostics metrics. * Instrument http-client metrics with diagnostics - Add OpenTelemetry `Attributes` import in `metricsMiddleware` to annotate HTTP client metric dimensions - Captures elapsed request time and feed it to `global.diagnosticsMetrics.recordLatency` - Create and increment a request counter * Add cache state in http-client metrics - Capture the resolved cache state for each HTTP client response - Include `cache_state` on latency histograms and the primary request counter - Emit `http_client_cache_total` diagnostics counter to replace legacy extension stats * Report retries via diagnostics metrics for http-client metrics - Read retry count from config with safe default so zero attempts are tracked - Retain legacy status extension updates while preparing to deprecate them - Emit `http_client_requests_retried_total` counter with base attributes for diagnostics * Add unit tests for metricsMiddleware This commit introduces a test suite for the metricsMiddleware, validating the functionality of metrics recording for various HTTP client scenarios, including successful requests, cache hits, retries, and error handling. * Instrument timings middleware with diagnostics metrics The HTTP timings middleware now takes advantage of DiagnosticsMetrics whenever it is available so we emit request latency and counter telemetry with stable OpenTelemetry-style attributes. This ensure `http-handler` metrics from legacy log-based approach are proper instrumented with diagnostics patterns. Legacy batching remains in place for backwards compatibility, and we log a warning when the diagnostics client is missing to make instrumentation gaps visible. - extend context destructuring to capture the route type for tagging - record latency histograms for every request and keep legacy batching intact - increment the new http_handler_requests_total counter with consistent attributes * Add unit tests for timings middleware This commit introduces a test suite for the timings middleware, validating the recording of diagnostics metrics for various HTTP request scenarios, including successful requests, error responses, and legacy compatibility. The tests ensure accurate logging of timing and billing information, as well as proper handling of different status codes and graceful degradation when diagnostics metrics are unavailable. * Enhance GraphQL metrics with diagnostics support This commit updates the Metric schema directive to integrate new diagnostics metrics, allowing for the recording of latency histograms and counters with stable attributes. It maintains backward compatibility with legacy metrics while providing warnings when diagnostics metrics are unavailable. Key changes include: - Refactoring status handling for clarity - Adding attributes for component, field name, and status in diagnostics metrics - Recording latency for all requests and incrementing a counter for GraphQL field requests * Add unit tests for Metric schema directive This commit introduces a comprehensive test suite for the Metric schema directive, validating the recording of metrics for both successful and failed GraphQL field resolutions. The tests ensure accurate logging of latency and error metrics, while also checking for proper handling when diagnostics metrics are unavailable. Key additions include: - Tests for successful field resolution metrics - Tests for failed field resolution metrics - Warning checks when diagnostics metrics are not present * Export DiagnosticsMetrics * Update DiagnosticsMetrics to conditionally log limit warnings This commit modifies the DiagnosticsMetrics class to log warnings about exceeded attribute limits only if it is in a linked app context * Add a description explaning unit test that handles intialization errors * fix: correct mock context and test setup for metrics tests Add missing mock properties to requestStats tests (`status`, `route`) and enable `LINKED` constant mocking for DiagnosticsMetrics attribute limiting tests * Release beta version
1 parent 0a92667 commit aa8d323

17 files changed

Lines changed: 2003 additions & 12 deletions

File tree

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@vtex/api",
3-
"version": "7.2.4",
3+
"version": "7.2.5-beta.0",
44
"description": "VTEX I/O API client",
55
"main": "lib/index.js",
66
"typings": "lib/index.d.ts",
Lines changed: 335 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,335 @@
1+
// Mock @vtex/diagnostics-nodejs before any imports
2+
jest.mock('@vtex/diagnostics-nodejs', () => ({
3+
Types: {},
4+
getMetricClient: jest.fn(),
5+
getLogger: jest.fn(),
6+
}))
7+
8+
jest.mock('../../service/metrics/client', () => ({
9+
getMetricClient: jest.fn(),
10+
}))
11+
12+
jest.mock('../../errors/RequestCancelledError', () => ({
13+
RequestCancelledError: class RequestCancelledError extends Error {
14+
constructor(message: string) {
15+
super(message)
16+
this.name = 'RequestCancelledError'
17+
}
18+
},
19+
}))
20+
21+
import { DiagnosticsMetrics } from '../../metrics/DiagnosticsMetrics'
22+
import { MetricsAccumulator } from '../../metrics/MetricsAccumulator'
23+
import { metricsMiddleware } from './metrics'
24+
import { MiddlewareContext } from '../typings'
25+
26+
describe('metricsMiddleware', () => {
27+
let mockMetrics: jest.Mocked<MetricsAccumulator>
28+
let mockDiagnosticsMetrics: jest.Mocked<DiagnosticsMetrics>
29+
let mockNext: jest.Mock
30+
let mockCtx: MiddlewareContext
31+
32+
beforeEach(() => {
33+
// Mock MetricsAccumulator
34+
mockMetrics = {
35+
batch: jest.fn(),
36+
} as any
37+
38+
// Mock DiagnosticsMetrics
39+
mockDiagnosticsMetrics = {
40+
recordLatency: jest.fn(),
41+
incrementCounter: jest.fn(),
42+
setGauge: jest.fn(),
43+
} as any
44+
45+
// Set up global
46+
global.diagnosticsMetrics = mockDiagnosticsMetrics
47+
48+
// Mock next function
49+
mockNext = jest.fn().mockResolvedValue(undefined)
50+
51+
// Mock context
52+
mockCtx = {
53+
config: {
54+
metric: 'test-client',
55+
retryCount: 0,
56+
},
57+
response: {
58+
status: 200,
59+
},
60+
cacheHit: undefined,
61+
inflightHit: undefined,
62+
memoizedHit: undefined,
63+
} as any
64+
})
65+
66+
afterEach(() => {
67+
jest.clearAllMocks()
68+
delete (global as any).diagnosticsMetrics
69+
})
70+
71+
describe('successful requests', () => {
72+
it('should record metrics for successful request with no cache', async () => {
73+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
74+
75+
await middleware(mockCtx, mockNext)
76+
77+
// Legacy metrics
78+
const batchCall = mockMetrics.batch.mock.calls[0]
79+
expect(batchCall[0]).toBe('http-client-test-client')
80+
expect(Array.isArray(batchCall[1])).toBe(true) // hrtime tuple
81+
expect(batchCall[2]).toMatchObject({
82+
success: 1,
83+
'success-miss': 1,
84+
})
85+
86+
// Diagnostics metrics
87+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
88+
expect(Array.isArray(latencyCall[0])).toBe(true) // hrtime tuple
89+
expect(latencyCall[1]).toMatchObject({
90+
component: 'http-client',
91+
client_metric: 'test-client',
92+
status_code: 200,
93+
status: 'success',
94+
cache_state: 'miss',
95+
})
96+
97+
// Main request counter with status as attribute
98+
expect(mockDiagnosticsMetrics.incrementCounter).toHaveBeenCalledWith(
99+
'http_client_requests_total',
100+
1,
101+
expect.objectContaining({
102+
component: 'http-client',
103+
client_metric: 'test-client',
104+
status_code: 200,
105+
status: 'success',
106+
})
107+
)
108+
109+
// Cache counter with cache_state as attribute
110+
expect(mockDiagnosticsMetrics.incrementCounter).toHaveBeenCalledWith(
111+
'http_client_cache_total',
112+
1,
113+
expect.objectContaining({
114+
component: 'http-client',
115+
client_metric: 'test-client',
116+
status: 'success',
117+
cache_state: 'miss',
118+
})
119+
)
120+
})
121+
122+
it('should record metrics with cache hit', async () => {
123+
mockCtx.cacheHit = { revalidated: 1 }
124+
125+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
126+
127+
await middleware(mockCtx, mockNext)
128+
129+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
130+
expect(latencyCall[1]).toMatchObject({
131+
cache_state: 'hit',
132+
})
133+
134+
// Should have cache counter with cache_state attribute
135+
expect(mockDiagnosticsMetrics.incrementCounter).toHaveBeenCalledWith(
136+
'http_client_cache_total',
137+
1,
138+
expect.objectContaining({
139+
component: 'http-client',
140+
client_metric: 'test-client',
141+
cache_state: 'hit',
142+
})
143+
)
144+
})
145+
146+
it('should record metrics with inflight hit', async () => {
147+
mockCtx.inflightHit = true
148+
149+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
150+
151+
await middleware(mockCtx, mockNext)
152+
153+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
154+
expect(latencyCall[1]).toMatchObject({
155+
cache_state: 'inflight',
156+
})
157+
})
158+
159+
it('should record metrics with memoized hit', async () => {
160+
mockCtx.memoizedHit = true
161+
162+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
163+
164+
await middleware(mockCtx, mockNext)
165+
166+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
167+
expect(latencyCall[1]).toMatchObject({
168+
cache_state: 'memoized',
169+
})
170+
})
171+
172+
it('should include retry count in attributes when retries occurred', async () => {
173+
mockCtx.config.retryCount = 2
174+
175+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
176+
177+
await middleware(mockCtx, mockNext)
178+
179+
// Histogram should not include retry_count (only in counter)
180+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
181+
expect(latencyCall[1]).toMatchObject({
182+
status: 'success',
183+
cache_state: 'miss',
184+
})
185+
186+
// Should have retry counter with retry_count attribute
187+
expect(mockDiagnosticsMetrics.incrementCounter).toHaveBeenCalledWith(
188+
'http_client_requests_retried_total',
189+
1,
190+
expect.objectContaining({
191+
component: 'http-client',
192+
client_metric: 'test-client',
193+
status: 'success',
194+
status_code: 200,
195+
retry_count: 2, // Number, not string
196+
})
197+
)
198+
})
199+
})
200+
201+
describe('error handling', () => {
202+
it('should record metrics for timeout errors', async () => {
203+
mockNext.mockRejectedValueOnce({
204+
response: {
205+
data: { code: 'ProxyTimeout' }, // TIMEOUT_CODE
206+
},
207+
})
208+
209+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
210+
211+
await expect(middleware(mockCtx, mockNext)).rejects.toMatchObject({
212+
response: expect.any(Object),
213+
})
214+
215+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
216+
expect(latencyCall[1]).toMatchObject({
217+
status: 'timeout',
218+
})
219+
})
220+
221+
it('should record metrics for aborted requests', async () => {
222+
mockNext.mockRejectedValueOnce({
223+
code: 'ECONNABORTED',
224+
})
225+
226+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
227+
228+
await expect(middleware(mockCtx, mockNext)).rejects.toMatchObject({
229+
code: 'ECONNABORTED',
230+
})
231+
232+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
233+
expect(latencyCall[1]).toMatchObject({
234+
status: 'aborted',
235+
})
236+
})
237+
238+
it('should record metrics for cancelled requests', async () => {
239+
mockNext.mockRejectedValueOnce({
240+
message: 'Request cancelled',
241+
})
242+
243+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
244+
245+
await expect(middleware(mockCtx, mockNext)).rejects.toThrow('Request cancelled')
246+
247+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
248+
expect(latencyCall[1]).toMatchObject({
249+
status: 'cancelled',
250+
})
251+
})
252+
253+
it('should record metrics for HTTP error responses', async () => {
254+
mockNext.mockRejectedValueOnce({
255+
response: {
256+
status: 500,
257+
},
258+
})
259+
260+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
261+
262+
await expect(middleware(mockCtx, mockNext)).rejects.toMatchObject({
263+
response: { status: 500 },
264+
})
265+
266+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
267+
expect(latencyCall[1]).toMatchObject({
268+
status: 'error',
269+
})
270+
})
271+
272+
it('should record metrics for generic errors', async () => {
273+
mockNext.mockRejectedValueOnce(new Error('Generic error'))
274+
275+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
276+
277+
await expect(middleware(mockCtx, mockNext)).rejects.toThrow('Generic error')
278+
279+
const latencyCall = mockDiagnosticsMetrics.recordLatency.mock.calls[0]
280+
expect(latencyCall[1]).toMatchObject({
281+
status: 'error',
282+
})
283+
})
284+
})
285+
286+
describe('backward compatibility', () => {
287+
it('should maintain legacy metrics when config.metric is set', async () => {
288+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
289+
290+
await middleware(mockCtx, mockNext)
291+
292+
// Verify legacy metrics still called
293+
expect(mockMetrics.batch).toHaveBeenCalledTimes(1)
294+
const batchCall = mockMetrics.batch.mock.calls[0]
295+
expect(batchCall[0]).toBe('http-client-test-client')
296+
expect(Array.isArray(batchCall[1])).toBe(true)
297+
expect(batchCall[2]).toBeTruthy()
298+
})
299+
300+
it('should not record metrics when config.metric is not set', async () => {
301+
mockCtx.config.metric = undefined
302+
303+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
304+
305+
await middleware(mockCtx, mockNext)
306+
307+
expect(mockMetrics.batch).not.toHaveBeenCalled()
308+
expect(mockDiagnosticsMetrics.recordLatency).not.toHaveBeenCalled()
309+
expect(mockDiagnosticsMetrics.incrementCounter).not.toHaveBeenCalled()
310+
})
311+
})
312+
313+
describe('graceful degradation', () => {
314+
it('should work without global.diagnosticsMetrics', async () => {
315+
delete (global as any).diagnosticsMetrics
316+
317+
const consoleWarnSpy = jest.spyOn(console, 'warn').mockImplementation()
318+
319+
const middleware = metricsMiddleware({ metrics: mockMetrics, name: 'test' })
320+
321+
await middleware(mockCtx, mockNext)
322+
323+
// Legacy metrics still work
324+
expect(mockMetrics.batch).toHaveBeenCalled()
325+
326+
// Warning logged
327+
expect(consoleWarnSpy).toHaveBeenCalledWith(
328+
'DiagnosticsMetrics not available. HTTP client metrics not reported.'
329+
)
330+
331+
consoleWarnSpy.mockRestore()
332+
})
333+
})
334+
})
335+

0 commit comments

Comments
 (0)