feat: add comprehensive Prometheus metrics across all layers#62
Merged
Conversation
…r, and storage layers Add metrics for cache request outcomes, chunk writes, flush failures, fillrange operations, upstream latency/errors, request coalescing, request duration, active connections, panic recovery, indexdb operations, evictions, migrations, and cached object counts — all under the shared tavern namespace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces Prometheus instrumentation across Tavern’s proxy, HTTP server, middleware, and disk bucket layers to improve operational visibility (latency, errors, cache behavior, panics, and connection tracking).
Changes:
- Added new Prometheus metric definitions and registration for proxy, server, caching middleware, recovery middleware, and disk bucket.
- Instrumented request paths to emit latency histograms and counters (upstream request duration/errors, cache outcomes, panic count, IndexDB op latency, migrations/evictions, etc.).
- Extended server handler/connection lifecycle to report end-to-end request latency and a connection gauge.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
storage/bucket/disk/metrics.go |
Adds disk-bucket metrics (IndexDB latency, I/O bytes, evictions/migrations, cache object gauge). |
storage/bucket/disk/disk.go |
Emits disk-bucket metrics during eviction/migration and IndexDB get/set/delete operations. |
server/metrics.go |
Adds server request duration histogram and a connection gauge. |
server/server.go |
Instruments HTTP handler latency and connection state changes. |
server/middleware/caching/metrics.go |
Adds caching middleware counters (requests, chunk writes, flush failures, fillrange). |
server/middleware/caching/internal.go |
Increments fillrange counter for upstream sub-requests. |
server/middleware/caching/caching.go |
Emits caching request/chunk/flush metrics along caching flow. |
server/middleware/recovery/metrics.go |
Adds panic counter for recovery middleware. |
server/middleware/recovery/recovery.go |
Increments panic counter when recovery catches a panic. |
proxy/metrics.go |
Adds proxy metrics (upstream duration, upstream errors, singleflight collapse counter). |
proxy/proxy.go |
Emits upstream proxy metrics and adds upstream error classification. |
.gitignore |
Ignores .codegraph/ and reasonix.toml. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+268
to
+273
| start := time.Now() | ||
| if err := d.indexdb.Delete(ctx, md.ID.Bytes()); err != nil { | ||
| indexdbOperationDuration.With(prometheus.Labels{"op": "delete", "bucket": d.ID()}).Observe(time.Since(start).Seconds()) | ||
| clog.Warnf("failed to delete metadata %s: %v", md.ID.WPath(d.path), err) | ||
| } | ||
| indexdbOperationDuration.With(prometheus.Labels{"op": "delete", "bucket": d.ID()}).Observe(time.Since(start).Seconds()) |
Comment on lines
133
to
137
| discard := func(evicted lru.Eviction[object.IDHash, storage.Mark]) { | ||
| fd := evicted.Key.WPath(d.path) | ||
| clog.Debugf("evict file %s, last-access %d", fd, evicted.Value.LastAccess()) | ||
| cacheEvictionsTotal.WithLabelValues(d.ID(), "lru").Inc() | ||
| _ = d.DiscardWithHash(context.Background(), evicted.Key) |
Comment on lines
149
to
156
| if err := demote(evicted); err != nil { | ||
| log.Warnf("demote failed: %v", err) | ||
| // fallback to discard | ||
| discard(evicted) | ||
| continue | ||
| } | ||
| cacheEvictionsTotal.WithLabelValues(d.ID(), "demote").Inc() | ||
| continue |
Comment on lines
+18
to
+22
| // diskIOBytesTotal tracks bytes read/written to disk by bucket. | ||
| // Labels: bucket, direction (read/write) | ||
| diskIOBytesTotal = prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
| Namespace: pkgmetrics.Namespace, | ||
| Name: "disk_io_bytes_total", |
Comment on lines
+9
to
+11
| // indexdbOperationDuration tracks indexdb operation latency by operation type and bucket. | ||
| // Labels: op (get/set/delete/iterate), bucket | ||
| indexdbOperationDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{ |
Comment on lines
+26
to
+28
| // cacheEvictionsTotal counts cache eviction events by bucket and reason. | ||
| // Labels: bucket, reason (lru/demote/discard) | ||
| cacheEvictionsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{ |
Comment on lines
95
to
+100
| if !collapsed { | ||
| return client.Do(req) | ||
| //return r.uncompress(client.Do(req)) | ||
| return trackedDo() | ||
| } | ||
|
|
||
| ret := <-r.flight.DoChan(onceKey(req), waitTimeout, func() (*http.Response, error) { | ||
| //return r.uncompress(client.Do(req)) | ||
| return client.Do(req) | ||
| return trackedDo() |
Comment on lines
+18
to
+20
| // upstreamErrorsTotal counts upstream errors by upstream address and error type. | ||
| // Labels: addr, error_type (network/timeout/http_status) | ||
| upstreamErrorsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{ |
Comment on lines
+29
to
+34
| // connectionsActive tracks the current number of active client connections. | ||
| connectionsActive = prometheus.NewGauge(prometheus.GaugeOpts{ | ||
| Namespace: pkgmetrics.Namespace, | ||
| Name: "connections_active", | ||
| Help: "The current number of active client connections", | ||
| }) |
Comment on lines
+11
to
+13
| upstreamRequestDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{ | ||
| Namespace: pkgmetrics.Namespace, | ||
| Name: "upstream_request_duration_seconds", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
为 Tavern 各核心组件新增 Prometheus 监控指标,统一使用
tavern命名空间,覆盖缓存、代理、服务器和存储层。新增指标
Proxy (
proxy/metrics.go)tavern_upstream_request_duration_seconds— 上游请求延迟直方图(按 addr 分组)tavern_upstream_errors_total— 上游请求错误计数(按 addr 和 error_type 分组)tavern_collapse_requests_total— singleflight 请求合并计数(primary/shared)Server (
server/metrics.go,server/server.go)tavern_request_duration_seconds— 端到端请求延迟直方图(按 HTTP method 分组)tavern_connections_active— 当前活跃连接数Cache Middleware (
server/middleware/caching/metrics.go)tavern_cache_requests_total— 缓存请求结果(按 cache_status 和 store_type 分组)tavern_cache_chunk_write_total— 块写入结果计数tavern_cache_flush_failed_total— 缓存刷新失败计数tavern_cache_fillrange_total— fillrange 上游子请求计数Recovery Middleware (
server/middleware/recovery/metrics.go)tavern_panics_total— panic 捕获计数Disk Bucket (
storage/bucket/disk/metrics.go)tavern_indexdb_operation_duration_seconds— IndexDB 操作延迟直方图tavern_disk_io_bytes_total— 磁盘 I/O 字节计数tavern_cache_evictions_total— 缓存淘汰事件计数(lru/demote)tavern_cache_migration_total— 缓存迁移计数(promote/demote)tavern_cache_objects— 当前缓存对象数量Test Plan
make check通过go test ./...通过/metrics端点可看到新增指标🤖 Generated with Claude Code