feat: add comprehensive Prometheus metrics across all layers by sendya · Pull Request #62 · omalloc/tavern

sendya · 2026-06-08T03:03:08Z

Summary

为 Tavern 各核心组件新增 Prometheus 监控指标，统一使用 tavern 命名空间，覆盖缓存、代理、服务器和存储层。

新增指标

Proxy (proxy/metrics.go)

tavern_upstream_request_duration_seconds — 上游请求延迟直方图（按 addr 分组）
tavern_upstream_errors_total — 上游请求错误计数（按 addr 和 error_type 分组）
tavern_collapse_requests_total — singleflight 请求合并计数（primary/shared）

Server (server/metrics.go, server/server.go)

tavern_request_duration_seconds — 端到端请求延迟直方图（按 HTTP method 分组）
tavern_connections_active — 当前活跃连接数

Cache Middleware (server/middleware/caching/metrics.go)

tavern_cache_requests_total — 缓存请求结果（按 cache_status 和 store_type 分组）
tavern_cache_chunk_write_total — 块写入结果计数
tavern_cache_flush_failed_total — 缓存刷新失败计数
tavern_cache_fillrange_total — fillrange 上游子请求计数

Recovery Middleware (server/middleware/recovery/metrics.go)

tavern_panics_total — panic 捕获计数

Disk Bucket (storage/bucket/disk/metrics.go)

tavern_indexdb_operation_duration_seconds — IndexDB 操作延迟直方图
tavern_disk_io_bytes_total — 磁盘 I/O 字节计数
tavern_cache_evictions_total — 缓存淘汰事件计数（lru/demote）
tavern_cache_migration_total — 缓存迁移计数（promote/demote）
tavern_cache_objects — 当前缓存对象数量

Test Plan

make check 通过
go test ./... 通过
启动 tavern 后 /metrics 端点可看到新增指标

🤖 Generated with Claude Code

…r, and storage layers Add metrics for cache request outcomes, chunk writes, flush failures, fillrange operations, upstream latency/errors, request coalescing, request duration, active connections, panic recovery, indexdb operations, evictions, migrations, and cached object counts — all under the shared tavern namespace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces Prometheus instrumentation across Tavern’s proxy, HTTP server, middleware, and disk bucket layers to improve operational visibility (latency, errors, cache behavior, panics, and connection tracking).

Changes:

Added new Prometheus metric definitions and registration for proxy, server, caching middleware, recovery middleware, and disk bucket.
Instrumented request paths to emit latency histograms and counters (upstream request duration/errors, cache outcomes, panic count, IndexDB op latency, migrations/evictions, etc.).
Extended server handler/connection lifecycle to report end-to-end request latency and a connection gauge.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`storage/bucket/disk/metrics.go`	Adds disk-bucket metrics (IndexDB latency, I/O bytes, evictions/migrations, cache object gauge).
`storage/bucket/disk/disk.go`	Emits disk-bucket metrics during eviction/migration and IndexDB get/set/delete operations.
`server/metrics.go`	Adds server request duration histogram and a connection gauge.
`server/server.go`	Instruments HTTP handler latency and connection state changes.
`server/middleware/caching/metrics.go`	Adds caching middleware counters (requests, chunk writes, flush failures, fillrange).
`server/middleware/caching/internal.go`	Increments fillrange counter for upstream sub-requests.
`server/middleware/caching/caching.go`	Emits caching request/chunk/flush metrics along caching flow.
`server/middleware/recovery/metrics.go`	Adds panic counter for recovery middleware.
`server/middleware/recovery/recovery.go`	Increments panic counter when recovery catches a panic.
`proxy/metrics.go`	Adds proxy metrics (upstream duration, upstream errors, singleflight collapse counter).
`proxy/proxy.go`	Emits upstream proxy metrics and adds upstream error classification.
`.gitignore`	Ignores `.codegraph/` and `reasonix.toml`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	start := time.Now()
 	if err := d.indexdb.Delete(ctx, md.ID.Bytes()); err != nil {
+		indexdbOperationDuration.With(prometheus.Labels{"op": "delete", "bucket": d.ID()}).Observe(time.Since(start).Seconds())
 		clog.Warnf("failed to delete metadata %s: %v", md.ID.WPath(d.path), err)
 	}
+	indexdbOperationDuration.With(prometheus.Labels{"op": "delete", "bucket": d.ID()}).Observe(time.Since(start).Seconds())


 	discard := func(evicted lru.Eviction[object.IDHash, storage.Mark]) {
 		fd := evicted.Key.WPath(d.path)
 		clog.Debugf("evict file %s, last-access %d", fd, evicted.Value.LastAccess())
+		cacheEvictionsTotal.WithLabelValues(d.ID(), "lru").Inc()
 		_ = d.DiscardWithHash(context.Background(), evicted.Key)


 					if err := demote(evicted); err != nil {
 						log.Warnf("demote failed: %v", err)
 						// fallback to discard
 						discard(evicted)
 						continue
 					}
+					cacheEvictionsTotal.WithLabelValues(d.ID(), "demote").Inc()
 					continue


+	// diskIOBytesTotal tracks bytes read/written to disk by bucket.
+	// Labels: bucket, direction (read/write)
+	diskIOBytesTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
+		Namespace: pkgmetrics.Namespace,
+		Name:      "disk_io_bytes_total",


+	// indexdbOperationDuration tracks indexdb operation latency by operation type and bucket.
+	// Labels: op (get/set/delete/iterate), bucket
+	indexdbOperationDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{


+	// cacheEvictionsTotal counts cache eviction events by bucket and reason.
+	// Labels: bucket, reason (lru/demote/discard)
+	cacheEvictionsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{


 	if !collapsed {
-		return client.Do(req)
-		//return r.uncompress(client.Do(req))
+		return trackedDo()
 	}

 	ret := <-r.flight.DoChan(onceKey(req), waitTimeout, func() (*http.Response, error) {
-		//return r.uncompress(client.Do(req))
-		return client.Do(req)
+		return trackedDo()


+	// upstreamErrorsTotal counts upstream errors by upstream address and error type.
+	// Labels: addr, error_type (network/timeout/http_status)
+	upstreamErrorsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{


+	// connectionsActive tracks the current number of active client connections.
+	connectionsActive = prometheus.NewGauge(prometheus.GaugeOpts{
+		Namespace: pkgmetrics.Namespace,
+		Name:      "connections_active",
+		Help:      "The current number of active client connections",
+	})


+	upstreamRequestDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
+		Namespace: pkgmetrics.Namespace,
+		Name:      "upstream_request_duration_seconds",


Copilot AI review requested due to automatic review settings June 8, 2026 03:03

Copilot started reviewing on behalf of sendya June 8, 2026 03:03 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

sendya merged commit 753da1c into main Jun 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add comprehensive Prometheus metrics across all layers#62

feat: add comprehensive Prometheus metrics across all layers#62
sendya merged 1 commit into
mainfrom
feat/cache-metric

sendya commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sendya commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

新增指标

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sendya commented Jun 8, 2026 •

edited

Loading