Skip to content

feat: add comprehensive Prometheus metrics across all layers#62

Merged
sendya merged 1 commit into
mainfrom
feat/cache-metric
Jun 8, 2026
Merged

feat: add comprehensive Prometheus metrics across all layers#62
sendya merged 1 commit into
mainfrom
feat/cache-metric

Conversation

@sendya

@sendya sendya commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

为 Tavern 各核心组件新增 Prometheus 监控指标,统一使用 tavern 命名空间,覆盖缓存、代理、服务器和存储层。

新增指标

Proxy (proxy/metrics.go)

  • tavern_upstream_request_duration_seconds — 上游请求延迟直方图(按 addr 分组)
  • tavern_upstream_errors_total — 上游请求错误计数(按 addr 和 error_type 分组)
  • tavern_collapse_requests_total — singleflight 请求合并计数(primary/shared)

Server (server/metrics.go, server/server.go)

  • tavern_request_duration_seconds — 端到端请求延迟直方图(按 HTTP method 分组)
  • tavern_connections_active — 当前活跃连接数

Cache Middleware (server/middleware/caching/metrics.go)

  • tavern_cache_requests_total — 缓存请求结果(按 cache_status 和 store_type 分组)
  • tavern_cache_chunk_write_total — 块写入结果计数
  • tavern_cache_flush_failed_total — 缓存刷新失败计数
  • tavern_cache_fillrange_total — fillrange 上游子请求计数

Recovery Middleware (server/middleware/recovery/metrics.go)

  • tavern_panics_total — panic 捕获计数

Disk Bucket (storage/bucket/disk/metrics.go)

  • tavern_indexdb_operation_duration_seconds — IndexDB 操作延迟直方图
  • tavern_disk_io_bytes_total — 磁盘 I/O 字节计数
  • tavern_cache_evictions_total — 缓存淘汰事件计数(lru/demote)
  • tavern_cache_migration_total — 缓存迁移计数(promote/demote)
  • tavern_cache_objects — 当前缓存对象数量

Test Plan

  • make check 通过
  • go test ./... 通过
  • 启动 tavern 后 /metrics 端点可看到新增指标

🤖 Generated with Claude Code

…r, and storage layers

Add metrics for cache request outcomes, chunk writes, flush failures, fillrange
operations, upstream latency/errors, request coalescing, request duration, active
connections, panic recovery, indexdb operations, evictions, migrations, and cached
object counts — all under the shared tavern namespace.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 8, 2026 03:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces Prometheus instrumentation across Tavern’s proxy, HTTP server, middleware, and disk bucket layers to improve operational visibility (latency, errors, cache behavior, panics, and connection tracking).

Changes:

  • Added new Prometheus metric definitions and registration for proxy, server, caching middleware, recovery middleware, and disk bucket.
  • Instrumented request paths to emit latency histograms and counters (upstream request duration/errors, cache outcomes, panic count, IndexDB op latency, migrations/evictions, etc.).
  • Extended server handler/connection lifecycle to report end-to-end request latency and a connection gauge.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
storage/bucket/disk/metrics.go Adds disk-bucket metrics (IndexDB latency, I/O bytes, evictions/migrations, cache object gauge).
storage/bucket/disk/disk.go Emits disk-bucket metrics during eviction/migration and IndexDB get/set/delete operations.
server/metrics.go Adds server request duration histogram and a connection gauge.
server/server.go Instruments HTTP handler latency and connection state changes.
server/middleware/caching/metrics.go Adds caching middleware counters (requests, chunk writes, flush failures, fillrange).
server/middleware/caching/internal.go Increments fillrange counter for upstream sub-requests.
server/middleware/caching/caching.go Emits caching request/chunk/flush metrics along caching flow.
server/middleware/recovery/metrics.go Adds panic counter for recovery middleware.
server/middleware/recovery/recovery.go Increments panic counter when recovery catches a panic.
proxy/metrics.go Adds proxy metrics (upstream duration, upstream errors, singleflight collapse counter).
proxy/proxy.go Emits upstream proxy metrics and adds upstream error classification.
.gitignore Ignores .codegraph/ and reasonix.toml.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +268 to +273
start := time.Now()
if err := d.indexdb.Delete(ctx, md.ID.Bytes()); err != nil {
indexdbOperationDuration.With(prometheus.Labels{"op": "delete", "bucket": d.ID()}).Observe(time.Since(start).Seconds())
clog.Warnf("failed to delete metadata %s: %v", md.ID.WPath(d.path), err)
}
indexdbOperationDuration.With(prometheus.Labels{"op": "delete", "bucket": d.ID()}).Observe(time.Since(start).Seconds())
Comment on lines 133 to 137
discard := func(evicted lru.Eviction[object.IDHash, storage.Mark]) {
fd := evicted.Key.WPath(d.path)
clog.Debugf("evict file %s, last-access %d", fd, evicted.Value.LastAccess())
cacheEvictionsTotal.WithLabelValues(d.ID(), "lru").Inc()
_ = d.DiscardWithHash(context.Background(), evicted.Key)
Comment on lines 149 to 156
if err := demote(evicted); err != nil {
log.Warnf("demote failed: %v", err)
// fallback to discard
discard(evicted)
continue
}
cacheEvictionsTotal.WithLabelValues(d.ID(), "demote").Inc()
continue
Comment on lines +18 to +22
// diskIOBytesTotal tracks bytes read/written to disk by bucket.
// Labels: bucket, direction (read/write)
diskIOBytesTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: pkgmetrics.Namespace,
Name: "disk_io_bytes_total",
Comment on lines +9 to +11
// indexdbOperationDuration tracks indexdb operation latency by operation type and bucket.
// Labels: op (get/set/delete/iterate), bucket
indexdbOperationDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Comment on lines +26 to +28
// cacheEvictionsTotal counts cache eviction events by bucket and reason.
// Labels: bucket, reason (lru/demote/discard)
cacheEvictionsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Comment thread proxy/proxy.go
Comment on lines 95 to +100
if !collapsed {
return client.Do(req)
//return r.uncompress(client.Do(req))
return trackedDo()
}

ret := <-r.flight.DoChan(onceKey(req), waitTimeout, func() (*http.Response, error) {
//return r.uncompress(client.Do(req))
return client.Do(req)
return trackedDo()
Comment thread proxy/metrics.go
Comment on lines +18 to +20
// upstreamErrorsTotal counts upstream errors by upstream address and error type.
// Labels: addr, error_type (network/timeout/http_status)
upstreamErrorsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Comment thread server/metrics.go
Comment on lines +29 to +34
// connectionsActive tracks the current number of active client connections.
connectionsActive = prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: pkgmetrics.Namespace,
Name: "connections_active",
Help: "The current number of active client connections",
})
Comment thread proxy/metrics.go
Comment on lines +11 to +13
upstreamRequestDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: pkgmetrics.Namespace,
Name: "upstream_request_duration_seconds",
@sendya sendya merged commit 753da1c into main Jun 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants