Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ Changelog for the bc-prometheus-ruby gem.

### Pending Release

- Add opt-in per-Resque-job histograms `resque_job_queue_latency_seconds` and `resque_job_perform_duration_seconds`, labelled by `job_class`.

## 0.8.1

- Prometheus client respects the enabled setting
Expand Down
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,18 @@ require 'bigcommerce/prometheus'
Bigcommerce::Prometheus::Instrumentors::Resque.new(app: Rails.application).start
```

### Per-job metrics (opt-in)

Set `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1` on Resque worker pods to enable two additional histograms recorded from the parent worker process.

- `resque_job_queue_latency_seconds{job_class}` — time from `scheduled_at` (falling back to `enqueued_at`) until a worker picks the job up. Per attempt; retries-with-backoff anchor on `scheduled_at` so the intentional backoff doesn't show as latency.
- `resque_job_perform_duration_seconds{job_class}` — total Resque child lifetime (fork → `Process.waitpid` return). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit.

These are off by default because they emit one histogram observation per job per worker pod, which adds cardinality. Opt in per service.
Comment thread
WillemHoman marked this conversation as resolved.

`resque_job_queue_latency_seconds` is supported for jobs enqueued via ActiveJob (`.perform_later`) — the enqueue timestamps come from ActiveJob's serialized payload, and the `job_class` label is the user's job class name, not `ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper`.
Vanilla Resque jobs (`Resque.enqueue`) carry no enqueue timestamps, so `queue_latency` silently no-ops for them; `perform_duration` works for every job regardless.

## Configuration

After requiring the main file, you can further configure with:
Expand All @@ -58,6 +70,7 @@ After requiring the main file, you can further configure with:
| server_thread_pool_size | The number of threads used for the exporter server | `3` | `ENV['PROMETHEUS_SERVER_THREAD_POOL_SIZE']` |
| process_name | What the current process name is (used in logging) | `"unknown"` | `ENV['PROCESS']` |
| railtie_disabled | Opt out flag for Railtie; use `Bigcommerce::Prometheus::Instrumentors::Web.new(app: Rails.application).start` in your app's code to start it up yourself | `0` | `ENV['PROMETHEUS_DISABLE_RAILTIE']` |
| resque_per_job_metrics_enabled | Enable per-job queue-latency and perform-duration histograms (parent-side, no synchronous flush) | `0` | `ENV['PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED']` |

## Custom Collectors

Expand Down
5 changes: 5 additions & 0 deletions lib/bigcommerce/prometheus.rb
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
require_relative 'prometheus/collectors/resque'
require_relative 'prometheus/type_collectors/base'
require_relative 'prometheus/type_collectors/resque'
require_relative 'prometheus/type_collectors/resque_job'
require_relative 'prometheus/integrations/active_record'
require_relative 'prometheus/type_collectors/active_record'

Expand All @@ -44,6 +45,10 @@
require_relative 'prometheus/integrations/railtie' if defined?(Rails)
require_relative 'prometheus/integrations/puma'
require_relative 'prometheus/integrations/resque'
require_relative 'prometheus/integrations/resque/active_job_payload'
require_relative 'prometheus/integrations/resque/vanilla_resque_payload'
require_relative 'prometheus/integrations/resque/job_payload'
require_relative 'prometheus/integrations/resque/job_metrics'

require_relative 'prometheus/servers/puma/server'
require_relative 'prometheus/servers/puma/rack_app'
Expand Down
1 change: 1 addition & 0 deletions lib/bigcommerce/prometheus/configuration.rb
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ module Configuration
puma_process_label: ENV.fetch('PROMETHEUS_PUMA_PROCESS_LABEL', 'web').to_s,
resque_collection_frequency: ENV.fetch('PROMETHEUS_RESQUE_COLLECTION_FREQUENCY', 30).to_i,
resque_process_label: ENV.fetch('PROMETHEUS_REQUEST_PROCESS_LABEL', 'resque').to_s,
resque_per_job_metrics_enabled: ENV.fetch('PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED', 0).to_i.positive?,

# Server configuration
not_found_body: ENV.fetch('PROMETHEUS_SERVER_NOT_FOUND_BODY', 'Not Found! The Prometheus Ruby Exporter only listens on /metrics and /send-metrics').to_s,
Expand Down
1 change: 1 addition & 0 deletions lib/bigcommerce/prometheus/instrumentors/resque.rb
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def start

server.add_type_collector(PrometheusExporter::Server::ActiveRecordCollector.new)
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::Resque.new)
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::ResqueJob.new)
@type_collectors.each do |tc|
server.add_type_collector(tc)
end
Expand Down
3 changes: 3 additions & 0 deletions lib/bigcommerce/prometheus/integrations/resque.rb
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ def self.start(client: nil)
client: client || ::Bigcommerce::Prometheus.client,
frequency: ::Bigcommerce::Prometheus.resque_collection_frequency
)
::Bigcommerce::Prometheus::Integrations::Resque::JobMetrics.start(
client: client || ::Bigcommerce::Prometheus.client
)
end
end
end
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# frozen_string_literal: true

# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
# persons to whom the Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
# Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
require 'time'

module Bigcommerce
module Prometheus
module Integrations
class Resque
##
# Payload fields for an ActiveJob-shaped Resque job, read from the
# inner hash at `args[0]`. ActiveJob's JobWrapper stamps the three
# fields the per-job metrics consume:
#
# * job_class — the user's actual job class name; used as the
# metric label.
# * enqueued_at — ISO 8601 string; queue-latency anchor when
# scheduled_at is absent.
# * scheduled_at — ISO 8601 string; preferred over enqueued_at
# when present (e.g. retries-with-backoff, so the
# intentional wait isn't counted as latency).
class ActiveJobPayload
# @return [String] the user's actual job class name
attr_reader :job_class

# @return [Time, nil] the queue-latency anchor; nil when both
# timestamps are absent or unparseable
attr_reader :anchor_time

# @param [Hash] inner the ActiveJob-shaped hash at `args[0]`;
# JobPayload.for guarantees a truthy 'job_class'
def initialize(inner)
@job_class = inner['job_class']
@anchor_time = parse_time(inner['scheduled_at']) || parse_time(inner['enqueued_at'])
end

private

def parse_time(value)
return value if value.is_a?(Time)
return nil if value.nil? || value.to_s.empty?

Time.iso8601(value.to_s)
rescue ArgumentError
nil
end
end
end
end
end
end
166 changes: 166 additions & 0 deletions lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# frozen_string_literal: true

# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
# persons to whom the Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
# Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
module Bigcommerce
module Prometheus
module Integrations
class Resque
##
# Per-Resque-job histogram metrics, recorded from the parent worker process.
# Hooked via a prepend around Resque::Worker#perform_with_fork.
# Queue latency is captured before super, perform duration after.
#
# Off unless PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1
# Emits one histogram observation per job per worker process, which can be high cardinality at scale.
#
# NOTE: queue_latency is supported for jobs enqueued via ActiveJob
# The gem reads three fields from
# `payload['args'][0]` (which must be a Hash):
#
# * job_class — the user's actual job class name; used as the
# metric label.
# * enqueued_at — ISO 8601 string; used as the queue-latency
# anchor when scheduled_at is absent.
# * scheduled_at — ISO 8601 string; preferred over enqueued_at
# when present (e.g. retries-with-backoff, so
# the intentional wait isn't counted as latency).
#
# ActiveJob produces this shape natively — the payload is wrapped by
# ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps
# the three fields above into `args[0]`.
#
# Vanilla Resque jobs enqueued via Resque.enqueue carry no enqueue timestamps.
# class MyJob
# @queue = :foo;
# def self.perform;
# end
# Their args are raw primitive values, not a wrapping hash.
# For these jobs, queue_latency silently no-ops.
# perform_duration works for both styles regardless.
#
# Payloads that replicate the three fields above are read the same way.
# Detection is by shape, not by wrapper class name.
# This means a vanilla job can opt in to queue_latency either by
# - converting to ActiveJob
# - enqueueing through a small wrapper class that stamps these fields into args[0].
#
module JobMetrics
class << self
##
# Install the parent-side hooks if the per-job metrics feature is enabled.
# Idempotent: safe to call multiple times.
#
# @param [PrometheusExporter::Client] client
#
def start(client:)
return unless ::Bigcommerce::Prometheus.resque_per_job_metrics_enabled

@client = client
install_hooks
end

##
# Push the queue-latency observation for a job that's about to be picked up by a worker.
# Anchors on scheduled_at if present so retries-with-backoff don't show the intentional wait as latency.
# Falls back to enqueued_at if scheduled_at isn't present.
#
# @param [ActiveJobPayload, VanillaResquePayload] payload
#
def record_queue_latency(payload)
anchor = payload.anchor_time
return unless anchor

# Clock skew between the enqueuer/scheduler and the worker can put the anchor in the future.
# Clamp to zero so the histogram never records a negative latency.
latency = (Time.now - anchor).to_f.clamp(0.0..)

@client.send_json(
type: 'resque_job',
metric: 'queue_latency',
value: latency,
custom_labels: { job_class: payload.job_class }
)
rescue StandardError => e
::Bigcommerce::Prometheus.logger&.warn(
"[bigcommerce-prometheus] resque_job queue_latency push failed: #{e.message}"
)
Comment thread
Copilot marked this conversation as resolved.
end

##
# Push the perform-duration observation for a completed job.
# Called from the `Resque::Worker#perform_with_fork` prepend, so it measures the full child lifetime:
# fork + reconnect + perform + exit
#
# @param [ActiveJobPayload, VanillaResquePayload] payload
# @param [Float] duration in seconds
#
def record_perform_duration(payload, duration)
@client.send_json(
type: 'resque_job',
metric: 'perform_duration',
value: duration,
custom_labels: { job_class: payload.job_class }
)
rescue StandardError => e
::Bigcommerce::Prometheus.logger&.warn(
"[bigcommerce-prometheus] resque_job perform_duration push failed: #{e.message}"
)
Comment thread
Copilot marked this conversation as resolved.
end

private

def install_hooks
return if @hooks_installed

::Resque::Worker.prepend(WorkerInstrumentation)
@hooks_installed = true
end
end

##
# Prepended onto Resque::Worker to capture for every job that goes through perform_with_fork:
# - queue latency: before super
# - perform duration: after super
# The duration timer starts immediately before super so the
# observation matches the documented fork-to-waitpid window and
# excludes payload parsing and the queue-latency push. The ensure
# is guarded so instrumentation failures can never mask a job
# exception.
module WorkerInstrumentation
def perform_with_fork(job, &block)
payload = begin
JobPayload.for(job)
rescue StandardError
nil
end
JobMetrics.record_queue_latency(payload) if payload
started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
super
ensure
if payload && started_at
JobMetrics.record_perform_duration(
payload,
Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at
)
end
end
end
end
end
end
end
end
69 changes: 69 additions & 0 deletions lib/bigcommerce/prometheus/integrations/resque/job_payload.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# frozen_string_literal: true

Comment thread
WillemHoman marked this conversation as resolved.
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
# persons to whom the Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
# Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
module Bigcommerce
module Prometheus
module Integrations
class Resque
##
# Classifies a Resque::Job's payload and builds the matching
# shape-specific payload object for per-job metrics.
#
# A payload is ActiveJob-shaped when `args[0]` is a Hash carrying a
# truthy 'job_class' — the shape
# ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper produces
# natively. Detection is by shape rather than by wrapper class name:
# the fields are ActiveJob's stable serialization format (persisted
# payloads must survive Rails upgrades), while the wrapper's class
# name is a private Rails constant — matching on it would silently
# kill the metric if Rails ever moved it. Payloads that replicate
# these fields are read the same way, by mechanism. Everything
# else — vanilla Resque jobs with primitive args, nil or non-Hash
# payloads, `args` not being an Array — is treated as vanilla.
#
# Both payload classes expose the same interface: #job_class
# (String) and #anchor_time (Time or nil).
#
module JobPayload
class << self
##
# @param [Resque::Job] resque_job
# @return [ActiveJobPayload, VanillaResquePayload]
#
def for(resque_job)
payload = resque_job.payload
payload = {} unless payload.is_a?(Hash)

inner = activejob_inner(payload)
inner ? ActiveJobPayload.new(inner) : VanillaResquePayload.new(payload)
end

private

def activejob_inner(payload)
args = payload['args']
first = args.is_a?(Array) ? args.first : nil
return nil unless first.is_a?(Hash) && first['job_class']

first
end
end
end
end
end
end
end
Loading