Skip to content

PAYMENTS-11567 Resque latency metrics#30

Draft
WillemHoman wants to merge 2 commits into
mainfrom
PAYMENTS-11567-resque_latency
Draft

PAYMENTS-11567 Resque latency metrics#30
WillemHoman wants to merge 2 commits into
mainfrom
PAYMENTS-11567-resque_latency

Conversation

@WillemHoman

@WillemHoman WillemHoman commented May 28, 2026

Copy link
Copy Markdown

What? Why?

Bigpay's domain-event publishing was suspected of being slow. To diagnose this, a previous PR (bigpay#10597, PAYMENTS-11567) added per-job Prometheus histograms (queue latency and publish duration) directly inside the Resque worker child process.
This required a synchronous metric-flush to ensure observations weren't lost when the fork-per-job child exited.
It had to wait on the bc-prom client's background drain cycle, adding ~480 ms per job.
With 20 worker pods each handling jobs serially, fleet throughput on the scheduled_action queue collapsed from ~1000 jobs/sec to ~40 jobs/sec, queues backed up, and the change was reverted.

This PR provides the same diagnostic visibility — but implemented in the parent Resque worker process where no synchronous flush is needed.
Bigpay (and any other service using bc-prometheus-ruby with Resque) that opt in via the PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED env var, gets per-job metrics with bounded cardinality, and pays zero per-job latency cost.

Two new histograms are exposed:

  • resque_job_queue_latency_seconds{job_class} — time from scheduled_at (or enqueued_at) until the worker picks the job up. Tells you whether the queue is backing up. Supported for ActiveJob-enqueued jobs — see below.
  • resque_job_perform_duration_seconds{job_class} — total child-process lifetime (fork → Process.waitpid return). A proxy for "how long is this job class actually taking?" — includes fork overhead and exit, but at the seconds-scale used for incident diagnosis the ~15 ms of fixed overhead is noise. Works for every Resque job regardless of payload shape.

Off by default. Opt in per service by setting PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1 on the worker pods.

How queue_latency reads the payload: the gem reads three fields from payload['args'][0] (which must be a Hash):

  • job_class — the user's actual job class name; used as the metric label.
  • enqueued_at — ISO 8601 string; used as the queue-latency anchor when scheduled_at is absent.
  • scheduled_at — ISO 8601 string; preferred over enqueued_at when present (e.g. retries-with-backoff).

ActiveJob (jobs inheriting from ApplicationJob, enqueued via .perform_later) produces this shape natively — the payload is wrapped by ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps those three keys into args[0]. Vanilla Resque jobs (class MyJob; @queue = :foo; def self.perform; end, enqueued via Resque.enqueue) do not produce this shape — their args are raw primitive values, not a wrapping hash — so queue_latency silently no-ops for them. perform_duration is unaffected. See queue_latency is ActiveJob-only below.

Why in bc-prometheus-ruby (not in each client service)

The instrumentation hooks live at the Resque::Worker level — Resque.before_fork for queue latency, Module#prepend around Resque::Worker#perform_with_fork for perform duration. These are gem-level concerns: every BigCommerce service using Resque has the same Worker class and the same fork-per-job lifecycle. Adding the metrics here means:

  • One implementation, used by all services, with consistent metric names and label semantics for cross-service comparison in Grafana.
  • No per-service code change required to adopt — just flip the env var on worker pods.
  • The architectural lesson from the bigpay incident is encoded once. Future services don't need to rediscover that in-child metric collection requires a synchronous flush and that the flush is incompatible with fork-per-job latency budgets. The gem just does the right thing.
  • Opt-in via env var protects services that don't want the per-job cardinality cost (one histogram observation per job per worker pod can be a lot at fleet scale).

If this lived in each service's repo, every team would write similar Resque.before_fork glue, with subtle inconsistencies in label naming, anchor-time logic, and error handling.


What was implemented

New env var

  • PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED (default 0). Read once at boot via the existing Bigcommerce::Prometheus.configure machinery (lib/bigcommerce/prometheus/configuration.rb). Mirrors the existing PROMETHEUS_ENABLED opt-in pattern.

New module: Bigcommerce::Prometheus::Integrations::Resque::JobMetrics

Lives at lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb. Class-method API:

  • JobMetrics.start(client:) — no-op unless the env var is on. Prepends WorkerInstrumentation onto Resque::Worker. Idempotent.
  • JobMetrics.record_queue_latency(payload) — pushes a resque_job envelope with metric: 'queue_latency', value = seconds from scheduled_at ∨ enqueued_at to now, label job_class. Takes the payload object built once per job in the prepend.
  • JobMetrics.record_perform_duration(payload, duration) — pushes a resque_job envelope with metric: 'perform_duration', value = the supplied duration, label job_class. Same payload instance.

Payload parsing is classified up front: JobPayload.for(resque_job) inspects the payload shape once and builds either an ActiveJobPayload (reads job_class and the scheduled_at/enqueued_at anchor from the inner args[0] hash) or a VanillaResquePayload (labels from the top-level class, never an anchor). Both expose the same #job_class/#anchor_time interface, eagerly extracted via type-checking (no exception-based control flow). The JobMetrics recording methods take this prebuilt payload rather than re-parsing the job on each call.

The WorkerInstrumentation submodule is the prepend that wraps perform_with_fork. It builds the payload object once (via JobPayload.for), records the queue latency before super, and records the perform duration in ensure:

def perform_with_fork(job, &block)
  started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  payload    = JobPayload.for(job)
  JobMetrics.record_queue_latency(payload)
  super
ensure
  JobMetrics.record_perform_duration(
    payload,
    Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at
  )
end

Both record_* methods rescue StandardError and log a warning — metric push failures never propagate into the publish/perform path.

Type collectors

Two type collectors, one per envelope shape, registered side-by-side in Instrumentors::Resque#start. The upstream PrometheusExporter::Server::Collector routes each envelope to whichever collector's type matches envelope['type'] — no in-collector dispatch needed.

  • Bigcommerce::Prometheus::TypeCollectors::Resque (lib/bigcommerce/prometheus/type_collectors/resque.rb) continues to own the aggregate worker/queue gauges (resque_workers_total, jobs_failed_total, jobs_pending_total, jobs_processed_total, queues_total, queue_sizes) fed by Collectors::Resque#collect.

  • Bigcommerce::Prometheus::TypeCollectors::ResqueJob (lib/bigcommerce/prometheus/type_collectors/resque_job.rb) owns the two new histograms:

    • resque_job_queue_latency_seconds with buckets tuned for queue dwell ([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300]).
    • resque_job_perform_duration_seconds with buckets tuned for per-job work ([0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60]).

    It calls super(type: 'resque_job', …) in initialize so the registered type string matches the type: 'resque_job' envelopes that JobMetrics.record_* emits.

Wiring

Integrations::Resque.start(client:) now also calls JobMetrics.start(client:). If the env var is off, that call returns immediately and nothing is hooked. If on, the hooks install and the metrics flow.

Docs

  • README updated with a new "Per-job metrics (opt-in)" subsection under Resque, and a row added to the Configuration table for resque_per_job_metrics_enabled.
  • CHANGELOG entry under Pending Release.

Specs

  • spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rbJobMetrics-specific responsibilities (envelope shape, error rescue, integration with the payload classes).
  • spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rbJobPayload.for classification (which payload class each wire shape builds). active_job_payload_spec.rb — anchor selection between scheduled_at and enqueued_at, time-parsing edge cases, field independence under partial failure. vanilla_resque_payload_spec.rb — top-level class labelling, 'unknown' fallback, anchor always nil.
  • spec/bigcommerce/prometheus/type_collectors/resque_spec.rb — aggregate type collector; existing aggregate behaviour preserved.
  • spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb — per-job type collector, including a #type assertion that the registered string is 'resque_job' (the production-routing contract).

Design decisions

Parent-side, not child-side

This is the central decision and the whole reason this PR exists. In-child metric collection requires a synchronous flush before exit! (Resque's default child-exit), and the flush is bounded by the bc-prom worker thread's sleep cadence (0.5 s by default) — which makes it slow relative to fast publishes. The parent process is long-lived, so the bc-prom worker thread drains naturally between jobs without anyone synchronously waiting on it. Moving the instrumentation to the parent eliminates the per-job latency tax entirely.

Module#prepend for Resque::Worker#perform_with_fork

Resque exposes before_fork for the queue-latency timing (parent-side, runs just before fork), but it doesn't have an "after-perform-in-parent" hook to capture fork-to-waitpid duration. Module#prepend is the cleanest Ruby idiom for adding ensure-block timing around a method we don't own. Less fragile than alias_method chains, doesn't shadow inheritance behaviour for subclasses.

job_class as the only label (this PR)

Per-job histograms can blow up Prometheus cardinality if labels are unbounded (e.g. user IDs, store IDs). job_class is bounded by the number of ActiveJob classes in the service — typically tens, not thousands. ActiveJob payloads are unwrapped (payload.dig('args', 0, 'job_class')) to label by the user's actual class name rather than ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper. Falls back to the raw Resque payload class for non-ActiveJob jobs.

A follow-up PR will add an optional protocolClass.additional_resque_metric_labels(payload) — letting jobs add extra labels (e.g. event_name for bigpay's domain-event publish jobs). Deferred to keep this PR focused on the generic baseline.

Opt-in via env var, default off

Two reasons:

  • Cardinality budget: per-job histogram observations across the fleet add up. Services that don't need this signal shouldn't pay the cost.
  • Operational caution: the previous in-child attempt regressed throughput 25× in prod. Even though the parent-side approach doesn't have the same failure mode, defaulting off and letting each service opt in on its own timeline reduces risk.

Pattern mirrors PROMETHEUS_ENABLED. Toggling requires a worker-pod restart, which matches the env-var-at-boot pattern used elsewhere in the gem.

Idempotent start

start is called from Integrations::Resque.start, which itself is invoked from before_first_fork. The hook system is designed to be called once per worker process lifetime, but the prepend should still be safe under repeated calls. Implementation guards against double-registration of both the before_fork hook and the Resque::Worker prepend.

Errors never propagate

Both record_queue_latency and record_perform_duration rescue StandardError and log a warning. A misbehaving Prometheus client, a network blip, or a bug in the bc-prom transport should never break a Resque job's execution path. The metric is observational; the job is operational.

Per-job histograms live in their own TypeCollector

The two new histograms get a dedicated TypeCollectors::ResqueJob rather than being grafted onto the existing TypeCollectors::Resque. Reasons:

  • One *TypeCollector per envelope type matches the upstream gem convention (PrometheusExporter::Server::ResqueCollector#type returns the literal "resque", ActiveRecordCollector#type returns "active_record", etc.). Two envelope shapes → two collectors → the upstream router does the dispatch. Putting a case data['type'] inside one collector would duplicate routing logic the framework already provides, and hide the per-job collector behind the aggregate collector's name.
  • Explicit, legible type string. ResqueJob#initialize calls super(type: 'resque_job', …). So 'resque_job' is what appears in the wire envelopes from JobMetrics, in the registered @collectors[...] slot, in router lookups, and in any debugging output. Without the override, TypeCollectors::Base's auto-derivation would yield bigcommerce_prometheus_s_resquejob (the s is the leftover from typecollectorS, plus the flattened ResqueJob class name) — an ugly accidental string leaking into the wire format. One line of override keeps the type human-readable and decouples routing from the class name.
  • Doesn't lean on the symmetric-gsub coincidence that makes the existing aggregate pair work. Collectors::ResqueTypeCollectors::Resque routes today only because the auto-derivation on both sides happens to produce the same string (bigcommerce_prometheus_s_resque). Renaming either side silently breaks routing; the upstream router still finds something, but it's the wrong something. The new pair asserts its type explicitly and is robust to class renames.
  • Single-responsibility within the file. Aggregate gauges and per-job histograms have different sources, different update cadences, and different cardinality profiles. Keeping them in separate files makes each one's surface easier to read and to reason about.

Resque is not added as a development dependency (and the testing gap that leaves)

An initial attempt added gem 'resque', '>= 2.0' to the dev Gemfile to enable faithful testing of the new integration code. It was reverted when CI surfaced a dependency-resolution conflict:

  • The gemspec requires rack >= 3.0.
  • gem 'resque' pulls in sinatra as a transitive dep.
  • Older Sinatra versions cap rack < 3.
  • Under cold bundle resolution (CI has no Gemfile.lock — it's gitignored), the resolver picks an older Sinatra to satisfy Resque's loose sinatra >= 0.9.2 constraint, which then conflicts with the gemspec's rack >= 3.0.
  • bundle install fails on CI before specs even start.

The fix is straightforward (gem 'sinatra', '>= 4.0' forces a rack-3-compatible Sinatra), but that's a separate concern that deserves its own PR. It changes the gem's dev-bundle composition independently of the Resque-metrics work and should be revertable on its own if it causes unrelated friction. Bundling it into this PR couples two distinct changes.

Testing gap in this PR: the spec for Integrations::Resque::JobMetrics does not invoke .start, because .start calls ::Resque::Worker.prepend — which needs Resque loaded. Instead, the spec sets @client directly via instance_variable_set and exercises the pure logic (record_queue_latency, record_perform_duration — anchor selection, payload unwrapping, label assembly, error rescue) without touching Resque's API surface.

What's covered: the value-shaping logic (where the majority of bugs would live).

What's not covered: the install side — does .start actually hook before_fork and prepend the Worker module idempotently? Not testable without Resque in dev deps.

This is consistent with the gem's existing pattern. Collectors::Resque, Integrations::Resque.start, and Instrumentors::Resque#start are also entirely untested for the same reason — Resque isn't in dev deps. The "optional integration" framing made this look intentional; in practice it's accumulated testing debt.

Follow-up PRs (separate from this work):

  1. Bump the gem's Sinatra dev dep to >= 4.0 so Resque can be added to dev deps without breaking rack-3 resolution. Small, independently reviewable, independently revertable.
  2. Once Resque is available in dev: backfill specs for the install-side behaviour of JobMetrics.start (idempotent prepend, hook registration), plus the existing untested code paths in Collectors::Resque#collect, Integrations::Resque.start, and Instrumentors::Resque#start.

Both deferred to keep this PR scoped to the metrics feature itself.


queue_latency is ActiveJob-only

queue_latency is supported for jobs enqueued via ActiveJob (.perform_later). Vanilla Resque jobs (Resque.enqueue) carry no enqueue timestamps in their payloads, so the metric silently no-ops for them. perform_duration covers every job regardless of payload shape.

If a vanilla Resque job ever needs queue-latency visibility, the supported route is converting that job to ActiveJob. Things to plan for when converting: in-flight old-shape payloads in Redis will fail against the converted class at deploy time (drain the queue or keep a temporary self.perform shim), resque-scheduler YAML schedules can't enqueue ActiveJobs natively (each scheduled entry needs a small shim class), and arguments pass through ActiveJob serialization (GlobalID for ActiveRecord objects).

Mechanism note: the gem detects the payload by the three fields above rather than by matching the JobWrapper class name — the fields are ActiveJob's stable serialization format (persisted jobs must survive Rails upgrades), while the wrapper's class name is a private Rails constant. A payload that replicates those fields is therefore read the same way, but that is an implementation detail of how ActiveJob payloads are parsed, not a supported integration contract.

For bigpay this boundary costs nothing today: every job on the scheduled_action queue — the queue this PR was built to observe — is already ActiveJob.


Rollout

After release, consuming services opt in by:

  1. Bumping their Gemfile to the new gem version.
  2. Setting PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1 on worker pods only (not web, gRPC, etc. — they don't run Resque::Worker).
  3. Verifying metrics appear at /metrics (or wherever scraped) with the resque_job_* names and job_class labels.

For bigpay specifically, this replaces the reverted PAYMENTS-11567 instrumentation. A follow-up bigpay PR will:

  • Set the env var on worker pods via Launchbay.
  • Delete the in-child PublishJobInstrumentation machinery introduced and reverted in the original attempt.
  • Optionally implement additional_resque_metric_labels on BasePublishToDomainEventingJob once the follow-up gem PR adds the label-extension protocol.

How was it tested?

# HELP ruby_resque_job_queue_latency_seconds Seconds between when a Resque job was due to run (scheduled_at if set, falling back to enqueued_at) and when a worker process picked it up. Recorded per attempt; retries-with-backoff anchor on scheduled_at, excluding the intentional backoff wait. Opt-in via PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED.
# TYPE ruby_resque_job_queue_latency_seconds histogram
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="+Inf"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="300"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="120"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="60"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="30"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="5"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="2.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.01"} 0
ruby_resque_job_queue_latency_seconds_count{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 2
ruby_resque_job_queue_latency_seconds_sum{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 8.344688
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="+Inf"} 1
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="300"} 1
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="120"} 1
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="60"} 1
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="30"} 1
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="5"} 1
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="2.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.01"} 0
ruby_resque_job_queue_latency_seconds_count{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 1
ruby_resque_job_queue_latency_seconds_sum{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 4.981598

# HELP ruby_resque_job_perform_duration_seconds Total Resque child process lifetime (fork to waitpid). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit. Used as the per-job throughput signal at the worker-pod level. Opt-in via PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED.
# TYPE ruby_resque_job_perform_duration_seconds histogram
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="+Inf"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="60"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="30"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="10"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="5"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="2"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="1"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_perform_duration_seconds_count{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 1
ruby_resque_job_perform_duration_seconds_sum{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 1.4852839999366552
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="+Inf"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="60"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="30"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="10"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="5"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="2"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="1"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_perform_duration_seconds_count{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 1
ruby_resque_job_perform_duration_seconds_sum{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 0.7090679999673739


Note

Cursor Bugbot is generating a summary for commit 12b01d8. Configure here.

@WillemHoman WillemHoman force-pushed the PAYMENTS-11567-resque_latency branch 3 times, most recently from fe8acb7 to 12b01d8 Compare May 29, 2026 00:11
@WillemHoman

Copy link
Copy Markdown
Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 12b01d8. Configure here.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in Resque integration to emit per-job Prometheus histograms (queue latency and perform duration) from the parent worker process, avoiding the per-job synchronous flush overhead that previously regressed throughput in fork-per-job children.

Changes:

  • Add PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED configuration and parent-side Resque worker instrumentation to emit resque_job_* envelopes.
  • Introduce a dedicated resque_job type collector (TypeCollectors::ResqueJob) to expose the new per-job histograms.
  • Add/adjust specs and documentation (README + changelog) for the new functionality.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
spec/bigcommerce/prometheus/type_collectors/resque_spec.rb Updates spec to exercise TypeCollectors::Base#collect label-merging behavior.
spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb Adds spec coverage for new ResqueJob type collector routing + histogram observation behavior.
spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb Adds spec coverage for payload parsing (job_class + anchor timestamp selection/parsing).
spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb Adds spec coverage for envelope shape and error-rescue behavior without requiring Resque.
README.md Documents opt-in per-job metrics and adds configuration table row.
lib/bigcommerce/prometheus/type_collectors/resque.rb Clarifies responsibility boundaries between aggregate Resque metrics vs per-job metrics.
lib/bigcommerce/prometheus/type_collectors/resque_job.rb New type collector for per-job histograms with explicit type: 'resque_job' routing.
lib/bigcommerce/prometheus/integrations/resque/job_payload.rb New payload parser used by per-job metrics logic.
lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb New parent-side Resque worker instrumentation + metric envelope emission.
lib/bigcommerce/prometheus/integrations/resque.rb Wires JobMetrics.start into the Resque integration startup path.
lib/bigcommerce/prometheus/instrumentors/resque.rb Registers the new ResqueJob type collector with the exporter server.
lib/bigcommerce/prometheus/configuration.rb Adds resque_per_job_metrics_enabled config key (env var gated).
lib/bigcommerce/prometheus.rb Requires new integration/type collector files.
CHANGELOG.md Adds pending-release entry for the new per-job Resque metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +2
# frozen_string_literal: true

Comment on lines +86 to +91
@client.send_json(
type: 'resque_job',
metric: 'queue_latency',
value: (Time.now - anchor).to_f,
custom_labels: { job_class: payload.job_class }
)
Comment thread CHANGELOG.md Outdated

### Pending Release

- Add opt-in per-Resque-job histograms `resque_job_queue_latency_seconds` and `resque_job_perform_duration_seconds`, labelled by `job_class`. Recorded from the parent worker process (via `Resque.before_fork` and a `Module#prepend` around `Resque::Worker#perform_with_fork`), so no synchronous flush is needed in the forked child. Gated by `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED` (default off).

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Comment on lines +141 to +151
def perform_with_fork(job, &block)
started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
payload = JobPayload.new(job)
JobMetrics.record_queue_latency(payload)
super
ensure
JobMetrics.record_perform_duration(
payload,
Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at
)
end
Comment on lines +52 to +57
def collect_metrics(data:, labels: {})
name = data['metric']&.to_sym
return unless %i[queue_latency perform_duration].include?(name)

metric(name).observe(data['value'], labels)
end
Comment on lines +63 to 68
it "builds the #{hash_key} with the expected class and name" do
metric = subject[hash_key]
expect(metric).to be_a config[:class]
expect(metric.name).to eq config[:name]
expect(metric.help).to eq config[:help]
expect(metric.help).to eq(config[:help]) if config[:help]
end
Comment thread README.md
- `resque_job_queue_latency_seconds{job_class}` — time from `scheduled_at` (falling back to `enqueued_at`) until a worker picks the job up. Per attempt; retries-with-backoff anchor on `scheduled_at` so the intentional backoff doesn't show as latency.
- `resque_job_perform_duration_seconds{job_class}` — total Resque child lifetime (fork → `Process.waitpid` return). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit.

These are off by default because they emit one histogram observation per job per worker pod, which adds cardinality. Opt in per service.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants