PAYMENTS-11567 Resque latency metrics#30
Draft
WillemHoman wants to merge 2 commits into
Draft
Conversation
fe8acb7 to
12b01d8
Compare
Author
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 12b01d8. Configure here.
There was a problem hiding this comment.
Pull request overview
Adds an opt-in Resque integration to emit per-job Prometheus histograms (queue latency and perform duration) from the parent worker process, avoiding the per-job synchronous flush overhead that previously regressed throughput in fork-per-job children.
Changes:
- Add
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLEDconfiguration and parent-side Resque worker instrumentation to emitresque_job_*envelopes. - Introduce a dedicated
resque_jobtype collector (TypeCollectors::ResqueJob) to expose the new per-job histograms. - Add/adjust specs and documentation (README + changelog) for the new functionality.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/bigcommerce/prometheus/type_collectors/resque_spec.rb | Updates spec to exercise TypeCollectors::Base#collect label-merging behavior. |
| spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb | Adds spec coverage for new ResqueJob type collector routing + histogram observation behavior. |
| spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb | Adds spec coverage for payload parsing (job_class + anchor timestamp selection/parsing). |
| spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb | Adds spec coverage for envelope shape and error-rescue behavior without requiring Resque. |
| README.md | Documents opt-in per-job metrics and adds configuration table row. |
| lib/bigcommerce/prometheus/type_collectors/resque.rb | Clarifies responsibility boundaries between aggregate Resque metrics vs per-job metrics. |
| lib/bigcommerce/prometheus/type_collectors/resque_job.rb | New type collector for per-job histograms with explicit type: 'resque_job' routing. |
| lib/bigcommerce/prometheus/integrations/resque/job_payload.rb | New payload parser used by per-job metrics logic. |
| lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb | New parent-side Resque worker instrumentation + metric envelope emission. |
| lib/bigcommerce/prometheus/integrations/resque.rb | Wires JobMetrics.start into the Resque integration startup path. |
| lib/bigcommerce/prometheus/instrumentors/resque.rb | Registers the new ResqueJob type collector with the exporter server. |
| lib/bigcommerce/prometheus/configuration.rb | Adds resque_per_job_metrics_enabled config key (env var gated). |
| lib/bigcommerce/prometheus.rb | Requires new integration/type collector files. |
| CHANGELOG.md | Adds pending-release entry for the new per-job Resque metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1
to
+2
| # frozen_string_literal: true | ||
|
|
Comment on lines
+86
to
+91
| @client.send_json( | ||
| type: 'resque_job', | ||
| metric: 'queue_latency', | ||
| value: (Time.now - anchor).to_f, | ||
| custom_labels: { job_class: payload.job_class } | ||
| ) |
|
|
||
| ### Pending Release | ||
|
|
||
| - Add opt-in per-Resque-job histograms `resque_job_queue_latency_seconds` and `resque_job_perform_duration_seconds`, labelled by `job_class`. Recorded from the parent worker process (via `Resque.before_fork` and a `Module#prepend` around `Resque::Worker#perform_with_fork`), so no synchronous flush is needed in the forked child. Gated by `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED` (default off). |
… on worker processing time
30d8f7c to
599ca5c
Compare
Comment on lines
+141
to
+151
| def perform_with_fork(job, &block) | ||
| started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
| payload = JobPayload.new(job) | ||
| JobMetrics.record_queue_latency(payload) | ||
| super | ||
| ensure | ||
| JobMetrics.record_perform_duration( | ||
| payload, | ||
| Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at | ||
| ) | ||
| end |
Comment on lines
+52
to
+57
| def collect_metrics(data:, labels: {}) | ||
| name = data['metric']&.to_sym | ||
| return unless %i[queue_latency perform_duration].include?(name) | ||
|
|
||
| metric(name).observe(data['value'], labels) | ||
| end |
Comment on lines
+63
to
68
| it "builds the #{hash_key} with the expected class and name" do | ||
| metric = subject[hash_key] | ||
| expect(metric).to be_a config[:class] | ||
| expect(metric.name).to eq config[:name] | ||
| expect(metric.help).to eq config[:help] | ||
| expect(metric.help).to eq(config[:help]) if config[:help] | ||
| end |
| - `resque_job_queue_latency_seconds{job_class}` — time from `scheduled_at` (falling back to `enqueued_at`) until a worker picks the job up. Per attempt; retries-with-backoff anchor on `scheduled_at` so the intentional backoff doesn't show as latency. | ||
| - `resque_job_perform_duration_seconds{job_class}` — total Resque child lifetime (fork → `Process.waitpid` return). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit. | ||
|
|
||
| These are off by default because they emit one histogram observation per job per worker pod, which adds cardinality. Opt in per service. |
…load and VanillaResquePayload
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What? Why?
Bigpay's domain-event publishing was suspected of being slow. To diagnose this, a previous PR (bigpay#10597, PAYMENTS-11567) added per-job Prometheus histograms (queue latency and publish duration) directly inside the Resque worker child process.
This required a synchronous metric-flush to ensure observations weren't lost when the fork-per-job child exited.
It had to wait on the bc-prom client's background drain cycle, adding ~480 ms per job.
With 20 worker pods each handling jobs serially, fleet throughput on the
scheduled_actionqueue collapsed from ~1000 jobs/sec to ~40 jobs/sec, queues backed up, and the change was reverted.This PR provides the same diagnostic visibility — but implemented in the parent Resque worker process where no synchronous flush is needed.
Bigpay (and any other service using bc-prometheus-ruby with Resque) that opt in via the
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLEDenv var, gets per-job metrics with bounded cardinality, and pays zero per-job latency cost.Two new histograms are exposed:
resque_job_queue_latency_seconds{job_class}— time fromscheduled_at(orenqueued_at) until the worker picks the job up. Tells you whether the queue is backing up. Supported for ActiveJob-enqueued jobs — see below.resque_job_perform_duration_seconds{job_class}— total child-process lifetime (fork →Process.waitpidreturn). A proxy for "how long is this job class actually taking?" — includes fork overhead and exit, but at the seconds-scale used for incident diagnosis the ~15 ms of fixed overhead is noise. Works for every Resque job regardless of payload shape.Off by default. Opt in per service by setting
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1on the worker pods.How
queue_latencyreads the payload: the gem reads three fields frompayload['args'][0](which must be a Hash):job_class— the user's actual job class name; used as the metric label.enqueued_at— ISO 8601 string; used as the queue-latency anchor whenscheduled_atis absent.scheduled_at— ISO 8601 string; preferred overenqueued_atwhen present (e.g. retries-with-backoff).ActiveJob (jobs inheriting from
ApplicationJob, enqueued via.perform_later) produces this shape natively — the payload is wrapped byActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps those three keys intoargs[0]. Vanilla Resque jobs (class MyJob; @queue = :foo; def self.perform; end, enqueued viaResque.enqueue) do not produce this shape — their args are raw primitive values, not a wrapping hash — soqueue_latencysilently no-ops for them.perform_durationis unaffected. Seequeue_latencyis ActiveJob-only below.Why in bc-prometheus-ruby (not in each client service)
The instrumentation hooks live at the
Resque::Workerlevel —Resque.before_forkfor queue latency,Module#prependaroundResque::Worker#perform_with_forkfor perform duration. These are gem-level concerns: every BigCommerce service using Resque has the same Worker class and the same fork-per-job lifecycle. Adding the metrics here means:If this lived in each service's repo, every team would write similar
Resque.before_forkglue, with subtle inconsistencies in label naming, anchor-time logic, and error handling.What was implemented
New env var
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED(default0). Read once at boot via the existingBigcommerce::Prometheus.configuremachinery (lib/bigcommerce/prometheus/configuration.rb). Mirrors the existingPROMETHEUS_ENABLEDopt-in pattern.New module:
Bigcommerce::Prometheus::Integrations::Resque::JobMetricsLives at
lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb. Class-method API:JobMetrics.start(client:)— no-op unless the env var is on. PrependsWorkerInstrumentationontoResque::Worker. Idempotent.JobMetrics.record_queue_latency(payload)— pushes aresque_jobenvelope withmetric: 'queue_latency', value = seconds fromscheduled_at ∨ enqueued_atto now, labeljob_class. Takes the payload object built once per job in the prepend.JobMetrics.record_perform_duration(payload, duration)— pushes aresque_jobenvelope withmetric: 'perform_duration', value = the supplied duration, labeljob_class. Same payload instance.Payload parsing is classified up front:
JobPayload.for(resque_job)inspects the payload shape once and builds either anActiveJobPayload(readsjob_classand thescheduled_at/enqueued_atanchor from the innerargs[0]hash) or aVanillaResquePayload(labels from the top-levelclass, never an anchor). Both expose the same#job_class/#anchor_timeinterface, eagerly extracted via type-checking (no exception-based control flow). TheJobMetricsrecording methods take this prebuilt payload rather than re-parsing the job on each call.The
WorkerInstrumentationsubmodule is the prepend that wrapsperform_with_fork. It builds the payload object once (viaJobPayload.for), records the queue latency beforesuper, and records the perform duration inensure:Both
record_*methods rescueStandardErrorand log a warning — metric push failures never propagate into the publish/perform path.Type collectors
Two type collectors, one per envelope shape, registered side-by-side in
Instrumentors::Resque#start. The upstreamPrometheusExporter::Server::Collectorroutes each envelope to whichever collector'stypematchesenvelope['type']— no in-collector dispatch needed.Bigcommerce::Prometheus::TypeCollectors::Resque(lib/bigcommerce/prometheus/type_collectors/resque.rb) continues to own the aggregate worker/queue gauges (resque_workers_total,jobs_failed_total,jobs_pending_total,jobs_processed_total,queues_total,queue_sizes) fed byCollectors::Resque#collect.Bigcommerce::Prometheus::TypeCollectors::ResqueJob(lib/bigcommerce/prometheus/type_collectors/resque_job.rb) owns the two new histograms:resque_job_queue_latency_secondswith buckets tuned for queue dwell ([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300]).resque_job_perform_duration_secondswith buckets tuned for per-job work ([0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60]).It calls
super(type: 'resque_job', …)ininitializeso the registered type string matches thetype: 'resque_job'envelopes thatJobMetrics.record_*emits.Wiring
Integrations::Resque.start(client:)now also callsJobMetrics.start(client:). If the env var is off, that call returns immediately and nothing is hooked. If on, the hooks install and the metrics flow.Docs
resque_per_job_metrics_enabled.Specs
spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb—JobMetrics-specific responsibilities (envelope shape, error rescue, integration with the payload classes).spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb—JobPayload.forclassification (which payload class each wire shape builds).active_job_payload_spec.rb— anchor selection betweenscheduled_atandenqueued_at, time-parsing edge cases, field independence under partial failure.vanilla_resque_payload_spec.rb— top-level class labelling,'unknown'fallback, anchor always nil.spec/bigcommerce/prometheus/type_collectors/resque_spec.rb— aggregate type collector; existing aggregate behaviour preserved.spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb— per-job type collector, including a#typeassertion that the registered string is'resque_job'(the production-routing contract).Design decisions
Parent-side, not child-side
This is the central decision and the whole reason this PR exists. In-child metric collection requires a synchronous flush before
exit!(Resque's default child-exit), and the flush is bounded by the bc-prom worker thread's sleep cadence (0.5 s by default) — which makes it slow relative to fast publishes. The parent process is long-lived, so the bc-prom worker thread drains naturally between jobs without anyone synchronously waiting on it. Moving the instrumentation to the parent eliminates the per-job latency tax entirely.Module#prependforResque::Worker#perform_with_forkResque exposes
before_forkfor the queue-latency timing (parent-side, runs just before fork), but it doesn't have an "after-perform-in-parent" hook to capture fork-to-waitpid duration.Module#prependis the cleanest Ruby idiom for addingensure-block timing around a method we don't own. Less fragile thanalias_methodchains, doesn't shadow inheritance behaviour for subclasses.job_classas the only label (this PR)Per-job histograms can blow up Prometheus cardinality if labels are unbounded (e.g. user IDs, store IDs).
job_classis bounded by the number of ActiveJob classes in the service — typically tens, not thousands. ActiveJob payloads are unwrapped (payload.dig('args', 0, 'job_class')) to label by the user's actual class name rather thanActiveJob::QueueAdapters::ResqueAdapter::JobWrapper. Falls back to the raw Resque payload class for non-ActiveJob jobs.A follow-up PR will add an optional protocol —
Class.additional_resque_metric_labels(payload)— letting jobs add extra labels (e.g.event_namefor bigpay's domain-event publish jobs). Deferred to keep this PR focused on the generic baseline.Opt-in via env var, default off
Two reasons:
Pattern mirrors
PROMETHEUS_ENABLED. Toggling requires a worker-pod restart, which matches the env-var-at-boot pattern used elsewhere in the gem.Idempotent
startstartis called fromIntegrations::Resque.start, which itself is invoked frombefore_first_fork. The hook system is designed to be called once per worker process lifetime, but the prepend should still be safe under repeated calls. Implementation guards against double-registration of both thebefore_forkhook and theResque::Workerprepend.Errors never propagate
Both
record_queue_latencyandrecord_perform_durationrescueStandardErrorand log a warning. A misbehaving Prometheus client, a network blip, or a bug in the bc-prom transport should never break a Resque job's execution path. The metric is observational; the job is operational.Per-job histograms live in their own TypeCollector
The two new histograms get a dedicated
TypeCollectors::ResqueJobrather than being grafted onto the existingTypeCollectors::Resque. Reasons:*TypeCollectorper envelope type matches the upstream gem convention (PrometheusExporter::Server::ResqueCollector#typereturns the literal"resque",ActiveRecordCollector#typereturns"active_record", etc.). Two envelope shapes → two collectors → the upstream router does the dispatch. Putting acase data['type']inside one collector would duplicate routing logic the framework already provides, and hide the per-job collector behind the aggregate collector's name.ResqueJob#initializecallssuper(type: 'resque_job', …). So'resque_job'is what appears in the wire envelopes fromJobMetrics, in the registered@collectors[...]slot, in router lookups, and in any debugging output. Without the override,TypeCollectors::Base's auto-derivation would yieldbigcommerce_prometheus_s_resquejob(thesis the leftover fromtypecollectorS, plus the flattenedResqueJobclass name) — an ugly accidental string leaking into the wire format. One line of override keeps the type human-readable and decouples routing from the class name.Collectors::Resque→TypeCollectors::Resqueroutes today only because the auto-derivation on both sides happens to produce the same string (bigcommerce_prometheus_s_resque). Renaming either side silently breaks routing; the upstream router still finds something, but it's the wrong something. The new pair asserts its type explicitly and is robust to class renames.Resque is not added as a development dependency (and the testing gap that leaves)
An initial attempt added
gem 'resque', '>= 2.0'to the devGemfileto enable faithful testing of the new integration code. It was reverted when CI surfaced a dependency-resolution conflict:rack >= 3.0.gem 'resque'pulls insinatraas a transitive dep.rack < 3.Gemfile.lock— it's gitignored), the resolver picks an older Sinatra to satisfy Resque's loosesinatra >= 0.9.2constraint, which then conflicts with the gemspec'srack >= 3.0.bundle installfails on CI before specs even start.The fix is straightforward (
gem 'sinatra', '>= 4.0'forces a rack-3-compatible Sinatra), but that's a separate concern that deserves its own PR. It changes the gem's dev-bundle composition independently of the Resque-metrics work and should be revertable on its own if it causes unrelated friction. Bundling it into this PR couples two distinct changes.Testing gap in this PR: the spec for
Integrations::Resque::JobMetricsdoes not invoke.start, because.startcalls::Resque::Worker.prepend— which needs Resque loaded. Instead, the spec sets@clientdirectly viainstance_variable_setand exercises the pure logic (record_queue_latency,record_perform_duration— anchor selection, payload unwrapping, label assembly, error rescue) without touching Resque's API surface.What's covered: the value-shaping logic (where the majority of bugs would live).
What's not covered: the install side — does
.startactually hookbefore_forkand prepend the Worker module idempotently? Not testable without Resque in dev deps.This is consistent with the gem's existing pattern.
Collectors::Resque,Integrations::Resque.start, andInstrumentors::Resque#startare also entirely untested for the same reason — Resque isn't in dev deps. The "optional integration" framing made this look intentional; in practice it's accumulated testing debt.Follow-up PRs (separate from this work):
>= 4.0so Resque can be added to dev deps without breaking rack-3 resolution. Small, independently reviewable, independently revertable.JobMetrics.start(idempotent prepend, hook registration), plus the existing untested code paths inCollectors::Resque#collect,Integrations::Resque.start, andInstrumentors::Resque#start.Both deferred to keep this PR scoped to the metrics feature itself.
queue_latencyis ActiveJob-onlyqueue_latencyis supported for jobs enqueued via ActiveJob (.perform_later). Vanilla Resque jobs (Resque.enqueue) carry no enqueue timestamps in their payloads, so the metric silently no-ops for them.perform_durationcovers every job regardless of payload shape.If a vanilla Resque job ever needs queue-latency visibility, the supported route is converting that job to ActiveJob. Things to plan for when converting: in-flight old-shape payloads in Redis will fail against the converted class at deploy time (drain the queue or keep a temporary
self.performshim), resque-scheduler YAML schedules can't enqueue ActiveJobs natively (each scheduled entry needs a small shim class), and arguments pass through ActiveJob serialization (GlobalID for ActiveRecord objects).Mechanism note: the gem detects the payload by the three fields above rather than by matching the
JobWrapperclass name — the fields are ActiveJob's stable serialization format (persisted jobs must survive Rails upgrades), while the wrapper's class name is a private Rails constant. A payload that replicates those fields is therefore read the same way, but that is an implementation detail of how ActiveJob payloads are parsed, not a supported integration contract.For bigpay this boundary costs nothing today: every job on the
scheduled_actionqueue — the queue this PR was built to observe — is already ActiveJob.Rollout
After release, consuming services opt in by:
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1on worker pods only (not web, gRPC, etc. — they don't runResque::Worker)./metrics(or wherever scraped) with theresque_job_*names andjob_classlabels.For bigpay specifically, this replaces the reverted PAYMENTS-11567 instrumentation. A follow-up bigpay PR will:
PublishJobInstrumentationmachinery introduced and reverted in the original attempt.additional_resque_metric_labelsonBasePublishToDomainEventingJobonce the follow-up gem PR adds the label-extension protocol.How was it tested?
Note
Cursor Bugbot is generating a summary for commit 12b01d8. Configure here.