diff --git a/CHANGELOG.md b/CHANGELOG.md index f2e6a49..12cd1bd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - Slow job webhook alert — set `alert_slow_job_count_threshold` (integer) to fire a webhook whenever the number of currently-running slow jobs meets or exceeds the configured count; requires `slow_job_threshold` to define what "slow" means; uses the same `alert_webhook_url` and `alert_webhook_cooldown` settings as other alert types; event name `slow_job_threshold_exceeded` +- Stale process webhook alert — set `alert_stale_process_threshold` (integer) to fire a webhook whenever the number of stale workers meets or exceeds the configured count; a process is stale when its heartbeat has not been updated within `SolidQueue.process_alive_threshold`; uses the same `alert_webhook_url` and `alert_webhook_cooldown` settings; event name `stale_process_detected` ## [1.3.0] - 2026-05-27 diff --git a/README.md b/README.md index 95cb1ca..43e83fa 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,7 @@ SolidQueueWeb surfaces all of this in a browser UI available at any route you ch - **Dashboard quick actions** — "Retry All Failed" and "Discard All Blocked" cards appear on the dashboard only when the respective count is non-zero; one-click bulk operations with confirm dialogs, keeping the dashboard clean when everything is healthy - **CSV export** — "Export CSV" button on the jobs, failed jobs, and history pages downloads all records matching the current filters; columns are tailored per view - **Slow job detection** — when `slow_job_threshold` is configured, claimed jobs running longer than the threshold are flagged with an orange row, a "slow" badge, and a "Running For" duration column on the Running tab; a "Slow Jobs" warning card appears on the dashboard with a link to the Running tab -- **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; set `alert_queue_thresholds` for per-queue depth alerts; set `alert_slow_job_count_threshold` (requires `slow_job_threshold`) for slow-job count alerts; all fire asynchronously with a configurable cooldown (default 1 h) to prevent repeated alerts +- **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; set `alert_queue_thresholds` for per-queue depth alerts; set `alert_slow_job_count_threshold` (requires `slow_job_threshold`) for slow-job count alerts; set `alert_stale_process_threshold` for stale-worker alerts; all fire asynchronously with a configurable cooldown (default 1 h) to prevent repeated alerts - **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, p99, standard deviation, min, and max duration; sorted by p95 descending so the slowest classes surface first; high std dev surfaces inconsistent jobs worth investigating; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view - **Failed job trend chart** — a "Failures — Last 12 Hours" bar chart on the dashboard shows failures per hour over the last 12 hours; bars are red, making failure spikes visible before clicking into the failed jobs list - **Error frequency report** — `GET /jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows a count per group, and surfaces a sample backtrace in an expandable row; sorted by count descending so the most common errors appear first; accessible via the "Error Summary" button on the Failed Jobs page @@ -107,7 +107,8 @@ SolidQueueWeb.configure do |config| config.alert_webhook_url = "https://hooks.example.com/solid-queue" # POST target — string or array (default: nil = disabled) config.alert_failure_threshold = 10 # fire when failed count >= this (default: nil = disabled) config.alert_queue_thresholds = { "critical" => 50, "default" => 200 } # fire when queue depth >= threshold (default: {}) - config.alert_slow_job_count_threshold = 5 # fire when slow job count >= this (default: nil = disabled) + config.alert_slow_job_count_threshold = 5 # fire when slow job count >= this (default: nil = disabled) + config.alert_stale_process_threshold = 1 # fire when stale process count >= this (default: nil = disabled) config.alert_webhook_cooldown = 1800 # seconds between repeated alerts per alert type (default: 3600) config.connects_to = { reading: :reading, writing: :writing } # read replica (default: nil) config.time_zone = "America/New_York" # display timezone for all timestamps (default: nil = UTC) @@ -209,6 +210,31 @@ The same `alert_webhook_url` endpoint(s) receive the payload with a distinct eve The alert fires on every dashboard page load while the condition persists, subject to the cooldown window. +## Stale process alerts + +Set `alert_stale_process_threshold` to fire a webhook when the number of stale workers meets or exceeds a count. A process is considered stale when its `last_heartbeat_at` has not been updated within `SolidQueue.process_alive_threshold` (default 5 minutes). A stale worker means jobs in its queues have silently stopped processing. + +```ruby +SolidQueueWeb.configure do |config| + config.alert_stale_process_threshold = 1 # fire when any process goes stale + config.alert_webhook_url = "https://hooks.example.com/solid-queue" + config.alert_webhook_cooldown = 1800 # don't re-fire for 30 minutes (default: 3600) +end +``` + +The same `alert_webhook_url` endpoint(s) receive the payload with a distinct event type: + +```json +{ + "event": "stale_process_detected", + "stale_process_count": 2, + "threshold": 1, + "fired_at": "2026-05-28T08:00:00Z" +} +``` + +The alert fires on every dashboard page load while the condition persists, subject to the cooldown window. + ## Metrics endpoint `GET /jobs/metrics.json` returns a machine-readable JSON document suitable for Prometheus scraping, uptime monitors, or external dashboards. No configuration is required — the endpoint is available as soon as the engine is mounted. diff --git a/ROADMAP.md b/ROADMAP.md index b21141d..aad32c4 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -15,7 +15,7 @@ Pull requests for any of these are welcome. See [Contributing](README.md#contrib | Feature | Notes | |---|---| | ~~**Slow job webhook alert**~~ | ✅ Shipped in v1.4 — `alert_slow_job_count_threshold` fires when slow-job count meets or exceeds the threshold. | -| **Process stale webhook alert** | Fire when a worker's `last_heartbeat_at` expires. A worker going silent means jobs stop processing silently. | +| ~~**Process stale webhook alert**~~ | ✅ Shipped in v1.4 — `alert_stale_process_threshold` fires when stale worker count meets or exceeds the threshold. | | **Job wait time column** | Show time from `enqueued_at` to `created_at` on claimed executions — a direct measure of queue SLA. | --- diff --git a/app/controllers/solid_queue_web/dashboard_controller.rb b/app/controllers/solid_queue_web/dashboard_controller.rb index 1e8c081..5285280 100644 --- a/app/controllers/solid_queue_web/dashboard_controller.rb +++ b/app/controllers/solid_queue_web/dashboard_controller.rb @@ -5,6 +5,7 @@ def index AlertWebhook.call(failure_count: @stats.counts[:failed]) QueueDepthAlert.call SlowJobAlert.call + StaleProcessAlert.call end end end diff --git a/app/services/solid_queue_web/stale_process_alert.rb b/app/services/solid_queue_web/stale_process_alert.rb new file mode 100644 index 0000000..146c3d4 --- /dev/null +++ b/app/services/solid_queue_web/stale_process_alert.rb @@ -0,0 +1,68 @@ +require "net/http" +require "json" +require "uri" + +module SolidQueueWeb + class StaleProcessAlert + MUTEX = Mutex.new + + class << self + def call + return unless configured? + + stale_count = SolidQueue::Process + .where("last_heartbeat_at < ?", SolidQueue.process_alive_threshold.ago) + .count + + return if stale_count < SolidQueueWeb.alert_stale_process_threshold + return unless should_fire? + + urls = webhook_urls + Thread.new { urls.each { |url| post(url, stale_count) } } + end + + def reset! + MUTEX.synchronize { @last_fired_at = nil } + end + + private + + def configured? + SolidQueueWeb.alert_stale_process_threshold.present? && webhook_urls.any? + end + + def webhook_urls + Array(SolidQueueWeb.alert_webhook_url).flatten.compact.select(&:present?) + end + + def should_fire? + MUTEX.synchronize do + cooldown = SolidQueueWeb.alert_webhook_cooldown + return false if @last_fired_at && Time.current - @last_fired_at < cooldown + + @last_fired_at = Time.current + true + end + end + + def post(url_string, stale_count) + uri = URI.parse(url_string) + payload = JSON.generate( + event: "stale_process_detected", + stale_process_count: stale_count, + threshold: SolidQueueWeb.alert_stale_process_threshold, + fired_at: Time.current.iso8601 + ) + http = Net::HTTP.new(uri.host, uri.port) + http.use_ssl = uri.scheme == "https" + http.open_timeout = 5 + http.read_timeout = 10 + request = Net::HTTP::Post.new(uri.path.presence || "/", "Content-Type" => "application/json") + request.body = payload + http.request(request) + rescue => e + Rails.logger.error("[SolidQueueWeb] Stale process alert webhook failed: #{e.message}") + end + end + end +end diff --git a/lib/solid_queue_web.rb b/lib/solid_queue_web.rb index 81db3a0..0ea65c1 100644 --- a/lib/solid_queue_web.rb +++ b/lib/solid_queue_web.rb @@ -6,7 +6,8 @@ module SolidQueueWeb class << self attr_writer :page_size, :dashboard_refresh_interval, :default_refresh_interval, :search_results_limit, :slow_job_threshold, :alert_webhook_url, :alert_failure_threshold, :alert_webhook_cooldown, - :alert_queue_thresholds, :alert_slow_job_count_threshold, :connects_to, :time_zone + :alert_queue_thresholds, :alert_slow_job_count_threshold, :alert_stale_process_threshold, + :connects_to, :time_zone def page_size @page_size || 25 @@ -48,6 +49,10 @@ def alert_slow_job_count_threshold @alert_slow_job_count_threshold end + def alert_stale_process_threshold + @alert_stale_process_threshold + end + def connects_to @connects_to end diff --git a/spec/services/solid_queue_web/stale_process_alert_spec.rb b/spec/services/solid_queue_web/stale_process_alert_spec.rb new file mode 100644 index 0000000..2fd0184 --- /dev/null +++ b/spec/services/solid_queue_web/stale_process_alert_spec.rb @@ -0,0 +1,137 @@ +require "rails_helper" + +RSpec.describe SolidQueueWeb::StaleProcessAlert do + let(:webhook_url) { "http://example.com/webhook" } + + before do + SolidQueueWeb.alert_webhook_url = webhook_url + SolidQueueWeb.alert_stale_process_threshold = 1 + SolidQueueWeb.alert_webhook_cooldown = 3600 + allow(Thread).to receive(:new).and_yield + allow_any_instance_of(Net::HTTP).to receive(:request).and_return(Net::HTTPSuccess.new("1.1", "200", "OK")) + end + + after do + SolidQueueWeb.alert_webhook_url = nil + SolidQueueWeb.alert_stale_process_threshold = nil + SolidQueueWeb.alert_webhook_cooldown = nil + described_class.reset! + end + + def create_process(stale: false) + heartbeat = stale ? (SolidQueue.process_alive_threshold + 1.minute).ago : 30.seconds.ago + SolidQueue::Process.create!( + kind: "Worker", pid: rand(10_000..99_999), hostname: "test-host", + name: "worker-#{SecureRandom.hex(4)}", last_heartbeat_at: heartbeat + ) + end + + describe ".call" do + it "fires when stale process count meets the threshold" do + create_process(stale: true) + expect_any_instance_of(Net::HTTP).to receive(:request) + described_class.call + end + + it "fires when stale process count exceeds the threshold" do + 3.times { create_process(stale: true) } + expect_any_instance_of(Net::HTTP).to receive(:request) + described_class.call + end + + it "does not fire when stale process count is below the threshold" do + SolidQueueWeb.alert_stale_process_threshold = 2 + create_process(stale: true) + expect_any_instance_of(Net::HTTP).not_to receive(:request) + described_class.call + end + + it "does not count healthy processes as stale" do + create_process(stale: false) + expect_any_instance_of(Net::HTTP).not_to receive(:request) + described_class.call + end + + it "does not fire when alert_stale_process_threshold is not configured" do + SolidQueueWeb.alert_stale_process_threshold = nil + create_process(stale: true) + expect_any_instance_of(Net::HTTP).not_to receive(:request) + described_class.call + end + + it "does not fire when webhook url is not configured" do + SolidQueueWeb.alert_webhook_url = nil + create_process(stale: true) + expect_any_instance_of(Net::HTTP).not_to receive(:request) + described_class.call + end + + it "does not fire when no processes exist" do + expect_any_instance_of(Net::HTTP).not_to receive(:request) + described_class.call + end + + it "does not fire again within the cooldown window" do + create_process(stale: true) + described_class.call + expect_any_instance_of(Net::HTTP).not_to receive(:request) + described_class.call + end + + it "fires again after the cooldown window expires" do + create_process(stale: true) + described_class.call + described_class.instance_variable_set(:@last_fired_at, 2.hours.ago) + expect_any_instance_of(Net::HTTP).to receive(:request) + described_class.call + end + + it "sends a JSON payload with the correct fields" do + 2.times { create_process(stale: true) } + posted_body = nil + allow_any_instance_of(Net::HTTP).to receive(:request) do |_, req| + posted_body = JSON.parse(req.body) + Net::HTTPSuccess.new("1.1", "200", "OK") + end + + described_class.call + + expect(posted_body["event"]).to eq("stale_process_detected") + expect(posted_body["stale_process_count"]).to eq(2) + expect(posted_body["threshold"]).to eq(1) + expect(posted_body["fired_at"]).to be_present + end + + it "sets Content-Type to application/json" do + create_process(stale: true) + sent_request = nil + allow_any_instance_of(Net::HTTP).to receive(:request) do |_, req| + sent_request = req + Net::HTTPSuccess.new("1.1", "200", "OK") + end + + described_class.call + + expect(sent_request["Content-Type"]).to eq("application/json") + end + + it "logs an error and does not raise when the HTTP request fails" do + create_process(stale: true) + allow_any_instance_of(Net::HTTP).to receive(:request).and_raise(RuntimeError, "connection refused") + expect(Rails.logger).to receive(:error).with(/Stale process alert webhook failed/) + expect { described_class.call }.not_to raise_error + end + + context "when alert_webhook_url is an array" do + let(:second_url) { "http://example.com/second-webhook" } + + before { SolidQueueWeb.alert_webhook_url = [webhook_url, second_url] } + + it "posts to all configured URLs" do + create_process(stale: true) + expect(Net::HTTP).to receive(:new).twice.and_call_original + described_class.call + end + end + end +end