Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Slow job webhook alert — set `alert_slow_job_count_threshold` (integer) to fire a webhook whenever the number of currently-running slow jobs meets or exceeds the configured count; requires `slow_job_threshold` to define what "slow" means; uses the same `alert_webhook_url` and `alert_webhook_cooldown` settings as other alert types; event name `slow_job_threshold_exceeded`
- Stale process webhook alert — set `alert_stale_process_threshold` (integer) to fire a webhook whenever the number of stale workers meets or exceeds the configured count; a process is stale when its heartbeat has not been updated within `SolidQueue.process_alive_threshold`; uses the same `alert_webhook_url` and `alert_webhook_cooldown` settings; event name `stale_process_detected`

## [1.3.0] - 2026-05-27

Expand Down
30 changes: 28 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ SolidQueueWeb surfaces all of this in a browser UI available at any route you ch
- **Dashboard quick actions** — "Retry All Failed" and "Discard All Blocked" cards appear on the dashboard only when the respective count is non-zero; one-click bulk operations with confirm dialogs, keeping the dashboard clean when everything is healthy
- **CSV export** — "Export CSV" button on the jobs, failed jobs, and history pages downloads all records matching the current filters; columns are tailored per view
- **Slow job detection** — when `slow_job_threshold` is configured, claimed jobs running longer than the threshold are flagged with an orange row, a "slow" badge, and a "Running For" duration column on the Running tab; a "Slow Jobs" warning card appears on the dashboard with a link to the Running tab
- **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; set `alert_queue_thresholds` for per-queue depth alerts; set `alert_slow_job_count_threshold` (requires `slow_job_threshold`) for slow-job count alerts; all fire asynchronously with a configurable cooldown (default 1 h) to prevent repeated alerts
- **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; set `alert_queue_thresholds` for per-queue depth alerts; set `alert_slow_job_count_threshold` (requires `slow_job_threshold`) for slow-job count alerts; set `alert_stale_process_threshold` for stale-worker alerts; all fire asynchronously with a configurable cooldown (default 1 h) to prevent repeated alerts
- **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, p99, standard deviation, min, and max duration; sorted by p95 descending so the slowest classes surface first; high std dev surfaces inconsistent jobs worth investigating; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view
- **Failed job trend chart** — a "Failures — Last 12 Hours" bar chart on the dashboard shows failures per hour over the last 12 hours; bars are red, making failure spikes visible before clicking into the failed jobs list
- **Error frequency report** — `GET /jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows a count per group, and surfaces a sample backtrace in an expandable row; sorted by count descending so the most common errors appear first; accessible via the "Error Summary" button on the Failed Jobs page
Expand Down Expand Up @@ -107,7 +107,8 @@ SolidQueueWeb.configure do |config|
config.alert_webhook_url = "https://hooks.example.com/solid-queue" # POST target — string or array (default: nil = disabled)
config.alert_failure_threshold = 10 # fire when failed count >= this (default: nil = disabled)
config.alert_queue_thresholds = { "critical" => 50, "default" => 200 } # fire when queue depth >= threshold (default: {})
config.alert_slow_job_count_threshold = 5 # fire when slow job count >= this (default: nil = disabled)
config.alert_slow_job_count_threshold = 5 # fire when slow job count >= this (default: nil = disabled)
config.alert_stale_process_threshold = 1 # fire when stale process count >= this (default: nil = disabled)
config.alert_webhook_cooldown = 1800 # seconds between repeated alerts per alert type (default: 3600)
config.connects_to = { reading: :reading, writing: :writing } # read replica (default: nil)
config.time_zone = "America/New_York" # display timezone for all timestamps (default: nil = UTC)
Expand Down Expand Up @@ -209,6 +210,31 @@ The same `alert_webhook_url` endpoint(s) receive the payload with a distinct eve

The alert fires on every dashboard page load while the condition persists, subject to the cooldown window.

## Stale process alerts

Set `alert_stale_process_threshold` to fire a webhook when the number of stale workers meets or exceeds a count. A process is considered stale when its `last_heartbeat_at` has not been updated within `SolidQueue.process_alive_threshold` (default 5 minutes). A stale worker means jobs in its queues have silently stopped processing.

```ruby
SolidQueueWeb.configure do |config|
config.alert_stale_process_threshold = 1 # fire when any process goes stale
config.alert_webhook_url = "https://hooks.example.com/solid-queue"
config.alert_webhook_cooldown = 1800 # don't re-fire for 30 minutes (default: 3600)
end
```

The same `alert_webhook_url` endpoint(s) receive the payload with a distinct event type:

```json
{
"event": "stale_process_detected",
"stale_process_count": 2,
"threshold": 1,
"fired_at": "2026-05-28T08:00:00Z"
}
```

The alert fires on every dashboard page load while the condition persists, subject to the cooldown window.

## Metrics endpoint

`GET /jobs/metrics.json` returns a machine-readable JSON document suitable for Prometheus scraping, uptime monitors, or external dashboards. No configuration is required — the endpoint is available as soon as the engine is mounted.
Expand Down
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Pull requests for any of these are welcome. See [Contributing](README.md#contrib
| Feature | Notes |
|---|---|
| ~~**Slow job webhook alert**~~ | ✅ Shipped in v1.4 — `alert_slow_job_count_threshold` fires when slow-job count meets or exceeds the threshold. |
| **Process stale webhook alert** | Fire when a worker's `last_heartbeat_at` expires. A worker going silent means jobs stop processing silently. |
| ~~**Process stale webhook alert**~~ | ✅ Shipped in v1.4 — `alert_stale_process_threshold` fires when stale worker count meets or exceeds the threshold. |
| **Job wait time column** | Show time from `enqueued_at` to `created_at` on claimed executions — a direct measure of queue SLA. |

---
Expand Down
1 change: 1 addition & 0 deletions app/controllers/solid_queue_web/dashboard_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ def index
AlertWebhook.call(failure_count: @stats.counts[:failed])
QueueDepthAlert.call
SlowJobAlert.call
StaleProcessAlert.call
end
end
end
68 changes: 68 additions & 0 deletions app/services/solid_queue_web/stale_process_alert.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
require "net/http"
require "json"
require "uri"

module SolidQueueWeb
class StaleProcessAlert
MUTEX = Mutex.new

class << self
def call
return unless configured?

stale_count = SolidQueue::Process
.where("last_heartbeat_at < ?", SolidQueue.process_alive_threshold.ago)
.count

return if stale_count < SolidQueueWeb.alert_stale_process_threshold
return unless should_fire?

urls = webhook_urls
Thread.new { urls.each { |url| post(url, stale_count) } }
end

def reset!
MUTEX.synchronize { @last_fired_at = nil }
end

private

def configured?
SolidQueueWeb.alert_stale_process_threshold.present? && webhook_urls.any?
end

def webhook_urls
Array(SolidQueueWeb.alert_webhook_url).flatten.compact.select(&:present?)
end

def should_fire?
MUTEX.synchronize do
cooldown = SolidQueueWeb.alert_webhook_cooldown
return false if @last_fired_at && Time.current - @last_fired_at < cooldown

@last_fired_at = Time.current
true
end
end

def post(url_string, stale_count)
uri = URI.parse(url_string)
payload = JSON.generate(
event: "stale_process_detected",
stale_process_count: stale_count,
threshold: SolidQueueWeb.alert_stale_process_threshold,
fired_at: Time.current.iso8601
)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == "https"
http.open_timeout = 5
http.read_timeout = 10
request = Net::HTTP::Post.new(uri.path.presence || "/", "Content-Type" => "application/json")
request.body = payload
http.request(request)
rescue => e
Rails.logger.error("[SolidQueueWeb] Stale process alert webhook failed: #{e.message}")
end
end
end
end
7 changes: 6 additions & 1 deletion lib/solid_queue_web.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ module SolidQueueWeb
class << self
attr_writer :page_size, :dashboard_refresh_interval, :default_refresh_interval, :search_results_limit,
:slow_job_threshold, :alert_webhook_url, :alert_failure_threshold, :alert_webhook_cooldown,
:alert_queue_thresholds, :alert_slow_job_count_threshold, :connects_to, :time_zone
:alert_queue_thresholds, :alert_slow_job_count_threshold, :alert_stale_process_threshold,
:connects_to, :time_zone

def page_size
@page_size || 25
Expand Down Expand Up @@ -48,6 +49,10 @@ def alert_slow_job_count_threshold
@alert_slow_job_count_threshold
end

def alert_stale_process_threshold
@alert_stale_process_threshold
end

def connects_to
@connects_to
end
Expand Down
137 changes: 137 additions & 0 deletions spec/services/solid_queue_web/stale_process_alert_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
require "rails_helper"

RSpec.describe SolidQueueWeb::StaleProcessAlert do
let(:webhook_url) { "http://example.com/webhook" }

before do
SolidQueueWeb.alert_webhook_url = webhook_url
SolidQueueWeb.alert_stale_process_threshold = 1
SolidQueueWeb.alert_webhook_cooldown = 3600
allow(Thread).to receive(:new).and_yield
allow_any_instance_of(Net::HTTP).to receive(:request).and_return(Net::HTTPSuccess.new("1.1", "200", "OK"))
end

after do
SolidQueueWeb.alert_webhook_url = nil
SolidQueueWeb.alert_stale_process_threshold = nil
SolidQueueWeb.alert_webhook_cooldown = nil
described_class.reset!
end

def create_process(stale: false)
heartbeat = stale ? (SolidQueue.process_alive_threshold + 1.minute).ago : 30.seconds.ago
SolidQueue::Process.create!(
kind: "Worker", pid: rand(10_000..99_999), hostname: "test-host",
name: "worker-#{SecureRandom.hex(4)}", last_heartbeat_at: heartbeat
)
end

describe ".call" do
it "fires when stale process count meets the threshold" do
create_process(stale: true)
expect_any_instance_of(Net::HTTP).to receive(:request)
described_class.call
end

it "fires when stale process count exceeds the threshold" do
3.times { create_process(stale: true) }
expect_any_instance_of(Net::HTTP).to receive(:request)
described_class.call
end

it "does not fire when stale process count is below the threshold" do
SolidQueueWeb.alert_stale_process_threshold = 2
create_process(stale: true)
expect_any_instance_of(Net::HTTP).not_to receive(:request)
described_class.call
end

it "does not count healthy processes as stale" do
create_process(stale: false)
expect_any_instance_of(Net::HTTP).not_to receive(:request)
described_class.call
end

it "does not fire when alert_stale_process_threshold is not configured" do
SolidQueueWeb.alert_stale_process_threshold = nil
create_process(stale: true)
expect_any_instance_of(Net::HTTP).not_to receive(:request)
described_class.call
end

it "does not fire when webhook url is not configured" do
SolidQueueWeb.alert_webhook_url = nil
create_process(stale: true)
expect_any_instance_of(Net::HTTP).not_to receive(:request)
described_class.call
end

it "does not fire when no processes exist" do
expect_any_instance_of(Net::HTTP).not_to receive(:request)
described_class.call
end

it "does not fire again within the cooldown window" do
create_process(stale: true)
described_class.call
expect_any_instance_of(Net::HTTP).not_to receive(:request)
described_class.call
end

it "fires again after the cooldown window expires" do
create_process(stale: true)
described_class.call
described_class.instance_variable_set(:@last_fired_at, 2.hours.ago)
expect_any_instance_of(Net::HTTP).to receive(:request)
described_class.call
end

it "sends a JSON payload with the correct fields" do
2.times { create_process(stale: true) }
posted_body = nil
allow_any_instance_of(Net::HTTP).to receive(:request) do |_, req|
posted_body = JSON.parse(req.body)
Net::HTTPSuccess.new("1.1", "200", "OK")
end

described_class.call

expect(posted_body["event"]).to eq("stale_process_detected")
expect(posted_body["stale_process_count"]).to eq(2)
expect(posted_body["threshold"]).to eq(1)
expect(posted_body["fired_at"]).to be_present
end

it "sets Content-Type to application/json" do
create_process(stale: true)
sent_request = nil
allow_any_instance_of(Net::HTTP).to receive(:request) do |_, req|
sent_request = req
Net::HTTPSuccess.new("1.1", "200", "OK")
end

described_class.call

expect(sent_request["Content-Type"]).to eq("application/json")
end

it "logs an error and does not raise when the HTTP request fails" do
create_process(stale: true)
allow_any_instance_of(Net::HTTP).to receive(:request).and_raise(RuntimeError, "connection refused")
expect(Rails.logger).to receive(:error).with(/Stale process alert webhook failed/)
expect { described_class.call }.not_to raise_error
end

context "when alert_webhook_url is an array" do
let(:second_url) { "http://example.com/second-webhook" }

before { SolidQueueWeb.alert_webhook_url = [webhook_url, second_url] }

it "posts to all configured URLs" do
create_process(stale: true)
expect(Net::HTTP).to receive(:new).twice.and_call_original
described_class.call
end
end
end
end