Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Failed job trend chart — a "Failures — Last 12 Hours" bar chart card on the dashboard shows how many jobs failed per hour over the last 12 hours; bars are red (distinct from the blue throughput and purple queue depth charts); header shows the total failure count for the window; empty state shown when no failures exist in the period

- Error frequency report — a new Error Summary page (`/jobs/failed_jobs/errors`) groups all failed jobs by error class and message prefix, shows a count per group, and displays a sample backtrace (first 10 lines) in an expandable `<details>` element; groups are sorted by count descending so the most common errors appear first; accessible via an "Error Summary" button on the Failed Jobs page

## [1.1.0] - 2026-05-21
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ SolidQueueWeb surfaces all of this in a browser UI available at any route you ch
- **Slow job detection** — when `slow_job_threshold` is configured, claimed jobs running longer than the threshold are flagged with an orange row, a "slow" badge, and a "Running For" duration column on the Running tab; a "Slow Jobs" warning card appears on the dashboard with a link to the Running tab
- **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; fires asynchronously so dashboard performance is unaffected; a configurable cooldown (default 1 h) prevents repeated alerts while the count stays elevated
- **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, min, and max duration; sorted by p95 descending so the slowest classes surface first; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view
- **Failed job trend chart** — a "Failures — Last 12 Hours" bar chart on the dashboard shows failures per hour over the last 12 hours; bars are red, making failure spikes visible before clicking into the failed jobs list
- **Error frequency report** — `GET /jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows a count per group, and surfaces a sample backtrace in an expandable row; sorted by count descending so the most common errors appear first; accessible via the "Error Summary" button on the Failed Jobs page
- **Metrics / health endpoint** — `GET /jobs/metrics.json` returns a machine-readable JSON document with job counts, throughput, per-queue depth and pause state, and process health summary; suitable for Prometheus scraping, uptime monitors, or external dashboards; `slow_jobs` count included when `slow_job_threshold` is configured

Expand Down
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Pull requests for any of these are welcome. See [Contributing](README.md#contrib
| Feature | Notes |
|---|---|
| ~~**Error frequency report**~~ | ✓ Shipped — `/jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows count and a sample backtrace in an expandable row; sorted by count descending; accessible via "Error Summary" button on the Failed Jobs page. |
| **Failed job trend chart** | A "Failures — Last 12 Hours" sparkline on the dashboard (same pattern as the existing throughput and queue depth charts). Makes failure spikes visible before you click into the failed jobs list. |
| ~~**Failed job trend chart**~~ | ✓ Shipped — a "Failures — Last 12 Hours" bar chart card on the dashboard shows failures per hour; red bars, empty state when no failures; same pattern as the throughput and queue depth charts. |
| **P99 + std dev in performance analytics** | Extend `JobPerformanceStats` with a 99th percentile and standard deviation column. High std dev signals inconsistent jobs worth investigating. |

---
Expand Down
4 changes: 4 additions & 0 deletions app/assets/stylesheets/solid_queue_web/_11_throughput.css
Original file line number Diff line number Diff line change
Expand Up @@ -94,4 +94,8 @@

.sqd-sparkline__bar--depth {
background: var(--purple);
}

.sqd-sparkline__bar--failure {
background: var(--danger);
}
9 changes: 8 additions & 1 deletion app/services/solid_queue_web/dashboard_stats.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
module SolidQueueWeb
class DashboardStats
attr_reader :counts, :throughput, :sparkline, :depth_sparkline, :slow_jobs_count
attr_reader :counts, :throughput, :sparkline, :depth_sparkline, :failure_sparkline, :slow_jobs_count

def initialize
@now = Time.current
Expand Down Expand Up @@ -32,6 +32,13 @@ def compute
finished_times.count { |t| t >= from && t < to }
end

failed_times = SolidQueue::FailedExecution.where(created_at: 12.hours.ago..@now).pluck(:created_at)
@failure_sparkline = 12.times.map do |i|
from = (12 - i).hours.ago
to = i == 11 ? @now : (11 - i).hours.ago
failed_times.count { |t| t >= from && t < to }
end

threshold = SolidQueueWeb.slow_job_threshold
@slow_jobs_count = threshold ? SolidQueue::ClaimedExecution.where("created_at <= ?", threshold.ago).count : 0

Expand Down
29 changes: 29 additions & 0 deletions app/views/solid_queue_web/dashboard/index.html.erb
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,35 @@
<% end %>
</div>

<% max_failures = [@stats.failure_sparkline.max, 1].max %>
<div class="sqd-card" style="margin-bottom: 1rem;">
<div class="sqd-card__header">
<span class="sqd-card__title">Failures &mdash; Last 12 Hours</span>
<div class="sqd-throughput__summary">
<span>Total: <strong><%= @stats.failure_sparkline.sum %></strong></span>
</div>
</div>
<% if @stats.failure_sparkline.all?(&:zero?) %>
<div class="sqd-sparkline__empty">No failures in the last 12 hours</div>
<% else %>
<div class="sqd-sparkline" aria-label="Failed jobs per hour over the last 12 hours">
<% @stats.failure_sparkline.each_with_index do |count, i| %>
<% pct = (count.to_f / max_failures * 100).round %>
<% hour_start = (12 - i).hours.ago %>
<% show_tick = [0, 3, 6, 9, 11].include?(i) %>
<div class="sqd-sparkline__col">
<div class="sqd-sparkline__bar-wrap">
<div class="sqd-sparkline__bar sqd-sparkline__bar--failure"
style="height: <%= [pct, 3].max %>%"
title="<%= hour_start.strftime('%-I%p').downcase %>: <%= count %> <%= "failure".pluralize(count) %>"></div>
</div>
<div class="sqd-sparkline__tick"><%= show_tick ? (i == 11 ? "now" : hour_start.strftime("%-I%p").downcase) : "" %></div>
</div>
<% end %>
</div>
<% end %>
</div>

<div style="display:grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 1rem;">
<div class="sqd-card">
<div class="sqd-card__header">
Expand Down
26 changes: 26 additions & 0 deletions spec/requests/solid_queue_web/dashboard_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,32 @@
get "/jobs"
expect(response.body).to include("No completed jobs in the last 24 hours")
end

it "renders the failures sparkline card" do
get "/jobs"
expect(response.body).to include("Failures")
expect(response.body).to include("Last 12 Hours")
end

it "shows empty state for failures card when no failures exist" do
get "/jobs"
expect(response.body).to include("No failures in the last 12 hours")
end

it "renders failure bars when failed jobs exist within the last 12 hours" do
job = SolidQueue::Job.create!(
queue_name: "default", class_name: "BrokenJob",
arguments: {}, active_job_id: SecureRandom.uuid
)
job.ready_execution&.destroy
execution = SolidQueue::FailedExecution.create!(
job: job, error: { exception_class: "RuntimeError", message: "boom", backtrace: [] }
)
execution.update_columns(created_at: 30.minutes.ago)

get "/jobs"
expect(response.body).to include("sqd-sparkline__bar--failure")
end
end

describe "slow jobs card" do
Expand Down