Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Error frequency report — a new Error Summary page (`/jobs/failed_jobs/errors`) groups all failed jobs by error class and message prefix, shows a count per group, and displays a sample backtrace (first 10 lines) in an expandable `<details>` element; groups are sorted by count descending so the most common errors appear first; accessible via an "Error Summary" button on the Failed Jobs page

## [1.1.0] - 2026-05-21

### Added
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ SolidQueueWeb surfaces all of this in a browser UI available at any route you ch
- **Slow job detection** — when `slow_job_threshold` is configured, claimed jobs running longer than the threshold are flagged with an orange row, a "slow" badge, and a "Running For" duration column on the Running tab; a "Slow Jobs" warning card appears on the dashboard with a link to the Running tab
- **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; fires asynchronously so dashboard performance is unaffected; a configurable cooldown (default 1 h) prevents repeated alerts while the count stays elevated
- **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, min, and max duration; sorted by p95 descending so the slowest classes surface first; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view
- **Error frequency report** — `GET /jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows a count per group, and surfaces a sample backtrace in an expandable row; sorted by count descending so the most common errors appear first; accessible via the "Error Summary" button on the Failed Jobs page
- **Metrics / health endpoint** — `GET /jobs/metrics.json` returns a machine-readable JSON document with job counts, throughput, per-queue depth and pause state, and process health summary; suitable for Prometheus scraping, uptime monitors, or external dashboards; `slow_jobs` count included when `slow_job_threshold` is configured

## Compatibility
Expand Down
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Pull requests for any of these are welcome. See [Contributing](README.md#contrib

| Feature | Notes |
|---|---|
| **Error frequency report** | Group all failed jobs by error class + message prefix, show count and a sample backtrace. When you have hundreds of failed jobs, you want to see "ArgumentError (x212), TimeoutError (x88)" at a glance. |
| ~~**Error frequency report**~~ | ✓ Shipped — `/jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows count and a sample backtrace in an expandable row; sorted by count descending; accessible via "Error Summary" button on the Failed Jobs page. |
| **Failed job trend chart** | A "Failures — Last 12 Hours" sparkline on the dashboard (same pattern as the existing throughput and queue depth charts). Makes failure spikes visible before you click into the failed jobs list. |
| **P99 + std dev in performance analytics** | Extend `JobPerformanceStats` with a 99th percentile and standard deviation column. High std dev signals inconsistent jobs worth investigating. |

Expand Down
7 changes: 7 additions & 0 deletions app/assets/stylesheets/solid_queue_web/_08_detail.css
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,13 @@

.sqd-pre--muted { color: var(--muted); }

.sqd-error-details summary {
cursor: pointer;
list-style: none;
}
.sqd-error-details summary::-webkit-details-marker { display: none; }
.sqd-error-details .sqd-pre { margin-top: 0.5rem; }

.sqd-error-header {
font-size: 13px;
padding: 0.5rem 0.75rem;
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
module SolidQueueWeb
module FailedJobs
class ErrorsController < ApplicationController
def index
@groups = ErrorFrequencyReport.new.groups
end
end
end
end
34 changes: 34 additions & 0 deletions app/services/solid_queue_web/error_frequency_report.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
module SolidQueueWeb
class ErrorFrequencyReport
Row = Data.define(:exception_class, :message_prefix, :count, :sample_backtrace)

MESSAGE_LIMIT = 120

def groups
SolidQueue::FailedExecution
.order(created_at: :desc)
.each_with_object({}) do |execution, acc|
key = [execution.exception_class.to_s, message_prefix(execution.message)]
entry = acc[key] ||= { count: 0, sample_backtrace: nil }
entry[:count] += 1
entry[:sample_backtrace] ||= execution.backtrace
end
.map do |(exception_class, prefix), data|
Row.new(
exception_class: exception_class,
message_prefix: prefix,
count: data[:count],
sample_backtrace: data[:sample_backtrace]
)
end
.sort_by { |row| -row.count }
end

private

def message_prefix(message)
return "" if message.nil?
message.length > MESSAGE_LIMIT ? "#{message[0, MESSAGE_LIMIT]}…" : message
end
end
end
44 changes: 44 additions & 0 deletions app/views/solid_queue_web/failed_jobs/errors/index.html.erb
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
<div class="sqd-page-header">
<h1 class="sqd-page-title">Error Summary</h1>
<div class="sqd-actions">
<%= link_to "← Failed Jobs", failed_jobs_path, class: "sqd-btn sqd-btn--muted sqd-btn--sm" %>
</div>
</div>

<% if @groups.any? %>
<div class="sqd-card">
<table>
<thead>
<tr>
<th scope="col">Error Class</th>
<th scope="col">Message</th>
<th scope="col" style="text-align: right;">Count</th>
</tr>
</thead>
<tbody>
<% @groups.each do |group| %>
<tr>
<td class="sqd-mono"><%= group.exception_class.presence || "—" %></td>
<td>
<% if group.sample_backtrace.present? %>
<details class="sqd-error-details">
<summary class="sqd-truncate" title="<%= group.message_prefix %>">
<%= group.message_prefix.presence || "—" %>
</summary>
<pre class="sqd-pre sqd-pre--muted"><%= Array(group.sample_backtrace).first(10).join("\n") %></pre>
</details>
<% else %>
<span class="sqd-truncate" title="<%= group.message_prefix %>"><%= group.message_prefix.presence || "—" %></span>
<% end %>
</td>
<td style="text-align: right;"><%= group.count %></td>
</tr>
<% end %>
</tbody>
</table>
</div>
<% else %>
<div class="sqd-card">
<div class="sqd-empty">No failed jobs. All clear!</div>
</div>
<% end %>
1 change: 1 addition & 0 deletions app/views/solid_queue_web/failed_jobs/index.html.erb
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
<h1 class="sqd-page-title">Failed Jobs</h1>
<% if @failed_jobs.any? %>
<div class="sqd-actions">
<%= link_to "Error Summary", failed_job_errors_path, class: "sqd-btn sqd-btn--muted sqd-btn--sm" %>
<%= link_to "Export CSV", failed_jobs_path(format: :csv, queue: @queue, q: @search, period: @period),
class: "sqd-btn sqd-btn--muted", data: { turbo: false } %>
<%= button_to "Retry All", retry_all_failed_jobs_path,
Expand Down
2 changes: 2 additions & 0 deletions config/routes.rb
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@
end
end

get "failed_jobs/errors", to: "failed_jobs/errors#index", as: :failed_job_errors

resource :failed_job_selection, path: "failed_jobs/selection", only: [:create, :destroy],
controller: "failed_jobs/selections"
resources :failed_jobs, only: [:index, :destroy] do
Expand Down
98 changes: 98 additions & 0 deletions spec/requests/solid_queue_web/failed_job_errors_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
require "rails_helper"

RSpec.describe "FailedJobErrors", type: :request do
def failed_execution(class_name: "TestJob", exception_class: "RuntimeError", message: "boom", backtrace: ["app/jobs/test_job.rb:10"])
job = SolidQueue::Job.create!(
queue_name: "default", class_name: class_name,
arguments: {}, active_job_id: SecureRandom.uuid
)
job.ready_execution&.destroy
SolidQueue::FailedExecution.create!(
job: job,
error: { exception_class: exception_class, message: message, backtrace: backtrace }
)
end

describe "GET /jobs/failed_jobs/errors" do
it "returns HTTP success" do
get "/jobs/failed_jobs/errors"
expect(response).to have_http_status(:ok)
end

it "displays the Error Summary heading" do
get "/jobs/failed_jobs/errors"
expect(response.body).to include("Error Summary")
end

it "shows an empty state when no failed jobs exist" do
get "/jobs/failed_jobs/errors"
expect(response.body).to include("No failed jobs")
end

it "renders a row for each distinct error class" do
failed_execution(exception_class: "ArgumentError", message: "bad arg")
failed_execution(exception_class: "TimeoutError", message: "timed out")

get "/jobs/failed_jobs/errors"
expect(response.body).to include("ArgumentError")
expect(response.body).to include("TimeoutError")
end

it "aggregates multiple failures with the same error class and message" do
2.times { failed_execution(exception_class: "RuntimeError", message: "boom") }

get "/jobs/failed_jobs/errors"
expect(response.body).to include("RuntimeError")
expect(response.body.scan("RuntimeError").size).to eq(1)
end

it "sorts groups by count descending" do
3.times { failed_execution(exception_class: "FrequentError", message: "often") }
1.times { failed_execution(exception_class: "RareError", message: "once") }

get "/jobs/failed_jobs/errors"
frequent_pos = response.body.index("FrequentError")
rare_pos = response.body.index("RareError")
expect(frequent_pos).to be < rare_pos
end

it "truncates long messages to MESSAGE_LIMIT characters" do
long_message = "x" * 200
failed_execution(exception_class: "RuntimeError", message: long_message)

get "/jobs/failed_jobs/errors"
expect(response.body).not_to include(long_message)
expect(response.body).to include("x" * 120)
end

it "renders a backtrace inside a details element when present" do
failed_execution(exception_class: "RuntimeError", message: "boom", backtrace: ["app/jobs/test_job.rb:10"])

get "/jobs/failed_jobs/errors"
expect(response.body).to include("<details")
expect(response.body).to include("app/jobs/test_job.rb:10")
end

it "links back to the failed jobs page" do
get "/jobs/failed_jobs/errors"
expect(response.body).to include("Failed Jobs")
expect(response.body).to include("/jobs/failed_jobs")
end

describe "authentication" do
after { SolidQueueWeb.instance_variable_set(:@authenticate, nil) }

it "allows access when auth block returns truthy" do
SolidQueueWeb.authenticate { true }
get "/jobs/failed_jobs/errors"
expect(response).to have_http_status(:ok)
end

it "returns 401 when auth block returns falsy" do
SolidQueueWeb.authenticate { false }
get "/jobs/failed_jobs/errors"
expect(response).to have_http_status(:unauthorized)
end
end
end
end