diff --git a/CHANGELOG.md b/CHANGELOG.md index d53d795..603d3a6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- Failed job trend chart — a "Failures — Last 12 Hours" bar chart card on the dashboard shows how many jobs failed per hour over the last 12 hours; bars are red (distinct from the blue throughput and purple queue depth charts); header shows the total failure count for the window; empty state shown when no failures exist in the period + - Error frequency report — a new Error Summary page (`/jobs/failed_jobs/errors`) groups all failed jobs by error class and message prefix, shows a count per group, and displays a sample backtrace (first 10 lines) in an expandable `
` element; groups are sorted by count descending so the most common errors appear first; accessible via an "Error Summary" button on the Failed Jobs page ## [1.1.0] - 2026-05-21 diff --git a/README.md b/README.md index 9d6dd99..4726034 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,7 @@ SolidQueueWeb surfaces all of this in a browser UI available at any route you ch - **Slow job detection** — when `slow_job_threshold` is configured, claimed jobs running longer than the threshold are flagged with an orange row, a "slow" badge, and a "Running For" duration column on the Running tab; a "Slow Jobs" warning card appears on the dashboard with a link to the Running tab - **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; fires asynchronously so dashboard performance is unaffected; a configurable cooldown (default 1 h) prevents repeated alerts while the count stays elevated - **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, min, and max duration; sorted by p95 descending so the slowest classes surface first; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view +- **Failed job trend chart** — a "Failures — Last 12 Hours" bar chart on the dashboard shows failures per hour over the last 12 hours; bars are red, making failure spikes visible before clicking into the failed jobs list - **Error frequency report** — `GET /jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows a count per group, and surfaces a sample backtrace in an expandable row; sorted by count descending so the most common errors appear first; accessible via the "Error Summary" button on the Failed Jobs page - **Metrics / health endpoint** — `GET /jobs/metrics.json` returns a machine-readable JSON document with job counts, throughput, per-queue depth and pause state, and process health summary; suitable for Prometheus scraping, uptime monitors, or external dashboards; `slow_jobs` count included when `slow_job_threshold` is configured diff --git a/ROADMAP.md b/ROADMAP.md index fe8bd55..2f63650 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -13,7 +13,7 @@ Pull requests for any of these are welcome. See [Contributing](README.md#contrib | Feature | Notes | |---|---| | ~~**Error frequency report**~~ | ✓ Shipped — `/jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows count and a sample backtrace in an expandable row; sorted by count descending; accessible via "Error Summary" button on the Failed Jobs page. | -| **Failed job trend chart** | A "Failures — Last 12 Hours" sparkline on the dashboard (same pattern as the existing throughput and queue depth charts). Makes failure spikes visible before you click into the failed jobs list. | +| ~~**Failed job trend chart**~~ | ✓ Shipped — a "Failures — Last 12 Hours" bar chart card on the dashboard shows failures per hour; red bars, empty state when no failures; same pattern as the throughput and queue depth charts. | | **P99 + std dev in performance analytics** | Extend `JobPerformanceStats` with a 99th percentile and standard deviation column. High std dev signals inconsistent jobs worth investigating. | --- diff --git a/app/assets/stylesheets/solid_queue_web/_11_throughput.css b/app/assets/stylesheets/solid_queue_web/_11_throughput.css index b302e59..f4ea0a6 100644 --- a/app/assets/stylesheets/solid_queue_web/_11_throughput.css +++ b/app/assets/stylesheets/solid_queue_web/_11_throughput.css @@ -94,4 +94,8 @@ .sqd-sparkline__bar--depth { background: var(--purple); +} + +.sqd-sparkline__bar--failure { + background: var(--danger); } \ No newline at end of file diff --git a/app/services/solid_queue_web/dashboard_stats.rb b/app/services/solid_queue_web/dashboard_stats.rb index 7c84e43..d010866 100644 --- a/app/services/solid_queue_web/dashboard_stats.rb +++ b/app/services/solid_queue_web/dashboard_stats.rb @@ -1,6 +1,6 @@ module SolidQueueWeb class DashboardStats - attr_reader :counts, :throughput, :sparkline, :depth_sparkline, :slow_jobs_count + attr_reader :counts, :throughput, :sparkline, :depth_sparkline, :failure_sparkline, :slow_jobs_count def initialize @now = Time.current @@ -32,6 +32,13 @@ def compute finished_times.count { |t| t >= from && t < to } end + failed_times = SolidQueue::FailedExecution.where(created_at: 12.hours.ago..@now).pluck(:created_at) + @failure_sparkline = 12.times.map do |i| + from = (12 - i).hours.ago + to = i == 11 ? @now : (11 - i).hours.ago + failed_times.count { |t| t >= from && t < to } + end + threshold = SolidQueueWeb.slow_job_threshold @slow_jobs_count = threshold ? SolidQueue::ClaimedExecution.where("created_at <= ?", threshold.ago).count : 0 diff --git a/app/views/solid_queue_web/dashboard/index.html.erb b/app/views/solid_queue_web/dashboard/index.html.erb index 616d37c..cfbbefd 100644 --- a/app/views/solid_queue_web/dashboard/index.html.erb +++ b/app/views/solid_queue_web/dashboard/index.html.erb @@ -104,6 +104,35 @@ <% end %> +<% max_failures = [@stats.failure_sparkline.max, 1].max %> +
+
+ Failures — Last 12 Hours +
+ Total: <%= @stats.failure_sparkline.sum %> +
+
+ <% if @stats.failure_sparkline.all?(&:zero?) %> +
No failures in the last 12 hours
+ <% else %> +
+ <% @stats.failure_sparkline.each_with_index do |count, i| %> + <% pct = (count.to_f / max_failures * 100).round %> + <% hour_start = (12 - i).hours.ago %> + <% show_tick = [0, 3, 6, 9, 11].include?(i) %> +
+
+
">
+
+
<%= show_tick ? (i == 11 ? "now" : hour_start.strftime("%-I%p").downcase) : "" %>
+
+ <% end %> +
+ <% end %> +
+
diff --git a/spec/requests/solid_queue_web/dashboard_spec.rb b/spec/requests/solid_queue_web/dashboard_spec.rb index a0cc814..e5aa9ee 100644 --- a/spec/requests/solid_queue_web/dashboard_spec.rb +++ b/spec/requests/solid_queue_web/dashboard_spec.rb @@ -44,6 +44,32 @@ get "/jobs" expect(response.body).to include("No completed jobs in the last 24 hours") end + + it "renders the failures sparkline card" do + get "/jobs" + expect(response.body).to include("Failures") + expect(response.body).to include("Last 12 Hours") + end + + it "shows empty state for failures card when no failures exist" do + get "/jobs" + expect(response.body).to include("No failures in the last 12 hours") + end + + it "renders failure bars when failed jobs exist within the last 12 hours" do + job = SolidQueue::Job.create!( + queue_name: "default", class_name: "BrokenJob", + arguments: {}, active_job_id: SecureRandom.uuid + ) + job.ready_execution&.destroy + execution = SolidQueue::FailedExecution.create!( + job: job, error: { exception_class: "RuntimeError", message: "boom", backtrace: [] } + ) + execution.update_columns(created_at: 30.minutes.ago) + + get "/jobs" + expect(response.body).to include("sqd-sparkline__bar--failure") + end end describe "slow jobs card" do