Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- p99 and standard deviation columns in performance analytics — `JobPerformanceStats` now computes a 99th percentile and population standard deviation for each job class; both columns appear in the Performance table between p95 and Min; high std dev surfaces inconsistent jobs worth investigating

- Failed job trend chart — a "Failures — Last 12 Hours" bar chart card on the dashboard shows how many jobs failed per hour over the last 12 hours; bars are red (distinct from the blue throughput and purple queue depth charts); header shows the total failure count for the window; empty state shown when no failures exist in the period

- Error frequency report — a new Error Summary page (`/jobs/failed_jobs/errors`) groups all failed jobs by error class and message prefix, shows a count per group, and displays a sample backtrace (first 10 lines) in an expandable `<details>` element; groups are sorted by count descending so the most common errors appear first; accessible via an "Error Summary" button on the Failed Jobs page
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ SolidQueueWeb surfaces all of this in a browser UI available at any route you ch
- **CSV export** — "Export CSV" button on the jobs, failed jobs, and history pages downloads all records matching the current filters; columns are tailored per view
- **Slow job detection** — when `slow_job_threshold` is configured, claimed jobs running longer than the threshold are flagged with an orange row, a "slow" badge, and a "Running For" duration column on the Running tab; a "Slow Jobs" warning card appears on the dashboard with a link to the Running tab
- **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; fires asynchronously so dashboard performance is unaffected; a configurable cooldown (default 1 h) prevents repeated alerts while the count stays elevated
- **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, min, and max duration; sorted by p95 descending so the slowest classes surface first; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view
- **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, p99, standard deviation, min, and max duration; sorted by p95 descending so the slowest classes surface first; high std dev surfaces inconsistent jobs worth investigating; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view
- **Failed job trend chart** — a "Failures — Last 12 Hours" bar chart on the dashboard shows failures per hour over the last 12 hours; bars are red, making failure spikes visible before clicking into the failed jobs list
- **Error frequency report** — `GET /jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows a count per group, and surfaces a sample backtrace in an expandable row; sorted by count descending so the most common errors appear first; accessible via the "Error Summary" button on the Failed Jobs page
- **Metrics / health endpoint** — `GET /jobs/metrics.json` returns a machine-readable JSON document with job counts, throughput, per-queue depth and pause state, and process health summary; suitable for Prometheus scraping, uptime monitors, or external dashboards; `slow_jobs` count included when `slow_job_threshold` is configured
Expand Down
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Pull requests for any of these are welcome. See [Contributing](README.md#contrib
|---|---|
| ~~**Error frequency report**~~ | ✓ Shipped — `/jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows count and a sample backtrace in an expandable row; sorted by count descending; accessible via "Error Summary" button on the Failed Jobs page. |
| ~~**Failed job trend chart**~~ | ✓ Shipped — a "Failures — Last 12 Hours" bar chart card on the dashboard shows failures per hour; red bars, empty state when no failures; same pattern as the throughput and queue depth charts. |
| **P99 + std dev in performance analytics** | Extend `JobPerformanceStats` with a 99th percentile and standard deviation column. High std dev signals inconsistent jobs worth investigating. |
| ~~**P99 + std dev in performance analytics**~~ | ✓ Shipped — `JobPerformanceStats` computes p99 and population std dev for each job class; both columns appear in the Performance table between p95 and Min. |

---

Expand Down
10 changes: 9 additions & 1 deletion app/services/solid_queue_web/job_performance_stats.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
module SolidQueueWeb
class JobPerformanceStats
Row = Struct.new(:class_name, :count, :avg, :p50, :p95, :min, :max, keyword_init: true)
Row = Struct.new(:class_name, :count, :avg, :p50, :p95, :p99, :std_dev, :min, :max, keyword_init: true)

def initialize(scope)
@scope = scope
Expand All @@ -18,6 +18,8 @@ def rows
avg: mean(durations),
p50: percentile(durations, 50),
p95: percentile(durations, 95),
p99: percentile(durations, 99),
std_dev: std_dev(durations),
min: durations.first,
max: durations.last
)
Expand All @@ -34,5 +36,11 @@ def percentile(sorted, pct)
idx = [(pct / 100.0 * sorted.size).ceil - 1, 0].max
sorted[idx]
end

def std_dev(sorted)
return 0.0 if sorted.size < 2
m = mean(sorted)
Math.sqrt(sorted.sum { |x| (x - m)**2 } / sorted.size)
end
end
end
4 changes: 4 additions & 0 deletions app/views/solid_queue_web/performance/index.html.erb
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@
<th scope="col" style="text-align: right;">Avg</th>
<th scope="col" style="text-align: right;">p50</th>
<th scope="col" style="text-align: right;">p95</th>
<th scope="col" style="text-align: right;">p99</th>
<th scope="col" style="text-align: right;">Std Dev</th>
<th scope="col" style="text-align: right;">Min</th>
<th scope="col" style="text-align: right;">Max</th>
</tr>
Expand All @@ -36,6 +38,8 @@
<td class="sqd-mono" style="text-align: right;"><%= format_duration(row.avg) %></td>
<td class="sqd-mono" style="text-align: right;"><%= format_duration(row.p50) %></td>
<td class="sqd-mono" style="text-align: right;"><%= format_duration(row.p95) %></td>
<td class="sqd-mono" style="text-align: right;"><%= format_duration(row.p99) %></td>
<td class="sqd-mono" style="text-align: right;"><%= format_duration(row.std_dev) %></td>
<td class="sqd-mono" style="text-align: right;"><%= format_duration(row.min) %></td>
<td class="sqd-mono" style="text-align: right;"><%= format_duration(row.max) %></td>
</tr>
Expand Down
17 changes: 17 additions & 0 deletions spec/requests/solid_queue_web/performance_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,23 @@ def finished_job(class_name:, duration_seconds:, finished_ago: 1.hour)
expect(response.body).to include("No finished jobs found in the last 1h")
end

it "renders p99 and Std Dev column headers" do
finished_job(class_name: "WorkJob", duration_seconds: 10)

get "/jobs/performance"
expect(response.body).to include("p99")
expect(response.body).to include("Std Dev")
end

it "renders p99 and std_dev values for a job class" do
5.times { |i| finished_job(class_name: "WorkJob", duration_seconds: (i + 1) * 10) }

get "/jobs/performance"
expect(response.body).to include("WorkJob")
expect(response.body).to include("p99")
expect(response.body).to include("Std Dev")
end

it "sorts rows by p95 descending (slowest class first)" do
finished_job(class_name: "FastJob", duration_seconds: 2)
finished_job(class_name: "SlowJob", duration_seconds: 120)
Expand Down