diff --git a/CHANGELOG.md b/CHANGELOG.md index 603d3a6..aafae45 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- p99 and standard deviation columns in performance analytics — `JobPerformanceStats` now computes a 99th percentile and population standard deviation for each job class; both columns appear in the Performance table between p95 and Min; high std dev surfaces inconsistent jobs worth investigating + - Failed job trend chart — a "Failures — Last 12 Hours" bar chart card on the dashboard shows how many jobs failed per hour over the last 12 hours; bars are red (distinct from the blue throughput and purple queue depth charts); header shows the total failure count for the window; empty state shown when no failures exist in the period - Error frequency report — a new Error Summary page (`/jobs/failed_jobs/errors`) groups all failed jobs by error class and message prefix, shows a count per group, and displays a sample backtrace (first 10 lines) in an expandable `
` element; groups are sorted by count descending so the most common errors appear first; accessible via an "Error Summary" button on the Failed Jobs page diff --git a/README.md b/README.md index 4726034..5cedbc7 100644 --- a/README.md +++ b/README.md @@ -54,7 +54,7 @@ SolidQueueWeb surfaces all of this in a browser UI available at any route you ch - **CSV export** — "Export CSV" button on the jobs, failed jobs, and history pages downloads all records matching the current filters; columns are tailored per view - **Slow job detection** — when `slow_job_threshold` is configured, claimed jobs running longer than the threshold are flagged with an orange row, a "slow" badge, and a "Running For" duration column on the Running tab; a "Slow Jobs" warning card appears on the dashboard with a link to the Running tab - **Webhook alerts** — set `alert_webhook_url` and `alert_failure_threshold` to receive a POST request whenever the failed job count meets or exceeds the threshold; fires asynchronously so dashboard performance is unaffected; a configurable cooldown (default 1 h) prevents repeated alerts while the count stays elevated -- **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, min, and max duration; sorted by p95 descending so the slowest classes surface first; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view +- **Performance analytics** — per-job-class statistics at `/jobs/performance` showing run count, average, p50, p95, p99, standard deviation, min, and max duration; sorted by p95 descending so the slowest classes surface first; high std dev surfaces inconsistent jobs worth investigating; period filter scopes to 1h / 24h / 7d or all time; each class name links to the filtered History view - **Failed job trend chart** — a "Failures — Last 12 Hours" bar chart on the dashboard shows failures per hour over the last 12 hours; bars are red, making failure spikes visible before clicking into the failed jobs list - **Error frequency report** — `GET /jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows a count per group, and surfaces a sample backtrace in an expandable row; sorted by count descending so the most common errors appear first; accessible via the "Error Summary" button on the Failed Jobs page - **Metrics / health endpoint** — `GET /jobs/metrics.json` returns a machine-readable JSON document with job counts, throughput, per-queue depth and pause state, and process health summary; suitable for Prometheus scraping, uptime monitors, or external dashboards; `slow_jobs` count included when `slow_job_threshold` is configured diff --git a/ROADMAP.md b/ROADMAP.md index 2f63650..c2543fd 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -14,7 +14,7 @@ Pull requests for any of these are welcome. See [Contributing](README.md#contrib |---|---| | ~~**Error frequency report**~~ | ✓ Shipped — `/jobs/failed_jobs/errors` groups all failed jobs by error class and message prefix, shows count and a sample backtrace in an expandable row; sorted by count descending; accessible via "Error Summary" button on the Failed Jobs page. | | ~~**Failed job trend chart**~~ | ✓ Shipped — a "Failures — Last 12 Hours" bar chart card on the dashboard shows failures per hour; red bars, empty state when no failures; same pattern as the throughput and queue depth charts. | -| **P99 + std dev in performance analytics** | Extend `JobPerformanceStats` with a 99th percentile and standard deviation column. High std dev signals inconsistent jobs worth investigating. | +| ~~**P99 + std dev in performance analytics**~~ | ✓ Shipped — `JobPerformanceStats` computes p99 and population std dev for each job class; both columns appear in the Performance table between p95 and Min. | --- diff --git a/app/services/solid_queue_web/job_performance_stats.rb b/app/services/solid_queue_web/job_performance_stats.rb index b880119..c6a78a4 100644 --- a/app/services/solid_queue_web/job_performance_stats.rb +++ b/app/services/solid_queue_web/job_performance_stats.rb @@ -1,6 +1,6 @@ module SolidQueueWeb class JobPerformanceStats - Row = Struct.new(:class_name, :count, :avg, :p50, :p95, :min, :max, keyword_init: true) + Row = Struct.new(:class_name, :count, :avg, :p50, :p95, :p99, :std_dev, :min, :max, keyword_init: true) def initialize(scope) @scope = scope @@ -18,6 +18,8 @@ def rows avg: mean(durations), p50: percentile(durations, 50), p95: percentile(durations, 95), + p99: percentile(durations, 99), + std_dev: std_dev(durations), min: durations.first, max: durations.last ) @@ -34,5 +36,11 @@ def percentile(sorted, pct) idx = [(pct / 100.0 * sorted.size).ceil - 1, 0].max sorted[idx] end + + def std_dev(sorted) + return 0.0 if sorted.size < 2 + m = mean(sorted) + Math.sqrt(sorted.sum { |x| (x - m)**2 } / sorted.size) + end end end diff --git a/app/views/solid_queue_web/performance/index.html.erb b/app/views/solid_queue_web/performance/index.html.erb index 7a12851..f07e1f2 100644 --- a/app/views/solid_queue_web/performance/index.html.erb +++ b/app/views/solid_queue_web/performance/index.html.erb @@ -21,6 +21,8 @@ Avg p50 p95 + p99 + Std Dev Min Max @@ -36,6 +38,8 @@ <%= format_duration(row.avg) %> <%= format_duration(row.p50) %> <%= format_duration(row.p95) %> + <%= format_duration(row.p99) %> + <%= format_duration(row.std_dev) %> <%= format_duration(row.min) %> <%= format_duration(row.max) %> diff --git a/spec/requests/solid_queue_web/performance_spec.rb b/spec/requests/solid_queue_web/performance_spec.rb index e1b1e8d..a36915e 100644 --- a/spec/requests/solid_queue_web/performance_spec.rb +++ b/spec/requests/solid_queue_web/performance_spec.rb @@ -67,6 +67,23 @@ def finished_job(class_name:, duration_seconds:, finished_ago: 1.hour) expect(response.body).to include("No finished jobs found in the last 1h") end + it "renders p99 and Std Dev column headers" do + finished_job(class_name: "WorkJob", duration_seconds: 10) + + get "/jobs/performance" + expect(response.body).to include("p99") + expect(response.body).to include("Std Dev") + end + + it "renders p99 and std_dev values for a job class" do + 5.times { |i| finished_job(class_name: "WorkJob", duration_seconds: (i + 1) * 10) } + + get "/jobs/performance" + expect(response.body).to include("WorkJob") + expect(response.body).to include("p99") + expect(response.body).to include("Std Dev") + end + it "sorts rows by p95 descending (slowest class first)" do finished_job(class_name: "FastJob", duration_seconds: 2) finished_job(class_name: "SlowJob", duration_seconds: 120)