Skip to content

Feature/stale job monitor#5558

Draft
brendankowitz wants to merge 11 commits into
mainfrom
feature/stale-job-monitor
Draft

Feature/stale job monitor#5558
brendankowitz wants to merge 11 commits into
mainfrom
feature/stale-job-monitor

Conversation

@brendankowitz
Copy link
Copy Markdown
Member

@brendankowitz brendankowitz commented May 7, 2026

Description

This pull request introduces a stale job monitor for SQL-backed async job queues. The monitor reports the age of the oldest queued job per queue type so stalled queues can be detected before customers observe delayed operations.

Key changes

Stale job monitor

  • Added StaleJobWatchdog to query active jobs for each QueueType, compute the oldest queued job age per queue, log stale queues, and publish StaleJobMetricsNotification.
  • Added StaleJobMetricsNotification and StaleJobMetricHandler to expose the latest queue-age snapshot through the FhirServer meter as Jobs.OldestQueuedAgeSeconds with a queue_type tag.

Dependency injection and background service integration

  • Registered StaleJobWatchdog as a singleton in SQL Server service registration.
  • Re-registered StaleJobMetricHandler as a singleton MediatR notification handler so the observable gauge reads a stable metric snapshot.
  • Updated WatchdogsBackgroundService to start StaleJobWatchdog with the existing SQL watchdogs.

Testing and documentation

  • Added logic tests for queue age computation and metric snapshot updates.
  • Added a SQL watchdog integration test that verifies notifications include all queue types when the queue is empty.
  • Added ADR documentation at docs/arch/adr-2605-stale-job-monitor.md.

Related issues

Addresses AB#164461.

Testing

  • dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net8.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net9.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net8.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net9.0 --no-restore

FHIR Team Checklist

  • Title is succinct and less than 65 characters.
  • Milestone added for the sprint that it is merged.
  • Tagged with the type of update: New Feature.
  • Tagged with release area: Azure Healthcare APIs.
  • Tagged with PaaS compatibility: No-PaaS-breaking-change.
  • ADR included: docs/arch/adr-2605-stale-job-monitor.md.
  • CI is green before merge.
  • Reviewed squash-merge requirements.

Semver Change

Feature

brendankowitz and others added 9 commits April 21, 2026 11:12
Spec for a StaleJobWatchdog that emits fhir_oldest_queued_job_age_seconds
Prometheus gauge per queue type when no jobs are running.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6-task plan: notification, metric handler, watchdog, WatchdogsBackgroundService
wiring, DI registration, integration test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Previously, a single running job in any queue masked staleness in every
  other queue, defeating per-queue-type alerting. ComputeQueueAges now
  evaluates the running check per queue.
- StaleJobMetricHandler swapped from a per-key-updated ConcurrentDictionary
  to a volatile reference swap so ObservableGauge scrapes never observe a
  partial multi-queue update.
- Added logic test asserting a running job in one queue does not suppress
  another queue's staleness.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@brendankowitz brendankowitz added New Feature Label for a new feature in FHIR OSS Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs No-PaaS-breaking-change ADR-Included ADR Included in the PR labels May 7, 2026
@brendankowitz brendankowitz added this to the FY26\Q4\2Wk\2Wk23 milestone May 7, 2026
brendankowitz and others added 2 commits May 11, 2026 12:02
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an ObservableGauge<long> named Jobs.QueueDepth to the existing
StaleJobMetricHandler, using the same per-tick SQL result set already
fetched by StaleJobWatchdog. Reports pending (Created) and running job
counts per QueueType via queue_type and state tags, complementing the
existing Jobs.OldestQueuedAgeSeconds metric for full active-queue
observability. ADR 2605 amended with the depth metric decision.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ADR-Included ADR Included in the PR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs New Feature Label for a new feature in FHIR OSS No-PaaS-breaking-change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant