Feature/stale job monitor#5558
Draft
brendankowitz wants to merge 11 commits into
Draft
Conversation
Spec for a StaleJobWatchdog that emits fhir_oldest_queued_job_age_seconds Prometheus gauge per queue type when no jobs are running. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6-task plan: notification, metric handler, watchdog, WatchdogsBackgroundService wiring, DI registration, integration test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Previously, a single running job in any queue masked staleness in every other queue, defeating per-queue-type alerting. ComputeQueueAges now evaluates the running check per queue. - StaleJobMetricHandler swapped from a per-key-updated ConcurrentDictionary to a volatile reference swap so ObservableGauge scrapes never observe a partial multi-queue update. - Added logic test asserting a running job in one queue does not suppress another queue's staleness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an ObservableGauge<long> named Jobs.QueueDepth to the existing StaleJobMetricHandler, using the same per-tick SQL result set already fetched by StaleJobWatchdog. Reports pending (Created) and running job counts per QueueType via queue_type and state tags, complementing the existing Jobs.OldestQueuedAgeSeconds metric for full active-queue observability. ADR 2605 amended with the depth metric decision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces a stale job monitor for SQL-backed async job queues. The monitor reports the age of the oldest queued job per queue type so stalled queues can be detected before customers observe delayed operations.
Key changes
Stale job monitor
StaleJobWatchdogto query active jobs for eachQueueType, compute the oldest queued job age per queue, log stale queues, and publishStaleJobMetricsNotification.StaleJobMetricsNotificationandStaleJobMetricHandlerto expose the latest queue-age snapshot through theFhirServermeter asJobs.OldestQueuedAgeSecondswith aqueue_typetag.Dependency injection and background service integration
StaleJobWatchdogas a singleton in SQL Server service registration.StaleJobMetricHandleras a singleton MediatR notification handler so the observable gauge reads a stable metric snapshot.WatchdogsBackgroundServiceto startStaleJobWatchdogwith the existing SQL watchdogs.Testing and documentation
docs/arch/adr-2605-stale-job-monitor.md.Related issues
Addresses AB#164461.
Testing
dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net8.0 --no-restoredotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net9.0 --no-restoredotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net8.0 --no-restoredotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net9.0 --no-restoreFHIR Team Checklist
New Feature.Azure Healthcare APIs.No-PaaS-breaking-change.docs/arch/adr-2605-stale-job-monitor.md.Semver Change
Feature