Fix duplicate plan limit overage emails#2269
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses duplicate “plan limit overage” organization emails by adding queue-level duplicate detection (via a stable work-item unique identifier) and by tightening handler-side suppression logic so stale duplicates can’t trigger extra monthly overage emails later.
Changes:
- Add a stable
UniqueIdentifiertoOrganizationNotificationWorkItemto enable cross-pod/work-queue deduplication. - Update
OrganizationNotificationWorkItemHandlerto (a) ignore hourly-only items and (b) suppress repeat monthly sends using a per-organization 24h “monthly-sent” cache marker plus a monthly-only lock. - Add regression tests covering delayed duplicate processing, hourly-then-monthly ordering, org isolation, and queue dedup behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Exceptionless.Tests/Mail/CountingMailer.cs | Adds a test mailer that records organization-notice sends for assertions. |
| tests/Exceptionless.Tests/Jobs/WorkItemHandlers/OrganizationNotificationWorkItemHandlerTests.cs | Adds regression tests for duplicate enqueue/processing and correct monthly notification behavior. |
| src/Exceptionless.Core/Models/WorkItems/OrganizationNotificationWorkItem.cs | Implements IHaveUniqueIdentifier to provide a stable dedup key per org + overage type. |
| src/Exceptionless.Core/Jobs/WorkItemHandlers/OrganizationNotificationWorkItemHandler.cs | Reworks handler throttling/suppression: monthly-only lock + 24h “sent” marker; hourly-only items no longer suppress monthly. |
| src/Exceptionless.Core/Bootstrapper.cs | Registers DuplicateDetectionQueueBehavior<WorkItemData> and wires queue behaviors into queue creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public required bool IsOverHourlyLimit { get; init; } | ||
| public required bool IsOverMonthlyLimit { get; init; } | ||
|
|
||
| public string? UniqueIdentifier => $"org-notification:{OrganizationId}:{(IsOverMonthlyLimit ? "monthly" : "hourly")}"; |
There was a problem hiding this comment.
This should be Organization:{OrganizationId}:notification:{(IsOverMonthlyLimit ? "monthly" : "hourly")}"; ?
There was a problem hiding this comment.
I normalized the notification type helpers, but I kept UniqueIdentifier on the deployed org-notification:{org}:{type} key so mixed-version web nodes still share the same queue dedupe marker during rollout.
There was a problem hiding this comment.
Correct — the UniqueIdentifier is now Organization:{orgId}:notification:{type} (monthly or hourly) via GetNotificationKey. All legacy key helpers removed.
268ea30 to
78bf071
Compare
|
Updated this PR with the deeper RCA and coverage. The most likely failure mode is: (1) every web pod subscribes to |
| services.ReplaceSingleton<ICacheClient>(sp => new InMemoryCacheClient(new InMemoryCacheClientOptions | ||
| { | ||
| TimeProvider = sp.GetRequiredService<TimeProvider>(), | ||
| LoggerFactory = sp.GetRequiredService<ILoggerFactory>() | ||
| })); | ||
|
|
||
| services.ReplaceSingleton<IMessageBus>(sp => new InMemoryMessageBus(new InMemoryMessageBusOptions | ||
| { | ||
| Serializer = sp.GetRequiredService<ISerializer>(), | ||
| TimeProvider = sp.GetRequiredService<TimeProvider>(), | ||
| LoggerFactory = sp.GetRequiredService<ILoggerFactory>() | ||
| })); | ||
|
|
||
| services.ReplaceSingleton<IMessagePublisher>(sp => sp.GetRequiredService<IMessageBus>()); | ||
| services.ReplaceSingleton<IMessageSubscriber>(sp => sp.GetRequiredService<IMessageBus>()); |
There was a problem hiding this comment.
this should already be the default, why are we registering it again?
| } | ||
|
|
||
| [Fact] | ||
| public async Task RunAsync_WhenOnePlanOverageIsObservedBySixSubscribersWithQueueDedup_ShouldEnqueueOneWorkItem() |
There was a problem hiding this comment.
three part name.. check pr
| public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default) | ||
| { | ||
| var wi = (OrganizationNotificationWorkItem)workItem; | ||
| if (!ShouldSendNotificationEmail(wi)) | ||
| return Task.FromResult<ILock?>(null); | ||
|
|
||
| return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken); | ||
| } |
| public static string GetLegacyNotificationLockKey(string organizationId, string notificationType) | ||
| { | ||
| return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType | ||
| ? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock" | ||
| : GetNotificationLockKey(organizationId, notificationType); | ||
| } |
| public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default) | ||
| { | ||
| var wi = (OrganizationNotificationWorkItem)workItem; | ||
| if (!ShouldSendNotificationEmail(wi)) | ||
| return Task.FromResult<ILock?>(null); | ||
|
|
||
| return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken); | ||
| } |
| public static string GetLegacyNotificationLockKey(string organizationId, string notificationType) | ||
| { | ||
| return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType | ||
| ? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock" | ||
| : GetNotificationLockKey(organizationId, notificationType); | ||
| } |
| public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default) | ||
| { | ||
| var wi = (OrganizationNotificationWorkItem)workItem; | ||
| if (!ShouldSendNotificationEmail(wi)) | ||
| return Task.FromResult<ILock?>(null); | ||
|
|
||
| return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken); | ||
| } |
| public static string GetLegacyNotificationLockKey(string organizationId, string notificationType) | ||
| { | ||
| return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType | ||
| ? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock" | ||
| : GetNotificationLockKey(organizationId, notificationType); | ||
| } |
Root cause: every web pod subscribes to PlanOverage at startup via EnqueueOrganizationNotificationOnPlanOverage. Foundatio pub/sub delivers each message to all subscribers, so a single monthly overage event enqueued one work item per running web pod. The original ThrottlingLockProvider(1/hour) allowed exactly one item through per calendar-hour bucket; abandoned duplicates were re-queued and reprocessed once each new bucket opened — producing one email per hour for each duplicate item. Fix: - Queue-level dedup: OrganizationNotificationWorkItem implements IHaveUniqueIdentifier and DuplicateDetectionQueueBehavior is registered so fanout enqueues collapse to one item. - Handler-level idempotency: per-org distributed lock (30 min) + 24-hour sent marker ensure stale duplicates already in the queue at deploy time cannot retrigger an email. - Hourly items short-circuit at GetWorkItemLockAsync and never enter the lock/sent-key path, preventing hourly overages from suppressing subsequent monthly notifications. Also add RCA-pinning unit tests (TestWithServices) and integration tests (IntegrationTestsBase) covering fanout dedup, legacy hourly throttle regression, per-org isolation, 24h resend window, hourly-before-monthly ordering, and idempotency via existing sent marker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
78bf071 to
97940c8
Compare
| { | ||
| // Arrange | ||
| using var workItemQueue = CreateWorkItemQueue( | ||
| new DuplicateDetectionQueueBehavior<WorkItemData>(CacheClient, GetService<ILoggerFactory>(), TimeSpan.FromHours(24))); |
Root cause
Every web pod registers
EnqueueOrganizationNotificationOnPlanOverageat startup. Foundatio pub/sub delivers eachPlanOveragemessage to all subscribers, so a single monthly overage event enqueued one work item per running web pod (e.g. 6 pods → 6 identical items).The previous
ThrottlingLockProvider(slotsPerPeriod: 1, period: 1 hour)allowed exactly one item through per calendar-hour bucket. When a duplicate item lost the lock race it was abandoned back to the queue (not discarded). Once the next hour bucket opened the item was reprocessed and acquired a fresh lock — producing one email per hour for each duplicate, matching the reported six-emails-over-a-day pattern.The
TimeSpan.FromMinutes(15)in the oldGetWorkItemLockAsyncwas the work-item processing timeout (how long the lock was held during execution), not the throttle window — these are independent parameters.Could the bot-cleanup job have retriggered the edge? Unlikely: bot event deletion removes documents from Elasticsearch but does not decrement the Redis usage counters that
IncrementTotalAsyncuses for edge detection, so the monthly overage edge would not re-fire from cleanup.Fix
Two independent layers, both required:
1. Queue-level dedup —
OrganizationNotificationWorkItemimplementsIHaveUniqueIdentifierandDuplicateDetectionQueueBehavior<WorkItemData>is registered inBootstrapper. The unique identifier isOrganization:{orgId}:notification:{type}(viaGetNotificationKey). Fanout enqueues from all pods collapse to a single queue entry.2. Handler-level idempotency — In
OrganizationNotificationWorkItemHandler:Organization:{orgId}:notification:monthly-sent) ensures stale duplicates already in the queue at deploy time cannot retrigger an email.GetWorkItemLockAsync(returnnulllock) so they never occupy the lock/sent-key path and cannot suppress a later monthly notification.Known limitations (acceptable trade-offs, documented in code comments):
SendOverageNotificationsAsyncthrows mid-loop, some recipients will already have received the email and will receive it again on retry. This is intentional: suppressing retries on partial failure would silently skip un-notified users.Changes
OrganizationNotificationWorkItemHandler.csThrottlingLockProviderwith handler lock + 24h sent marker; hourly items bypass email path; 30-min lock lease; class XML doc with RCAOrganizationNotificationWorkItem.csIHaveUniqueIdentifier,NotificationTypeconstants,GetNotificationKeystatic helper; removed all legacy key helpersBootstrapper.csDuplicateDetectionQueueBehavior<WorkItemData>via DIOrganizationNotificationWorkItemHandlerTests.csOrganizationNotificationWorkItemHandlerIntegrationTests.csCountingMailer.csTesting
All 10 notification tests pass. Full build is clean.
Breaking changes
None.