Skip to content

fix: Optimize cleanup jobs, eliminate raw ES client usage, add test coverage#2267

Open
niemyjski wants to merge 1 commit into
mainfrom
niemyjski/verbose-guacamole
Open

fix: Optimize cleanup jobs, eliminate raw ES client usage, add test coverage#2267
niemyjski wants to merge 1 commit into
mainfrom
niemyjski/verbose-guacamole

Conversation

@niemyjski
Copy link
Copy Markdown
Member

@niemyjski niemyjski commented May 29, 2026

Summary

Comprehensive optimization of cleanup jobs: eliminate all direct Elasticsearch client usage, fix critical bugs, restore correct loop semantics, and add full integration test coverage.

Critical Bug Fixes

@min_count:2@min:2 in GetDuplicateSignaturesAsync

The @min_count syntax is silently ignored by Foundatio Parsers, causing the aggregation to return ALL signature hashes (not just duplicates). This would have caused FixDuplicateStacks to treat every stack as a duplicate on the next run.

FixDuplicateStacks only ran one batch (regression from refactor)

The original code looped until GetDuplicateSignaturesAsync returned empty. The refactor broke this. Restored the while loop with:

  • Batch tracking and per-batch logging
  • Infinite-loop guard: exits if an entire batch fails to process
  • ImmediateConsistency() on CountAsync inside GetDuplicateSignaturesAsync (one refresh per batch, matching original Indices.RefreshAsync call pattern — NOT per item)

ReassignStackAsync data-loss hazard on empty sequence

If sourceStackIds was empty, PatchAllAsync would have no stack filter and would reassign ALL events to the target stack. Added materialization + early return guard.

FixDuplicateStacks event-first ordering

Moved ReassignStackAsync before SaveAsync (soft-delete). If event reassignment fails, duplicate stacks remain visible and no data is lost. Previously: soft-delete first → event reassignment failure → orphan cleanup deletes the events.

GetDistinctFieldValuesAsync afterKey cursor leak

afterKey was only populated when composite.AfterKey != null, never cleared. Added: always clear cursor first, then repopulate. Callers checking afterKey.AfterKey.Count > 0 now correctly detect end-of-pagination.

Architecture (eliminate raw ES client from jobs)

  • Refactored CleanupOrphanedDataJob to use repository methods exclusively
  • Added GetDistinctFieldValuesAsync using composite aggregation (encapsulated in repository — composite aggregation is not in Foundatio's DSL, so raw client use is justified and documented)
  • Added RemoveAllByProjectIds/RemoveAllByOrganizationIds, RemoveAllByStackIdsAsync to IEventRepository
  • Added ReassignStackAsync using parameterized Painless script
  • Added GetDuplicateSignaturesAsync to IStackRepository

Other Fixes

  • Added OperationCanceledException filter before generic catch
  • Added lock renewal at page boundaries in CleanupDataJob
  • Consolidated two SaveAsync calls into one using spread syntax
  • Removed redundant is_deleted:false filter (repository applies soft-delete filter by default)
  • Reverted page size to 5 for org/project cleanup (2.5s sleep/item makes larger pages impractical)
  • CompositeKeyResult converted from class to record
  • Pattern matching throughout (is null/is not null, is [])
  • XML documentation on composite aggregation rationale and Painless script safety
  • Added :{Message} to error log format strings

Rebased onto main

Resolved merge conflict with #2268 (Deleted counter tests) — preserved all tests from both branches.

Test Coverage (39 job tests + 278 repo tests, all passing)

New job tests (named Method_Given_Expected):

  • RunAsync_SuspendedOrganization_SuspendsRelatedTokens
  • RunAsync_SoftDeletedOrganization_RemovesAllRelatedData
  • RunAsync_SoftDeletedProject_RemovesProjectAndEvents
  • RunAsync_SoftDeletedStack_RemovesStackAndEvents
  • RunAsync_EventsOutsideRetentionPeriod_RemovesExpiredEvents
  • DeleteOrphanedEventsByStack_WithLargeDataset_DeletesAllOrphanedEvents
  • CleanupSoftDeletedOrganizations_WithMultiplePages_RemovesAllData
  • CleanupSoftDeletedStacks_WithMultiplePages_RemovesAllStacks
  • EnforceRetention_WithMultipleOrganizations_RespectsPerOrgRetention
  • EnforceRetention_WithEventsOutsideRetention_DeletesOnlyExpiredEvents
  • Plus 5 deleted-counter tests from #2268 (merged)

New repository tests:

  • GetDistinctStackIds_WithMultipleStacks_ReturnsAllUniqueIds
  • GetDistinctStackIds_WithPagination_ReturnsAllIds
  • ReassignStack_WithSourceEvents_MovesAllEventsToTarget
  • RemoveAllByProjectIds_WithMatchingEvents_RemovesAll
  • RemoveAllByOrganizationIds_WithMatchingEvents_RemovesAll
  • GetDuplicateSignatures_WithDuplicates_ReturnsSignatures
  • GetDuplicateSignatures_WithNoDuplicates_ReturnsEmpty
  • GetDuplicateSignatures_WithSoftDeletedStacks_ExcludesThem

Comment thread src/Exceptionless.Core/Repositories/EventRepository.cs Fixed
Comment thread src/Exceptionless.Core/Jobs/CleanupOrphanedDataJob.cs Fixed
@niemyjski niemyjski self-assigned this May 30, 2026

namespace Exceptionless.Tests.Jobs;

public class CleanupOrphanedDataJobTests : IntegrationTestsBase
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests need 3 name part with act, assert, arrange. check pr.

}

[Fact]
public async Task CanCleanupMultipleSoftDeletedOrganizations()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests need 3 name part with act, assert, arrange. check pr.

);
}

public async Task<IReadOnlyCollection<string>> GetDuplicateSignaturesAsync(int maxResults = 10000)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we have integration tests for this.


public async Task<IReadOnlyCollection<string>> GetDuplicateSignaturesAsync(int maxResults = 10000)
{
var result = await CountAsync(q => q.FilterExpression("is_deleted:false")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

soft deletes should be filtered out by default.

Comment on lines +197 to +198
if (buckets == null || buckets.Count == 0)
return Array.Empty<string>();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pattern matching and use []

public Task<long> RemoveAllByProjectIdsAsync(string[] projectIds)
{
ArgumentNullException.ThrowIfNull(projectIds);
if (projectIds.Length == 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer is and is not over == and !=, check pr.

return RemoveAllAsync(q => q.Organization(organizationIds));
}

public Task<long> ReassignStackAsync(IEnumerable<string> sourceStackIds, string targetStackId)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels dangerous where is it used? make sure we have integration tests for this and every method added in repos.

Comment on lines +265 to +266
if (composite?.Buckets == null || composite.Buckets.Count == 0)
return Array.Empty<string>();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern matching here and return [] check pr for things like this.

{
var search = await _configuration.Client.SearchAsync<PersistentEvent>(s =>
{
s.Size(0).Aggregations(a => a
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should document here why we are using a composite agg. and performance / concerns.

Task<IReadOnlyCollection<string>> GetDistinctOrganizationIdsAsync(int batchSize, CompositeKeyResult? afterKey = null);
}

public class CompositeKeyResult
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use record for this

catch (Exception ex)
{
error++;
_logger.LogError(ex, "Error fixing duplicate stack {ProjectId} {SignatureHash}", projectId, signature);
Copy link
Copy Markdown
Member Author

@niemyjski niemyjski May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at all log statements here we always include :{Message}

Comment on lines +235 to +236
await _stackRepository.SaveAsync(duplicateStacks);
await _stackRepository.SaveAsync(targetStack);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we call this twice when you can just call [] overload.

namespace Exceptionless.Core.Jobs;

[Job(Description = "Deletes orphaned data.", IsContinuous = false)]
public class CleanupOrphanedDataJob : JobWithLockBase, IHealthCheck
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure we have 100% coverage for this before we change this and it's passing in origin/main.

private async Task CleanupSoftDeletedProjectsAsync(JobContext context)
{
var projectResults = await _projectRepository.GetAllAsync(o => o.SoftDeleteMode(SoftDeleteQueryMode.DeletedOnly).SearchAfterPaging().PageLimit(5));
var projectResults = await _projectRepository.GetAllAsync(o => o.SoftDeleteMode(SoftDeleteQueryMode.DeletedOnly).SearchAfterPaging().PageLimit(100));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be changing the page size in this pr? wouldn't it be better to get smaller projects incase they are removed etc.. and search over those.

var projectResults = await _projectRepository.GetAllAsync(o => o.SoftDeleteMode(SoftDeleteQueryMode.DeletedOnly).SearchAfterPaging().PageLimit(100));
_logger.CleanupProjectSoftDeletes(projectResults.Total);

while (projectResults.Documents.Count > 0 && !context.CancellationToken.IsCancellationRequested)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during cleanup what's renewing the job lock?

@niemyjski niemyjski force-pushed the niemyjski/verbose-guacamole branch from 154223e to 7788c75 Compare May 30, 2026 02:45
Comment thread src/Exceptionless.Core/Jobs/CleanupOrphanedDataJob.cs Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors cleanup jobs to rely on repository abstractions, improves duplicate stack cleanup behavior, and adds integration coverage for cleanup/repository operations.

Changes:

  • Reworked orphaned-data and cleanup jobs with lock renewal, cancellation checks, and repository-based deletes/updates.
  • Added event repository helpers for bulk deletion, stack reassignment, and distinct-id aggregation.
  • Added integration tests for cleanup pagination, retention, duplicate signatures, and event repository operations.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/Exceptionless.Core/Jobs/CleanupDataJob.cs Extends lock duration and renews locks during paged cleanup/retention loops.
src/Exceptionless.Core/Jobs/CleanupOrphanedDataJob.cs Replaces direct Elasticsearch calls with repository methods for orphan cleanup and duplicate-stack fixing.
src/Exceptionless.Core/Repositories/EventRepository.cs Adds bulk delete, stack reassignment, and composite aggregation helpers.
src/Exceptionless.Core/Repositories/Interfaces/IEventRepository.cs Exposes new event repository cleanup/query APIs and composite cursor type.
src/Exceptionless.Core/Repositories/StackRepository.cs Adds duplicate signature aggregation lookup.
src/Exceptionless.Core/Repositories/Interfaces/IStackRepository.cs Exposes duplicate signature lookup API.
tests/Exceptionless.Tests/Jobs/CleanupDataJobTests.cs Adds cleanup pagination and retention integration coverage.
tests/Exceptionless.Tests/Jobs/CleanupOrphanedDataJobTests.cs Adds integration coverage for orphan cleanup and duplicate stack merging.
tests/Exceptionless.Tests/Repositories/EventRepositoryTests.cs Adds coverage for distinct ids, stack reassignment, and bulk delete helpers.
tests/Exceptionless.Tests/Repositories/StackRepositoryTests.cs Adds duplicate signature repository coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +223 to +230
public Task<long> ReassignStackAsync(IEnumerable<string> sourceStackIds, string targetStackId)
{
ArgumentNullException.ThrowIfNull(sourceStackIds);
ArgumentException.ThrowIfNullOrEmpty(targetStackId);

return PatchAllAsync(
q => q.Stack(sourceStackIds),
new ScriptPatch("ctx._source.stack_id = params.targetStackId")

var buckets = duplicateStackAgg.Aggregations.Terms("stacks")?.Buckets ?? new List<Nest.KeyedBucket<string>>();
int total = buckets.Count;
var duplicateSignatures = await _stackRepository.GetDuplicateSignaturesAsync();
…overage

- Refactor CleanupOrphanedDataJob to use repository methods exclusively
  (eliminates all direct IElasticClient usage)
- Fix critical bug: @min_count:2 → @min:2 in GetDuplicateSignaturesAsync
  (invalid syntax was silently ignored, returning ALL signatures as duplicates)
- Add lock renewal at page boundaries in CleanupDataJob
- Add OperationCanceledException filter before generic catch
- Convert CompositeKeyResult class to record
- Use pattern matching (is null/is not null, is []) throughout
- Remove redundant is_deleted:false filter (repository handles soft deletes)
- Revert page size to 5 (2.5s sleep/item makes large pages impractical)
- Consolidate duplicate SaveAsync calls using spread syntax
- Add XML documentation for composite aggregation and script safety
- Add 8 new integration tests for EventRepository and StackRepository
- Rename all tests to Method_Given_Expected convention

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@niemyjski niemyjski force-pushed the niemyjski/verbose-guacamole branch from 7788c75 to 3c5880b Compare May 30, 2026 03:27
@github-actions
Copy link
Copy Markdown

Code Coverage

Package Line Rate Branch Rate Complexity Health
Exceptionless.Insulation 25% 23% 203
Exceptionless.Web 73% 62% 3922
Exceptionless.AppHost 18% 9% 82
Exceptionless.Core 70% 64% 7855
Summary 69% (13707 / 19928) 62% (7215 / 11598) 12062

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants