Skip to content

Reverted unnecessary reindex test changes#5563

Merged
SergeyGaluzo merged 13 commits into
mainfrom
users/sergal/tests-back
May 14, 2026
Merged

Reverted unnecessary reindex test changes#5563
SergeyGaluzo merged 13 commits into
mainfrom
users/sergal/tests-back

Conversation

@SergeyGaluzo
Copy link
Copy Markdown
Contributor

@SergeyGaluzo SergeyGaluzo commented May 12, 2026

Reverts back all recent changes related to CI pipeline not working in the e2e ReindexTests class.
Reverts back parallel update test in e2e ReindexTests class to spread the update load.
Add deletes of resources before reindex for count sensitive test.
Changes resource deletes on cleanup to hard deletes.
Adjusted CI settings to match PR ones (exception CPU and RAM on replicas)

@SergeyGaluzo SergeyGaluzo requested a review from a team as a code owner May 12, 2026 16:04
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.42%. Comparing base (57ff116) to head (acf203d).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5563      +/-   ##
==========================================
+ Coverage   77.36%   77.42%   +0.05%     
==========================================
  Files         993      993              
  Lines       36406    36418      +12     
  Branches     5515     5518       +3     
==========================================
+ Hits        28167    28197      +30     
+ Misses       6879     6860      -19     
- Partials     1360     1361       +1     

see 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@SergeyGaluzo SergeyGaluzo changed the title Reverted previous reindex test changes Reverted unnecessary reindex test changes May 12, 2026
Comment thread test/Microsoft.Health.Fhir.Shared.Tests.E2E/Rest/Reindex/ReindexTests.cs Dismissed
fhibf
fhibf previously approved these changes May 13, 2026
// Maximum time to wait for a reindex job to reach a terminal state. Set high enough to accommodate
// multi-replica search-parameter cache convergence in CI (poll interval up to 30s, conformance refresh
// up to 60s, plus reindex worker queue scheduling and retry backoffs).
private static readonly TimeSpan ReindexJobCompletionTimeout = TimeSpan.FromMinutes(20);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this will cause issues for in PaaS as well as CI as we have seen cases where reindex tests take longer than 5 minutes. IF you have already solved this, great, otherwise you will recreate this failure of the failing as it expects completed but got running......

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reindex tests are passing in CI

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@SergeyGaluzo SergeyGaluzo May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not any good reason for single resource test to run longer than 5 minutes. We need to look at why it is happening. Increasing time is not a correct approach because it might hide the root cause.

Copy link
Copy Markdown
Contributor

@jestradaMS jestradaMS May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on log analytics it was a phantom host that had scaled down but fell into the lookback period set in the convergence logic. You can take a look at this job in CI log analytics.

Finding: Of 11 reindex orchestrator jobs in this build, only Job 1571 (the first) exceeded 5 minutes — it ran 330 seconds (17:33:39 → 17:39:09 UTC) before being cancelled by the test client. Jobs 1572–1644 all completed in 81–94 seconds.

Root cause: ReindexOrchestratorJob.WaitForAllInstancesCacheSyncAsync got stuck waiting for 3/4 hosts synced. The 4th "active" host was a phantom — a replica (xhgdw or lqlp8) that ACA had scaled down ~2 min before the test started but whose last heartbeat still
fell inside the orchestrator's 180 s active-hosts window (ActiveHostsEventsMultiplier=9 × SearchParameterCacheRefreshIntervalSeconds=20s). The dead replica could never refresh its search-parameter cache hash, so the orchestrator polled until the E2E client
cancelled it first.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good info. If we waited a little bit longer, orchestrator would have failed anyway. Therefore, increase in total reindex wait time is not a solution for this problem. I think it is acceptable to have intermittent tests failures because of this,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that could happen in production as well during scaling. If nothing else we should at least have a follow up work item to harden the convergence for self healing this case e.g. for hosts that haven't converged, check if they are still active in last x time vs just relying on the start lookback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following. First, reindex will fail with "unable to update cache please retry" or such. When customer retries, old pods should not be considered as there are no messages in the interval orchestrator looks at. Looks that we do not need to do anything. Am I missing sometging?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a bad customer experience for something we can solve.

Copy link
Copy Markdown
Contributor Author

@SergeyGaluzo SergeyGaluzo May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are discussing current functionality that exists in PROD, and it is not related to this PR.
I don't see indications that we have PROD problems in this functionality, and therefore I am not comfortable to add any work items. If you think it is justified, please go ahead.

Comment thread test/Microsoft.Health.Fhir.Shared.Tests.E2E/Rest/Reindex/ReindexTests.cs Outdated
@SergeyGaluzo SergeyGaluzo merged commit 0c27653 into main May 14, 2026
48 of 49 checks passed
@SergeyGaluzo SergeyGaluzo deleted the users/sergal/tests-back branch May 14, 2026 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants