Reverted unnecessary reindex test changes#5563
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5563 +/- ##
==========================================
+ Coverage 77.36% 77.42% +0.05%
==========================================
Files 993 993
Lines 36406 36418 +12
Branches 5515 5518 +3
==========================================
+ Hits 28167 28197 +30
+ Misses 6879 6860 -19
- Partials 1360 1361 +1 🚀 New features to boost your workflow:
|
| // Maximum time to wait for a reindex job to reach a terminal state. Set high enough to accommodate | ||
| // multi-replica search-parameter cache convergence in CI (poll interval up to 30s, conformance refresh | ||
| // up to 60s, plus reindex worker queue scheduling and retry backoffs). | ||
| private static readonly TimeSpan ReindexJobCompletionTimeout = TimeSpan.FromMinutes(20); |
There was a problem hiding this comment.
removing this will cause issues for in PaaS as well as CI as we have seen cases where reindex tests take longer than 5 minutes. IF you have already solved this, great, otherwise you will recreate this failure of the failing as it expects completed but got running......
There was a problem hiding this comment.
All reindex tests are passing in CI
There was a problem hiding this comment.
For example in your current run, it passed next team as it got lucky on retry with no other load
There was a problem hiding this comment.
There should not any good reason for single resource test to run longer than 5 minutes. We need to look at why it is happening. Increasing time is not a correct approach because it might hide the root cause.
There was a problem hiding this comment.
Based on log analytics it was a phantom host that had scaled down but fell into the lookback period set in the convergence logic. You can take a look at this job in CI log analytics.
Finding: Of 11 reindex orchestrator jobs in this build, only Job 1571 (the first) exceeded 5 minutes — it ran 330 seconds (17:33:39 → 17:39:09 UTC) before being cancelled by the test client. Jobs 1572–1644 all completed in 81–94 seconds.
Root cause: ReindexOrchestratorJob.WaitForAllInstancesCacheSyncAsync got stuck waiting for 3/4 hosts synced. The 4th "active" host was a phantom — a replica (xhgdw or lqlp8) that ACA had scaled down ~2 min before the test started but whose last heartbeat still
fell inside the orchestrator's 180 s active-hosts window (ActiveHostsEventsMultiplier=9 × SearchParameterCacheRefreshIntervalSeconds=20s). The dead replica could never refresh its search-parameter cache hash, so the orchestrator polled until the E2E client
cancelled it first.
There was a problem hiding this comment.
Good info. If we waited a little bit longer, orchestrator would have failed anyway. Therefore, increase in total reindex wait time is not a solution for this problem. I think it is acceptable to have intermittent tests failures because of this,
There was a problem hiding this comment.
This is something that could happen in production as well during scaling. If nothing else we should at least have a follow up work item to harden the convergence for self healing this case e.g. for hosts that haven't converged, check if they are still active in last x time vs just relying on the start lookback.
There was a problem hiding this comment.
I am not following. First, reindex will fail with "unable to update cache please retry" or such. When customer retries, old pods should not be considered as there are no messages in the interval orchestrator looks at. Looks that we do not need to do anything. Am I missing sometging?
There was a problem hiding this comment.
Yes, it's a bad customer experience for something we can solve.
There was a problem hiding this comment.
We are discussing current functionality that exists in PROD, and it is not related to this PR.
I don't see indications that we have PROD problems in this functionality, and therefore I am not comfortable to add any work items. If you think it is justified, please go ahead.
Reverts back all recent changes related to CI pipeline not working in the e2e ReindexTests class.
Reverts back parallel update test in e2e ReindexTests class to spread the update load.
Add deletes of resources before reindex for count sensitive test.
Changes resource deletes on cleanup to hard deletes.
Adjusted CI settings to match PR ones (exception CPU and RAM on replicas)