Fix OpenSearch source pagination to handle failures correctly#6891
Merged
graytaylor0 merged 2 commits intoMay 30, 2026
Conversation
Pagination previously terminated whenever a page returned fewer documents than the configured batch_size, which silently dropped the rest of an index whenever a request hit partial shard failures or a transient error. The correct termination signal is used instead: nextSearchAfter == null / empty page for search_after and PIT workers, and an empty page for the scroll worker. Shard failures are now captured in a bounded map of normalized reason -> count (capped at 20 distinct keys with an "__other__" overflow bucket), persisted on OpenSearchIndexProgressState, surfaced as new counters (searchShardsFailed, searchRequestsFailed, indicesCompletedWithFailures), and logged per page plus once at index completion. The scroll worker no longer aborts an index on a single per-request exception; it tolerates up to MAX_CONSECUTIVE_SCROLL_FAILURES retries before giving up the partition. Signed-off-by: Keyur-S-Patel <keyurpatel.opensource@gmail.com> Fixes opensearch-project#6337
…urce - Add @JsonProperty on fields for explicit bidirectional JSON mapping - Extract ShardFailureAggregator for cleaner separation of concerns - Remove unused totalHits from SearchScrollResponse - Add IP:port normalization to SearchShardStatistics - Improve scroll failure log message to note possible skipped documents - Add unit tests for WorkerCommonUtils.hasMorePages - Add ScrollWorker test for short-page continuation (key fix behavior) - Replace brittle call-count mocking with per-page result objects in PitWorkerTest - Replace Thread.sleep with awaitility in NoSearchContextWorkerTest Signed-off-by: Keyur-S-Patel <keyurpatel.opensource@gmail.com>
45d78ed to
5852f9b
Compare
Contributor
Author
|
Create new PR due to DCO failure : #6829 |
graytaylor0
approved these changes
May 29, 2026
Zhangxunmt
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #6337
Pagination previously terminated whenever a page returned fewer documents than the configured
batch_size, which silently dropped the rest of an index whenever a request hit partial shard failures or a transient error.Changes
Pagination termination fix:
nextSearchAfter == nullor empty pageShard failure tracking:
SearchShardStatisticsmodel captures shard-level failures from search responsesMap<String, Long>(capped at 20 distinct keys with__other__overflow bucket)Scroll worker resilience:
SearchContextLimitExceptionandIndexNotFoundExceptionstill abort immediatelyMetrics (3 new counters):
searchRequestsFailed— search/scroll page requests that threwsearchShardsFailed— total failed shards observed across all pagesindicesCompletedWithFailures— indices that completed with at least one recorded failureProgress state persistence:
hadSearchFailuresflag andfailureReasonCountsmap added toOpenSearchIndexProgressStateTesting
SearchShardStatistics(normalization, capping, merging,fromShardCountsfactory)OpenSearchIndexProgressState(recordShardFailures, recordRequestFailure)WorkerCommonUtilscompletion tests for failure summary emissionCheck List
Signed-off-by)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.