Skip to content

Fix OpenSearch source pagination to handle failures correctly#6891

Merged
graytaylor0 merged 2 commits into
opensearch-project:mainfrom
Keyur-S-Patel:fix/6337-opensearch-source-failure-handling-v2
May 30, 2026
Merged

Fix OpenSearch source pagination to handle failures correctly#6891
graytaylor0 merged 2 commits into
opensearch-project:mainfrom
Keyur-S-Patel:fix/6337-opensearch-source-failure-handling-v2

Conversation

@Keyur-S-Patel
Copy link
Copy Markdown
Contributor

Description

Fixes #6337

Pagination previously terminated whenever a page returned fewer documents than the configured batch_size, which silently dropped the rest of an index whenever a request hit partial shard failures or a transient error.

Changes

Pagination termination fix:

  • search_after/PIT workers: stop only when nextSearchAfter == null or empty page
  • Scroll worker: stop only when page is empty
  • Short pages are no longer treated as end-of-index

Shard failure tracking:

  • New SearchShardStatistics model captures shard-level failures from search responses
  • Failure reasons are normalized (shard IDs, node IDs, UUIDs stripped) and aggregated into a bounded Map<String, Long> (capped at 20 distinct keys with __other__ overflow bucket)
  • Failures are logged per page and summarized at index completion

Scroll worker resilience:

  • Tolerates up to 3 consecutive scroll failures before giving up the partition
  • SearchContextLimitException and IndexNotFoundException still abort immediately

Metrics (3 new counters):

  • searchRequestsFailed — search/scroll page requests that threw
  • searchShardsFailed — total failed shards observed across all pages
  • indicesCompletedWithFailures — indices that completed with at least one recorded failure

Progress state persistence:

  • hadSearchFailures flag and failureReasonCounts map added to OpenSearchIndexProgressState
  • Backward compatible: old persisted state deserializes cleanly (defaults to false/empty)

Testing

  • Unit tests for SearchShardStatistics (normalization, capping, merging, fromShardCounts factory)
  • Unit tests for OpenSearchIndexProgressState (recordShardFailures, recordRequestFailure)
  • Worker tests covering: continuation past short pages, correct termination signals, shard failure recording, scroll retry tolerance
  • WorkerCommonUtils completion tests for failure summary emission

Check List

  • New functionality includes testing
  • New functionality has been documented in the design doc
  • Commits are signed (Signed-off-by)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Pagination previously terminated whenever a page returned fewer
documents than the configured batch_size, which silently dropped the
rest of an index whenever a request hit partial shard failures or a
transient error. The correct termination signal is used instead:
nextSearchAfter == null / empty page for search_after and PIT workers,
and an empty page for the scroll worker.

Shard failures are now captured in a bounded map of normalized reason
-> count (capped at 20 distinct keys with an "__other__" overflow
bucket), persisted on OpenSearchIndexProgressState, surfaced as new
counters (searchShardsFailed, searchRequestsFailed,
indicesCompletedWithFailures), and logged per page plus once at index
completion.

The scroll worker no longer aborts an index on a single per-request
exception; it tolerates up to MAX_CONSECUTIVE_SCROLL_FAILURES retries
before giving up the partition.

Signed-off-by: Keyur-S-Patel <keyurpatel.opensource@gmail.com>

Fixes opensearch-project#6337
…urce

- Add @JsonProperty on fields for explicit bidirectional JSON mapping
- Extract ShardFailureAggregator for cleaner separation of concerns
- Remove unused totalHits from SearchScrollResponse
- Add IP:port normalization to SearchShardStatistics
- Improve scroll failure log message to note possible skipped documents
- Add unit tests for WorkerCommonUtils.hasMorePages
- Add ScrollWorker test for short-page continuation (key fix behavior)
- Replace brittle call-count mocking with per-page result objects in PitWorkerTest
- Replace Thread.sleep with awaitility in NoSearchContextWorkerTest

Signed-off-by: Keyur-S-Patel <keyurpatel.opensource@gmail.com>
@Keyur-S-Patel Keyur-S-Patel force-pushed the fix/6337-opensearch-source-failure-handling-v2 branch from 45d78ed to 5852f9b Compare May 29, 2026 19:49
@Keyur-S-Patel
Copy link
Copy Markdown
Contributor Author

Create new PR due to DCO failure : #6829

@graytaylor0 graytaylor0 merged commit 6769025 into opensearch-project:main May 30, 2026
100 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] OpenSearch source does not handle failures correctly

3 participants