Skip to content

HDDS-13482. Intermittent failure in TestContainerStateMachineFailures#10397

Draft
chungen0126 wants to merge 6 commits into
apache:masterfrom
chungen0126:HDDS-13482
Draft

HDDS-13482. Intermittent failure in TestContainerStateMachineFailures#10397
chungen0126 wants to merge 6 commits into
apache:masterfrom
chungen0126:HDDS-13482

Conversation

@chungen0126
Copy link
Copy Markdown
Contributor

@chungen0126 chungen0126 commented May 31, 2026

What changes were proposed in this pull request?

Summary

Fix intermittent failure in TestContainerStateMachine#testApplyTransactionFailure, TestContainerStateMachine#testContainerStateMachineRestartWithDNChangePipeline, testWriteStateMachineDataIdempotencyWithClosedContainer, and testApplyTransactionIdempotencyWithClosedContainer.

Changes

For testWriteStateMachineDataIdempotencyWithClosedContainer:

The test stemmed from a race between a retry write operation and a close container request. The test expects idempotency for identical data, but intermittent failures occurred because the initial write and the retry write contained different data.

  • Case A (Success): If close container executes first, no error occurs.
  • Case B (Failure): If the retry write executes before the close container, a mismatch occurs between the written data "hello" and the committed metadata. While the container successfully closes, it is later marked as "unhealthy" by the scanner due to a checksum mismatch.

Fix: Updated the test to ensure data consistency during retries or adjusted the timing expectations to handle the race condition correctly.

For testContainerStateMachineFailures

testContainerStateMachineFailures causes failures in subsequent tests, including testContainerStateMachineRestartWithDNChangePipeline. This occurs because the test triggers a Ratis storage reset that invalidates existing pipelines. Since these pipelines are closed passively via client-side retries rather than by the ScrubbingService, they leave in the PipelineManager, causing subsequent tests to erroneously select and fail on these stale pipelines.

Fix: Make testContainerStateMachineFailures at the end of the class.

For testApplyTransactionFailure

Intermittent failures in this test happen because the initial takeSnapshot call does not guarantee that the snapshot is the final one before the container data deletion. Any notifyTermIndexUpdated event occurring after that point triggers a new snapshot, leading to snapshot inconsistency.

Fix: I refactor the test. By capturing the snapshot after the container data is deleted, we ensure that the snapshot is the last one before the deletion. Subsequent transactions and snapshot operations are then applied to verify that these actions do not alter the existing, consistent snapshot.

For testApplyTransactionIdempotencyWithClosedContainer

When the close container command finishes sending, it does not guarantee that the last applied index has been updated concurrently. If take snapshot`is triggered immediately afterward, the resulting snapshot may not reflect the latest state.

Fix: Added a waiting step after the close container command is sent to ensure that the last applied index has been fully updated before proceeding to take snapshot. This guarantees that the generated snapshot is always up-to-date with the latest index.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13482
https://issues.apache.org/jira/browse/HDDS-12215
https://issues.apache.org/jira/browse/HDDS-14962
https://issues.apache.org/jira/browse/HDDS-6115

How was this patch tested?

Before changes: TestContainerStateMachine failed 22 times in 20 * 10 iterations. https://github.com/chungen0126/ozone/actions/runs/26375145366

After changes: TestContainerStateMachine passed: 20 * 10 iterations after changes. https://github.com/chungen0126/ozone/actions/runs/26744043803

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant