HDDS-15269. Avoid 30s shutdown wait in ReconTaskController by chihsuan · Pull Request #10402 · apache/ozone

chihsuan · 2026-06-01T13:29:35Z

What changes were proposed in this pull request?

TestReconTaskControllerImpl took ~145s to run 17 tests. Profiling each method
showed the time was not spread out: four tests each blocked for almost exactly
30 seconds inside ReconTaskController.stop().

Root cause

ReconTaskControllerImpl runs a single background thread
(processBufferedEventsAsync) that loops on an interruptible
eventBuffer.poll(...) and only exits when the thread is interrupted:

while (!Thread.currentThread().isInterrupted()) {
    ReconEvent event = eventBuffer.poll(1000);
    ...
}

When this loop was introduced (HDDS-13576), stop() used shutdownNow(), which
interrupts the thread, so it exited immediately. A later reliability change
(HDDS-13956) replaced that with a graceful shutdown:

executor.shutdown();                       // does NOT interrupt running tasks
executor.awaitTermination(30, SECONDS);    // waits the full timeout
executor.shutdownNow();                     // only now interrupts

shutdown() stops accepting new work but never interrupts the running thread,
and the loop has no non-interrupt exit path. So the graceful phase can never
drain the loop: awaitTermination always burns its full 30s timeout before the
fallback shutdownNow() finally stops it. Every stop() therefore costs ~30s.
This is not test-only: in production it delays Recon shutdown by ~30s and logs a
misleading "did not terminate within 30 seconds" warning on every shutdown.

Fix (production)

Give the loop a cooperative exit path: a volatile boolean running flag set in
start() and cleared in stop(). The loop checks it each iteration, so after
stop() it exits on the next poll cycle and the existing graceful shutdown
completes promptly.

I chose the cooperative flag over simply reverting to shutdownNow() because it
preserves the intent of HDDS-13956: an in-flight event is still allowed to
finish instead of being interrupted mid-processing. The 30s awaitTermination
stays as a genuine safety net rather than the normal path.

Fix (test)

Two tests (testProcessReInitializationEventWith*) call stop() on a Mockito
spy(controller) to quiesce the background loop. A spy is a shallow copy, so it
flips the running flag on the copy while the live thread (started on the
original controller in setUp) keeps running, still hitting the 30s timeout.
They now stop the original controller. This does not change what the tests
assert; stop() there is only setup to avoid a race with the background loop.

Scope

Kept intentionally small to fix only the critical issue (the 30s production
shutdown hang). Two unrelated, lower-impact test-only slowdowns remain and are
left for a follow-up to avoid scope creep:

testNewRetryLogicWithMaxRetriesExceeded and testFailedTaskRetryLogic use
real-clock Thread.sleep to wait out RETRY_DELAY_MS; speeding these up needs
the retry delay to be injectable.

Result

TestReconTaskControllerImpl: ~145s → ~26s, all tests pass, checkstyle clean.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15269

How was this patch tested?

Added testStopCompletesPromptly, which fails (~31s) before the fix and passes
(~1s) after, guarding against regression.
Ran the full TestReconTaskControllerImpl class: 18 tests, 0 failures,
~26s (down from ~145s).
mvn -pl hadoop-ozone/recon checkstyle:check: 0 violations.

ReconTaskControllerImpl.stop() switched to a graceful shutdown() + awaitTermination(30s) in HDDS-13956, but the event processing loop only exits on interrupt, so shutdown() can never drain it and stop() always burned the full 30s timeout before shutdownNow() took effect. Add a volatile running flag that the loop checks, cleared in stop(), so the loop exits on its next poll cycle and graceful shutdown completes promptly. Add testStopCompletesPromptly as a guard.

testProcessReInitializationEventWith* called stop() on a Mockito spy to quiesce the background event loop. The spy is a shallow copy, so this only flipped the running flag on the copy while the live event-processing thread (started on the original controller) kept running, forcing stop() to wait out the full 30s shutdown timeout. Stop the original controller instead.

adoroszlai

Thanks @chihsuan for the patch. Not only does this speed up TestReconTaskControllerImpl, but all Recon integration tests as well (saving 30s for each cluster stop):

-Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 56.83 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryAdmin
+Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.02 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryAdmin
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 82.93 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryMemoryLeak
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 54.92 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryMemoryLeak
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 149.4 s -- in org.apache.hadoop.ozone.recon.TestReconAsPassiveScm
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 61.61 s -- in org.apache.hadoop.ozone.recon.TestReconAsPassiveScm
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 120.9 s -- in org.apache.hadoop.ozone.recon.TestReconContainerEndpoint
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 61.89 s -- in org.apache.hadoop.ozone.recon.TestReconContainerEndpoint
-Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 174.1 s -- in org.apache.hadoop.ozone.recon.TestReconContainerHealthSummaryEndToEnd
+Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 86.49 s -- in org.apache.hadoop.ozone.recon.TestReconContainerHealthSummaryEndToEnd
-Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 108.3 s -- in org.apache.hadoop.ozone.recon.TestReconInsightsForDeletedDirectories
+Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 81.83 s -- in org.apache.hadoop.ozone.recon.TestReconInsightsForDeletedDirectories
-Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.88 s -- in org.apache.hadoop.ozone.recon.TestReconQuasiClosedContainerEndpoint
+Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 32.82 s -- in org.apache.hadoop.ozone.recon.TestReconQuasiClosedContainerEndpoint
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 172.0 s -- in org.apache.hadoop.ozone.recon.TestReconScmSnapshot
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 83.28 s -- in org.apache.hadoop.ozone.recon.TestReconScmSnapshot
-Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 354.9 s -- in org.apache.hadoop.ozone.recon.TestReconTasks
+Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 176.5 s -- in org.apache.hadoop.ozone.recon.TestReconTasks
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.43 s -- in org.apache.hadoop.ozone.recon.TestReconTasksMultiNode
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.28 s -- in org.apache.hadoop.ozone.recon.TestReconTasksMultiNode
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 103.2 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManager
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.61 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManager
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 68.88 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManagerFSO
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 38.21 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManagerFSO
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.3 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointEC
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 135.2 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointEC
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.0 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointRatis
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 134.2 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointRatis

chihsuan added 2 commits June 1, 2026 21:26

chihsuan marked this pull request as ready for review June 1, 2026 13:53

adoroszlai changed the title ~~HDDS-15269. Avoid 30s shutdown wait in ReconTaskController, speeding up its test~~ HDDS-15269. Avoid 30s shutdown wait in ReconTaskController Jun 1, 2026

adoroszlai approved these changes Jun 1, 2026

View reviewed changes

adoroszlai merged commit 20f7097 into apache:master Jun 1, 2026
104 of 107 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-15269. Avoid 30s shutdown wait in ReconTaskController#10402

HDDS-15269. Avoid 30s shutdown wait in ReconTaskController#10402
adoroszlai merged 2 commits into
apache:masterfrom
chihsuan:HDDS-15269

chihsuan commented Jun 1, 2026 •

edited

Loading

Uh oh!

adoroszlai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chihsuan commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Root cause

Fix (production)

Fix (test)

Scope

Result

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chihsuan commented Jun 1, 2026 •

edited

Loading