Skip to content

HDDS-15269. Avoid 30s shutdown wait in ReconTaskController#10402

Merged
adoroszlai merged 2 commits into
apache:masterfrom
chihsuan:HDDS-15269
Jun 1, 2026
Merged

HDDS-15269. Avoid 30s shutdown wait in ReconTaskController#10402
adoroszlai merged 2 commits into
apache:masterfrom
chihsuan:HDDS-15269

Conversation

@chihsuan
Copy link
Copy Markdown
Contributor

@chihsuan chihsuan commented Jun 1, 2026

What changes were proposed in this pull request?

TestReconTaskControllerImpl took ~145s to run 17 tests. Profiling each method
showed the time was not spread out: four tests each blocked for almost exactly
30 seconds inside ReconTaskController.stop().

Root cause

ReconTaskControllerImpl runs a single background thread
(processBufferedEventsAsync) that loops on an interruptible
eventBuffer.poll(...) and only exits when the thread is interrupted:

while (!Thread.currentThread().isInterrupted()) {
    ReconEvent event = eventBuffer.poll(1000);
    ...
}

When this loop was introduced (HDDS-13576), stop() used shutdownNow(), which
interrupts the thread, so it exited immediately. A later reliability change
(HDDS-13956) replaced that with a graceful shutdown:

executor.shutdown();                       // does NOT interrupt running tasks
executor.awaitTermination(30, SECONDS);    // waits the full timeout
executor.shutdownNow();                     // only now interrupts

shutdown() stops accepting new work but never interrupts the running thread,
and the loop has no non-interrupt exit path. So the graceful phase can never
drain the loop: awaitTermination always burns its full 30s timeout before the
fallback shutdownNow() finally stops it. Every stop() therefore costs ~30s.
This is not test-only: in production it delays Recon shutdown by ~30s and logs a
misleading "did not terminate within 30 seconds" warning on every shutdown.

Fix (production)

Give the loop a cooperative exit path: a volatile boolean running flag set in
start() and cleared in stop(). The loop checks it each iteration, so after
stop() it exits on the next poll cycle and the existing graceful shutdown
completes promptly.

I chose the cooperative flag over simply reverting to shutdownNow() because it
preserves the intent of HDDS-13956: an in-flight event is still allowed to
finish instead of being interrupted mid-processing. The 30s awaitTermination
stays as a genuine safety net rather than the normal path.

Fix (test)

Two tests (testProcessReInitializationEventWith*) call stop() on a Mockito
spy(controller) to quiesce the background loop. A spy is a shallow copy, so it
flips the running flag on the copy while the live thread (started on the
original controller in setUp) keeps running, still hitting the 30s timeout.
They now stop the original controller. This does not change what the tests
assert; stop() there is only setup to avoid a race with the background loop.

Scope

Kept intentionally small to fix only the critical issue (the 30s production
shutdown hang). Two unrelated, lower-impact test-only slowdowns remain and are
left for a follow-up to avoid scope creep:

  • testNewRetryLogicWithMaxRetriesExceeded and testFailedTaskRetryLogic use
    real-clock Thread.sleep to wait out RETRY_DELAY_MS; speeding these up needs
    the retry delay to be injectable.

Result

TestReconTaskControllerImpl: ~145s → ~26s, all tests pass, checkstyle clean.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15269

How was this patch tested?

  • Added testStopCompletesPromptly, which fails (~31s) before the fix and passes
    (~1s) after, guarding against regression.
  • Ran the full TestReconTaskControllerImpl class: 18 tests, 0 failures,
    ~26s (down from ~145s).
  • mvn -pl hadoop-ozone/recon checkstyle:check: 0 violations.

chihsuan added 2 commits June 1, 2026 21:26
ReconTaskControllerImpl.stop() switched to a graceful shutdown() +
awaitTermination(30s) in HDDS-13956, but the event processing loop only
exits on interrupt, so shutdown() can never drain it and stop() always
burned the full 30s timeout before shutdownNow() took effect.

Add a volatile running flag that the loop checks, cleared in stop(), so
the loop exits on its next poll cycle and graceful shutdown completes
promptly. Add testStopCompletesPromptly as a guard.
testProcessReInitializationEventWith* called stop() on a Mockito spy to
quiesce the background event loop. The spy is a shallow copy, so this only
flipped the running flag on the copy while the live event-processing thread
(started on the original controller) kept running, forcing stop() to wait
out the full 30s shutdown timeout. Stop the original controller instead.
@chihsuan chihsuan marked this pull request as ready for review June 1, 2026 13:53
@adoroszlai adoroszlai changed the title HDDS-15269. Avoid 30s shutdown wait in ReconTaskController, speeding up its test HDDS-15269. Avoid 30s shutdown wait in ReconTaskController Jun 1, 2026
Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chihsuan for the patch. Not only does this speed up TestReconTaskControllerImpl, but all Recon integration tests as well (saving 30s for each cluster stop):

-Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 56.83 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryAdmin
+Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.02 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryAdmin
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 82.93 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryMemoryLeak
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 54.92 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryMemoryLeak
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 149.4 s -- in org.apache.hadoop.ozone.recon.TestReconAsPassiveScm
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 61.61 s -- in org.apache.hadoop.ozone.recon.TestReconAsPassiveScm
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 120.9 s -- in org.apache.hadoop.ozone.recon.TestReconContainerEndpoint
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 61.89 s -- in org.apache.hadoop.ozone.recon.TestReconContainerEndpoint
-Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 174.1 s -- in org.apache.hadoop.ozone.recon.TestReconContainerHealthSummaryEndToEnd
+Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 86.49 s -- in org.apache.hadoop.ozone.recon.TestReconContainerHealthSummaryEndToEnd
-Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 108.3 s -- in org.apache.hadoop.ozone.recon.TestReconInsightsForDeletedDirectories
+Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 81.83 s -- in org.apache.hadoop.ozone.recon.TestReconInsightsForDeletedDirectories
-Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.88 s -- in org.apache.hadoop.ozone.recon.TestReconQuasiClosedContainerEndpoint
+Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 32.82 s -- in org.apache.hadoop.ozone.recon.TestReconQuasiClosedContainerEndpoint
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 172.0 s -- in org.apache.hadoop.ozone.recon.TestReconScmSnapshot
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 83.28 s -- in org.apache.hadoop.ozone.recon.TestReconScmSnapshot
-Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 354.9 s -- in org.apache.hadoop.ozone.recon.TestReconTasks
+Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 176.5 s -- in org.apache.hadoop.ozone.recon.TestReconTasks
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.43 s -- in org.apache.hadoop.ozone.recon.TestReconTasksMultiNode
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.28 s -- in org.apache.hadoop.ozone.recon.TestReconTasksMultiNode
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 103.2 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManager
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.61 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManager
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 68.88 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManagerFSO
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 38.21 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManagerFSO
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.3 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointEC
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 135.2 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointEC
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.0 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointRatis
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 134.2 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointRatis

@adoroszlai adoroszlai merged commit 20f7097 into apache:master Jun 1, 2026
104 of 107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants