Search before asking
Fluss version
0.9.0 (latest release)
Please describe the bug 🐞
Summary
When a table is deleted and the StopReplicaRequest fails to reach the target TabletServer (e.g., due to network issues or TabletServer being down), the deletion process gets permanently stuck. The replica state remains in ReplicaDeletionStarted, blocking any retry. This causes fluss_coordinator_tableCount to be higher than the actual tables in ZooKeeper, with no self-healing mechanism.
Root Cause
The Deletion State Machine
OnlineReplica → OfflineReplica → ReplicaDeletionStarted → ReplicaDeletionSuccessful → NonExistentReplica
The completeDeleteTable (which calls coordinatorContext.removeTable() and decreases tableCount) is only invoked when all replicas reach ReplicaDeletionSuccessful.
The Deadlock
In CoordinatorRequestBatch.sendStopRequest() (line 465-469):
if (throwable != null) {
// todo: in FLUSS-55886145, we will introduce a sender thread to send the request.
// in here, we just ignore the error.
LOG.warn("Failed to send stop replica request to tablet server {}.", serverId, throwable);
return; // ← No event produced, no state transition
}
When the request fails to send (network error, TabletServer down):
- No
DeleteReplicaResponseReceivedEvent is produced
- Replica state stays at
ReplicaDeletionStarted permanently
isEligibleForDeletion() checks !coordinatorContext.isAnyReplicaInState(tableId, ReplicaDeletionStarted) — returns false
- Even if
resumeDeletions() is called later, it cannot retry because the eligibility check fails
- Permanent deadlock — no code path can move the replica out of
ReplicaDeletionStarted
Contrast with Successful Response Error
When the request is sent successfully but the TabletServer returns an error response, the flow works correctly:
DeleteReplicaResponseReceivedEvent is produced
retryDeleteAndSuccessDeleteReplicas handles the failure
- Replica state transitions to allow retry
The bug is specifically in the request send failure path (network-level failure or tablet server down).
Solution
No response
Are you willing to submit a PR?
Search before asking
Fluss version
0.9.0 (latest release)
Please describe the bug 🐞
Summary
When a table is deleted and the
StopReplicaRequestfails to reach the target TabletServer (e.g., due to network issues or TabletServer being down), the deletion process gets permanently stuck. The replica state remains inReplicaDeletionStarted, blocking any retry. This causesfluss_coordinator_tableCountto be higher than the actual tables in ZooKeeper, with no self-healing mechanism.Root Cause
The Deletion State Machine
The
completeDeleteTable(which callscoordinatorContext.removeTable()and decreases tableCount) is only invoked when all replicas reachReplicaDeletionSuccessful.The Deadlock
In
CoordinatorRequestBatch.sendStopRequest()(line 465-469):When the request fails to send (network error, TabletServer down):
DeleteReplicaResponseReceivedEventis producedReplicaDeletionStartedpermanentlyisEligibleForDeletion()checks!coordinatorContext.isAnyReplicaInState(tableId, ReplicaDeletionStarted)— returns falseresumeDeletions()is called later, it cannot retry because the eligibility check failsReplicaDeletionStartedContrast with Successful Response Error
When the request is sent successfully but the TabletServer returns an error response, the flow works correctly:
DeleteReplicaResponseReceivedEventis producedretryDeleteAndSuccessDeleteReplicashandles the failureThe bug is specifically in the request send failure path (network-level failure or tablet server down).
Solution
No response
Are you willing to submit a PR?