Skip to content

[server] Table deletion stuck permanently when StopReplica request fails #3357

@gyang94

Description

@gyang94

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.9.0 (latest release)

Please describe the bug 🐞

Summary

When a table is deleted and the StopReplicaRequest fails to reach the target TabletServer (e.g., due to network issues or TabletServer being down), the deletion process gets permanently stuck. The replica state remains in ReplicaDeletionStarted, blocking any retry. This causes fluss_coordinator_tableCount to be higher than the actual tables in ZooKeeper, with no self-healing mechanism.

Root Cause

The Deletion State Machine

OnlineReplica → OfflineReplica → ReplicaDeletionStarted → ReplicaDeletionSuccessful → NonExistentReplica

The completeDeleteTable (which calls coordinatorContext.removeTable() and decreases tableCount) is only invoked when all replicas reach ReplicaDeletionSuccessful.

The Deadlock

In CoordinatorRequestBatch.sendStopRequest() (line 465-469):

if (throwable != null) {
    // todo: in FLUSS-55886145, we will introduce a sender thread to send the request.
    // in here, we just ignore the error.
    LOG.warn("Failed to send stop replica request to tablet server {}.", serverId, throwable);
    return;  // ← No event produced, no state transition
}

When the request fails to send (network error, TabletServer down):

  1. No DeleteReplicaResponseReceivedEvent is produced
  2. Replica state stays at ReplicaDeletionStarted permanently
  3. isEligibleForDeletion() checks !coordinatorContext.isAnyReplicaInState(tableId, ReplicaDeletionStarted) — returns false
  4. Even if resumeDeletions() is called later, it cannot retry because the eligibility check fails
  5. Permanent deadlock — no code path can move the replica out of ReplicaDeletionStarted

Contrast with Successful Response Error

When the request is sent successfully but the TabletServer returns an error response, the flow works correctly:

  • DeleteReplicaResponseReceivedEvent is produced
  • retryDeleteAndSuccessDeleteReplicas handles the failure
  • Replica state transitions to allow retry

The bug is specifically in the request send failure path (network-level failure or tablet server down).

Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions