[server] Table deletion stuck permanently when StopReplica request fails

### Search before asking

- [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar.


### Fluss version

0.9.0 (latest release)

### Please describe the bug 🐞

## Summary

When a table is deleted and the `StopReplicaRequest` fails to reach the target TabletServer (e.g., due to network issues or TabletServer being down), the deletion process gets permanently stuck. The replica state remains in `ReplicaDeletionStarted`, blocking any retry. This causes `fluss_coordinator_tableCount` to be higher than the actual tables in ZooKeeper, with no self-healing mechanism.

## Root Cause

### The Deletion State Machine

```
OnlineReplica → OfflineReplica → ReplicaDeletionStarted → ReplicaDeletionSuccessful → NonExistentReplica
```

The `completeDeleteTable` (which calls `coordinatorContext.removeTable()` and decreases tableCount) is only invoked when **all** replicas reach `ReplicaDeletionSuccessful`.

### The Deadlock

In `CoordinatorRequestBatch.sendStopRequest()` (line 465-469):

```java
if (throwable != null) {
    // todo: in FLUSS-55886145, we will introduce a sender thread to send the request.
    // in here, we just ignore the error.
    LOG.warn("Failed to send stop replica request to tablet server {}.", serverId, throwable);
    return;  // ← No event produced, no state transition
}
```

When the request **fails to send** (network error, TabletServer down):
1. No `DeleteReplicaResponseReceivedEvent` is produced
2. Replica state stays at `ReplicaDeletionStarted` permanently
3. `isEligibleForDeletion()` checks `!coordinatorContext.isAnyReplicaInState(tableId, ReplicaDeletionStarted)` — returns **false**
4. Even if `resumeDeletions()` is called later, it cannot retry because the eligibility check fails
5. **Permanent deadlock** — no code path can move the replica out of `ReplicaDeletionStarted`

### Contrast with Successful Response Error

When the request is sent successfully but the TabletServer returns an error response, the flow works correctly:
- `DeleteReplicaResponseReceivedEvent` is produced
- `retryDeleteAndSuccessDeleteReplicas` handles the failure
- Replica state transitions to allow retry

The bug is specifically in the **request send failure** path (network-level failure or tablet server down).


### Solution

_No response_

### Are you willing to submit a PR?

- [ ] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] Table deletion stuck permanently when StopReplica request fails #3357

Search before asking

Fluss version

Please describe the bug 🐞

Summary

Root Cause

The Deletion State Machine

The Deadlock

Contrast with Successful Response Error

Solution

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[server] Table deletion stuck permanently when StopReplica request fails #3357

Description

Search before asking

Fluss version

Please describe the bug 🐞

Summary

Root Cause

The Deletion State Machine

The Deadlock

Contrast with Successful Response Error

Solution

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions