Is your feature request related to a problem? Please describe.
When a BifroMQ cluster node fails due to disk exhaustion or partial storage corruption, the SWIM failure detector can be stalled indefinitely. If the OS is still responsive enough to ACK TCP probes while the application layer is hung, the failure detector repeatedly cancels its suspicion timer and never
declares the node dead. As a result, the Raft voter list is never cleaned up, write operations stall, and the cluster cannot self-heal — regardless of whether the node is removed from the load balancer. The only reliable fix today is to physically power off the machine and wait up to ~90 seconds for the
automatic pipeline to complete. There is no operator-facing API to accelerate or override this.
Re-deploying on the same IP makes things significantly worse: the new process gets a new UUID-based identity while the old identity persists in the CRDT membership, causing other nodes' indirect probes to keep the old identity alive. The cluster ends up stuck with a ghost voter that can never be evicted.
Describe the solution you'd like
Two new endpoints in bifromq-apiserver (default port 8091):
- Evict a store node from all its Raft groups
DELETE /store/node
Headers:
store_name: dist.worker | inbox.store | retain.store
store_id:
This triggers a ChangeConfigCommand against every KV range for which the target store is a voter or learner, removing it immediately. This is exactly what UnreachableReplicaRemovalBalancer does automatically, but driven by operator intent rather than a timeout.
Relevant internal code: UnreachableReplicaRemovalBalancer.java, KVStoreBalanceController.java
- Evict a node from base-cluster membership
DELETE /cluster/node
Headers:
host:
port:
This bypasses the suspicion timeout and directly removes the target HostEndpoint from CRDT membership, unblocking the isMissingInStore() gate that UnreachableReplicaRemovalBalancer depends on.
Relevant internal code: AutoDropper.java, HostMemberList.java
Describe alternatives you've considered
- Tuning timeouts (zombieProbeDelayInMS, suspicionMultiplier, baseProbeInterval): Reduces the self-healing window in normal failure cases, but does not help when the node is half-dead and the failure detector keeps getting suppressed by sporadic ACKs.
- Custom balancer plugin: A custom balancer can produce ChangeConfigCommand, but it still cannot bypass the isMissingInStore() gate without also evicting the node from base-cluster membership. It also requires custom code deployment rather than a simple API call.
- Graceful shutdown on the failed node: Only works if the node is reachable enough to run AgentHost.close(), which is not the case when the disk is full or the process is unresponsive.
Additional context
The internal command types (ChangeConfigCommand, QuitCommand, RecoveryCommand) and the balancer framework already exist in base-kv-store-balance-spi. The apiserver handler pattern is well-established in bifromq-apiserver/src/main/java/org/apache/bifromq/apiserver/http/handler/. The main work is wiring an
external trigger through to these existing internal mechanisms.
Is your feature request related to a problem? Please describe.
When a BifroMQ cluster node fails due to disk exhaustion or partial storage corruption, the SWIM failure detector can be stalled indefinitely. If the OS is still responsive enough to ACK TCP probes while the application layer is hung, the failure detector repeatedly cancels its suspicion timer and never
declares the node dead. As a result, the Raft voter list is never cleaned up, write operations stall, and the cluster cannot self-heal — regardless of whether the node is removed from the load balancer. The only reliable fix today is to physically power off the machine and wait up to ~90 seconds for the
automatic pipeline to complete. There is no operator-facing API to accelerate or override this.
Re-deploying on the same IP makes things significantly worse: the new process gets a new UUID-based identity while the old identity persists in the CRDT membership, causing other nodes' indirect probes to keep the old identity alive. The cluster ends up stuck with a ghost voter that can never be evicted.
Describe the solution you'd like
Two new endpoints in bifromq-apiserver (default port 8091):
DELETE /store/node
Headers:
store_name: dist.worker | inbox.store | retain.store
store_id:
This triggers a ChangeConfigCommand against every KV range for which the target store is a voter or learner, removing it immediately. This is exactly what UnreachableReplicaRemovalBalancer does automatically, but driven by operator intent rather than a timeout.
Relevant internal code: UnreachableReplicaRemovalBalancer.java, KVStoreBalanceController.java
DELETE /cluster/node
Headers:
host:
port:
This bypasses the suspicion timeout and directly removes the target HostEndpoint from CRDT membership, unblocking the isMissingInStore() gate that UnreachableReplicaRemovalBalancer depends on.
Relevant internal code: AutoDropper.java, HostMemberList.java
Describe alternatives you've considered
Additional context
The internal command types (ChangeConfigCommand, QuitCommand, RecoveryCommand) and the balancer framework already exist in base-kv-store-balance-spi. The apiserver handler pattern is well-established in bifromq-apiserver/src/main/java/org/apache/bifromq/apiserver/http/handler/. The main work is wiring an
external trigger through to these existing internal mechanisms.