Expose a Store Node Eviction API via APIServer

  **Is your feature request related to a problem? Please describe.**
  
  When a BifroMQ cluster node fails due to disk exhaustion or partial storage corruption, the SWIM failure detector can be stalled indefinitely. If the OS is still responsive enough to ACK TCP probes while the application layer is hung, the failure detector repeatedly cancels its suspicion timer and never
  declares the node dead. As a result, the Raft voter list is never cleaned up, write operations stall, and the cluster cannot self-heal — regardless of whether the node is removed from the load balancer. The only reliable fix today is to physically power off the machine and wait up to ~90 seconds for the
  automatic pipeline to complete. There is no operator-facing API to accelerate or override this.

  Re-deploying on the same IP makes things significantly worse: the new process gets a new UUID-based identity while the old identity persists in the CRDT membership, causing other nodes' indirect probes to keep the old identity alive. The cluster ends up stuck with a ghost voter that can never be evicted.

  **Describe the solution you'd like**
  
  Two new endpoints in bifromq-apiserver (default port 8091):

  1. Evict a store node from all its Raft groups
  
  DELETE /store/node
  Headers:
    store_name: dist.worker | inbox.store | retain.store
    store_id:   <storeId to evict>

  This triggers a ChangeConfigCommand against every KV range for which the target store is a voter or learner, removing it immediately. This is exactly what UnreachableReplicaRemovalBalancer does automatically, but driven by operator intent rather than a timeout.

  Relevant internal code: UnreachableReplicaRemovalBalancer.java, KVStoreBalanceController.java

  2. Evict a node from base-cluster membership
  
  DELETE /cluster/node
  Headers:
    host: <IP of the node to evict>
    port: <cluster port>

  This bypasses the suspicion timeout and directly removes the target HostEndpoint from CRDT membership, unblocking the isMissingInStore() gate that UnreachableReplicaRemovalBalancer depends on.

  Relevant internal code: AutoDropper.java, HostMemberList.java
  
  **Describe alternatives you've considered**

  - Tuning timeouts (zombieProbeDelayInMS, suspicionMultiplier, baseProbeInterval): Reduces the self-healing window in normal failure cases, but does not help when the node is half-dead and the failure detector keeps getting suppressed by sporadic ACKs.
  - Custom balancer plugin: A custom balancer can produce ChangeConfigCommand, but it still cannot bypass the isMissingInStore() gate without also evicting the node from base-cluster membership. It also requires custom code deployment rather than a simple API call.
  - Graceful shutdown on the failed node: Only works if the node is reachable enough to run AgentHost.close(), which is not the case when the disk is full or the process is unresponsive.

 **Additional context**

  The internal command types (ChangeConfigCommand, QuitCommand, RecoveryCommand) and the balancer framework already exist in base-kv-store-balance-spi. The apiserver handler pattern is well-established in bifromq-apiserver/src/main/java/org/apache/bifromq/apiserver/http/handler/. The main work is wiring an
   external trigger through to these existing internal mechanisms.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose a Store Node Eviction API via APIServer #253

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Expose a Store Node Eviction API via APIServer #253

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions