Quorum queue member cannot recover, runs into an {error, name_not_registered}
#15727
-
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 6 replies
-
|
I'm trying to understand the procedure here, based on your reconstruction. There are a few things that are bothering me about this. First, a single node being down in a three node cluster should not result in a quorum queue being in the down state. Under normal circumstances, when one node of a 3 node cluster goes down, then the other two remaining nodes are quorum critical, meaning they cannot be removed. However, the quorum queue ought to automatically elect a new leader on one of the two remaining nodes and should continue to run in absence of the third member. Leaders are not fixed, so the fact that the leader continues to be on the down node and the queue reports as down is puzzling. The entire design of quorum queues is such that they should continue to operate on the majority nodes while a minority of nodes is down. However, in the case of a three node cluster, removing a second member from the cluster while a member is down is guaranteed to remove quorum, and thus everything will be stopped in that situation. Why are you attempting to remove a queue member from the cluster while a node is down? Normally no intervention beyond ensuring the remaining node eventually comes back online should be required. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the response. Let me clarify the actual sequence of events. The queue appeared to be The issue was discovered while testing some availability scenarios not related to quorum queues (basically I just stopped rabbitmq on node1 by the command
So the queue was never truly functional even when appearing The The |
Beta Was this translation helpful? Give feedback.
-
|
@timur-ND we need logs from all nodes in order to investigate this behavior. We will not guess what the relative timing of these events is.
This is a symptom, not the cause. What Does the Warning Mean?There is effectively one code path that results in an So the quorum queue in question does not have its local data. Maybe the node restart was actually
This further confirms my hypothesis that the Raft data directories (or their parts) somehow do not survive a restart, leading quorum queue replicas on every node to believe they have no prior state. Durable storage is a fundamental assumption of Raft, not specific to RabbitMQ's quorum queues or our Raft implementation. The "Expected Behavior" suggestions are not specific enough and make it sound like the problem is trivial but in fact we've been chasing edge cases that can trip up leader election for You keep hammering
but the problem comes down to quorum queue members in your environment repeatedly not finding their local data. If this was a common behavior we'd be flooded with such reports but the only other issue that I could find was #10007 from 2023. How to ProceedThere are quite a few possible scenarios and not enough specifics to conclude anything. Please stop making it sound like the queue "should just join its existing members" (it tries to do just that) and share logs from all nodes. Note that all hostnames, virtual host names, usernames, queues and stream names in the logs can be obfuscated using |
Beta Was this translation helpful? Give feedback.
-
|
UIds are node local and will not be the same across a Ra cluster. Are you sure you didn't accidentally nuke your disks during the upgrade? |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the detailed feedback. The root cause seems is a failed quorum queue replica removal during forget_cluster_node, followed by a node re-join. Exact sequence (from logs) All queues were removed successfully except fastedge_compile in vhost fastedge_dev, which timed out on all nodes: Mar 2, 15:26 — node1 was removed from the Khepri cluster: Mar 2, 15:30 — node1 came back up and rejoined the cluster A new independent Ra instance was bootstrapped instead of re-syncing with the existing Raft group. This is consistent with the node having been reset and rejoined as a fresh member. Question I followed you suggestion and obfuscated logs. If you have time, you can check these logs and maybe correct me with timeline and root cause |
Beta Was this translation helpful? Give feedback.



is irrelevant: it's a message from the CQ message store, it has nothing to do with quorum queues.
So the data on one of the nodes was wiped after all (by
rabbitmqctl forget_cluster_node, which performs an equivalent ofrabbitmq-queues shrink_allbefore leaving the cluster as of #9449), and the node rejoined.My best hypothesis is this:
rabbitmqctl forget_cluster_nodeleft the cluster before a few QQ replicas were removedrabbitmqct…