Quorum queue member cannot recover, runs into an `{error, name_not_registered}` #15727

timur-ND · 2026-03-12T10:39:44Z

timur-ND
Mar 12, 2026

Describe the bug

During quorum queue recovery after a cluster node restart (intentionally stopped), a node emits the following warning and bootstraps a brand new Ra server instance instead of rejoining the existing Raft group:

[warning] Quorum queue recovery: configured member of queue 'fastedge_compile' in vhost 'fastedge_dev'
was not found on this node. Starting member as a new one. Context: name_not_registered

This results in the affected node having a different Ra UID (directory name under the quorum data path) than the other cluster members. Since all three nodes now have distinct, mutually unknown Ra server IDs, no leader election can ever succeed - all replicas remain permanently stuck as follower with Term=0 and all indexes at 0.
The queue becomes completely non-functional:

No leader is ever elected
All management operations (delete, force delete, remove member) hang indefinitely or fail with no_more_servers_to_try
Message metrics show ? in the Management UI
The queue record is absent from Khepri (metadata store), so the queue exists only as orphaned Ra processes on disk. Confirmed by rabbit_amqqueue:lookup returning {error, not_found} and meta.dets returning no entry for the queue on any node:

    rabbitmqctl eval '{ok, Q} = rabbit_amqqueue:lookup(
      rabbit_misc:r(<<"fastedge_dev">>, queue, <<"fastedge_compile">>)),
      rabbit_amqqueue:delete_crashed(Q).'

    Error: {{:badmatch, {:error, :not_found}}, ...}

The Management UI incorrectly shows a leader node for the queue (screenshot 1), masking the real state. The issue only became apparent when stopping the reported leader node caused the queue to go down with no re-election - revealing there was never a real leader to begin with

This issue was reproduced on RabbitMQ 4.2.1 and 4.2.4, Erlang/OTP 27.

Reproduction steps

Exact reproduction steps are not confirmed, but the sequence of events reconstructed from logs is:

3-node RabbitMQ cluster, quorum queue exists and is healthy.
Stopped one node (node1) - leader of this queue. Queue is marked as down (screenshot 1)
An attempt is made to remove a replica from one node while the cluster is in a degraded state:

    [warning] queue 'fastedge_compile' in vhost 'fastedge_dev': failed to remove member (replica)
    on node 'rabbit@node1-***', error: {no_more_servers_to_try, [
      {timeout, {fastedge_dev_fastedge_compile, 'rabbit@node2-...'}},
      {timeout, {fastedge_dev_fastedge_compile, 'rabbit@node3-...'}},
      {error, nodedown}
    ]}

The target node (node1) is later started
On recovery, the node cannot find its Ra registration and logs:

    [warning] Quorum queue recovery: configured member of queue 'fastedge_compile'
    in vhost 'fastedge_dev' was not found on this node.
    Starting member as a new one. Context: name_not_registered

The node creates a new independent Ra instance with a new UID, diverging from the other two nodes
All three nodes now have different Ra UIDs for the same logical queue (confirmed by inspecting the quorum data directories):

# rabbit@node1 — after the bad recovery
/var/lib/rabbitmq/mnesia/rabbit@node1/quorum/rabbit@node1/FASTED8WMMFAXOW0K7/

# rabbit@node2 — different UID entirely
/var/lib/rabbitmq/mnesia/rabbit@node2/quorum/rabbit@node2/FASTEDXLLGCRLDP54C/

# rabbit@node3 — different UID again
/var/lib/rabbitmq/mnesia/rabbit@node3/quorum/rabbit@node3/FASTEDVJTY86XOVPO8/

Additionally, meta.dets on all nodes contained no entry for this queue, confirming the Ra instances were never part of a coordinated group.

    rabbitmqctl eval '
      {ok, T} = dets:open_file(meta, [{file, "...rabbit@node1.../meta.dets"}]),
      io:format("~p~n", [dets:match(T, {rabbit_quorum_queue,
        {resource, <<"fastedge_dev">>, queue, <<"fastedge_compile">>}, "_", "_"})]),
      dets:close(meta).
    '
    []

Later identified, that no leader for queue(all 3 nodes have follover role).No leader election is possible; all replicas report Term=0, Last Log Index=0 (screenshot 2)
The queue is not registered in Khepri

Expected behavior

When a node restarts and its Ra registration is missing (name_not_registered), it should not silently create a new independent Ra instance. Instead, it should either:

Attempt to re-join the existing Ra group by contacting the other cluster members
Raise a more visible error and refuse to start the member until the inconsistency is resolved
At minimum, log an error rather than a warning, since this condition leads to irreversible data inconsistency

Additional context

Cluster: 3 nodes - rabbit@node1 (ed), rabbit@node2 (am3), rabbit@node3 (anx)
RabbitMQ versions affected: 4.2.1, 4.2.4
Erlang/OTP: 27 (erts-15.2.7.4)
Queue type: quorum
Metadata store: Khepri (healthy throughout - metadata_store_status showed a stable leader with Term=40) (screenshot 3)

Screenshot 1 (when node1 (ed) is down, no new leader election):

Screenshot 2:

Screenshot 3:

Workaround used:
Normal deletion hung indefinitely:

rabbitmqctl delete_queue fastedge_compile --vhost fastedge_dev
Deleting queue 'fastedge_compile' on vhost 'fastedge_dev' ...
^CInterrupt

delete_crashed failed because the queue was already absent from Khepri:

rabbitmqctl eval '{ok, Q} = rabbit_amqqueue:lookup(rabbit_misc:r(<<"fastedge_dev">>, queue, <<"fastedge_compile">>)), rabbit_amqqueue:delete_crashed(Q).'
Error:
{{:badmatch, {:error, :not_found}}, [{:erl_eval, :expr, 6, [file: ~c"erl_eval.erl", line: 667]}, {:erl_eval, :exprs, 6, [file: ~c"erl_eval.erl", line: 271]}, {:erl_eval, :exprs, 2, []}]}

Resolved by force-deleting the orphaned Ra instances on all three nodes using ra:force_delete_server/2, then re-importing the queue definition from a scheduled backup (rabbitmqctl import_definitions). The queue recovered immediately with a healthy leader election (Term=1).

Answered by michaelklishin

Mar 12, 2026

[warning] Message store "XXX/msg_store_persistent": rebuilding indices from scratch

is irrelevant: it's a message from the CQ message store, it has nothing to do with quorum queues.

So the data on one of the nodes was wiped after all (by rabbitmqctl forget_cluster_node, which performs an equivalent of rabbitmq-queues shrink_all before leaving the cluster as of #9449), and the node rejoined.

My best hypothesis is this:

rabbitmqctl forget_cluster_node left the cluster before a few QQ replicas were removed
Therefore those replicas' state was likely still stored in the metadata store
But the local data directory was deleted at least once (and the node was reset, explicitly or by rabbitmqct…

View full answer

MirahImage · 2026-03-12T11:49:50Z

MirahImage
Mar 12, 2026
Maintainer

I'm trying to understand the procedure here, based on your reconstruction.

There are a few things that are bothering me about this. First, a single node being down in a three node cluster should not result in a quorum queue being in the down state. Under normal circumstances, when one node of a 3 node cluster goes down, then the other two remaining nodes are quorum critical, meaning they cannot be removed. However, the quorum queue ought to automatically elect a new leader on one of the two remaining nodes and should continue to run in absence of the third member. Leaders are not fixed, so the fact that the leader continues to be on the down node and the queue reports as down is puzzling. The entire design of quorum queues is such that they should continue to operate on the majority nodes while a minority of nodes is down.

However, in the case of a three node cluster, removing a second member from the cluster while a member is down is guaranteed to remove quorum, and thus everything will be stopped in that situation. Why are you attempting to remove a queue member from the cluster while a node is down? Normally no intervention beyond ensuring the remaining node eventually comes back online should be required.

0 replies

timur-ND · 2026-03-12T13:00:12Z

timur-ND
Mar 12, 2026
Author

Thank you for the response. Let me clarify the actual sequence of events.

The queue appeared to be running while all 3 nodes were online - the Management UI showed a leader (node1) and metrics looked normal (though message counts showed ?, which we now know was an early indicator something was wrong). I didn't try to send/consume messages from this queue.

The issue was discovered while testing some availability scenarios not related to quorum queues (basically I just stopped rabbitmq on node1 by the command systemctl stop rabbitmq-server.service) :

All 3 nodes online - queue shows as running, UI shows a leader on node1
node1 was stopped intentionally
Then I noticed quorum queue is down (it wasn't in my testing scenario to test quorum queues). Leader in UI still shown as node1, no new leader was elected on the remaining 2 nodes
node1 was brought back up, later (week or more) cluster was upgraded to 4.2.4 (from 4.2.1). I was thinking some kind of bug, but decided to Upgrade to latest.
node1 was stopped again expecting a normal re-election - same result, queue went down with no failover
At this point I started investigating deeper and discovered that all 3 nodes had Term=0, Last Log Index=0, and most crucially - different Ra UIDs for the same queue across all nodes

So the queue was never truly functional even when appearing running. The Management UI was masking the broken state by showing a stale leader. The fact that stopping any single node immediately caused down with no re-election was the symptom that revealed the underlying problem.
My attempts to remove the queue and recreate it were a recovery effort, not the cause of the issue.

The remove member attempt came after this discovery, as a failed recovery attempt.

The name_not_registered warning in the logs appears to be the point where the state became inconsistent — the node silently created a new independent Ra instance instead of rejoining the existing group.

0 replies

michaelklishin · 2026-03-12T15:42:24Z

michaelklishin
Mar 12, 2026
Maintainer

@timur-ND we need logs from all nodes in order to investigate this behavior. We will not guess what the relative timing of these events is.

This results in the affected node having a different Ra UID (directory name under the quorum data path) than the other cluster members

This is a symptom, not the cause.

What Does the Warning Mean?

There is effectively one code path that results in an {error, name_not_registered} to be returned:
ra_server_sup_sup:recover_config/2 (the link is for the Ra 2.17.x version, not the 3.x one in main).

So the quorum queue in question does not have its local data. Maybe the node restart was actually
a host (OS) restart and some data directories are not actually persistent, that'd be my best guess.

Since all three nodes now have distinct, mutually unknown Ra server IDs, no leader election can ever succeed

This further confirms my hypothesis that the Raft data directories (or their parts) somehow do not survive a restart, leading quorum queue replicas on every node to believe they have no prior state.

Durable storage is a fundamental assumption of Raft, not specific to RabbitMQ's quorum queues or our Raft implementation.

The "Expected Behavior" suggestions are not specific enough and make it sound like the problem is trivial but in fact we've been chasing edge cases that can trip up leader election for
a while, and the last fully understood scenario was #14241, addressed in #15615.

You keep hammering

the node silently created a new independent Ra instance instead of rejoining the existing group

but the problem comes down to quorum queue members in your environment repeatedly not finding their local data. If this was a common behavior we'd be flooded with such reports but the only other issue that I could find was #10007 from 2023.

How to Proceed

There are quite a few possible scenarios and not enough specifics to conclude anything. Please stop making it sound like the queue "should just join its existing members" (it tries to do just that) and share logs from all nodes.

Note that all hostnames, virtual host names, usernames, queues and stream names in the logs can be obfuscated using rabbitmq-lqt.

0 replies

kjnilsson · 2026-03-12T16:09:03Z

kjnilsson
Mar 12, 2026
Maintainer

UIds are node local and will not be the same across a Ra cluster. Are you sure you didn't accidentally nuke your disks during the upgrade?

0 replies

timur-ND · 2026-03-12T19:13:56Z

timur-ND
Mar 12, 2026
Author

Thank you for the detailed feedback.
Yes, all nodes disk are persistent ext4.
After further investigation I was able to reconstruct the full sequence of events from logs across all three nodes. I apologize for the incomplete initial description - the reproduction scenario is now much clearer.
My "testing some availability scenarios" as I mentioned above was regarding the question "What can I do during disaster, if one node is down?". So, I tested some scenario with Classic queues and tried to delete it (but it was impossible since node was down). So I tried to remove node from cluster (to be able recreate queue on surviving node by simple consumer app) and rejoin node back. I remember that "forget_node" was stuck for some time, seems because of this queue, that I mentioned in this discussion.

The root cause seems is a failed quorum queue replica removal during forget_cluster_node, followed by a node re-join.

Exact sequence (from logs)
Mar 2, 14:42 — node1 was stopped intentionally to test a node removal scenario (rabbitmqctl forget_cluster_node)
Mar 2, 15:24 — the remaining nodes began removing all quorum queue replicas from node1:

[info] Will remove all queues from node rabbit@node1. The node is likely being removed from the cluster.
[info] Asked to remove all quorum queue replicas from node rabbit@node1

All queues were removed successfully except fastedge_compile in vhost fastedge_dev, which timed out on all nodes:

[warning] queue 'fastedge_compile' in vhost 'fastedge_dev': failed to remove member (replica)
on node 'rabbit@node1', error: {no_more_servers_to_try, [
  {timeout, {fastedge_dev_fastedge_compile, 'rabbit@node2'}},
  {timeout, {fastedge_dev_fastedge_compile, 'rabbit@node3'}},
  {error, nodedown}
]}

Mar 2, 15:26 — node1 was removed from the Khepri cluster:

[info] Asked to remove node rabbit@node1 from Khepri cluster "rabbitmq_metadata" but not member of it

Mar 2, 15:30 — node1 came back up and rejoined the cluster
Mar 2, 15:45 — on recovery, node1 could not find its Ra registration for fastedge_compile and logged:

[warning] Quorum queue recovery: configured member of queue 'fastedge_compile'
in vhost 'fastedge_dev' was not found on this node.
Starting member as a new one. Context: name_not_registered

A new independent Ra instance was bootstrapped instead of re-syncing with the existing Raft group.
Additionally, on every vhost during this restart:

[warning] Message store "XXX/msg_store_persistent": rebuilding indices from scratch

This is consistent with the node having been reset and rejoined as a fresh member.

Question
When a node re-joins after forget_cluster_node and Khepri still lists it as a quorum queue member, but the Ra data is absent - is name_not_registered + bootstrap of a new instance the correct behavior? Or should the node attempt to re-sync from the existing Raft group?

I followed you suggestion and obfuscated logs.
ed-obfuscated.log
am3-obfuscated.log
anx-obfuscated.log

If you have time, you can check these logs and maybe correct me with timeline and root cause

6 replies

michaelklishin Mar 12, 2026
Maintainer

[warning] Message store "XXX/msg_store_persistent": rebuilding indices from scratch

is irrelevant: it's a message from the CQ message store, it has nothing to do with quorum queues.

So the data on one of the nodes was wiped after all (by rabbitmqctl forget_cluster_node, which performs an equivalent of rabbitmq-queues shrink_all before leaving the cluster as of #9449), and the node rejoined.

My best hypothesis is this:

rabbitmqctl forget_cluster_node left the cluster before a few QQ replicas were removed
Therefore those replicas' state was likely still stored in the metadata store
But the local data directory was deleted at least once (and the node was reset, explicitly or by rabbitmqctl join_cluster, which resets the new node in 4.x)
QQ cannot find local state and trusts the metadata store, and ends up in a code path that should only be used by the first replica to start when a new QQ is declared

This is similar to #13131 #15615 where a quorum queue was re-declared while one replica was on a stopped node that was restarted.

I don't have a good answer "how would a QQ member without local data know what to do".

What thing we can do to make this scenario significantly less likely: rabbitmqctl forget_cluster_node should leave the cluster and reset Khepri state last, not effectively concurrently with removing QQ replicas.

Answer selected by michaelklishin

michaelklishin Mar 13, 2026
Maintainer

There are other findings in the log:

At 15:44:27, a forced boot was attempted
Around 16:15, 14 freshly added members with empty state run into a function_clause in rabbit_fifo_q:get/1 and their state seemingly was blank

michaelklishin Mar 13, 2026
Maintainer

#15729

timur-ND Mar 13, 2026
Author

Thank you for the thorough analysis and answer

michaelklishin Mar 13, 2026
Maintainer

@timur-ND my findings and #15729 were only possible because you have shared a lot of details and the obfuscated logs. Thanks, you have effectively contributed to RabbitMQ without writing any code 👏

Quorum queue member cannot recover, runs into an {error, name_not_registered} #15727

Uh oh!

timur-ND Mar 12, 2026

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 5 comments · 6 replies

Uh oh!

Uh oh!

MirahImage Mar 12, 2026 Maintainer

Uh oh!

Uh oh!

timur-ND Mar 12, 2026 Author

Uh oh!

Uh oh!

michaelklishin Mar 12, 2026 Maintainer

What Does the Warning Mean?

How to Proceed

Uh oh!

kjnilsson Mar 12, 2026 Maintainer

Uh oh!

timur-ND Mar 12, 2026 Author

Uh oh!

Uh oh!

michaelklishin Mar 12, 2026 Maintainer

Uh oh!

Uh oh!

michaelklishin Mar 13, 2026 Maintainer

Uh oh!

michaelklishin Mar 13, 2026 Maintainer

Uh oh!

timur-ND Mar 13, 2026 Author

Uh oh!

michaelklishin Mar 13, 2026 Maintainer

Quorum queue member cannot recover, runs into an `{error, name_not_registered}` #15727

timur-ND
Mar 12, 2026

Replies: 5 comments 6 replies

MirahImage
Mar 12, 2026
Maintainer

timur-ND
Mar 12, 2026
Author

michaelklishin
Mar 12, 2026
Maintainer

kjnilsson
Mar 12, 2026
Maintainer

timur-ND
Mar 12, 2026
Author

michaelklishin Mar 12, 2026
Maintainer

michaelklishin Mar 13, 2026
Maintainer

michaelklishin Mar 13, 2026
Maintainer

timur-ND Mar 13, 2026
Author

michaelklishin Mar 13, 2026
Maintainer