Not all messages are routed after a network failure (with Khepri) #15726

lukasfraser · 2026-03-12T10:13:49Z

lukasfraser
Mar 12, 2026

Describe the bug

Environment

We are using RabbitMQ version 4.2.4 (Erlang 27.3.4.8) with Khepri on a Kubernetes cluster consisting of 3 nodes, where each node runs a single rabbitmq pod.
Our own process are based on C++ and Node.js with a RabbitMQ heartbeat around 5 to 10 seconds. They are deployed to the same nodes as rabbitmq.
On average we can expect around 100-200 messages per second, with peaks on startup going up to maybe 1000-2000 per second. The overall connection count is well below 100.

Usage

We create durable topic-based exchanges for publishing messages.
Consumers create classic auto-delete & exclusive queues with bindings to these exchanges.
If I am not mistaken RabbitMQ should automatically delete these queues & bindings when the consumer disconnects. To be on the safe side we also set property x-expires to 60000 (=60 seconds).

How to reproduce

When we simulate a network disconnect on one of the three nodes (e.g., disable network interface on the host machine for 3 minutes via ip link set eth0 down; sleep 180; ip link set eth0 up) we have noticed that in some cases some (not all!) exchanges may start to cause issues.
From what we can tell broken exchanges are caused by "zombie" queues & bindings within RabbitMQ (= auto-delete queues without a consumer that should have already been destroyed by RabbitMQ). Interestingly enough you can only see the "zombie" queues via the RabbitMQ Management GUI. They do not show up when calling "rabbitmqctl list_queues" in the command line. All three nodes show the same list of queues.
The "zombies" are still there after 1-2 hours.

When a message is published to an affected exchange RabbitMQ gets confused when trying to distribute messages to the affected bindings & queues:

behavior 1: every single message is rejected
behavior 2: every X message is rejected while other messages in between can be published just fine

Please note that the same rejections also occur when publishing a message via the RabbitMQ Management GUI, so it's not just our processes / libraries being unhappy. It does not appear to make any difference to which RabbitMQ pod you are connected. Client reconnects / restarts do not fix the problem. There are no relevant messages in the RabbitMQ logs. All we can see is our processes reconnecting, doing the "normal" stuff. No obvious errors.

Workaround

A workaround is to delete the exchange and recreate it. However, in order to do that you first need to reliably detect that an exchange is broken on the producer side. We do not want to trigger this unless it is absolutely necessary.
After recreating the exchange the consumer also needs to find out that its bindings to the old exchange are gone and bind to the new exchange.
We cannot use the mandatory flag for publishing updates because there are situations where we do not have any consumers.

What we have noticed is that the exchange behavior has changed between RabbitMQ 4.0.9 and 4.2.4 (both using Khepri). The same issue also occurred in 4.0.9, but less frequently.

Questions

Is this a known issue? I could not find any existing ticket that matches what we experience (aside from old issues related to Mnesia).
Are we doing something wrong in the way we are using exchanges?
Is there something else we can check?

Reproduction steps

start cluster with all 3 nodes
create exchanges
create consumer queues + bindings
trigger network disconnect on one node via ip link set eth0 down; sleep 180; ip link set eth0 up
wait until disconnect is over and everything is reconnected

Expected behavior

Temporary issues / disconnects are expected due to the specific test scenario.
However, exchanges should not continue to reject messages after the network is restored.

Additional context

No response

michaelklishin · 2026-03-12T15:26:02Z

michaelklishin
Mar 12, 2026
Maintainer

@lukasfraser we need an executable way to reproduce your topology. We won't guess what exactly your topology is like, how many queues you have, and so on.

So far it sounds like a combination of 2-3 known behaviors or issues. On top of that, you seemingly believe that auto-delete queues are supposed to be cleaned up when their declaring clients disconnect. That is the behavior of exclusive queues, auto-delete queues are not tied to the lifecycle of their declaring connection.

Topic Exchange Bindings

Topic exchange bindings with Khepri had a few known bugs addressed by 4.2.4 to the extent we could reproduce them, namely in #15237.

Where we (or the users) knew how to reproduce this behavior with some bindings being left behind, we did. This can be the explanation of the "some messages are rejected" behavior, although what specifically "rejected" means is not clear to me. They were not routed, not rejected in the publisher confirms/DLX sense.

Mass Disconnect of Clients and The Redeclaration Race™

There is a widely known scenario where a mass disconnect of clients results in exclusive queues being removed concurrently with client reconnections (that include queue declarations). There is very little that RabbitMQ can do about that besides #15276.

With Khepri this scenario is actually less problematic than with Mnesia, see the very first item in the 4.0 release notes which has quite a few links to prior discussions.

Why are the `list_queues` Results Different from Those in the Management UI?

rabbitmqctl list_queues and the management UI use different data sources. CLI tools query the node directly while the management plugin uses a separate statistics database.

0 replies

lukasfraser · 2026-03-13T06:52:02Z

lukasfraser
Mar 13, 2026
Author

Thank you for the reply.

Regarding topology:
We have around 10 exchanges. There are 50-100 classic queues in total with bindings to these exchanges. The total number of bindings can be around 100-200 and they are more or less evenly distributed across all exchanges.
As I mentioned, our classic queues are not just marked as auto-delete, but also as exclusive. They are tightly linked to a specific client connection. When a client disconnects the queue is no longer used and can therefore be deleted - which is exactly what happens under "normal" circumstances.
Please note that we are not reusing the same classic queue name after a client reconnect. There should not be any conflict.
However, when temporarily disconnecting one of the three Kubernetes nodes it will affect the RabbitMQ cluster itself and also our clients that happen to connect to or from that node.
The expectation is that queues & exchanges continue to work with just two nodes. Then after a few minutes the third node rejoins the cluster.

Regarding rejection:
Sorry about the confusion, that is how the client libraries like to call it. What I mean by that is publisher confirms, which are always used in our setup. Speed is not our primary concern, we focus more one having a reliably communication.
The response we get from RabbitMQ is a "nack" instead of an "ack". We can see in the RabbitMQ Management GUI that messages are being published to the exchange. However, these "nacked" messages are not delivered to the consumers.

Regarding reproducibility:
Unfortunately there is no easy way to reproduce the issue. When doing a Kubernetes node disconnect maybe 1 or 2 exchanges can cause issues, but maybe none. It's not always the same exchanges causing issues. It can take a few attempts.

For us it would be perfectly acceptable if RabbitMQ temporarily "nacks" messages while a node is leaving or joining the cluster, giving RabbitMQ some time to figure things out. Our clients will just retry publishing the message after a few seconds. However, if the system remains in this unhappy state we are quite literally blocked. Leaving us just the option of triggering the workaround that deletes & recreates the affected exchanges.

9 replies

michaelklishin Mar 16, 2026
Maintainer

These PerfTest examples are screaming "this was LLM generated without any verification".

lukasfraser Mar 16, 2026
Author

Sorry, was unable to verify it at the time.
Should have known, it looked too good to be true...

I have created a PR as requested:
lukebakken/docker-rabbitmq-cluster#4

Have tried to reproduce the issue via

docker network disconnect rabbitnet docker-rabbitmq-cluster-rmq0-1

followed by

docker network connect rabbitnet docker-rabbitmq-cluster-rmq0-1

But no luck so far. It does look different, because here transient queues hosted on the offline instance are marked as "down", which is not what we see on Kubernetes. When the network is restored these "down" queues are correctly deleted and we only have the expected queues.

On Kubernetes when only 2 out of 3 nodes are left we do not see any "down" queues.

lukebakken Mar 16, 2026
Maintainer

Thank you for doing testing yourself. You're running the same command I would to simulate a network issue. So, now the task is to try to list everything that could be different between running the test via docker compose vs your env:

Docker vs k8s
ip link set eth0 down; sleep 180; ip link set eth0 up vs docker network ...
PerfTest vs your applications
RabbitMQ settings (any other differences?)
Docker container (I'm assuming you're using the same community image)
Anything else?

because here transient queues hosted on the offline instance are marked as "down", which is not what we see on Kubernetes

What do you see in your environment?

lukasfraser Mar 17, 2026
Author

RabbitMQ Settings

These are the custom RabbitMQ settings we use on our cluster:

vm_memory_high_watermark.relative = 0.8
cluster_partition_handling = autoheal
net_ticktime = 12
cluster_keepalive_interval = 5000

We lowered net_ticktime as well as cluster_keepalive_interval to speed up the time it takes RabbitMQ to detect that one of its nodes is unreachable.
cluster_partition_handling we had since RabbitMQ 3.X with mirrored classic queues. Our intention was to have a cluster that can heal itself, rather than pause queues when the state becomes unclear. Not quite sure how relevant this still is on 4.2.4 with Khepri, quorum queues & raft.

Docker

We use the RabbitMQ image rabbitmq:4.2.4 from https://hub.docker.com/_/rabbitmq/tags

Scenarios

We currently know of two scenarios that can (sometimes) lead to issues with transient queues:

disconnect one of the Kubernetes nodes by triggering command ip link set eth0 down; sleep 180; ip link set eth0 up, or by unplugging the network cable if it's a physical machine. Doing a graceful shutdown of a node seems to be less "dangerous".
restart the entire Kubernetes cluster: e.g., turn off all machines during the night and start the system again in the morning

Docker vs k8s

It is difficult to compare the behavior of both setups.
If I simulate a network outage on one of the Kubernetes nodes with the "ip link set..." command the RabbitMQ Management GUI becomes unreachable for a few seconds. I assume Kubernetes and / or RabbitMQ need to consolidate with the 2 remaining nodes.
When the Management GUI responds again most transient queues of the offline node are already gone (while the respective node is still offline!). However, some queues of the offline node might remain in the system and continue to be marked as "running", but without a consumer. They might even remain in the system when the third Kubernetes node returns - which can lead to THE problems we are discussing.

Kubernetes Test

Did another test on Kubernetes this morning to gather more data.
This is what the list of queues on rabbitmq-server-0 looked like prior to the disconnect:

Most queues you can see are auto-delete AND exclusive. However, as mentioned yesterday due to legacy reasons we also have some that are only auto-delete but NOT exclusive (classic queues like this also showed up in the definitions file).
Queues that do not display a rate value are used for RPC only and are not bound to an exchange. So it is fine that there is no activity.

Here you can see the overall activity of the cluster prior to the disconnect:

After triggering the "if link set" command on rabbitmq-server-0 and waiting for a few seconds the list of queues "owned" by rabbitmq-server-0 looked as follows:

Many transient queues were already gone, but some remained. No queues were marked as "down". The quorum queue in the middle was still negotiating. As a consequence one of the two remaining RabbitMQ nodes triggered a restart during this period:

2026-03-17 06:50:43.449974+00:00 [error] <0.29588.0> ** Reason for termination = error:{assert,
2026-03-17 06:50:43.449974+00:00 [error] <0.29588.0>                                       [{module,ra_server},
2026-03-17 06:50:43.449974+00:00 [error] <0.29588.0>                                        {line,2138},
2026-03-17 06:50:43.449974+00:00 [error] <0.29588.0>                                        {expression,"PrevIdx < SnapIdx"},
2026-03-17 06:50:43.449974+00:00 [error] <0.29588.0>                                        {expected,true},
2026-03-17 06:50:43.449974+00:00 [error] <0.29588.0>                                        {value,false}]}

Everything became usable again a few seconds later.

Then after a few more minutes the third node returned and Kubernetes as a whole started to recover.
However, we can already see some weird things happening:

The first two queues on the list (used for RPC, not bound to an exchange) do not have a consumer, but are still alive:

The third queue is actually bound to an exchange and somehow managed to survive the disconnect with a valid consumer (which ran on a different node, not the one that got disconnected). Which is surprising to me because I would have expected most (?) transient queues of rabbitmq-server-0 to be dropped at some point. But maybe there is some magic happening in the background that I am unaware of :)

In this discussion I tried to focus on exchanges, because here "zombie" transient queues can affect the overall functionality and prevent messages from being published. "zombie" transient queues that are only used as reply queues for RPC do not really affect our functionality, aside from maybe eating up some memory. So I am currently less concerned about those.
But maybe the two cases are linked?

lukasfraser Mar 18, 2026
Author

Some additional info in case it helps:

Aside from RabbitMQ 4.2.4 we use cluster-operator 2.15.0 and messaging-topology-operator 1.17.4

Our processes usually connect to RabbitMQ by going through the "rabbitmq" service. They are forwarded to a random RabbitMQ pod on every connection attempt - distributing the connections more or less evenly.
As a consequence it can happen that when you disconnect one of the Kubernetes nodes that not just processes running on this particular node are interrupted, but also processes running on the two remaining nodes that happen to be using RabbitMQ on the disconnected node. They can however continue after a reconnect, because the rabbitmq service will now only use the two remaining nodes as potential targets.

Not all messages are routed after a network failure (with Khepri) #15726

Uh oh!

lukasfraser Mar 12, 2026

Describe the bug

Environment

Usage

How to reproduce

Workaround

Questions

Reproduction steps

Expected behavior

Additional context

Replies: 2 comments · 9 replies

Uh oh!

Uh oh!

michaelklishin Mar 12, 2026 Maintainer

Topic Exchange Bindings

Mass Disconnect of Clients and The Redeclaration Race™

Why are the list_queues Results Different from Those in the Management UI?

Uh oh!

Uh oh!

lukasfraser Mar 13, 2026 Author

Uh oh!

michaelklishin Mar 16, 2026 Maintainer

Uh oh!

Uh oh!

lukasfraser Mar 16, 2026 Author

Uh oh!

lukebakken Mar 16, 2026 Maintainer

Uh oh!

lukasfraser Mar 17, 2026 Author

RabbitMQ Settings

Docker

Scenarios

Docker vs k8s

Kubernetes Test

Uh oh!

lukasfraser Mar 18, 2026 Author

lukasfraser
Mar 12, 2026

Replies: 2 comments 9 replies

michaelklishin
Mar 12, 2026
Maintainer

Why are the `list_queues` Results Different from Those in the Management UI?

lukasfraser
Mar 13, 2026
Author

michaelklishin Mar 16, 2026
Maintainer

lukasfraser Mar 16, 2026
Author

lukebakken Mar 16, 2026
Maintainer

lukasfraser Mar 17, 2026
Author

lukasfraser Mar 18, 2026
Author