[Questions] Autoheal partition handling behavior and quorum queues #15563
-
Community Support Policy
RabbitMQ version used4.1.2 Erlang version used26.2.x Operating system (distribution) usedlinux How is RabbitMQ deployed?Generic binary package rabbitmq-diagnostics status outputSee https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics DetailsLogs from node 1 (with sensitive values edited out)2026-02-22 02:52:50.510020-03:00 [warning] <0.227735327.0> Some consumers : '#{}' created by channel : '<0.227735327.0>' are die out. Logs from node 2 (if applicable, with sensitive values edited out)2026-02-22 03:43:51.829163-03:00 [warning] <0.238192351.0> Some consumers : '#{}' created by channel : '<0.238192351.0>' are die out. Logs from node 3 (if applicable, with sensitive values edited out)2026-02-22 03:38:50.525720-03:00 [info] <0.243981234.0> closing AMQP connection (172.23.62.2:43646 -> 172.23.1.1:5673 - cachingConnectionFactory#65f3e805:0, vhost: '/', user: 'rabbitmq_ma', duration: '2M, 48s') rabbitmq.conflisteners.tcp = none Steps to deploy RabbitMQ clusterconfig and run 'rabbitmq-server' Steps to reproduce the behavior in questionnot sure advanced.configSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location DetailsApplication codeDetails# PASTE CODE HERE, BETWEEN BACKTICKSKubernetes deployment fileDetails# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKSWhat problem are you trying to solve?The timeline is as follows: 2.At '2026-02-22 03:44:35.170842-03:00', the rabbit@rabbitmqservice-1 instance printed a log indicating that rabbit@rabbitmqservice-2 was offline, without any other context. 3.At '2026-02-22 03:44:35.256189-03:00', the rabbit@rabbitmqservice-0 instance printed a log . I guess my rabbit@rabbitmqservice-2 might have had a network partition with rabbit@rabbitmqservice-1 and rabbit@rabbitmqservice-0 because I set 'cluster_partition_handling = autoheal'. Later, rabbit@rabbitmqservice-2 became the winner, and it made rabbit@rabbitmqservice-1 and rabbit@rabbitmqservice-0 restart and recover. I was wondering if we could add more detailed logging to indicate the cause of the network partition, rather than just printing 'rabbit on node 'rabbit@rabbitmqservice-1' down'. Secondly, I would like to know if there was a reason for the network partition, or was it due to a network problem? Finally, I want to know whether there are parameters that can be adjusted to adjust the heartbeat timeout and sensitivity, and reduce the frequency of re-electing the primary node. Best wishes to you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
@dormanze RabbitMQ 4.1.x is out of community support. The autoheal partition handling strategy restarts all nodes except for the first one on the list (the "winner"). We cannot possibly know what was the reason for the network partition given these few snippets. What we do know is that everything Raft-based uses
Increasing the heartbeat beyond the default 60 seconds is rarely necessary. Most likely your quorum queues have generated enough inter-node to trigger the above mechanism since all QQs use a single TCP connection. Use streams for such workloads: each stream uses a separate connection for replication and thus scenarios effectively never happen in practice per our experience. We can add more logging but we won't add more logging because Mnesia is gone in Anyhow, I've already responded with way more than an out-of-community-support series would require me to. |
Beta Was this translation helpful? Give feedback.
-
|
#15564 will expose the rest of the However, again, the issue here is likely the know and fundamental "all QQs use a single connection for replication" problem, which can produce false positive via net ticks. A major step forward after complete Mnesia removal for This might happen by |
Beta Was this translation helpful? Give feedback.
@dormanze RabbitMQ 4.1.x is out of community support.
The autoheal partition handling strategy restarts all nodes except for the first one on the list (the "winner").
We cannot possibly know what was the reason for the network partition given these few snippets. What we do know is that everything Raft-based uses
rabbitmq/atenfor peer failure detection but that is entirely orthogonal to what triggers the partition handling strategy. For that, see Inter-node Communication Heartbeats.atenhas several settings that can be adjusted viaadvanced.configjust like any other setting not exposed torabbitmq.conf. One setting is exposed, though:raft.adaptive_failure_detector.poll_interval. This i…