[Questions] Autoheal partition handling behavior and quorum queues #15563

dormanze · 2026-02-26T03:04:28Z

dormanze
Feb 26, 2026

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.1.2

Erlang version used

26.2.x

Operating system (distribution) used

linux

How is RabbitMQ deployed?

Generic binary package

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

Details

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

2026-02-22 02:52:50.510020-03:00 [warning] <0.227735327.0> Some consumers : '#{}' created by channel : '<0.227735327.0>' are die out.
2026-02-22 03:44:35.256189-03:00 [warning] <0.227944974.0> Received a 'DOWN' message from 'rabbit@rabbitmqservice-2' but still can communicate with it
2026-02-22 03:44:35.257594-03:00 [error] <0.1090.0> Partial partition disconnect from rabbit@rabbitmqservice-1

Logs from node 2 (if applicable, with sensitive values edited out)

2026-02-22 03:43:51.829163-03:00 [warning] <0.238192351.0> Some consumers : '#{}' created by channel : '<0.238192351.0>' are die out.
2026-02-22 03:43:51.837427-03:00 [info] <0.238191908.0> closing AMQP connection (172.23.62.2:51850 -> 172.23.1.2:5673 - cachingConnectionFactory#282c4da0:0, vhost: '/', user: 'rabbitmq_ma', duration: '3M, 49s')
2026-02-22 03:44:35.170842-03:00 [info] <0.1092.0> rabbit on node 'rabbit@rabbitmqservice-2' down

Logs from node 3 (if applicable, with sensitive values edited out)

2026-02-22 03:38:50.525720-03:00 [info] <0.243981234.0> closing AMQP connection (172.23.62.2:43646 -> 172.23.1.1:5673 - cachingConnectionFactory#65f3e805:0, vhost: '/', user: 'rabbitmq_ma', duration: '2M, 48s')
2026-02-22 03:44:35.148272-03:00 [info] <0.546.0> rabbit on node 'rabbit@rabbitmqservice-1' down
2026-02-22 03:44:35.150804-03:00 [info] <0.144131523.0> queue 'vm.mysql.vmuninstall' in vhost '/': Leader monitor down with noconnection, setting election timeout

rabbitmq.conf

listeners.tcp = none
listeners.ssl.default = 127.0.0.1:RABBITMQ_NODE_SERVICE_PORT
#credential_validator.validation_backend = rabbit_credential_validator_password_regexp
#credential_validator.regexp = ^(?=.\d)(?=.[A-Z])(?=.[a-z])(?=.[~~@%^_=+[{}]:,./?])[\da-zA-Z~~@%^_=+[{}]:,./?]{16,32}$|guest
log.file.level = info
log.file.rotation.size = 524288000
log.file.rotation.count = 0
vm_memory_high_watermark.absolute = 1638MiB
password_hashing_module = rabbit_password_hashing_sha512
cluster_partition_handling = autoheal
cluster_formation.discovery_retry_limit = 20
cluster_formation.discovery_retry_interval = 1000
mnesia_table_loading_retry_timeout = 30000
mnesia_table_loading_retry_limit = 5
cluster_formation.node_cleanup.only_log_warning = true
management.login_session_timeout = 30

Steps to deploy RabbitMQ cluster

config and run 'rabbitmq-server'

Steps to reproduce the behavior in question

not sure

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

Details

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

Details

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

Details

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

The timeline is as follows:
1.At '2026-02-22 03:44:35.148272-03:00', the rabbit@rabbitmqservice-2 instance printed a log indicating that rabbit@rabbitmqservice-1 was offline, without any other context.

2026-02-22 03:38:50.525720-03:00 [info] <0.243981234.0> closing AMQP connection (172.23.62.2:43646 -> 172.23.1.1:5673 - cachingConnectionFactory#65f3e805:0, vhost: '/', user: 'rabbitmq_ma', duration: '2M, 48s')
2026-02-22 03:44:35.148272-03:00 [info] <0.546.0> rabbit on node 'rabbit@rabbitmqservice-1' down
2026-02-22 03:44:35.150804-03:00 [info] <0.144131523.0> queue 'vm.mysql.vmuninstall' in vhost '/': Leader monitor down with noconnection, setting election timeout

2.At '2026-02-22 03:44:35.170842-03:00', the rabbit@rabbitmqservice-1 instance printed a log indicating that rabbit@rabbitmqservice-2 was offline, without any other context.

2026-02-22 03:43:51.837427-03:00 [info] <0.238191908.0> closing AMQP connection (172.23.62.2:51850 -> 172.23.1.2:5673 - cachingConnectionFactory#282c4da0:0, vhost: '/', user: 'rabbitmq_ma', duration: '3M, 49s')
2026-02-22 03:44:35.170842-03:00 [info] <0.1092.0> rabbit on node 'rabbit@rabbitmqservice-2' down
2026-02-22 03:44:35.173123-03:00 [info] <0.3915.0> Consumer '<<"amq.ctag-_KA4bQzgb8imlH8vOaEzRg">>' in queue 'com.huawei.genex.gaussdb.create' is delete by channel '<14785.6888.0>' .

3.At '2026-02-22 03:44:35.256189-03:00', the rabbit@rabbitmqservice-0 instance printed a log .

2026-02-22 02:52:50.510020-03:00 [warning] <0.227735327.0> Some consumers : '#{}' created by channel : '<0.227735327.0>' are die out.
2026-02-22 03:44:35.256189-03:00 [warning] <0.227944974.0> Received a 'DOWN' message from 'rabbit@rabbitmqservice-2' but still can communicate with it
2026-02-22 03:44:35.257594-03:00 [error] <0.1090.0> Partial partition disconnect from rabbit@rabbitmqservice-1

I guess my rabbit@rabbitmqservice-2 might have had a network partition with rabbit@rabbitmqservice-1 and rabbit@rabbitmqservice-0 because I set 'cluster_partition_handling = autoheal'. Later, rabbit@rabbitmqservice-2 became the winner, and it made rabbit@rabbitmqservice-1 and rabbit@rabbitmqservice-0 restart and recover.

I was wondering if we could add more detailed logging to indicate the cause of the network partition, rather than just printing 'rabbit on node 'rabbit@rabbitmqservice-1' down'.

Secondly, I would like to know if there was a reason for the network partition, or was it due to a network problem?

Finally, I want to know whether there are parameters that can be adjusted to adjust the heartbeat timeout and sensitivity, and reduce the frequency of re-electing the primary node.

Best wishes to you.

Answered by michaelklishin

Feb 26, 2026

@dormanze RabbitMQ 4.1.x is out of community support.

The autoheal partition handling strategy restarts all nodes except for the first one on the list (the "winner").

We cannot possibly know what was the reason for the network partition given these few snippets. What we do know is that everything Raft-based uses rabbitmq/aten for peer failure detection but that is entirely orthogonal to what triggers the partition handling strategy. For that, see Inter-node Communication Heartbeats.

aten has several settings that can be adjusted via advanced.config just like any other setting not exposed to rabbitmq.conf. One setting is exposed, though: raft.adaptive_failure_detector.poll_interval. This i…

View full answer

michaelklishin · 2026-02-26T04:54:53Z

michaelklishin
Feb 26, 2026
Maintainer

@dormanze RabbitMQ 4.1.x is out of community support.

The autoheal partition handling strategy restarts all nodes except for the first one on the list (the "winner").

We cannot possibly know what was the reason for the network partition given these few snippets. What we do know is that everything Raft-based uses rabbitmq/aten for peer failure detection but that is entirely orthogonal to what triggers the partition handling strategy. For that, see Inter-node Communication Heartbeats.

aten has several settings that can be adjusted via advanced.config just like any other setting not exposed to rabbitmq.conf. One setting is exposed, though: raft.adaptive_failure_detector.poll_interval. This is arguably the least relevant setting in the context of this question.

Increasing the heartbeat beyond the default 60 seconds is rarely necessary. Most likely your quorum queues have generated enough inter-node to trigger the above mechanism since all QQs use a single TCP connection. Use streams for such workloads: each stream uses a separate connection for replication and thus scenarios effectively never happen in practice per our experience.

We can add more logging but we won't add more logging because Mnesia is gone in main (well, Khepri is now required) and soon enough, either for 4.4 or maybe even 4.3, those partition handling strategies will follow because Khepri recovers the same way any Raft-based system does.

Anyhow, I've already responded with way more than an out-of-community-support series would require me to.

0 replies

michaelklishin · 2026-02-26T05:08:21Z

michaelklishin
Feb 26, 2026
Maintainer

#15564 will expose the rest of the aten settings via rabbitmq.conf.

However, again, the issue here is likely the know and fundamental "all QQs use a single connection for replication" problem, which can produce false positive via net ticks.

A major step forward after complete Mnesia removal for 4.3.0 (happening as we speak) would be to remove all partition handling strategies, which Mnesia needed but Khepri does not.

This might happen by 4.4.0, at least I would not promise anything more than that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Questions] Autoheal partition handling behavior and quorum queues #15563

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Questions] Autoheal partition handling behavior and quorum queues #15563

Uh oh!

Uh oh!

dormanze Feb 26, 2026

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 2 comments

Uh oh!

Uh oh!

michaelklishin Feb 26, 2026 Maintainer

Uh oh!

michaelklishin Feb 26, 2026 Maintainer

dormanze
Feb 26, 2026

michaelklishin
Feb 26, 2026
Maintainer

michaelklishin
Feb 26, 2026
Maintainer