Summary
In a large-scale IoT benchmark with 10 million concurrent MQTT connections, all devices publish status messages matching a single $share group subscription. The publish side sustains 400K+ msg/s, but the subscribe side only consumes 20-40K msg/s (5-10%). The dist-service layer is the confirmed bottleneck.
Environment
BifroMQ Cluster (15 nodes):
- Broker: 7 nodes, 48 vCPU each — 1-5% CPU, idle
- Dist: 5 nodes, 48 vCPU each — 64-79% CPU, near saturation
- Retain: 3 nodes, 8 vCPU each — 0-1% CPU, idle
Load Generator: 25 machines x 400K connections = 10M total, 8 NICs per machine, custom Rust MQTT client.
Subscriber: 13 machines x 30 connections = 390 subscribers, all in one $share group.
BifroMQ Version: 4.0.0-incubating, JDK 25, Linux (Alibaba Cloud ECS)
Topic Structure:
Publish: /device_status/{device_id}/{product_key}/{product_model}/{region}/
Subscribe: $share/group_name//device_status/#
Test Results
Config A — 430,000 publishers, 1 msg/s each
- Pub QPS: 431K (peak at 8.4M connections)
- Sub QPS: 4-6K/s
Config B — 5,000 publishers, 86 msg/s each
- Pub QPS: 424K (zero drops)
- Sub QPS: 4K/s
Config C — 100,000 publishers, 4 msg/s each
- Pub QPS: 389K (zero drops at low connection count)
- Sub QPS: 22-42K/s (best result)
Config D — Same as C, observed at only 290K connections
- Pub QPS: 389K, zero drops, zero ConnFail, broker 1-5% CPU
- Sub QPS: 22K/s — proves the ceiling exists regardless of broker load
Key Observations
-
Broker nodes nearly idle (1-5% CPU) — MQTT protocol handling is not the bottleneck.
-
Dist nodes are the bottleneck at 64-79% CPU. All 5 dist nodes show load avg 30-38 on 48-core machines.
-
Sub throughput scales with publisher diversity — 5K unique publishers gives 4K/s sub; 100K unique publishers gives 22-42K/s sub. This suggests dist-layer partitioning doesn't distribute load evenly when message sources are concentrated.
-
Subscriber QoS 0 vs QoS 1 makes no difference — bottleneck is in the dist-layer fanout/delivery pipeline, not in broker-to-subscriber PUBACK.
-
Payload size reduction (900B to 100B) did not improve sub throughput.
-
At low connection count (290K), with zero drops, sub is still capped at 22K/s — the shared subscription delivery ceiling is independent of broker load.
Questions
-
Is there a recommended way to tune shared subscription throughput for high fan-in scenarios (many publishers, single $share group, few hundred subscribers)?
-
Would increasing dist_worker_fanout_parallelism help? Default is max(2, cpus/8) = 6 on 48-core machines — seems conservative for 400K+ msg/s.
-
Are there other system properties we should tune? (e.g. data_plane_burst_latency_ms, dist_worker_inline_fanout_threshold)
-
Is the single-threaded DeliverExecutor design intentional for ordering? Could it be relaxed for unordered shared subscriptions?
Detailed source code analysis and full optimization history in comments below.
Summary
In a large-scale IoT benchmark with 10 million concurrent MQTT connections, all devices publish status messages matching a single
$sharegroup subscription. The publish side sustains 400K+ msg/s, but the subscribe side only consumes 20-40K msg/s (5-10%). The dist-service layer is the confirmed bottleneck.Environment
BifroMQ Cluster (15 nodes):
Load Generator: 25 machines x 400K connections = 10M total, 8 NICs per machine, custom Rust MQTT client.
Subscriber: 13 machines x 30 connections = 390 subscribers, all in one
$sharegroup.BifroMQ Version: 4.0.0-incubating, JDK 25, Linux (Alibaba Cloud ECS)
Topic Structure:
Test Results
Config A — 430,000 publishers, 1 msg/s each
Config B — 5,000 publishers, 86 msg/s each
Config C — 100,000 publishers, 4 msg/s each
Config D — Same as C, observed at only 290K connections
Key Observations
Broker nodes nearly idle (1-5% CPU) — MQTT protocol handling is not the bottleneck.
Dist nodes are the bottleneck at 64-79% CPU. All 5 dist nodes show load avg 30-38 on 48-core machines.
Sub throughput scales with publisher diversity — 5K unique publishers gives 4K/s sub; 100K unique publishers gives 22-42K/s sub. This suggests dist-layer partitioning doesn't distribute load evenly when message sources are concentrated.
Subscriber QoS 0 vs QoS 1 makes no difference — bottleneck is in the dist-layer fanout/delivery pipeline, not in broker-to-subscriber PUBACK.
Payload size reduction (900B to 100B) did not improve sub throughput.
At low connection count (290K), with zero drops, sub is still capped at 22K/s — the shared subscription delivery ceiling is independent of broker load.
Questions
Is there a recommended way to tune shared subscription throughput for high fan-in scenarios (many publishers, single
$sharegroup, few hundred subscribers)?Would increasing
dist_worker_fanout_parallelismhelp? Default ismax(2, cpus/8) = 6on 48-core machines — seems conservative for 400K+ msg/s.Are there other system properties we should tune? (e.g.
data_plane_burst_latency_ms,dist_worker_inline_fanout_threshold)Is the single-threaded
DeliverExecutordesign intentional for ordering? Could it be relaxed for unordered shared subscriptions?Detailed source code analysis and full optimization history in comments below.