Skip to content

Shared Subscription Throughput Bottleneck: 400K msg/s Published, Only 20-40K msg/s Consumed #262

@wxhdavid

Description

@wxhdavid

Summary

In a large-scale IoT benchmark with 10 million concurrent MQTT connections, all devices publish status messages matching a single $share group subscription. The publish side sustains 400K+ msg/s, but the subscribe side only consumes 20-40K msg/s (5-10%). The dist-service layer is the confirmed bottleneck.

Environment

BifroMQ Cluster (15 nodes):

  • Broker: 7 nodes, 48 vCPU each — 1-5% CPU, idle
  • Dist: 5 nodes, 48 vCPU each — 64-79% CPU, near saturation
  • Retain: 3 nodes, 8 vCPU each — 0-1% CPU, idle

Load Generator: 25 machines x 400K connections = 10M total, 8 NICs per machine, custom Rust MQTT client.

Subscriber: 13 machines x 30 connections = 390 subscribers, all in one $share group.

BifroMQ Version: 4.0.0-incubating, JDK 25, Linux (Alibaba Cloud ECS)

Topic Structure:

Publish:   /device_status/{device_id}/{product_key}/{product_model}/{region}/
Subscribe: $share/group_name//device_status/#

Test Results

Config A — 430,000 publishers, 1 msg/s each

  • Pub QPS: 431K (peak at 8.4M connections)
  • Sub QPS: 4-6K/s

Config B — 5,000 publishers, 86 msg/s each

  • Pub QPS: 424K (zero drops)
  • Sub QPS: 4K/s

Config C — 100,000 publishers, 4 msg/s each

  • Pub QPS: 389K (zero drops at low connection count)
  • Sub QPS: 22-42K/s (best result)

Config D — Same as C, observed at only 290K connections

  • Pub QPS: 389K, zero drops, zero ConnFail, broker 1-5% CPU
  • Sub QPS: 22K/s — proves the ceiling exists regardless of broker load

Key Observations

  1. Broker nodes nearly idle (1-5% CPU) — MQTT protocol handling is not the bottleneck.

  2. Dist nodes are the bottleneck at 64-79% CPU. All 5 dist nodes show load avg 30-38 on 48-core machines.

  3. Sub throughput scales with publisher diversity — 5K unique publishers gives 4K/s sub; 100K unique publishers gives 22-42K/s sub. This suggests dist-layer partitioning doesn't distribute load evenly when message sources are concentrated.

  4. Subscriber QoS 0 vs QoS 1 makes no difference — bottleneck is in the dist-layer fanout/delivery pipeline, not in broker-to-subscriber PUBACK.

  5. Payload size reduction (900B to 100B) did not improve sub throughput.

  6. At low connection count (290K), with zero drops, sub is still capped at 22K/s — the shared subscription delivery ceiling is independent of broker load.

Questions

  1. Is there a recommended way to tune shared subscription throughput for high fan-in scenarios (many publishers, single $share group, few hundred subscribers)?

  2. Would increasing dist_worker_fanout_parallelism help? Default is max(2, cpus/8) = 6 on 48-core machines — seems conservative for 400K+ msg/s.

  3. Are there other system properties we should tune? (e.g. data_plane_burst_latency_ms, dist_worker_inline_fanout_threshold)

  4. Is the single-threaded DeliverExecutor design intentional for ordering? Could it be relaxed for unordered shared subscriptions?

Detailed source code analysis and full optimization history in comments below.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions