Shared Subscription Throughput Bottleneck: 400K msg/s Published, Only 20-40K msg/s Consumed

## Summary

In a large-scale IoT benchmark with **10 million concurrent MQTT connections**, all devices publish status messages matching a single `$share` group subscription. The publish side sustains **400K+ msg/s**, but the subscribe side only consumes **20-40K msg/s** (5-10%). The dist-service layer is the confirmed bottleneck.

## Environment

**BifroMQ Cluster (15 nodes):**

- Broker: 7 nodes, 48 vCPU each — **1-5% CPU, idle**
- Dist: 5 nodes, 48 vCPU each — **64-79% CPU, near saturation**
- Retain: 3 nodes, 8 vCPU each — 0-1% CPU, idle

**Load Generator:** 25 machines x 400K connections = 10M total, 8 NICs per machine, custom Rust MQTT client.

**Subscriber:** 13 machines x 30 connections = 390 subscribers, all in one `$share` group.

**BifroMQ Version:** 4.0.0-incubating, JDK 25, Linux (Alibaba Cloud ECS)

**Topic Structure:**

```
Publish:   /device_status/{device_id}/{product_key}/{product_model}/{region}/
Subscribe: $share/group_name//device_status/#
```

## Test Results

**Config A** — 430,000 publishers, 1 msg/s each
- Pub QPS: 431K (peak at 8.4M connections)
- Sub QPS: **4-6K/s**

**Config B** — 5,000 publishers, 86 msg/s each
- Pub QPS: 424K (zero drops)
- Sub QPS: **4K/s**

**Config C** — 100,000 publishers, 4 msg/s each
- Pub QPS: 389K (zero drops at low connection count)
- Sub QPS: **22-42K/s** (best result)

**Config D** — Same as C, observed at only 290K connections
- Pub QPS: 389K, zero drops, zero ConnFail, broker 1-5% CPU
- Sub QPS: **22K/s** — proves the ceiling exists regardless of broker load

## Key Observations

1. **Broker nodes nearly idle** (1-5% CPU) — MQTT protocol handling is not the bottleneck.

2. **Dist nodes are the bottleneck** at 64-79% CPU. All 5 dist nodes show load avg 30-38 on 48-core machines.

3. **Sub throughput scales with publisher diversity** — 5K unique publishers gives 4K/s sub; 100K unique publishers gives 22-42K/s sub. This suggests dist-layer partitioning doesn't distribute load evenly when message sources are concentrated.

4. **Subscriber QoS 0 vs QoS 1 makes no difference** — bottleneck is in the dist-layer fanout/delivery pipeline, not in broker-to-subscriber PUBACK.

5. **Payload size reduction (900B to 100B) did not improve sub throughput.**

6. **At low connection count (290K), with zero drops, sub is still capped at 22K/s** — the shared subscription delivery ceiling is independent of broker load.

## Questions

1. Is there a recommended way to tune shared subscription throughput for high fan-in scenarios (many publishers, single `$share` group, few hundred subscribers)?

2. Would increasing `dist_worker_fanout_parallelism` help? Default is `max(2, cpus/8) = 6` on 48-core machines — seems conservative for 400K+ msg/s.

3. Are there other system properties we should tune? (e.g. `data_plane_burst_latency_ms`, `dist_worker_inline_fanout_threshold`)

4. Is the single-threaded `DeliverExecutor` design intentional for ordering? Could it be relaxed for unordered shared subscriptions?

Detailed source code analysis and full optimization history in comments below.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared Subscription Throughput Bottleneck: 400K msg/s Published, Only 20-40K msg/s Consumed #262

Summary

Environment

Test Results

Key Observations

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Shared Subscription Throughput Bottleneck: 400K msg/s Published, Only 20-40K msg/s Consumed #262

Description

Summary

Environment

Test Results

Key Observations

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions