[RFC]: Optimizing Mooncake TCP Transport for High-Concurrency Throughput

### Changes proposed

# 1. Summary
This RFC proposes optimizations to Mooncake TCP Transport by introducing multiple io_contexts for parallel processing, a concurrency limiter, asynchronous connection establishment, and CUDA asynchronous copy, in order to improve throughput and GPU memory transfer efficiency under high-concurrency workloads.

# 2. Motivation
## 2.1 Current Architecture Analysis
The existing TCP Transport uses a single io_context and worker thread model:

<img width="1372" height="692" alt="Image" src="https://github.com/user-attachments/assets/e299d957-ec01-4cc3-842f-1eb94655e1a7" />

## 2.2 Performance Bottlenecks
| **Bottleneck** | **Cause** | **Impact** |
| --- | --- | --- |
| Single-threaded I/O | A single io_contexts is driven by a single worker thread, so CPU multicore resources cannot be used in parallel to process I/O events. Throughput becomes limited under high concurrency. | Low utilization of multicore CPUs |
| Synchronous connection establishment | getConnections waits synchronously for TCP three-way handshake completion. | Blocks the calling thread; connection latency propagates into transfer latency |
| No concurrency limit | Transfer tasks are submitted without restriction. | File descriptor exhaustion, memory pressure, and system instability during traffic spikes |
| Synchronous CUDA copy | cudaMemcpy blocks until the copy completes. | Blocks the I/O thread, preventing it from handling other socket events |
| Temporary buffers | Each CUDA transfer allocates and frees a temporary buffer via new/delete. | Memory allocation overhead and fragmentation |


## 2.3 Current Performance
In an environment measured by iperf3 with 10.8 GB/s bandwidth, the current performance is as follows:

| **Scenario** | **Bandwidth** | **Bandwidth Utilization** |
| --- | --- | --- |
| DRAM transfer | 3.2 GB/s | 30% |
| VRAM transfer | 2.3 GB/s | 22% |


## 2.4 Design Goals
+ Fully utilize multicore CPUs to improve throughput under high-concurrency workloads
+ Make connection establishment asynchronous to eliminate blocking
+ Limit the number of concurrent transfers to keep resource usage controllable
+ Make GPU memory transfers non-blocking so they do not interfere with I/O threads

# 3. Proposed Design
## 3.1 Architecture Overview

<img width="1400" height="736" alt="Image" src="https://github.com/user-attachments/assets/72cdb2f2-bcce-4d41-932c-f0e932a94961" />

## 3.2 Multi-IO Context Pool
### 3.2.1 Design Principles
In the current single io_context model, all socket events (accept, read, write) are handled by the same thread through io_context.run(). On multicore systems, this leads to:

+ One core being fully saturated while other cores remain idle
+ The internal ASIO queue becoming a contention hotspot

The proposed improvement is:

+ Create N independent io_contexts, each running in its own thread
+ Bind each io_context to an independent acceptor, and use SO_REUSEPORT to achieve kernel-level load balancing
+ Distribute client connections to different io_contexts in round-robin fashion

Configure the number of I/O threads via the MC_NUM_TCP_IO_THREADS environment variable

### 3.2.2 Acceptor Load Balancing
Use SO_REUSEPORT to allow multiple sockets to bind to the same port, so the kernel can automatically distribute incoming connections across different acceptors:

<img width="1372" height="584" alt="Image" src="https://github.com/user-attachments/assets/2ad9c9d0-58f4-418e-982b-bc5f847bc31d" />

## 3.3 ConcurrencyLimiter
### 3.3.1 Design Principles
Unrestricted concurrency may cause:

+ File descriptor exhaustion
+ Memory pressure, since each transfer consumes buffers
+ Overload of the remote peer

A soft flow-control mechanism is introduced through ConcurrencyLimiter:

<img width="978" height="658" alt="Image" src="https://github.com/user-attachments/assets/b51fa534-a86c-4fc3-b726-9412d6ba5824" />


### 3.3.2 Key Interfaces
| **Method** | **Responsibility** |
| --- | --- |
| submit(slice, fn) | If the limit is not exceeded, execute fn immediately; otherwise enqueue it |
| onTransferFinished(fn) | Decrement the active count and try to dequeue and execute the next task |
| activeCount() | Return the number of currently active transfers, for monitoring |
| pendingCount() | Return the length of the waiting queue, for monitoring |


## 3.4 Async Connection Pool
### 3.4.1 Differences from the Current Implementation
| **Feature** | **Existing Implementation** | **Proposed Solution** |
| --- | --- | --- |
| Acquisition method | Synchronous blocking getConnection() | Asynchronous asyncGetConnection(callback) |
| Connection failure handling | Return error directly | Retry with exponential backoff |
| Idle detection | Clean up during acquisition | Independent periodic cleanup timer |
| Statistics | None | Provide reuse/new connection counters |


### 3.4.2 Retry Strategy
Exponential backoff with random jitter.

### 3.4.3 GC Mechanism
The connection pool maintains a periodic timer to scan and clean up:

+ Connections that have been idle for longer than a configured threshold
+ Invalid connections that have already been closed

## 3.5 CUDA Async Copy Subsystem
### 3.5.1 Current Problems
In the existing implementation, Session::writeBody / readBody have the following issues:

```cpp
if (isCudaMemory(addr)) {
 dram_buffer = new char[size]; // temporary buffer
 cudaMemcpy(dram_buffer, addr, size, ...); // sync block
 // ... async_write ...
 delete[] dram_buffer; // release
}
```

+ Synchronous cudaMemcpy blocks the I/O thread, preventing it from handling other sockets during the copy
+ A temporary buffer is allocated and freed for each transfer, which incurs overhead

### 3.5.2 Proposed Solution
**Pinned memory reuse:**

+ Each Session holds a pinned buffer allocated via cudaMallocHost 
+ The lifetime of the buffer is bound to the Session, avoiding repeated allocations and deallocations

**Asynchronous copy + event polling:**

+ Use asynchronous copy operations
+ Poll CUDA events to detect completion

<img width="1132" height="694" alt="Image" src="https://github.com/user-attachments/assets/90eed228-d08b-4f3a-ad8e-511723c0e2b8" />

# 4. Configuration
## 4.1 New Configuration Items
| **Parameter** | **Environment Variable** | **Default** | **Description** |
| :--- | :--- | :--- | :--- |
| num_tcp_io_threads | MC_NUM_TCP_IO_THREADS | 8 | Number of I/O threads (= number of io_context s) |
| tcp_max_connections | MC_TCP_MAX_CONNECTIONS | 64 | Maximum number of concurrent transfers |


# 5. Compatibility
## 5.1 API Compatibility
Public APIs remain unchanged.

## 5.2 Behavioral Differences
| **Behavior** | **Before** | **After** |
| --- | --- | --- |
| Number of I/O threads | 1 | N (configurable) |
| Connection acquisition | Synchronous | Asynchronous |
| CUDA copy | Synchronous and blocking | Asynchronous and non-blocking |
| Concurrency control | None | With an upper limit |


# 6. Performance
Benchmarking was performed using transfer_engine_bench.

## 6.1 Test Environment
| CPU | 23 cores |
| :--- | :--- |
| Network | single-node loopback 47 GB/s, cross-node 10.8 GB/s (iperf3) |
| Test configuration | io_threads=16, max_connections=64, io_size=1MiB |


## 6.2 Test Results
### 6.2.1 Single-node Test
| **Memory** | **Operation** | **Native** | **Optimized** | **Improvement** |
| :--- | :--- | :--- | :--- | :--- |
| DRAM | read | 6.73 GB/s | 46.24 GB/s | 6.9× |
| DRAM | write | 6.71 GB/s | 45.39 GB/s | 6.8× |
| VRAM | read | 2.36 GB/s | 29.02 GB/s | 12.3× |
| VRAM | write | 2.36 GB/s | 29.25 GB/s | 12.4× |


### 6.2.2 Cross-node Test
| **Memory** | **Operation** | **Native** | **Optimized** | **Improvement** |
| :--- | :--- | :--- | :--- | :--- |
| DRAM | read | 3.16 GB/s | 10.73 GB/s | 3.4× |
| DRAM | write | 3.29 GB/s | 10.81 GB/s | 3.3× |
| VRAM | read | 2.33 GB/s | 10.68 GB/s | 4.6× |
| VRAM | write | 2.31 GB/s | 10.66 GB/s | 4.6× |


# 7. Rollout Plan
| **Phase** | **Content** |
| :--- | :--- |
| Phase 1 | Multi-IO Context + ConcurrencyLimiter |
| Phase 2 | Async Connection Pool |
| Phase 3 | CUDA Async Subsystem + Session adaptation |


 




### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Bottleneck	Cause	Impact
Single-threaded I/O	A single io_contexts is driven by a single worker thread, so CPU multicore resources cannot be used in parallel to process I/O events. Throughput becomes limited under high concurrency.	Low utilization of multicore CPUs
Synchronous connection establishment	getConnections waits synchronously for TCP three-way handshake completion.	Blocks the calling thread; connection latency propagates into transfer latency
No concurrency limit	Transfer tasks are submitted without restriction.	File descriptor exhaustion, memory pressure, and system instability during traffic spikes
Synchronous CUDA copy	cudaMemcpy blocks until the copy completes.	Blocks the I/O thread, preventing it from handling other socket events
Temporary buffers	Each CUDA transfer allocates and frees a temporary buffer via new/delete.	Memory allocation overhead and fragmentation

Method	Responsibility
submit(slice, fn)	If the limit is not exceeded, execute fn immediately; otherwise enqueue it
onTransferFinished(fn)	Decrement the active count and try to dequeue and execute the next task
activeCount()	Return the number of currently active transfers, for monitoring
pendingCount()	Return the length of the waiting queue, for monitoring

Feature	Existing Implementation	Proposed Solution
Acquisition method	Synchronous blocking getConnection()	Asynchronous asyncGetConnection(callback)
Connection failure handling	Return error directly	Retry with exponential backoff
Idle detection	Clean up during acquisition	Independent periodic cleanup timer
Statistics	None	Provide reuse/new connection counters

Parameter	Environment Variable	Default	Description
num_tcp_io_threads	MC_NUM_TCP_IO_THREADS	8	Number of I/O threads (= number of io_context s)
tcp_max_connections	MC_TCP_MAX_CONNECTIONS	64	Maximum number of concurrent transfers

Behavior	Before	After
Number of I/O threads	1	N (configurable)
Connection acquisition	Synchronous	Asynchronous
CUDA copy	Synchronous and blocking	Asynchronous and non-blocking
Concurrency control	None	With an upper limit

CPU	23 cores
Network	single-node loopback 47 GB/s, cross-node 10.8 GB/s (iperf3)
Test configuration	io_threads=16, max_connections=64, io_size=1MiB

Memory	Operation	Native	Optimized	Improvement
DRAM	read	6.73 GB/s	46.24 GB/s	6.9×
DRAM	write	6.71 GB/s	45.39 GB/s	6.8×
VRAM	read	2.36 GB/s	29.02 GB/s	12.3×
VRAM	write	2.36 GB/s	29.25 GB/s	12.4×

Memory	Operation	Native	Optimized	Improvement
DRAM	read	3.16 GB/s	10.73 GB/s	3.4×
DRAM	write	3.29 GB/s	10.81 GB/s	3.3×
VRAM	read	2.33 GB/s	10.68 GB/s	4.6×
VRAM	write	2.31 GB/s	10.66 GB/s	4.6×

Phase	Content
Phase 1	Multi-IO Context + ConcurrencyLimiter
Phase 2	Async Connection Pool
Phase 3	CUDA Async Subsystem + Session adaptation

[RFC]: Optimizing Mooncake TCP Transport for High-Concurrency Throughput #1749

Description

Changes proposed

1. Summary

2. Motivation

2.1 Current Architecture Analysis

2.2 Performance Bottlenecks

2.3 Current Performance

2.4 Design Goals

3. Proposed Design

3.1 Architecture Overview

3.2 Multi-IO Context Pool

3.2.1 Design Principles

3.2.2 Acceptor Load Balancing

3.3 ConcurrencyLimiter

3.3.1 Design Principles

3.3.2 Key Interfaces

3.4 Async Connection Pool

3.4.1 Differences from the Current Implementation

3.4.2 Retry Strategy

3.4.3 GC Mechanism

3.5 CUDA Async Copy Subsystem

3.5.1 Current Problems

3.5.2 Proposed Solution

4. Configuration

4.1 New Configuration Items

5. Compatibility

5.1 API Compatibility

5.2 Behavioral Differences

6. Performance

6.1 Test Environment

6.2 Test Results

6.2.1 Single-node Test

6.2.2 Cross-node Test

7. Rollout Plan

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions