Changes proposed
1. Summary
This RFC proposes optimizations to Mooncake TCP Transport by introducing multiple io_contexts for parallel processing, a concurrency limiter, asynchronous connection establishment, and CUDA asynchronous copy, in order to improve throughput and GPU memory transfer efficiency under high-concurrency workloads.
2. Motivation
2.1 Current Architecture Analysis
The existing TCP Transport uses a single io_context and worker thread model:
2.2 Performance Bottlenecks
| Bottleneck |
Cause |
Impact |
| Single-threaded I/O |
A single io_contexts is driven by a single worker thread, so CPU multicore resources cannot be used in parallel to process I/O events. Throughput becomes limited under high concurrency. |
Low utilization of multicore CPUs |
| Synchronous connection establishment |
getConnections waits synchronously for TCP three-way handshake completion. |
Blocks the calling thread; connection latency propagates into transfer latency |
| No concurrency limit |
Transfer tasks are submitted without restriction. |
File descriptor exhaustion, memory pressure, and system instability during traffic spikes |
| Synchronous CUDA copy |
cudaMemcpy blocks until the copy completes. |
Blocks the I/O thread, preventing it from handling other socket events |
| Temporary buffers |
Each CUDA transfer allocates and frees a temporary buffer via new/delete. |
Memory allocation overhead and fragmentation |
2.3 Current Performance
In an environment measured by iperf3 with 10.8 GB/s bandwidth, the current performance is as follows:
| Scenario |
Bandwidth |
Bandwidth Utilization |
| DRAM transfer |
3.2 GB/s |
30% |
| VRAM transfer |
2.3 GB/s |
22% |
2.4 Design Goals
- Fully utilize multicore CPUs to improve throughput under high-concurrency workloads
- Make connection establishment asynchronous to eliminate blocking
- Limit the number of concurrent transfers to keep resource usage controllable
- Make GPU memory transfers non-blocking so they do not interfere with I/O threads
3. Proposed Design
3.1 Architecture Overview
3.2 Multi-IO Context Pool
3.2.1 Design Principles
In the current single io_context model, all socket events (accept, read, write) are handled by the same thread through io_context.run(). On multicore systems, this leads to:
- One core being fully saturated while other cores remain idle
- The internal ASIO queue becoming a contention hotspot
The proposed improvement is:
- Create N independent io_contexts, each running in its own thread
- Bind each io_context to an independent acceptor, and use SO_REUSEPORT to achieve kernel-level load balancing
- Distribute client connections to different io_contexts in round-robin fashion
Configure the number of I/O threads via the MC_NUM_TCP_IO_THREADS environment variable
3.2.2 Acceptor Load Balancing
Use SO_REUSEPORT to allow multiple sockets to bind to the same port, so the kernel can automatically distribute incoming connections across different acceptors:
3.3 ConcurrencyLimiter
3.3.1 Design Principles
Unrestricted concurrency may cause:
- File descriptor exhaustion
- Memory pressure, since each transfer consumes buffers
- Overload of the remote peer
A soft flow-control mechanism is introduced through ConcurrencyLimiter:
3.3.2 Key Interfaces
| Method |
Responsibility |
| submit(slice, fn) |
If the limit is not exceeded, execute fn immediately; otherwise enqueue it |
| onTransferFinished(fn) |
Decrement the active count and try to dequeue and execute the next task |
| activeCount() |
Return the number of currently active transfers, for monitoring |
| pendingCount() |
Return the length of the waiting queue, for monitoring |
3.4 Async Connection Pool
3.4.1 Differences from the Current Implementation
| Feature |
Existing Implementation |
Proposed Solution |
| Acquisition method |
Synchronous blocking getConnection() |
Asynchronous asyncGetConnection(callback) |
| Connection failure handling |
Return error directly |
Retry with exponential backoff |
| Idle detection |
Clean up during acquisition |
Independent periodic cleanup timer |
| Statistics |
None |
Provide reuse/new connection counters |
3.4.2 Retry Strategy
Exponential backoff with random jitter.
3.4.3 GC Mechanism
The connection pool maintains a periodic timer to scan and clean up:
- Connections that have been idle for longer than a configured threshold
- Invalid connections that have already been closed
3.5 CUDA Async Copy Subsystem
3.5.1 Current Problems
In the existing implementation, Session::writeBody / readBody have the following issues:
if (isCudaMemory(addr)) {
dram_buffer = new char[size]; // temporary buffer
cudaMemcpy(dram_buffer, addr, size, ...); // sync block
// ... async_write ...
delete[] dram_buffer; // release
}
- Synchronous cudaMemcpy blocks the I/O thread, preventing it from handling other sockets during the copy
- A temporary buffer is allocated and freed for each transfer, which incurs overhead
3.5.2 Proposed Solution
Pinned memory reuse:
- Each Session holds a pinned buffer allocated via cudaMallocHost
- The lifetime of the buffer is bound to the Session, avoiding repeated allocations and deallocations
Asynchronous copy + event polling:
- Use asynchronous copy operations
- Poll CUDA events to detect completion
4. Configuration
4.1 New Configuration Items
| Parameter |
Environment Variable |
Default |
Description |
| num_tcp_io_threads |
MC_NUM_TCP_IO_THREADS |
8 |
Number of I/O threads (= number of io_context s) |
| tcp_max_connections |
MC_TCP_MAX_CONNECTIONS |
64 |
Maximum number of concurrent transfers |
5. Compatibility
5.1 API Compatibility
Public APIs remain unchanged.
5.2 Behavioral Differences
| Behavior |
Before |
After |
| Number of I/O threads |
1 |
N (configurable) |
| Connection acquisition |
Synchronous |
Asynchronous |
| CUDA copy |
Synchronous and blocking |
Asynchronous and non-blocking |
| Concurrency control |
None |
With an upper limit |
6. Performance
Benchmarking was performed using transfer_engine_bench.
6.1 Test Environment
| CPU |
23 cores |
| Network |
single-node loopback 47 GB/s, cross-node 10.8 GB/s (iperf3) |
| Test configuration |
io_threads=16, max_connections=64, io_size=1MiB |
6.2 Test Results
6.2.1 Single-node Test
| Memory |
Operation |
Native |
Optimized |
Improvement |
| DRAM |
read |
6.73 GB/s |
46.24 GB/s |
6.9× |
| DRAM |
write |
6.71 GB/s |
45.39 GB/s |
6.8× |
| VRAM |
read |
2.36 GB/s |
29.02 GB/s |
12.3× |
| VRAM |
write |
2.36 GB/s |
29.25 GB/s |
12.4× |
6.2.2 Cross-node Test
| Memory |
Operation |
Native |
Optimized |
Improvement |
| DRAM |
read |
3.16 GB/s |
10.73 GB/s |
3.4× |
| DRAM |
write |
3.29 GB/s |
10.81 GB/s |
3.3× |
| VRAM |
read |
2.33 GB/s |
10.68 GB/s |
4.6× |
| VRAM |
write |
2.31 GB/s |
10.66 GB/s |
4.6× |
7. Rollout Plan
| Phase |
Content |
| Phase 1 |
Multi-IO Context + ConcurrencyLimiter |
| Phase 2 |
Async Connection Pool |
| Phase 3 |
CUDA Async Subsystem + Session adaptation |
Before submitting a new issue...
Changes proposed
1. Summary
This RFC proposes optimizations to Mooncake TCP Transport by introducing multiple io_contexts for parallel processing, a concurrency limiter, asynchronous connection establishment, and CUDA asynchronous copy, in order to improve throughput and GPU memory transfer efficiency under high-concurrency workloads.
2. Motivation
2.1 Current Architecture Analysis
The existing TCP Transport uses a single io_context and worker thread model:
2.2 Performance Bottlenecks
2.3 Current Performance
In an environment measured by iperf3 with 10.8 GB/s bandwidth, the current performance is as follows:
2.4 Design Goals
3. Proposed Design
3.1 Architecture Overview
3.2 Multi-IO Context Pool
3.2.1 Design Principles
In the current single io_context model, all socket events (accept, read, write) are handled by the same thread through io_context.run(). On multicore systems, this leads to:
The proposed improvement is:
Configure the number of I/O threads via the MC_NUM_TCP_IO_THREADS environment variable
3.2.2 Acceptor Load Balancing
Use SO_REUSEPORT to allow multiple sockets to bind to the same port, so the kernel can automatically distribute incoming connections across different acceptors:
3.3 ConcurrencyLimiter
3.3.1 Design Principles
Unrestricted concurrency may cause:
A soft flow-control mechanism is introduced through ConcurrencyLimiter:
3.3.2 Key Interfaces
immediately; otherwise enqueue it
3.4 Async Connection Pool
3.4.1 Differences from the Current Implementation
3.4.2 Retry Strategy
Exponential backoff with random jitter.
3.4.3 GC Mechanism
The connection pool maintains a periodic timer to scan and clean up:
3.5 CUDA Async Copy Subsystem
3.5.1 Current Problems
In the existing implementation, Session::writeBody / readBody have the following issues:
3.5.2 Proposed Solution
Pinned memory reuse:
Asynchronous copy + event polling:
4. Configuration
4.1 New Configuration Items
s)
5. Compatibility
5.1 API Compatibility
Public APIs remain unchanged.
5.2 Behavioral Differences
6. Performance
Benchmarking was performed using transfer_engine_bench.
6.1 Test Environment
6.2 Test Results
6.2.1 Single-node Test
6.2.2 Cross-node Test
7. Rollout Plan
Before submitting a new issue...