Skip to content

[RFC]: Optimizing Mooncake TCP Transport for High-Concurrency Throughput #1749

@yxj-97

Description

@yxj-97

Changes proposed

1. Summary

This RFC proposes optimizations to Mooncake TCP Transport by introducing multiple io_contexts for parallel processing, a concurrency limiter, asynchronous connection establishment, and CUDA asynchronous copy, in order to improve throughput and GPU memory transfer efficiency under high-concurrency workloads.

2. Motivation

2.1 Current Architecture Analysis

The existing TCP Transport uses a single io_context and worker thread model:

Image

2.2 Performance Bottlenecks

Bottleneck Cause Impact
Single-threaded I/O A single io_contexts is driven by a single worker thread, so CPU multicore resources cannot be used in parallel to process I/O events. Throughput becomes limited under high concurrency. Low utilization of multicore CPUs
Synchronous connection establishment getConnections waits synchronously for TCP three-way handshake completion. Blocks the calling thread; connection latency propagates into transfer latency
No concurrency limit Transfer tasks are submitted without restriction. File descriptor exhaustion, memory pressure, and system instability during traffic spikes
Synchronous CUDA copy cudaMemcpy blocks until the copy completes. Blocks the I/O thread, preventing it from handling other socket events
Temporary buffers Each CUDA transfer allocates and frees a temporary buffer via new/delete. Memory allocation overhead and fragmentation

2.3 Current Performance

In an environment measured by iperf3 with 10.8 GB/s bandwidth, the current performance is as follows:

Scenario Bandwidth Bandwidth Utilization
DRAM transfer 3.2 GB/s 30%
VRAM transfer 2.3 GB/s 22%

2.4 Design Goals

  • Fully utilize multicore CPUs to improve throughput under high-concurrency workloads
  • Make connection establishment asynchronous to eliminate blocking
  • Limit the number of concurrent transfers to keep resource usage controllable
  • Make GPU memory transfers non-blocking so they do not interfere with I/O threads

3. Proposed Design

3.1 Architecture Overview

Image

3.2 Multi-IO Context Pool

3.2.1 Design Principles

In the current single io_context model, all socket events (accept, read, write) are handled by the same thread through io_context.run(). On multicore systems, this leads to:

  • One core being fully saturated while other cores remain idle
  • The internal ASIO queue becoming a contention hotspot

The proposed improvement is:

  • Create N independent io_contexts, each running in its own thread
  • Bind each io_context to an independent acceptor, and use SO_REUSEPORT to achieve kernel-level load balancing
  • Distribute client connections to different io_contexts in round-robin fashion

Configure the number of I/O threads via the MC_NUM_TCP_IO_THREADS environment variable

3.2.2 Acceptor Load Balancing

Use SO_REUSEPORT to allow multiple sockets to bind to the same port, so the kernel can automatically distribute incoming connections across different acceptors:

Image

3.3 ConcurrencyLimiter

3.3.1 Design Principles

Unrestricted concurrency may cause:

  • File descriptor exhaustion
  • Memory pressure, since each transfer consumes buffers
  • Overload of the remote peer

A soft flow-control mechanism is introduced through ConcurrencyLimiter:

Image

3.3.2 Key Interfaces

Method Responsibility
submit(slice, fn) If the limit is not exceeded, execute fn
immediately; otherwise enqueue it
onTransferFinished(fn) Decrement the active count and try to dequeue and execute the next task
activeCount() Return the number of currently active transfers, for monitoring
pendingCount() Return the length of the waiting queue, for monitoring

3.4 Async Connection Pool

3.4.1 Differences from the Current Implementation

Feature Existing Implementation Proposed Solution
Acquisition method Synchronous blocking getConnection() Asynchronous asyncGetConnection(callback)
Connection failure handling Return error directly Retry with exponential backoff
Idle detection Clean up during acquisition Independent periodic cleanup timer
Statistics None Provide reuse/new connection counters

3.4.2 Retry Strategy

Exponential backoff with random jitter.

3.4.3 GC Mechanism

The connection pool maintains a periodic timer to scan and clean up:

  • Connections that have been idle for longer than a configured threshold
  • Invalid connections that have already been closed

3.5 CUDA Async Copy Subsystem

3.5.1 Current Problems

In the existing implementation, Session::writeBody / readBody have the following issues:

if (isCudaMemory(addr)) {
    dram_buffer = new char[size];           // temporary buffer
    cudaMemcpy(dram_buffer, addr, size, ...); // sync block
    // ... async_write ...
    delete[] dram_buffer;                    // release
}
  • Synchronous cudaMemcpy blocks the I/O thread, preventing it from handling other sockets during the copy
  • A temporary buffer is allocated and freed for each transfer, which incurs overhead

3.5.2 Proposed Solution

Pinned memory reuse:

  • Each Session holds a pinned buffer allocated via cudaMallocHost
  • The lifetime of the buffer is bound to the Session, avoiding repeated allocations and deallocations

Asynchronous copy + event polling:

  • Use asynchronous copy operations
  • Poll CUDA events to detect completion
Image

4. Configuration

4.1 New Configuration Items

Parameter Environment Variable Default Description
num_tcp_io_threads MC_NUM_TCP_IO_THREADS 8 Number of I/O threads (= number of io_context
s)
tcp_max_connections MC_TCP_MAX_CONNECTIONS 64 Maximum number of concurrent transfers

5. Compatibility

5.1 API Compatibility

Public APIs remain unchanged.

5.2 Behavioral Differences

Behavior Before After
Number of I/O threads 1 N (configurable)
Connection acquisition Synchronous Asynchronous
CUDA copy Synchronous and blocking Asynchronous and non-blocking
Concurrency control None With an upper limit

6. Performance

Benchmarking was performed using transfer_engine_bench.

6.1 Test Environment

CPU 23 cores
Network single-node loopback 47 GB/s, cross-node 10.8 GB/s (iperf3)
Test configuration io_threads=16, max_connections=64, io_size=1MiB

6.2 Test Results

6.2.1 Single-node Test

Memory Operation Native Optimized Improvement
DRAM read 6.73 GB/s 46.24 GB/s 6.9×
DRAM write 6.71 GB/s 45.39 GB/s 6.8×
VRAM read 2.36 GB/s 29.02 GB/s 12.3×
VRAM write 2.36 GB/s 29.25 GB/s 12.4×

6.2.2 Cross-node Test

Memory Operation Native Optimized Improvement
DRAM read 3.16 GB/s 10.73 GB/s 3.4×
DRAM write 3.29 GB/s 10.81 GB/s 3.3×
VRAM read 2.33 GB/s 10.68 GB/s 4.6×
VRAM write 2.31 GB/s 10.66 GB/s 4.6×

7. Rollout Plan

Phase Content
Phase 1 Multi-IO Context + ConcurrencyLimiter
Phase 2 Async Connection Pool
Phase 3 CUDA Async Subsystem + Session adaptation

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions