BlueCache enables high-throughput, CPU-bypass data movement between Host GPU memory and BlueField DPU memory/storage. It is implemented as two companion components:
- NIXL backend plugin (
BLUE_CACHE) — runs on the host, inside a NIXL application. - DPU agent (
blue-cache) — runs on the BlueField DPU ARM cores.
Host DPU
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ Application (LMCache etc.) │ │ blue-cache agent │
│ │ │ │ ┌─────────────────────┐ │
│ ▼ │ DOCA Comch │ │ DOCA DMA engine │ │
│ ┌─────────────┐ │ or TCP │ │ ┌───────────────┐ │ │
│ │ NIXL agent │ │◄────────────────►│ │ │ staging buffer │ │ │
│ │ + BLUE_ │ │ control │ │ └───────┬───────┘ │ │
│ │ CACHE │ │ messages │ │ │ │ │
│ └──────┬──────┘ │ │ │ ▼ │ │
│ │ │ │ │ ┌───────────────┐ │ │
│ ▼ VRAM_SEG │ │ │ │ NIXL storage │ │ │
│ ┌─────────────┐ │ │ │ │ backend │ │ │
│ │ GPU buffer │ │ DOCA DMA │ │ │ (POSIX/OBJ) │ │ │
│ │ (exported │◄───────────┴──────────────────►│ │ └───────┬───────┘ │ │
│ │ via PCI) │ │ │ │ │ │
│ └─────────────┘ │ │ ▼ │ │
│ │ │ ┌───────────────┐ │ │
│ │ │ │ local storage │ │ │
│ │ │ └───────────────┘ │ │
└─────────────────────────────────────────────────┘ └─────────────────────┘ │
The plugin exposes two NIXL memory types:
VRAM_SEG— Host GPU memory. Registered buffers are exported viadoca_mmap_export_pci()so the DPU can import them.OBJ_SEG— DPU-resident object. ThemetaInfofield carries the object path or key; the actual I/O happens on the DPU.
The backend is local-only (supportsRemote() == false). Both source and destination must be reachable through the same host-side BlueField PCI function.
The host plugin and DPU agent communicate through a small, request/response protocol defined in common/include/dma_transfer.h.
Two transports are supported:
- DOCA Communication Channel (Comch) — default. Uses DOCA Comch to send messages without requiring a reachable DPU management IP.
- TCP — fallback on port
18517. Used when Comch is unavailable or misconfigured.
The control plane carries only metadata: operation direction, file/object path, remote GPU address, PCI export descriptor, and status. Bulk data never traverses the control plane.
For a write (NIXL_WRITE, GPU → DPU storage):
- Host plugin creates a DOCA mmap over the GPU buffer and exports it.
- Host plugin sends a batch of
DMA_REQ_BATCH_PUSHrequests. - DPU agent imports the remote mmap.
- DPU agent DMAs chunks from GPU into its pre-allocated staging buffer.
- DPU agent writes chunks to storage via its NIXL storage backend.
For a read (NIXL_READ, DPU storage → GPU):
- Host plugin queries the object size (
DMA_REQ_PULL_INFO). - Host plugin exports a writable GPU mmap and sends
DMA_REQ_BATCH_PULL. - DPU agent uses a pipelined reader/DMA worker to overlap storage reads with DMA back to GPU.
The DPU agent (blue-cache/src/blue_cache_agent.c) maintains:
- A single DOCA DMA device/context/progress engine.
- A reusable staging buffer registered once at startup.
- A slot pool sized by
queue_depth. - A pluggable storage backend (
storage_backend.cpp) implemented on top of NIXL:posix_storage_backend— local files via NIXLFILE_SEG.xdfs_storage_backend— object storage via NIXLOBJ_SEG(optional).xdfs_kv_storage_backend— object storage with key validation (optional).
The host plugin (nixl-plugin/src/blue_cache_backend.cpp) implements the NIXL backend engine interface:
registerMem/deregisterMem— manage DOCA mmaps for GPU buffers and object descriptors.prepXfer/postXfer— validate descriptor pairs and spawn a worker thread that sends batched control requests.checkXfer— poll the asynchronous request handle state.
Transfers are asynchronous: postXfer returns NIXL_IN_PROG and a background worker drives the control channel.
common/include/dma_transfer.h defines:
- Magic number
DMA_TRANSFER_MAGIC(0x44545246— "DTRF"). - Protocol version
DMA_TRANSFER_VERSION. - Request/response structs including batched segment layout.
Both the host plugin and DPU agent must use the same protocol version. The new project keeps this file as a single source of truth.