support kv quant/offload#1035
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive KV cache management system for autoregressive transformer inference, supporting rolling windows, quantization (int8/fp8), and CPU offloading via asynchronous CUDA streams. Key additions include a new KVCacheManager, specialized cache pools (e.g., OffloadQuantRollingKVCachePool), and Triton kernels for efficient quantization and rescaling. The feedback identifies critical issues regarding device-agnostic code, specifically hardcoded device strings and global capability checks that could lead to runtime errors in multi-GPU environments. Additionally, there are performance concerns regarding synchronous CPU-GPU transfers caused by calling .item() on GPU tensors within the inference loop.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
No description provided.