Skip to content

support kv quant/offload#1035

Merged
gushiqiao merged 8 commits into
mainfrom
gsq/kvcache
Apr 23, 2026
Merged

support kv quant/offload#1035
gushiqiao merged 8 commits into
mainfrom
gsq/kvcache

Conversation

@gushiqiao
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive KV cache management system for autoregressive transformer inference, supporting rolling windows, quantization (int8/fp8), and CPU offloading via asynchronous CUDA streams. Key additions include a new KVCacheManager, specialized cache pools (e.g., OffloadQuantRollingKVCachePool), and Triton kernels for efficient quantization and rescaling. The feedback identifies critical issues regarding device-agnostic code, specifically hardcoded device strings and global capability checks that could lead to runtime errors in multi-GPU environments. Additionally, there are performance concerns regarding synchronous CPU-GPU transfers caused by calling .item() on GPU tensors within the inference loop.

Comment thread lightx2v/common/kvcache/quant.py Outdated
Comment thread lightx2v/common/ops/attn/sage_attn.py Outdated
Comment thread lightx2v/common/kvcache/offload.py Outdated
Comment thread lightx2v/common/kvcache/rolling.py
gushiqiao and others added 3 commits April 23, 2026 12:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@gushiqiao gushiqiao merged commit d7ec87f into main Apr 23, 2026
2 checks passed
@gushiqiao gushiqiao deleted the gsq/kvcache branch April 23, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants