Update dependency cache_dit to v1.5.0#33
Open
renovate[bot] wants to merge 1 commit into
Open
Conversation
9463dfe to
fd0ef19
Compare
fd0ef19 to
c65665d
Compare
c65665d to
376529f
Compare
376529f to
8e12ad4
Compare
8e12ad4 to
90ea7f3
Compare
90ea7f3 to
7f6b76d
Compare
7f6b76d to
ba50475
Compare
ba50475 to
0b1a44a
Compare
0b1a44a to
1ce318c
Compare
1ce318c to
28fa8b8
Compare
28fa8b8 to
9ea5b4b
Compare
9ea5b4b to
302b19a
Compare
302b19a to
92bb116
Compare
92bb116 to
3b5d57c
Compare
3b5d57c to
8b1494c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==1.1.10→==1.5.0Release Notes
vipshop/cache-dit (cache_dit)
v1.5.0: Major ReleaseCompare Source
🚀 Cache-DiT v1.5.0 Release Notes
Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases
📋 Overview
Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.
✨ Core Highlights
1. 💎 SVDQuant W4A4 Quantization
Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's
cache_dit.quantize()/cache_dit.load()API.📊 PTQ (Post-Training Quantization)
Supports
svdq_int4_r{rank}andsvdq_nvfp4_r{rank}quant types:calibrate_fn→ SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels:low(recommended default, ~18× speedup),medium,high. Serialize to{quant_type}.safetensors+quant_config.json; restore viacache_dit.load().runtime_kernel="v1"is supported for NVFP4.Performance (FLUX.2-klein-4B, 1024×1024, L20):
NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):
⚡ DQ (Dynamic Quantization)
Zero-calibration quantization via
_dqsuffix types (e.g.,svdq_int4_r128_dq):auto/stable_auto/power/log/rank/top/fixed). Supportsfew_shot_auto_compilefor deferred compilation after quantization.DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.
🔧 SVDQ Converter CLI
New
cache-dit-convertcommand-line tool for one-click model conversion to SVDQ W4A4:Supports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs
{quant_type}.safetensors+quant_config.json.🔀 Fused MLP
New
fused_gelu_mlp/fused_gelu_projpasses (enable viasvdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.🔗 Parallelism Compatibility (Cache-DiT Exclusive)
SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains.
SVDQW4A4ShardLinear(dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.⚙️ Quantization Configuration Enhancements
regional_quantize=True+repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision.precision_plan): Assign different quant types to different sub-layers by name pattern.per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers.QuantizeBackendenum (AUTO / TORCHAO / CACHE_DIT / NONE).QuantizeConfig+svdq_kwargs.📦 cache-dit-cu13 Pre-built Wheel
Pre-compiled SVDQuant wheel for CUDA 13 users:
pip install cache-dit-cu13— no source build needed.2. 💾 Bucket-style Layerwise CPU Offload
Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.
Core Design:
transfer_buckets,persistent_buckets,persistent_bins,prefetch_limit,max_copy_streams,max_inflight_prefetch_bytes.Performance (FLUX.1-dev, L20):
Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.
torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.
CLI Quick Start:
3. 🌩️ Ray Wrapper (Transparent Distributed Inference)
The Ray Wrapper makes distributed inference completely transparent to user code. No
torchrun, nodist.init_process_group, no manual model sharding — justuse_ray=True, and Cache-DiT handles everything.Two Wrapper Levels:
Key Features:
ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.ray_use_compile: Automatic per-worker compilation.ray_runtime_env: Custom module import handling viaPYTHONPATH.Performance (FLUX.2-klein-base-9B):
Minimal Example:
4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)
DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.
Mathematical Principle: The cached feature stream is modeled as a linear dynamical system$Y_{t+1} \approx A \cdot Y_t$ . The propagator $A$ is identified via one economy SVD, then eigendecomposed for extrapolation via $\lambda^k$ . Unlike TaylorSeer's polynomial extrapolation (diverges as $t^n \to \infty$ ), DMD is bounded when $\lvert\lambda\rvert \leq 1$ .
TaylorSeer vs DMD:
Usage:
CLI:
python -m cache_dit.generate flux --cache --dmd --dmd-history 6🔧 Other Enhancements
🧩 FLUX.2-klein-kv Series
📈 CUDA Graph
descent_tuningenabled by default (#935)🏗️ Kernel & Infrastructure Refactoring
torch.libraryops (#905)🔗 Parallelism Improvements
record_plansfor TP/TE-P planners (#916-#917)⚡ Compilation Optimizations
🤝 Community Integrations
📦 Dependency Updates
uvfor dependency management (#992)📝 Documentation
QuantizeConfig+svdq_kwargs. Legacygrad_ckpt,reorder_before_quantizeetc. removed👥 New Contributors
Thank you to the new contributors who joined the Cache-DiT community:
🙏 Contributors
Thank you to all contributors for this release: @DefTruth, @Archerkattri, @FNGarvin, @blian6.
For the full list of changes, see GitHub Release v1.5.0 and the full changelog.
🇨🇳 中文版
📋 概述
Cache-DiT v1.5.0 是一次重大功能更新,历时 3 个月(2026-03-12 ~ 2026-06-16),涵盖 176 个 PR。本次发布围绕四大核心模块:SVDQuant W4A4 量化(PTQ/DQ/NVFP4/转换器CLI)、DMD Calibrator(基于指数基底的动态模态分解校准器)、Bucket-style Layerwise CPU Offload(计算-通信重叠的逐层卸载)、Ray Wrapper(分布式推理透明化包装器)。此外还包括 FLUX.2-klein-kv 系列支持、CUDA Graph 完整集成、大量 kernel 重构与量化增强、并行框架改进、以及全面更新的文档体系。
✨ 核心亮点
1. 💎 SVDQuant W4A4 量化
Cache-DiT v1.5.0 原生集成了完整的 SVDQuant W4A4 量化工作流,这是本次发布最重要的特性。与其依赖第三方库,用户现在可以直接通过 Cache-DiT 的
cache_dit.quantize()/cache_dit.load()API 完成从校准到推理的全链路 W4A4 量化。📊 PTQ(后训练量化)
支持
svdq_int4_r{rank}和svdq_nvfp4_r{rank}两种量化类型:calibrate_fn收集激活统计 → SVD 低秩分解 → INT4 打包 → 运行时 W4A4 GEMM。支持三种校准精度:low(推荐默认,~18x 加速)、medium、high。序列化到{quant_type}.safetensors+quant_config.json,通过cache_dit.load()一键恢复。runtime_kernel="v1"。性能数据(FLUX.2-klein-4B,1024×1024,L20):
NVFP4 性能(RTX 5090,FLUX.2-klein-4B,1024×1024):
⚡ DQ(动态量化)
无需校准数据的零样本量化,类型后缀
_dq(如svdq_int4_r128_dq):auto/stable_auto/power/log/rank/top/fixed),可配置few_shot_steps、few_shot_relax_factor、few_shot_relax_top_ratio、few_shot_relax_strategy。支持few_shot_auto_compile在量化完成后自动触发torch.compileDQ 性能(FLUX.2-klein-4B,1024×1024,identity smooth,rank=128):1.28s,PSNR 28.71 dB。
🔧 SVDQ Converter CLI
新增
cache-dit-convert命令行工具,一键将预训练模型转换为 SVDQ W4A4 格式:支持 INT4/NVFP4 DQ 转换、自定义 smooth 策略、多种精度选项。生成
{quant_type}.safetensors+quant_config.json,可通过cache_dit.load()直接加载。🔀 Fused MLP
新增
fused_gelu_mlp/fused_gelu_projpass(通过svdq_kwargs["fused_mlp"]=True启用),将第一个 GEMM + GELU 激活融合为单 kernel,降低 kernel launch 开销和中间激活显存占用。🔗 与并行兼容(Cache-DiT 独家特色)
SVDQuant 在 Cache-DiT 中完整支持 Context Parallelism(Ulysses / Ring / USP)和 Tensor Parallelism(PyTorch DTensor)。用户可以在分布式推理场景下叠加使用量化 + 并行,实现极致的显存压缩和吞吐提升。
SVDQW4A4ShardLinear(dtensor.py)提供原生 TP 分片支持。这是 Cache-DiT 中 SVDQuant 区别于其他 W4A4 实现的差异化能力。⚙️ 量化配置增强
regional_quantize=True+repeated_blocks):仅量化 transformer 的重复块,保持嵌入层等敏感层全精度precision_plan):按层名模式为不同子层指定不同量化类型(如attn.to_q用float8_per_tensor、attn.to_k用float8_weight_only)per_tensor_fallback=True,默认开启):在 TP 场景下,不支持 per-row 量化的层自动回退到 per-tensor,消除跳过警告,提升覆盖率(144 → 144,0 skip)QuantizeBackend枚举(AUTO / TORCHAO / CACHE_DIT / NONE)QuantizeConfig+svdq_kwargs📦 cache-dit-cu13 预编译 Wheel
为 CUDA 13 用户提供预编译的 SVDQuant wheel:
pip install cache-dit-cu13,免去从源码编译 SVDQ kernel 的麻烦。2. 💾 Bucket-style Layerwise CPU Offload
Cache-DiT v1.5.0 引入了全新的 bucket 式逐层卸载机制,解决了传统逐层 offload "每层等传输"的低效问题。
核心设计:
transfer_buckets(预取深度)、persistent_buckets(常驻桶数)、persistent_bins(常驻分布桶数)、prefetch_limit(保守预取限制)、max_copy_streams(并发拷贝流数)、max_inflight_prefetch_bytes(预取字节预算)性能数据(FLUX.1-dev,L20):
仅 1.2s 的额外延迟(vs 无 offload 23.4s),即可将显存从 38 GiB 压缩到 16 GiB。
兼容 torch.compile:先应用 offload 再编译,offload hooks 在编译后正常工作。
CLI 快速启动:
3. 🌩️ Ray Wrapper(分布式推理透明化)
Ray Wrapper 让分布式推理对用户代码完全透明。你不需要
torchrun、不需要dist.init_process_group、不需要手动模型分片 —— 只需use_ray=True,Cache-DiT 接管一切。两种包装级别:
核心特性:
ray_transfer_fn:用户自定义每个 worker 的模型加载逻辑,绕过序列化/反序列化开销,解决自定义模块的类解析问题ray_use_compile:worker 内自动编译ray_runtime_env:通过PYTHONPATH处理自定义模块导入性能数据(FLUX.2-klein-base-9B):
最简用法:
4. 🔮 DMD Calibrator(动态模态分解校准器)
DMD(Dynamic Mode Decomposition)是一个基于指数基底的前馈校准器,作为 TaylorSeer(多项式基底)的替代方案。
数学原理:将缓存特征流建模为线性动力系统$Y_{t+1} \approx A \cdot Y_t$ ,通过一次 SVD 识别传播子 $A$ ,特征分解后以 $\lambda^k$ 进行外推。相比 TaylorSeer 的多项式外推($t^n$ 发散),DMD 在 $\lvert\lambda\rvert \leq 1$ 时有界。
TaylorSeer vs DMD 对比:
使用方式:
CLI:
python -m cache_dit.generate flux --cache --dmd --dmd-history 6🔧 其他增强
🧩 FLUX.2-klein-kv 系列支持
📈 CUDA Graph
🏗️ Kernel 与基础设施重构
torch.libraryops(#905)🔗 并行改进
record_plans函数支持(TP/TE-P planners,#916-#917)⚡ 编译优化
🤝 社区集成
📦 依赖升级
uv替代 pip 管理依赖(#992)📝 文档
QuantizeConfig+svdq_kwargs,旧版grad_ckpt、reorder_before_quantize等参数已移除👥 新贡献者
感谢以下新贡献者加入 Cache-DiT 社区:
🙏 致谢
感谢所有贡献者:@DefTruth、@Archerkattri、@FNGarvin、@blian6。
完整变更列表请参见 GitHub Release v1.5.0。
v1.3.12Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.11...v1.3.12
v1.3.11Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.10...v1.3.11
v1.3.10Compare Source
What's Changed
New Contributors
Full Changelog: vipshop/cache-dit@v1.3.9...v1.3.10
v1.3.9Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.8...v1.3.9
v1.3.8Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.7...v1.3.8
v1.3.7Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.6...v1.3.7
v1.3.6Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.5...v1.3.6
v1.3.5: QuantizationCompare Source
Low-bits Quantization
Overview
Quantization is a powerful technique to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower precision data types. Cache-DiT supports various quantization methods, including FP8, INT8, and INT4 quantization, to help users achieve faster inference and lower memory usage while maintaining acceptable model performance.
FP8 Quantization
Currently, TorchAo has been fully integrated into Cache-DiT as the backend for online quantization. You can implement model quantization by calling quantize or pass a QuantizeConfig to enable_cache API. (recommended)
For GPUs with low memory capacity, we recommend using float8_per_row or float8_per_block, as these
Configuration
📅 Schedule: (UTC)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.