Update dependency cache_dit to v1.5.0 by renovate[bot] · Pull Request #33 · BaizeAI/image-gen-runtime

renovate · 2026-01-16T10:29:54Z

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package	Change	Age	Confidence
cache_dit	`==1.1.10` → `==1.5.0`

Release Notes

vipshop/cache-dit (cache_dit)

`v1.5.0`: Major Release

Compare Source

🚀 Cache-DiT v1.5.0 Release Notes

Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases

📋 Overview

Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's cache_dit.quantize() / cache_dit.load() API.

📊 PTQ (Post-Training Quantization)

Supports svdq_int4_r{rank} and svdq_nvfp4_r{rank} quant types:

INT4 PTQ (≥sm80): Collect activation statistics via calibrate_fn → SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels: low (recommended default, ~18× speedup), medium, high. Serialize to {quant_type}.safetensors + quant_config.json; restore via cache_dit.load().
NVFP4 PTQ (≥sm120, Blackwell): Designed for RTX 5090 and other Blackwell GPUs. Currently only runtime_kernel="v1" is supported for NVFP4.

Performance (FLUX.2-klein-4B, 1024×1024, L20):

Stage	Latency (s)	Memory (GiB)	Transformer Weight (GiB)
BF16 baseline	2.13	17.32	7.22
SVDQuant INT4	1.24	12.39	2.28
SVDQuant + compile	1.02	12.39	2.28

Transformer weight reduction: ~3.2× compression (7.22 → 2.28 GiB)
End-to-end latency: ~1.7× speedup (2.13 → 1.24s), ~2.1× with compile (2.13 → 1.02s)
PSNR > 29 dB, near-lossless visual quality

NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):

Stage	Latency (s)	Speedup	Memory (GiB)
BF16 baseline	0.97	1.00×	17.32
NVFP4 PTQ	0.58	1.69×	12.50
NVFP4 + compile	0.47	2.05×	12.50

⚡ DQ (Dynamic Quantization)

Zero-calibration quantization via _dq suffix types (e.g., svdq_int4_r128_dq):

identity (default): Apply SVD low-rank decomposition directly to the original weight matrix — no calibration, no serialization.
weight / weight_inv: Weight-statistics-only heuristic smooth strategies (experimental).
few_shot: Collect a small number of real inference forwards at runtime, then quantize in-place with configurable relaxation strategies (7 strategies: auto/stable_auto/power/log/rank/top/fixed). Supports few_shot_auto_compile for deferred compilation after quantization.

DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.

🔧 SVDQ Converter CLI

New cache-dit-convert command-line tool for one-click model conversion to SVDQ W4A4:

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

Supports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs {quant_type}.safetensors + quant_config.json.

🔀 Fused MLP

New fused_gelu_mlp / fused_gelu_proj passes (enable via svdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains. SVDQW4A4ShardLinear (dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.

⚙️ Quantization Configuration Enhancements

Regional Quantization (regional_quantize=True + repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision.
Hybrid Precision Plan (precision_plan): Assign different quant types to different sub-layers by name pattern.
FP8 Per-Tensor Fallback (per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers.
TorchAO Backend Refactor: Cleaner QuantizeBackend enum (AUTO / TORCHAO / CACHE_DIT / NONE).
Quantize API Refactor: Deprecated legacy kwargs, unified under QuantizeConfig + svdq_kwargs.

📦 cache-dit-cu13 Pre-built Wheel

Pre-compiled SVDQuant wheel for CUDA 13 users: pip install cache-dit-cu13 — no source build needed.

2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.

Core Design:

Bucket Pipeline: Divide target modules into small contiguous buckets; prefetch the next bucket asynchronously while the current one executes.
Dual Independent Copy Stream Pools: Separate CUDA stream pools for onload (H2D) and offload (D2H).
Persistent Bins: Distribute the persistent budget evenly across the target sequence.
Flexible Resource Controls: transfer_buckets, persistent_buckets, persistent_bins, prefetch_limit, max_copy_streams, max_inflight_prefetch_bytes.

Performance (FLUX.1-dev, L20):

Config	Memory	Latency
No offload	~38 GiB	23.4s
Diffusers sequential	~1 GiB	335s
Layerwise (transfer=4, persistent=32, bins=4)	~16 GiB	24.6s

Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.

torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.

CLI Quick Start:

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

The Ray Wrapper makes distributed inference completely transparent to user code. No torchrun, no dist.init_process_group, no manual model sharding — just use_ray=True, and Cache-DiT handles everything.

Two Wrapper Levels:

Level	Description	Best For
Pipeline Wrapper (recommended)	Ray manages the entire pipeline execution	Full feature support (cache, quant, parallelism), simplest, fastest.
Transformer Wrapper	Only the transformer runs on Ray workers	Lightweight, but slight slower

Key Features:

ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.
ray_use_compile: Automatic per-worker compilation.
ray_runtime_env: Custom module import handling via PYTHONPATH.
Supports all parallelism strategies: TP, Ulysses, Ring.
LoRA support: fuse before enabling (TP requires fused LoRAs).

Performance (FLUX.2-klein-base-9B):

Config	Latency
Baseline (single GPU)	47.41s
Ray TP=2 + compile	24.57s

Minimal Example:

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # Code unchanged

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.

Mathematical Principle: The cached feature stream is modeled as a linear dynamical system $Y_{t+1} \approx A \cdot Y_t$. The propagator $A$ is identified via one economy SVD, then eigendecomposed for extrapolation via $\lambda^k$. Unlike TaylorSeer's polynomial extrapolation (diverges as $t^n \to \infty$), DMD is bounded when $\lvert\lambda\rvert \leq 1$.

TaylorSeer vs DMD:

Aspect	TaylorSeer (Polynomial)	DMD (Exponential)
Basis	$Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$	$Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
Extrapolation	Diverges as $t^n \to \infty$	Bounded when $\lvert\lambda\rvert \leq 1$
Snapshots needed	2+ (1st order)	≥ 4 uniformly spaced
Best for	DiT-class denoising (DDPM)	Flow-matching generators (Hunyuan3D, etc.)
Noise sensitivity	Low	Moderate (SVD truncation suppresses noise)

Usage:

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI: python -m cache_dit.generate flux --cache --dmd --dmd-history 6

🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series

TP + compile integration (#888)
fp8 per-row + TP support (#896)
Async Ulysses support (#877)

📈 CUDA Graph

Full CUDA Graph support (#942-#952)
CUDA Graph + fp8 rowwise integration (#952)
descent_tuning enabled by default (#935)
compile full-graph CLI option (#946)

🏗️ Kernel & Infrastructure Refactoring

Triton kernel refactoring (3/N, #907-#909)
Communication kernels registered as torch.library ops (#905)
CuTeDSL communication kernels (fp8, #977), merge-attn-states kernel (#973)
Unified all2all/ring communication API (#985)
Async Ulysses refactoring (#986)
Unified ops registration policy (#931, #939, #967)

🔗 Parallelism Improvements

record_plans for TP/TE-P planners (#916-#917)
sub cp_plan support (#989)
CP & VAE-P planner refactoring for better logging (#918)

⚡ Compilation Optimizations

Removed manual graph breaks (#885)
CUDA Graph for dynamic compile (#951)
SVDQuant + compile compatibility; offload + compile (#1014)
Log suppression improvements (#934, #923)

🤝 Community Integrations

MindIE-SD: Huawei Ascend NPU attention/compilation backend support (@blian6, #1004)
TensorRT-LLM community link (#991)

📦 Dependency Updates

PyTorch → 2.11.0 (#900)
uv for dependency management (#992)
Python version compatibility fix (@FNGarvin, #1025)
TorchAO ≥ 0.17.0

📝 Documentation

New/reworked docs: QUANTIZATION.md, OFFLOAD.md, RAY.md, CACHE_API.md (DMD), CUDA_GRAPH.md, COMPILE.md
Formatting and typo fixes (#886-#887, #899, #903, #913)
FAQ update: Flash Attention 2 install guide (#915)
Community link fixes (#928, #940)

⚠️ Breaking Changes

Change	PR	Migration
Serving module deprecated	#933	Migrate to SGLang Diffusion or vLLM-Omni
Native Diffusers parallelism backend deprecated	#1017	Use Cache-DiT native parallelism backends (better performance)
Quantize API legacy kwargs deprecated	#910-#911	Migrate to unified `QuantizeConfig` + `svdq_kwargs`. Legacy `grad_ckpt`, `reorder_before_quantize` etc. removed

👥 New Contributors

Thank you to the new contributors who joined the Cache-DiT community:

@blian6 — MindIE-SD NPU attention/compilation backend support
@FNGarvin — Python version compatibility fix
@Archerkattri — DMD Calibrator

🙏 Contributors

Thank you to all contributors for this release: @DefTruth, @Archerkattri, @FNGarvin, @blian6.

For the full list of changes, see GitHub Release v1.5.0 and the full changelog.

🇨🇳 中文版

📋 概述

Cache-DiT v1.5.0 是一次重大功能更新，历时 3 个月（2026-03-12 ~ 2026-06-16），涵盖 176 个 PR。本次发布围绕四大核心模块：SVDQuant W4A4 量化（PTQ/DQ/NVFP4/转换器CLI）、DMD Calibrator（基于指数基底的动态模态分解校准器）、Bucket-style Layerwise CPU Offload（计算-通信重叠的逐层卸载）、Ray Wrapper（分布式推理透明化包装器）。此外还包括 FLUX.2-klein-kv 系列支持、CUDA Graph 完整集成、大量 kernel 重构与量化增强、并行框架改进、以及全面更新的文档体系。

✨ 核心亮点

1. 💎 SVDQuant W4A4 量化

Cache-DiT v1.5.0 原生集成了完整的 SVDQuant W4A4 量化工作流，这是本次发布最重要的特性。与其依赖第三方库，用户现在可以直接通过 Cache-DiT 的 cache_dit.quantize() / cache_dit.load() API 完成从校准到推理的全链路 W4A4 量化。

📊 PTQ（后训练量化）

支持 svdq_int4_r{rank} 和 svdq_nvfp4_r{rank} 两种量化类型：

INT4 PTQ（≥sm80）：通过 calibrate_fn 收集激活统计 → SVD 低秩分解 → INT4 打包 → 运行时 W4A4 GEMM。支持三种校准精度：low（推荐默认，~18x 加速）、medium、high。序列化到 {quant_type}.safetensors + quant_config.json，通过 cache_dit.load() 一键恢复。
NVFP4 PTQ（≥sm120，Blackwell）：专为 RTX 5090 等 Blackwell GPU 设计，使用 NVFP4 格式打包权重，当前仅支持 runtime_kernel="v1"。

性能数据（FLUX.2-klein-4B，1024×1024，L20）：

阶段	延迟 (s)	显存 (GiB)	Transformer 权重 (GiB)
BF16 baseline	2.13	17.32	7.22
SVDQuant INT4	1.24	12.39	2.28
SVDQuant + compile	1.02	12.39	2.28

Transformer 权重：~3.2× 压缩（7.22 → 2.28 GiB）
端到端延迟：~1.7× 加速（2.13 → 1.24s），compile 叠加后 ~2.1× 加速（2.13 → 1.02s）
PSNR 保持 29+ dB，视觉质量几乎无损

NVFP4 性能（RTX 5090，FLUX.2-klein-4B，1024×1024）：

阶段	延迟 (s)	加速比	显存 (GiB)
BF16 baseline	0.97	1.00×	17.32
NVFP4 PTQ	0.58	1.69×	12.50
NVFP4 + compile	0.47	2.05×	12.50

⚡ DQ（动态量化）

无需校准数据的零样本量化，类型后缀 _dq（如 svdq_int4_r128_dq）：

identity（默认）：直接对原始权重矩阵做 SVD 低秩分解，无需校准、无需序列化
weight / weight_inv：仅基于权重统计量的启发式平滑策略（实验性）
few_shot：运行时收集少量前向的激活统计后实时量化，支持 7 种松弛策略（auto/stable_auto/power/log/rank/top/fixed），可配置 few_shot_steps、few_shot_relax_factor、few_shot_relax_top_ratio、few_shot_relax_strategy。支持 few_shot_auto_compile 在量化完成后自动触发 torch.compile

DQ 性能（FLUX.2-klein-4B，1024×1024，identity smooth，rank=128）：1.28s，PSNR 28.71 dB。

🔧 SVDQ Converter CLI

新增 cache-dit-convert 命令行工具，一键将预训练模型转换为 SVDQ W4A4 格式：

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

支持 INT4/NVFP4 DQ 转换、自定义 smooth 策略、多种精度选项。生成 {quant_type}.safetensors + quant_config.json，可通过 cache_dit.load() 直接加载。

🔀 Fused MLP

新增 fused_gelu_mlp / fused_gelu_proj pass（通过 svdq_kwargs["fused_mlp"]=True 启用），将第一个 GEMM + GELU 激活融合为单 kernel，降低 kernel launch 开销和中间激活显存占用。

🔗 与并行兼容（Cache-DiT 独家特色）

SVDQuant 在 Cache-DiT 中完整支持 Context Parallelism（Ulysses / Ring / USP）和 Tensor Parallelism（PyTorch DTensor）。用户可以在分布式推理场景下叠加使用量化 + 并行，实现极致的显存压缩和吞吐提升。SVDQW4A4ShardLinear（dtensor.py）提供原生 TP 分片支持。这是 Cache-DiT 中 SVDQuant 区别于其他 W4A4 实现的差异化能力。

⚙️ 量化配置增强

区域量化（regional_quantize=True + repeated_blocks）：仅量化 transformer 的重复块，保持嵌入层等敏感层全精度
混合精度计划（precision_plan）：按层名模式为不同子层指定不同量化类型（如 attn.to_q 用 float8_per_tensor、attn.to_k 用 float8_weight_only）
FP8 Per-Tensor Fallback（per_tensor_fallback=True，默认开启）：在 TP 场景下，不支持 per-row 量化的层自动回退到 per-tensor，消除跳过警告，提升覆盖率（144 → 144，0 skip）
TorchAO 后端重构：更清晰的 QuantizeBackend 枚举（AUTO / TORCHAO / CACHE_DIT / NONE）
量化 API 重构：废弃旧版 kwargs，统一为 QuantizeConfig + svdq_kwargs

📦 cache-dit-cu13 预编译 Wheel

为 CUDA 13 用户提供预编译的 SVDQuant wheel：pip install cache-dit-cu13，免去从源码编译 SVDQ kernel 的麻烦。

2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 引入了全新的 bucket 式逐层卸载机制，解决了传统逐层 offload "每层等传输"的低效问题。

核心设计：

Bucket Pipeline：将目标模块切分为连续小桶，当前桶执行时异步预取下个桶，实现计算-通信重叠
双独立 Copy Stream 池：onload（H2D）和 offload（D2H）各自拥有独立 CUDA stream 池
Persistent Bins：将常驻预算均匀分布到目标序列上，避免热权重集中在 prefix
灵活的资源控制：transfer_buckets（预取深度）、persistent_buckets（常驻桶数）、persistent_bins（常驻分布桶数）、prefetch_limit（保守预取限制）、max_copy_streams（并发拷贝流数）、max_inflight_prefetch_bytes（预取字节预算）

性能数据（FLUX.1-dev，L20）：

配置	显存	延迟
无 offload	~38 GiB	23.4s
原生 diffusers offload	~25 GiB	56s
原生 diffusers sequential	~1 GiB	335s
Layerwise（transfer=4, persistent=32, bins=4）	~16 GiB	24.6s

仅 1.2s 的额外延迟（vs 无 offload 23.4s），即可将显存从 38 GiB 压缩到 16 GiB。

兼容 torch.compile：先应用 offload 再编译，offload hooks 在编译后正常工作。

CLI 快速启动：

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile

3. 🌩️ Ray Wrapper（分布式推理透明化）

Ray Wrapper 让分布式推理对用户代码完全透明。你不需要 torchrun、不需要 dist.init_process_group、不需要手动模型分片 —— 只需 use_ray=True，Cache-DiT 接管一切。

两种包装级别：

级别	描述	推荐场景
Pipeline Wrapper（推荐）	Ray 管理整个 pipeline 执行，包括 text encoder、VAE、scheduler	完整功能支持（cache、量化、并行），最简单的用户体验
Transformer Wrapper	仅 transformer 由 Ray 执行，其他组件留在主进程	轻量级，但比pipeline level 慢

核心特性：

ray_transfer_fn：用户自定义每个 worker 的模型加载逻辑，绕过序列化/反序列化开销，解决自定义模块的类解析问题
ray_use_compile：worker 内自动编译
ray_runtime_env：通过 PYTHONPATH 处理自定义模块导入
支持 TP、Ulysses、Ring 等所有并行策略
LoRA 支持：建议融合后使用（TP 不支持未融合 LoRA）

性能数据（FLUX.2-klein-base-9B）：

配置	延迟
Baseline（单卡）	47.41s
Ray Wrapper TP=2 + compile	24.57s

最简用法：

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # 代码完全不变

4. 🔮 DMD Calibrator（动态模态分解校准器）

DMD（Dynamic Mode Decomposition）是一个基于指数基底的前馈校准器，作为 TaylorSeer（多项式基底）的替代方案。

数学原理：将缓存特征流建模为线性动力系统 $Y_{t+1} \approx A \cdot Y_t$，通过一次 SVD 识别传播子 $A$，特征分解后以 $\lambda^k$ 进行外推。相比 TaylorSeer 的多项式外推（$t^n$ 发散），DMD 在 $\lvert\lambda\rvert \leq 1$ 时有界。

TaylorSeer vs DMD 对比：

维度	TaylorSeer（多项式）	DMD（指数）
基底	$Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$	$Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
外推行为	$t^n \to \infty$ 发散	$\lvert\lambda\rvert \leq 1$ 时有界
快照要求	2+（一阶）	≥ 4 均匀间隔
最佳场景	DiT 类去噪（DDPM）	流匹配生成器（Hunyuan3D 等）
噪声敏感性	低	中等（SVD 截断抗噪）

使用方式：

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI：python -m cache_dit.generate flux --cache --dmd --dmd-history 6

🔧 其他增强

🧩 FLUX.2-klein-kv 系列支持

TP + compile 集成（#888）
fp8 per-row + TP 支持（#896）
Async Ulysses 支持（#877）

📈 CUDA Graph

完整 CUDA Graph 支持（#942-#952）
CUDA Graph + fp8 rowwise 联合使用（#952）
descent_tuning 默认启用（#935）
compile full-graph CLI option（#946）

🏗️ Kernel 与基础设施重构

Triton kernel 重构（3/N，#907-#909）
通信 kernel 注册为 torch.library ops（#905）
CuTeDSL 通信 kernel（fp8，#977）、merge-attn-states kernel（#973）
统一 all2all/ring 通信 API（#985）
异步 Ulysses 重构（#986）
ops 注册策略统一（#931, #939, #967）

🔗 并行改进

record_plans 函数支持（TP/TE-P planners，#916-#917）
sub cp_plan 支持（#989）
CP & VAE-P planner 重构以改进日志（#918）

⚡ 编译优化

移除手动 graph break（#885）
CUDA Graph for dynamic compile（#951）
SVDQuant + compile 兼容（#1014 offload + compile）
日志抑制优化（#934, #923）

🤝 社区集成

MindIE-SD：华为昇腾 NPU 的注意力/编译后端支持（@blian6，#1004）
TensorRT-LLM 社区链接（#991）

📦 依赖升级

PyTorch → 2.11.0（#900）
使用 uv 替代 pip 管理依赖（#992）
Python 版本兼容性修复（@FNGarvin，#1025）
TorchAO ≥ 0.17.0

📝 文档

新增/大幅更新文档：QUANTIZATION.md、OFFLOAD.md、RAY.md、CACHE_API.md（DMD）、CUDA_GRAPH.md、COMPILE.md
文档格式化与 typo 修复（#886-#887, #899, #903, #913）
FAQ 更新：Flash Attention 2 安装指南（#915）
社区链接修复（#928, #940）

⚠️ Breaking Changes

变更	PR	迁移指引
Serving 模块废弃	#933	推荐迁移至 SGLang Diffusion 或 vLLM-Omni
Native Diffusers 并行后端废弃	#1017	使用 Cache-DiT 原生并行后端（性能更好）
Quantize API 废弃旧 kwargs	#910-#911	统一使用 `QuantizeConfig` + `svdq_kwargs`，旧版 `grad_ckpt`、`reorder_before_quantize` 等参数已移除

👥 新贡献者

感谢以下新贡献者加入 Cache-DiT 社区：

@blian6 — MindIE-SD NPU 注意力/编译后端支持
@FNGarvin — Python 版本兼容性修复
@Archerkattri — DMD Calibrator

🙏 致谢

感谢所有贡献者：@DefTruth、@Archerkattri、@FNGarvin、@blian6。

完整变更列表请参见 GitHub Release v1.5.0。

`v1.3.12`

Compare Source

What's Changed

chore: update installation guide by @DefTruth in #1035
chore: Update README.md by @DefTruth in #1036
Update README.md by @DefTruth in #1037
chore: update installation guide by @DefTruth in #1038
feat: config yaml support svdq dq/few-shot by @DefTruth in #1040
[2/N]: config yaml support svdq dq/few-shot by @DefTruth in #1042
chore: update why not svqd nvfp4 for sm100 by @DefTruth in #1043
chore: hightlight bucket-style layerwise offload by @DefTruth in #1044
chore: Update README.md by @DefTruth in #1045
svdq: add fused gelu mlp/proj pass by @DefTruth in #1047
docs: add fused gelu mlp/proj docs by @DefTruth in #1048

Full Changelog: vipshop/cache-dit@v1.3.11...v1.3.12

`v1.3.11`

Compare Source

What's Changed

chore: update release tools by @DefTruth in #1034

Full Changelog: vipshop/cache-dit@v1.3.10...v1.3.11

`v1.3.10`

Compare Source

What's Changed

feat: add MindIE-SD as optional NPU attention and compilation backend by @blian6 in #1004
chore: simplify attn backend auto select by @DefTruth in #1024
Fix Python version mismatch in setup.py by @FNGarvin in #1025
offload: extract copy stream pool and split init by @DefTruth in #1026
feat: support svdquant nvfp4 ptq/dq by @DefTruth in #1029
chore: Update README.md by @DefTruth in #1030
whl: cache-dit-cu13 pkg w/ svdq kernels by @DefTruth in #1031
whl: fix build_releases.sh tool by @DefTruth in #1032
whl: fix build_releases.sh tool by @DefTruth in #1033

New Contributors

@blian6 made their first contribution in #1004
@FNGarvin made their first contribution in #1025

Full Changelog: vipshop/cache-dit@v1.3.9...v1.3.10

`v1.3.9`

Compare Source

What's Changed

ray: refactor ray wrapper impl by @DefTruth in #1016
parallel: deprecated native diffusers backend by @DefTruth in #1017
ray: allow pass init_fn to ray wrapper by @DefTruth in #1019
docs: add torch.compile section to offload docs by @DefTruth in #1020
API: remove dup ray api call by @DefTruth in #1021
ray: simplify ray wrapper dispatch by @DefTruth in #1022
ray: soft check for ray path by @DefTruth in #1023

Full Changelog: vipshop/cache-dit@v1.3.8...v1.3.9

`v1.3.8`

Compare Source

What's Changed

CLI: allow 8-steps lora for qwen-image edit lightning by @DefTruth in #1011
skills: add triton-kernel skill by @DefTruth in #1013
feat: make layerwise offload compatible w/ compile by @DefTruth in #1014
ray: fix custom components serialize by @DefTruth in #1015

Full Changelog: vipshop/cache-dit@v1.3.7...v1.3.8

`v1.3.7`

Compare Source

What's Changed

docs: update ray wrapper docs by @DefTruth in #1005
docs: update ray wrapper docs by @DefTruth in #1006
ray: pass runtime env to workers by @DefTruth in #1007
chore: update ray wrapper docs by @DefTruth in #1008
ray: disable dashboard by default by @DefTruth in #1009

Full Changelog: vipshop/cache-dit@v1.3.6...v1.3.7

`v1.3.6`

Compare Source

What's Changed

chore: update cache-dit arch by @DefTruth in #932
bc: deprecated serving module by @DefTruth in #933
chore: suppress torch compile tuning logs by @DefTruth in #934
compile: enabled descent_tuning by default by @DefTruth in #935
docs: update quantization docs by @DefTruth in #937
quant: add quantize backend enum by @DefTruth in #938
kernel: refactor ops register by @DefTruth in #939
chore: fix vllm-omni docs links by @DefTruth in #940
examples: add cuda graph option by @DefTruth in #942
chore: fix utils log info by @DefTruth in #943
chore: add cuda graph usage docs by @DefTruth in #944
chore: add cuda graph usage to overview by @DefTruth in #945
CLI: add compile full-graph option by @DefTruth in #946
chore: fix fullgraph param typo by @DefTruth in #947
docs: add more cuda graph perf results by @DefTruth in #948
docs: add more cuda graph perf results by @DefTruth in #949
docs: update cuda graphs docs by @DefTruth in #950
chore: allow cuda graph for dynamic compile by @DefTruth in #951
feat: support cuda graph + fp8 rowwise by @DefTruth in #952
chore: hotfix for mkdocs broken by @DefTruth in #953
[1/N] feat: support svdquant w4a4 - kernels & skills by @DefTruth in #954
pytest: fast_svd mode for testing by @DefTruth in #955
[2/N] feat: streaming quantize for svdquant by @DefTruth in #956
[3/N] feat: PTQ workflow for svdquant by @DefTruth in #957
SKILL: add ptq-workflow-integration skill by @DefTruth in #958
pytest: separate kernels and quantization tests by @DefTruth in #959
chore: add docs strings to codebase by @DefTruth in #960
chore: add svdq e2e example and format code by @DefTruth in #961
SKILL: add Cute-DSL/CUDA/CUTLASS skills by @DefTruth in #962
chore: update docs by @DefTruth in #963
kernel: tune svdq w4a4 gemm stage/blk size for Ada by @DefTruth in #966
kernel: unified ops register policy by @DefTruth in #967
bench: refactor cache-dit bench by @DefTruth in #968
svdquant: fast svd decompose, ~18x speedup by @DefTruth in #969
[2/N] tune svdq w4a4 gemm for ada by @DefTruth in #970
bc: refactor distributed codebase by @DefTruth in #971
kernel: add cute-dsl based merge-attn-states kernel by @DefTruth in #973
feat: extend SVDQ PTQ -> SVDQ DQ by @DefTruth in #974
fix: support 3D input/output for W4A4 linear by @DefTruth in #975
chore: support svdq-calib option in examples by @DefTruth in #976
kernel: add cute-dsl based fp8 comm kernels by @DefTruth in #977
[1/N] feat: support cute-dsl based svdquant w4a4 by @DefTruth in #978
feat: support svdq-dq few shot by @DefTruth in #979
chore: update svdq-dq few shot docs by @DefTruth in #980
feat: support layerwise cpu offload by @DefTruth in #981
[2/N] feat: support layerwise offload by @DefTruth in #982
[3/N] feat: support layerwise offload by @DefTruth in #983
[4/N] feat: support layerwise offload by @DefTruth in #984
chore: unified all2all/ring comm api by @DefTruth in #985
chore: refactor async ulysses codebase by @DefTruth in #986
remove cutedsl based svdq kernels by @DefTruth in #987
fix tensor parallel register import error by @DefTruth in #988
feat: support sub cp_plan for context parallel by @DefTruth in #989
chore: fix attention dispatch comments by @DefTruth in #990
community: add tensorrt-llm x cache-dit link by @DefTruth in #991
deps: use uv to install deps by @DefTruth in #992
chore: update docs by @DefTruth in #993
chore: add layerwise offload to overview by @DefTruth in #994
chore: update layerwise offload cli quick start by @DefTruth in #995
attention: fix sage-attn backend dispatch by @DefTruth in #996
chore: add exclude-layers param to ptq example by @DefTruth in #997
attention: separate attn backends by @DefTruth in #998
svdq: support converter cli for dq workflow by @DefTruth in #999
chore: fix typo by @DefTruth in #1000
chore: revise quantization example in README by @DefTruth in #1001
chore: update README by @DefTruth in #1002
feat: support ray wrapper by @DefTruth in #1003

Full Changelog: vipshop/cache-dit@v1.3.5...v1.3.6

`v1.3.5`: Quantization

Compare Source

Low-bits Quantization

Overview

Quantization is a powerful technique to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower precision data types. Cache-DiT supports various quantization methods, including FP8, INT8, and INT4 quantization, to help users achieve faster inference and lower memory usage while maintaining acceptable model performance.

quantization type	description	devices
float8_per_row	quantize weights and activations to float8 (dynamic quantization) with rowwise method. (recommended)	>=sm89, Ada, Hopper or newer
float8_per_tensor	quantize weights and activations to float8 (dynamic quantization) with tensorwise method.	>=sm89, Ada, Hopper or newer
float8_per_block	block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128)	>=sm89, Ada, Hopper or newer
float8_weight_only	quantize only weights to float8, keep activations in full precision	>=sm89, Ada, Hopper or newer
int8_per_row	quantize weights and activations to int8 (dynamic quantization) with rowwise method.	>=sm80, Ampere or newer
int8_per_tensor	quantize weights and activations to int8 (dynamic quantization) with tensorwise method.	>=sm80, Ampere or newer
int8_weight_only	quantize only weights to int8, keep activations in full precision	>=sm80, Ampere or newer
int4_weight_only	quantize only weights to int4, keep activations in full precision	>=sm90, Hopper or newer, TMA required

FP8 Quantization

Currently, TorchAo has been fully integrated into Cache-DiT as the backend for online quantization. You can implement model quantization by calling quantize or pass a QuantizeConfig to enable_cache API. (recommended)

For GPUs with low memory capacity, we recommend using float8_per_row or float8_per_block, as these

✂ Note

PR body was truncated to here.

Configuration

📅 Schedule: (UTC)

Branch creation
- At any time (no schedule defined)
Automerge
- At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 9463dfe to fd0ef19 Compare February 2, 2026 05:23

renovate Bot changed the title ~~Update dependency cache_dit to v1.2.0~~ Update dependency cache_dit to v1.2.1 Feb 2, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from fd0ef19 to c65665d Compare February 10, 2026 08:50

renovate Bot changed the title ~~Update dependency cache_dit to v1.2.1~~ Update dependency cache_dit to v1.2.2 Feb 10, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from c65665d to 376529f Compare February 26, 2026 08:32

renovate Bot changed the title ~~Update dependency cache_dit to v1.2.2~~ Update dependency cache_dit to v1.2.3 Feb 26, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 376529f to 8e12ad4 Compare March 11, 2026 12:52

renovate Bot changed the title ~~Update dependency cache_dit to v1.2.3~~ Update dependency cache_dit to v1.3.0 Mar 11, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 8e12ad4 to 90ea7f3 Compare March 25, 2026 09:55

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.0~~ Update dependency cache_dit to v1.3.1 Mar 25, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 90ea7f3 to 7f6b76d Compare March 26, 2026 13:02

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.1~~ Update dependency cache_dit to v1.3.3 Mar 26, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 7f6b76d to ba50475 Compare March 27, 2026 05:24

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.3~~ Update dependency cache_dit to v1.3.4 Mar 27, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from ba50475 to 0b1a44a Compare March 30, 2026 10:15

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.4~~ Update dependency cache_dit to v1.3.5 Mar 30, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 0b1a44a to 1ce318c Compare May 11, 2026 05:43

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.5~~ Update dependency cache_dit to v1.3.6 May 11, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 1ce318c to 28fa8b8 Compare May 12, 2026 09:11

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.6~~ Update dependency cache_dit to v1.3.7 May 12, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 28fa8b8 to 9ea5b4b Compare May 25, 2026 14:11

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.7~~ Update dependency cache_dit to v1.3.8 May 25, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 9ea5b4b to 302b19a Compare May 27, 2026 04:30

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.8~~ Update dependency cache_dit to v1.3.9 May 27, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 302b19a to 92bb116 Compare June 4, 2026 10:34

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.9~~ Update dependency cache_dit to v1.3.11 Jun 4, 2026

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 92bb116 to 3b5d57c Compare June 9, 2026 05:33

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.11~~ Update dependency cache_dit to v1.3.12 Jun 9, 2026

Update dependency cache_dit to v1.5.0

8b1494c

renovate Bot force-pushed the renovate/cache_dit-1.x branch from 3b5d57c to 8b1494c Compare June 16, 2026 08:45

renovate Bot changed the title ~~Update dependency cache_dit to v1.3.12~~ Update dependency cache_dit to v1.5.0 Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependency cache_dit to v1.5.0#33

Update dependency cache_dit to v1.5.0#33
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/cache_dit-1.x

renovate Bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

renovate Bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v1.5.0: Major Release

🚀 Cache-DiT v1.5.0 Release Notes

📋 Overview

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

📊 PTQ (Post-Training Quantization)

⚡ DQ (Dynamic Quantization)

🔧 SVDQ Converter CLI

🔀 Fused MLP

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

⚙️ Quantization Configuration Enhancements

📦 cache-dit-cu13 Pre-built Wheel

2. 💾 Bucket-style Layerwise CPU Offload

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series

📈 CUDA Graph

🏗️ Kernel & Infrastructure Refactoring

🔗 Parallelism Improvements

⚡ Compilation Optimizations

🤝 Community Integrations

📦 Dependency Updates

📝 Documentation

⚠️ Breaking Changes

👥 New Contributors

🙏 Contributors

🇨🇳 中文版

📋 概述

✨ 核心亮点

1. 💎 SVDQuant W4A4 量化

📊 PTQ（后训练量化）

⚡ DQ（动态量化）

🔧 SVDQ Converter CLI

🔀 Fused MLP

🔗 与并行兼容（Cache-DiT 独家特色）

⚙️ 量化配置增强

📦 cache-dit-cu13 预编译 Wheel

2. 💾 Bucket-style Layerwise CPU Offload

3. 🌩️ Ray Wrapper（分布式推理透明化）

4. 🔮 DMD Calibrator（动态模态分解校准器）

🔧 其他增强

🧩 FLUX.2-klein-kv 系列支持

📈 CUDA Graph

🏗️ Kernel 与基础设施重构

🔗 并行改进

⚡ 编译优化

🤝 社区集成

📦 依赖升级

📝 文档

⚠️ Breaking Changes

👥 新贡献者

🙏 致谢

v1.3.12

What's Changed

v1.3.11

What's Changed

v1.3.10

What's Changed

New Contributors

v1.3.9

What's Changed

v1.3.8

What's Changed

v1.3.7

What's Changed

v1.3.6

What's Changed

v1.3.5: Quantization

Low-bits Quantization

Overview

FP8 Quantization

Configuration

Uh oh!

Reviewers

Assignees

renovate Bot commented Jan 16, 2026 •

edited

Loading

`v1.5.0`: Major Release

`v1.3.12`

`v1.3.11`

`v1.3.10`

`v1.3.9`

`v1.3.8`

`v1.3.7`

`v1.3.6`

`v1.3.5`: Quantization