Skip to content

Update dependency cache_dit to v1.5.0#33

Open
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/cache_dit-1.x
Open

Update dependency cache_dit to v1.5.0#33
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/cache_dit-1.x

Conversation

@renovate

@renovate renovate Bot commented Jan 16, 2026

Copy link
Copy Markdown
Contributor

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package Change Age Confidence
cache_dit ==1.1.10==1.5.0 age confidence

Release Notes

vipshop/cache-dit (cache_dit)

v1.5.0: Major Release

Compare Source

🚀 Cache-DiT v1.5.0 Release Notes

Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases

📋 Overview

Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's cache_dit.quantize() / cache_dit.load() API.

📊 PTQ (Post-Training Quantization)

Supports svdq_int4_r{rank} and svdq_nvfp4_r{rank} quant types:

  • INT4 PTQ (≥sm80): Collect activation statistics via calibrate_fn → SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels: low (recommended default, ~18× speedup), medium, high. Serialize to {quant_type}.safetensors + quant_config.json; restore via cache_dit.load().
  • NVFP4 PTQ (≥sm120, Blackwell): Designed for RTX 5090 and other Blackwell GPUs. Currently only runtime_kernel="v1" is supported for NVFP4.

Performance (FLUX.2-klein-4B, 1024×1024, L20):

Stage Latency (s) Memory (GiB) Transformer Weight (GiB)
BF16 baseline 2.13 17.32 7.22
SVDQuant INT4 1.24 12.39 2.28
SVDQuant + compile 1.02 12.39 2.28
  • Transformer weight reduction: ~3.2× compression (7.22 → 2.28 GiB)
  • End-to-end latency: ~1.7× speedup (2.13 → 1.24s), ~2.1× with compile (2.13 → 1.02s)
  • PSNR > 29 dB, near-lossless visual quality

NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):

Stage Latency (s) Speedup Memory (GiB)
BF16 baseline 0.97 1.00× 17.32
NVFP4 PTQ 0.58 1.69× 12.50
NVFP4 + compile 0.47 2.05× 12.50
⚡ DQ (Dynamic Quantization)

Zero-calibration quantization via _dq suffix types (e.g., svdq_int4_r128_dq):

  • identity (default): Apply SVD low-rank decomposition directly to the original weight matrix — no calibration, no serialization.
  • weight / weight_inv: Weight-statistics-only heuristic smooth strategies (experimental).
  • few_shot: Collect a small number of real inference forwards at runtime, then quantize in-place with configurable relaxation strategies (7 strategies: auto/stable_auto/power/log/rank/top/fixed). Supports few_shot_auto_compile for deferred compilation after quantization.

DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.

🔧 SVDQ Converter CLI

New cache-dit-convert command-line tool for one-click model conversion to SVDQ W4A4:

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

Supports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs {quant_type}.safetensors + quant_config.json.

🔀 Fused MLP

New fused_gelu_mlp / fused_gelu_proj passes (enable via svdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains. SVDQW4A4ShardLinear (dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.

⚙️ Quantization Configuration Enhancements
  • Regional Quantization (regional_quantize=True + repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision.
  • Hybrid Precision Plan (precision_plan): Assign different quant types to different sub-layers by name pattern.
  • FP8 Per-Tensor Fallback (per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers.
  • TorchAO Backend Refactor: Cleaner QuantizeBackend enum (AUTO / TORCHAO / CACHE_DIT / NONE).
  • Quantize API Refactor: Deprecated legacy kwargs, unified under QuantizeConfig + svdq_kwargs.
📦 cache-dit-cu13 Pre-built Wheel

Pre-compiled SVDQuant wheel for CUDA 13 users: pip install cache-dit-cu13 — no source build needed.


2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.

Core Design:

  • Bucket Pipeline: Divide target modules into small contiguous buckets; prefetch the next bucket asynchronously while the current one executes.
  • Dual Independent Copy Stream Pools: Separate CUDA stream pools for onload (H2D) and offload (D2H).
  • Persistent Bins: Distribute the persistent budget evenly across the target sequence.
  • Flexible Resource Controls: transfer_buckets, persistent_buckets, persistent_bins, prefetch_limit, max_copy_streams, max_inflight_prefetch_bytes.

Performance (FLUX.1-dev, L20):

Config Memory Latency
No offload ~38 GiB 23.4s
Diffusers sequential ~1 GiB 335s
Layerwise (transfer=4, persistent=32, bins=4) ~16 GiB 24.6s

Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.

torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.

CLI Quick Start:

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

The Ray Wrapper makes distributed inference completely transparent to user code. No torchrun, no dist.init_process_group, no manual model sharding — just use_ray=True, and Cache-DiT handles everything.

Two Wrapper Levels:

Level Description Best For
Pipeline Wrapper (recommended) Ray manages the entire pipeline execution Full feature support (cache, quant, parallelism), simplest, fastest.
Transformer Wrapper Only the transformer runs on Ray workers Lightweight, but slight slower

Key Features:

  • ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.
  • ray_use_compile: Automatic per-worker compilation.
  • ray_runtime_env: Custom module import handling via PYTHONPATH.
  • Supports all parallelism strategies: TP, Ulysses, Ring.
  • LoRA support: fuse before enabling (TP requires fused LoRAs).

Performance (FLUX.2-klein-base-9B):

Config Latency
Baseline (single GPU) 47.41s
Ray TP=2 + compile 24.57s

Minimal Example:

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # Code unchanged

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.

Mathematical Principle: The cached feature stream is modeled as a linear dynamical system $Y_{t+1} \approx A \cdot Y_t$. The propagator $A$ is identified via one economy SVD, then eigendecomposed for extrapolation via $\lambda^k$. Unlike TaylorSeer's polynomial extrapolation (diverges as $t^n \to \infty$), DMD is bounded when $\lvert\lambda\rvert \leq 1$.

TaylorSeer vs DMD:

Aspect TaylorSeer (Polynomial) DMD (Exponential)
Basis $Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$ $Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
Extrapolation Diverges as $t^n \to \infty$ Bounded when $\lvert\lambda\rvert \leq 1$
Snapshots needed 2+ (1st order) ≥ 4 uniformly spaced
Best for DiT-class denoising (DDPM) Flow-matching generators (Hunyuan3D, etc.)
Noise sensitivity Low Moderate (SVD truncation suppresses noise)

Usage:

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI: python -m cache_dit.generate flux --cache --dmd --dmd-history 6


🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series
📈 CUDA Graph
🏗️ Kernel & Infrastructure Refactoring
🔗 Parallelism Improvements
  • record_plans for TP/TE-P planners (#​916-#​917)
  • sub cp_plan support (#​989)
  • CP & VAE-P planner refactoring for better logging (#​918)
⚡ Compilation Optimizations
  • Removed manual graph breaks (#​885)
  • CUDA Graph for dynamic compile (#​951)
  • SVDQuant + compile compatibility; offload + compile (#​1014)
  • Log suppression improvements (#​934, #​923)
🤝 Community Integrations
  • MindIE-SD: Huawei Ascend NPU attention/compilation backend support (@​blian6, #​1004)
  • TensorRT-LLM community link (#​991)
📦 Dependency Updates
📝 Documentation

⚠️ Breaking Changes

Change PR Migration
Serving module deprecated #​933 Migrate to SGLang Diffusion or vLLM-Omni
Native Diffusers parallelism backend deprecated #​1017 Use Cache-DiT native parallelism backends (better performance)
Quantize API legacy kwargs deprecated #​910-#​911 Migrate to unified QuantizeConfig + svdq_kwargs. Legacy grad_ckpt, reorder_before_quantize etc. removed

👥 New Contributors

Thank you to the new contributors who joined the Cache-DiT community:

🙏 Contributors

Thank you to all contributors for this release: @​DefTruth, @​Archerkattri, @​FNGarvin, @​blian6.

For the full list of changes, see GitHub Release v1.5.0 and the full changelog.


🇨🇳 中文版

📋 概述

Cache-DiT v1.5.0 是一次重大功能更新,历时 3 个月(2026-03-12 ~ 2026-06-16),涵盖 176 个 PR。本次发布围绕四大核心模块:SVDQuant W4A4 量化(PTQ/DQ/NVFP4/转换器CLI)、DMD Calibrator(基于指数基底的动态模态分解校准器)、Bucket-style Layerwise CPU Offload(计算-通信重叠的逐层卸载)、Ray Wrapper(分布式推理透明化包装器)。此外还包括 FLUX.2-klein-kv 系列支持、CUDA Graph 完整集成、大量 kernel 重构与量化增强、并行框架改进、以及全面更新的文档体系。


✨ 核心亮点

1. 💎 SVDQuant W4A4 量化

Cache-DiT v1.5.0 原生集成了完整的 SVDQuant W4A4 量化工作流,这是本次发布最重要的特性。与其依赖第三方库,用户现在可以直接通过 Cache-DiT 的 cache_dit.quantize() / cache_dit.load() API 完成从校准到推理的全链路 W4A4 量化。

📊 PTQ(后训练量化)

支持 svdq_int4_r{rank}svdq_nvfp4_r{rank} 两种量化类型:

  • INT4 PTQ(≥sm80):通过 calibrate_fn 收集激活统计 → SVD 低秩分解 → INT4 打包 → 运行时 W4A4 GEMM。支持三种校准精度:low(推荐默认,~18x 加速)、mediumhigh。序列化到 {quant_type}.safetensors + quant_config.json,通过 cache_dit.load() 一键恢复。
  • NVFP4 PTQ(≥sm120,Blackwell):专为 RTX 5090 等 Blackwell GPU 设计,使用 NVFP4 格式打包权重,当前仅支持 runtime_kernel="v1"

性能数据(FLUX.2-klein-4B,1024×1024,L20):

阶段 延迟 (s) 显存 (GiB) Transformer 权重 (GiB)
BF16 baseline 2.13 17.32 7.22
SVDQuant INT4 1.24 12.39 2.28
SVDQuant + compile 1.02 12.39 2.28
  • Transformer 权重:~3.2× 压缩(7.22 → 2.28 GiB)
  • 端到端延迟:~1.7× 加速(2.13 → 1.24s),compile 叠加后 ~2.1× 加速(2.13 → 1.02s)
  • PSNR 保持 29+ dB,视觉质量几乎无损

NVFP4 性能(RTX 5090,FLUX.2-klein-4B,1024×1024):

阶段 延迟 (s) 加速比 显存 (GiB)
BF16 baseline 0.97 1.00× 17.32
NVFP4 PTQ 0.58 1.69× 12.50
NVFP4 + compile 0.47 2.05× 12.50
⚡ DQ(动态量化)

无需校准数据的零样本量化,类型后缀 _dq(如 svdq_int4_r128_dq):

  • identity(默认):直接对原始权重矩阵做 SVD 低秩分解,无需校准、无需序列化
  • weight / weight_inv:仅基于权重统计量的启发式平滑策略(实验性)
  • few_shot:运行时收集少量前向的激活统计后实时量化,支持 7 种松弛策略(auto/stable_auto/power/log/rank/top/fixed),可配置 few_shot_stepsfew_shot_relax_factorfew_shot_relax_top_ratiofew_shot_relax_strategy。支持 few_shot_auto_compile 在量化完成后自动触发 torch.compile

DQ 性能(FLUX.2-klein-4B,1024×1024,identity smooth,rank=128):1.28s,PSNR 28.71 dB。

🔧 SVDQ Converter CLI

新增 cache-dit-convert 命令行工具,一键将预训练模型转换为 SVDQ W4A4 格式:

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

支持 INT4/NVFP4 DQ 转换、自定义 smooth 策略、多种精度选项。生成 {quant_type}.safetensors + quant_config.json,可通过 cache_dit.load() 直接加载。

🔀 Fused MLP

新增 fused_gelu_mlp / fused_gelu_proj pass(通过 svdq_kwargs["fused_mlp"]=True 启用),将第一个 GEMM + GELU 激活融合为单 kernel,降低 kernel launch 开销和中间激活显存占用。

🔗 与并行兼容(Cache-DiT 独家特色)

SVDQuant 在 Cache-DiT 中完整支持 Context Parallelism(Ulysses / Ring / USP)和 Tensor Parallelism(PyTorch DTensor)。用户可以在分布式推理场景下叠加使用量化 + 并行,实现极致的显存压缩和吞吐提升。SVDQW4A4ShardLineardtensor.py)提供原生 TP 分片支持。这是 Cache-DiT 中 SVDQuant 区别于其他 W4A4 实现的差异化能力。

⚙️ 量化配置增强
  • 区域量化regional_quantize=True + repeated_blocks):仅量化 transformer 的重复块,保持嵌入层等敏感层全精度
  • 混合精度计划precision_plan):按层名模式为不同子层指定不同量化类型(如 attn.to_qfloat8_per_tensorattn.to_kfloat8_weight_only
  • FP8 Per-Tensor Fallbackper_tensor_fallback=True,默认开启):在 TP 场景下,不支持 per-row 量化的层自动回退到 per-tensor,消除跳过警告,提升覆盖率(144 → 144,0 skip)
  • TorchAO 后端重构:更清晰的 QuantizeBackend 枚举(AUTO / TORCHAO / CACHE_DIT / NONE)
  • 量化 API 重构:废弃旧版 kwargs,统一为 QuantizeConfig + svdq_kwargs
📦 cache-dit-cu13 预编译 Wheel

为 CUDA 13 用户提供预编译的 SVDQuant wheel:pip install cache-dit-cu13,免去从源码编译 SVDQ kernel 的麻烦。


2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 引入了全新的 bucket 式逐层卸载机制,解决了传统逐层 offload "每层等传输"的低效问题。

核心设计

  • Bucket Pipeline:将目标模块切分为连续小桶,当前桶执行时异步预取下个桶,实现计算-通信重叠
  • 双独立 Copy Stream 池:onload(H2D)和 offload(D2H)各自拥有独立 CUDA stream 池
  • Persistent Bins:将常驻预算均匀分布到目标序列上,避免热权重集中在 prefix
  • 灵活的资源控制transfer_buckets(预取深度)、persistent_buckets(常驻桶数)、persistent_bins(常驻分布桶数)、prefetch_limit(保守预取限制)、max_copy_streams(并发拷贝流数)、max_inflight_prefetch_bytes(预取字节预算)

性能数据(FLUX.1-dev,L20):

配置 显存 延迟
无 offload ~38 GiB 23.4s
原生 diffusers offload ~25 GiB 56s
原生 diffusers sequential ~1 GiB 335s
Layerwise(transfer=4, persistent=32, bins=4) ~16 GiB 24.6s

仅 1.2s 的额外延迟(vs 无 offload 23.4s),即可将显存从 38 GiB 压缩到 16 GiB。

兼容 torch.compile:先应用 offload 再编译,offload hooks 在编译后正常工作。

CLI 快速启动

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile
3. 🌩️ Ray Wrapper(分布式推理透明化)

Ray Wrapper 让分布式推理对用户代码完全透明。你不需要 torchrun、不需要 dist.init_process_group、不需要手动模型分片 —— 只需 use_ray=True,Cache-DiT 接管一切。

两种包装级别

级别 描述 推荐场景
Pipeline Wrapper(推荐) Ray 管理整个 pipeline 执行,包括 text encoder、VAE、scheduler 完整功能支持(cache、量化、并行),最简单的用户体验
Transformer Wrapper 仅 transformer 由 Ray 执行,其他组件留在主进程 轻量级,但比pipeline level 慢

核心特性

  • ray_transfer_fn:用户自定义每个 worker 的模型加载逻辑,绕过序列化/反序列化开销,解决自定义模块的类解析问题
  • ray_use_compile:worker 内自动编译
  • ray_runtime_env:通过 PYTHONPATH 处理自定义模块导入
  • 支持 TP、Ulysses、Ring 等所有并行策略
  • LoRA 支持:建议融合后使用(TP 不支持未融合 LoRA)

性能数据(FLUX.2-klein-base-9B):

配置 延迟
Baseline(单卡) 47.41s
Ray Wrapper TP=2 + compile 24.57s

最简用法

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # 代码完全不变

4. 🔮 DMD Calibrator(动态模态分解校准器)

DMD(Dynamic Mode Decomposition)是一个基于指数基底的前馈校准器,作为 TaylorSeer(多项式基底)的替代方案。

数学原理:将缓存特征流建模为线性动力系统 $Y_{t+1} \approx A \cdot Y_t$,通过一次 SVD 识别传播子 $A$,特征分解后以 $\lambda^k$ 进行外推。相比 TaylorSeer 的多项式外推($t^n$ 发散),DMD 在 $\lvert\lambda\rvert \leq 1$ 时有界。

TaylorSeer vs DMD 对比

维度 TaylorSeer(多项式) DMD(指数)
基底 $Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$ $Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
外推行为 $t^n \to \infty$ 发散 $\lvert\lambda\rvert \leq 1$ 时有界
快照要求 2+(一阶) ≥ 4 均匀间隔
最佳场景 DiT 类去噪(DDPM) 流匹配生成器(Hunyuan3D 等)
噪声敏感性 中等(SVD 截断抗噪)

使用方式

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI:python -m cache_dit.generate flux --cache --dmd --dmd-history 6


🔧 其他增强

🧩 FLUX.2-klein-kv 系列支持
  • TP + compile 集成(#​888
  • fp8 per-row + TP 支持(#​896
  • Async Ulysses 支持(#​877
📈 CUDA Graph
  • 完整 CUDA Graph 支持(#​942-#​952
  • CUDA Graph + fp8 rowwise 联合使用(#​952
  • descent_tuning 默认启用(#​935
  • compile full-graph CLI option(#​946
🏗️ Kernel 与基础设施重构
🔗 并行改进
  • record_plans 函数支持(TP/TE-P planners,#​916-#​917
  • sub cp_plan 支持(#​989
  • CP & VAE-P planner 重构以改进日志(#​918
⚡ 编译优化
  • 移除手动 graph break(#​885
  • CUDA Graph for dynamic compile(#​951
  • SVDQuant + compile 兼容(#​1014 offload + compile)
  • 日志抑制优化(#​934, #​923
🤝 社区集成
  • MindIE-SD:华为昇腾 NPU 的注意力/编译后端支持(@​blian6#​1004
  • TensorRT-LLM 社区链接(#​991
📦 依赖升级
📝 文档
  • 新增/大幅更新文档:QUANTIZATION.md、OFFLOAD.md、RAY.md、CACHE_API.md(DMD)、CUDA_GRAPH.md、COMPILE.md
  • 文档格式化与 typo 修复(#​886-#​887, #​899, #​903, #​913
  • FAQ 更新:Flash Attention 2 安装指南(#​915
  • 社区链接修复(#​928, #​940

⚠️ Breaking Changes

变更 PR 迁移指引
Serving 模块废弃 #​933 推荐迁移至 SGLang Diffusion 或 vLLM-Omni
Native Diffusers 并行后端废弃 #​1017 使用 Cache-DiT 原生并行后端(性能更好)
Quantize API 废弃旧 kwargs #​910-#​911 统一使用 QuantizeConfig + svdq_kwargs,旧版 grad_ckptreorder_before_quantize 等参数已移除

👥 新贡献者

感谢以下新贡献者加入 Cache-DiT 社区:

🙏 致谢

感谢所有贡献者:@​DefTruth@​Archerkattri@​FNGarvin@​blian6

完整变更列表请参见 GitHub Release v1.5.0

v1.3.12

Compare Source

What's Changed

Full Changelog: vipshop/cache-dit@v1.3.11...v1.3.12

v1.3.11

Compare Source

What's Changed

Full Changelog: vipshop/cache-dit@v1.3.10...v1.3.11

v1.3.10

Compare Source

What's Changed

New Contributors

Full Changelog: vipshop/cache-dit@v1.3.9...v1.3.10

v1.3.9

Compare Source

What's Changed

Full Changelog: vipshop/cache-dit@v1.3.8...v1.3.9

v1.3.8

Compare Source

What's Changed

Full Changelog: vipshop/cache-dit@v1.3.7...v1.3.8

v1.3.7

Compare Source

What's Changed

Full Changelog: vipshop/cache-dit@v1.3.6...v1.3.7

v1.3.6

Compare Source

What's Changed

Full Changelog: vipshop/cache-dit@v1.3.5...v1.3.6

v1.3.5: Quantization

Compare Source

Low-bits Quantization

Overview

Quantization is a powerful technique to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower precision data types. Cache-DiT supports various quantization methods, including FP8, INT8, and INT4 quantization, to help users achieve faster inference and lower memory usage while maintaining acceptable model performance.

quantization type description devices
float8_per_row quantize weights and activations to float8 (dynamic quantization) with rowwise method. (recommended) >=sm89, Ada, Hopper or newer
float8_per_tensor quantize weights and activations to float8 (dynamic quantization) with tensorwise method. >=sm89, Ada, Hopper or newer
float8_per_block block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128) >=sm89, Ada, Hopper or newer
float8_weight_only quantize only weights to float8, keep activations in full precision >=sm89, Ada, Hopper or newer
int8_per_row quantize weights and activations to int8 (dynamic quantization) with rowwise method. >=sm80, Ampere or newer
int8_per_tensor quantize weights and activations to int8 (dynamic quantization) with tensorwise method. >=sm80, Ampere or newer
int8_weight_only quantize only weights to int8, keep activations in full precision >=sm80, Ampere or newer
int4_weight_only quantize only weights to int4, keep activations in full precision >=sm90, Hopper or newer, TMA required

FP8 Quantization

Currently, TorchAo has been fully integrated into Cache-DiT as the backend for online quantization. You can implement model quantization by calling quantize or pass a QuantizeConfig to enable_cache API. (recommended)

For GPUs with low memory capacity, we recommend using float8_per_row or float8_per_block, as these

Note

PR body was truncated to here.


Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 9463dfe to fd0ef19 Compare February 2, 2026 05:23
@renovate renovate Bot changed the title Update dependency cache_dit to v1.2.0 Update dependency cache_dit to v1.2.1 Feb 2, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from fd0ef19 to c65665d Compare February 10, 2026 08:50
@renovate renovate Bot changed the title Update dependency cache_dit to v1.2.1 Update dependency cache_dit to v1.2.2 Feb 10, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from c65665d to 376529f Compare February 26, 2026 08:32
@renovate renovate Bot changed the title Update dependency cache_dit to v1.2.2 Update dependency cache_dit to v1.2.3 Feb 26, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 376529f to 8e12ad4 Compare March 11, 2026 12:52
@renovate renovate Bot changed the title Update dependency cache_dit to v1.2.3 Update dependency cache_dit to v1.3.0 Mar 11, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 8e12ad4 to 90ea7f3 Compare March 25, 2026 09:55
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.0 Update dependency cache_dit to v1.3.1 Mar 25, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 90ea7f3 to 7f6b76d Compare March 26, 2026 13:02
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.1 Update dependency cache_dit to v1.3.3 Mar 26, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 7f6b76d to ba50475 Compare March 27, 2026 05:24
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.3 Update dependency cache_dit to v1.3.4 Mar 27, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from ba50475 to 0b1a44a Compare March 30, 2026 10:15
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.4 Update dependency cache_dit to v1.3.5 Mar 30, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 0b1a44a to 1ce318c Compare May 11, 2026 05:43
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.5 Update dependency cache_dit to v1.3.6 May 11, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 1ce318c to 28fa8b8 Compare May 12, 2026 09:11
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.6 Update dependency cache_dit to v1.3.7 May 12, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 28fa8b8 to 9ea5b4b Compare May 25, 2026 14:11
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.7 Update dependency cache_dit to v1.3.8 May 25, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 9ea5b4b to 302b19a Compare May 27, 2026 04:30
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.8 Update dependency cache_dit to v1.3.9 May 27, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 302b19a to 92bb116 Compare June 4, 2026 10:34
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.9 Update dependency cache_dit to v1.3.11 Jun 4, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 92bb116 to 3b5d57c Compare June 9, 2026 05:33
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.11 Update dependency cache_dit to v1.3.12 Jun 9, 2026
@renovate renovate Bot force-pushed the renovate/cache_dit-1.x branch from 3b5d57c to 8b1494c Compare June 16, 2026 08:45
@renovate renovate Bot changed the title Update dependency cache_dit to v1.3.12 Update dependency cache_dit to v1.5.0 Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants