Skip to content

v1.0.0 🎉First Stable Release

Compare
Choose a tag to compare
@DefTruth DefTruth released this 25 Sep 07:23
· 22 commits to main since this release
9119f6a

🎉 v1.0.0 First Stable Release

We are excited to announce that the first API-stable version of cache-dit has finally been released!

cache-dit is a Unified, Flexible, and Training-free cache acceleration framework for 🤗 Diffusers, enabling cache acceleration with just one line of code. Key features include Unified Cache APIs, Forward Pattern Matching, Automatic Block Adapter, Hybrid Forward Pattern, DBCache, TaylorSeer Calibrator, and Cache CFG.

🔥 Core Features

  • Full 🤗 Diffusers Support: Notably, cache-dit now supports nearly all of Diffusers' DiT-based pipelines, such as Qwen-Image, FLUX.1, Qwen-Image-Lightning, Wan 2.1/2.2, HunyuanImage-2.1, HunyuanVideo, HunyuanDiT, HiDream, AuraFlow, CogView3Plus, CogView4, LTXVideo, CogVideoX/X 1.5, ConsisID, Cosmos, SkyReelsV2, VisualCloze, OmniGen 1/2, Lumina 1/2, PixArt, Chroma, Sana, Allegro, Mochi, SD 3/3.5, Amused, and DiT-XL.
  • Extremely Easy to Use: In most cases, you only need ♥️ one line ♥️ of code: cache_dit.enable_cache(...). After calling this API, just use the pipeline as normal.
  • Easy New Model Integration: Features like Unified Cache APIs, Forward Pattern Matching, Automatic Block Adapter, Hybrid Forward Pattern, and Patch Functor make it highly functional and flexible. For example, we achieved 🎉 Day 1 support for HunyuanImage-2.1—even before it was available in the Diffusers library.
  • State-of-the-Art Performance: Compared with algorithms including Δ-DiT, Chipmunk, FORA, DuCa, TaylorSeer, and FoCa, cache-dit's DBCache achieves the best accuracy when the speedup ratio is below 3x.
  • Support for 4/8-Step Distilled Models: Surprisingly, cache-dit's DBCache works for extremely few-step distilled models—something many other methods fail to do.
  • Compatibility with Other Optimizations: Designed to work seamlessly with torch.compile, model CPU offload, sequential CPU offload, group offloading, etc.
  • Hybrid Cache Acceleration: Now supports hybrid DBCache + Calibrator schemes (e.g., DBCache + TaylorSeerCalibrator). DBCache acts as the Indicator to decide when to cache, while the Calibrator decides how to cache. More mainstream cache acceleration algorithms (e.g., FoCa) will be supported in the future, along with additional benchmarks—stay tuned for updates!
  • 🤗 Diffusers Ecosystem Integration: 🔥 cache-dit has joined the 🤗 Diffusers community ecosystem as the first DiT-specific cache acceleration framework! Check out the documentation here: Diffusers Docs.

🔥Benchmarks

image-reward-bench

clip-score-bench

The comparison between cache-dit: DBCache and algorithms such as Δ-DiT, Chipmunk, FORA, DuCa, TaylorSeer and FoCa is as follows. Now, in the comparison with a speedup ratio less than 3x, cache-dit achieved the best accuracy. Surprisingly, cache-dit: DBCache still works in the extremely few-step distill model. For a complete benchmark, please refer to 📚Benchmarks.

Method TFLOPs(↓) SpeedUp(↑) ImageReward(↑) Clip Score(↑)
[FLUX.1-dev]: 50 steps 3726.87 1.00× 0.9898 32.404
[FLUX.1-dev]: 60% steps 2231.70 1.67× 0.9663 32.312
Δ-DiT(N=2) 2480.01 1.50× 0.9444 32.273
Δ-DiT(N=3) 1686.76 2.21× 0.8721 32.102
[FLUX.1-dev]: 34% steps 1264.63 3.13× 0.9453 32.114
Chipmunk 1505.87 2.47× 0.9936 32.776
FORA(N=3) 1320.07 2.82× 0.9776 32.266
DBCache(F=4,B=0,W=4,MC=4) 1400.08 2.66× 1.0065 32.838
DBCache+TaylorSeer(F=1,B=0,O=1) 1153.05 3.23× 1.0221 32.819
DuCa(N=5) 978.76 3.80× 0.9955 32.241
TaylorSeer(N=4,O=2) 1042.27 3.57× 0.9857 32.413
DBCache(F=1,B=0,W=4,MC=6) 944.75 3.94× 0.9997 32.849
DBCache+TaylorSeer(F=1,B=0,O=1) 944.75 3.94× 1.0107 32.865
FoCa(N=5): arxiv.2508.16211 893.54 4.16× 1.0029 32.948

🎉 v1.0.0 首个稳定版本发布

我们非常兴奋地宣布,cache-dit 的首个 API 稳定版本终于正式发布!

cache-dit 是一款为 🤗 Diffusers 打造的统一化(Unified)、高灵活(Flexible)、无需训练(Training-free) 的缓存加速框架,仅需一行代码即可实现缓存加速。核心特性包括统一缓存接口(Unified Cache APIs)前向模式匹配(Forward Pattern Matching)自动块适配(Automatic Block Adapter)混合前向模式(Hybrid Forward Pattern)DBCache 机制TaylorSeer 校准器(TaylorSeer Calibrator)Cache CFG

🔥 核心特性

  • 全面支持 🤗 Diffusers:值得注意的是,cache-dit 目前已支持 Diffusers 中几乎所有基于 DiT(Transformer 扩散模型)的流水线,例如 Qwen-Image、FLUX.1、Qwen-Image-Lightning、Wan 2.1/2.2、HunyuanImage-2.1、HunyuanVideo、HunyuanDiT、HiDream、AuraFlow、CogView3Plus、CogView4、LTXVideo、CogVideoX/X 1.5、ConsisID、Cosmos、SkyReelsV2、VisualCloze、OmniGen 1/2、Lumina 1/2、PixArt、Chroma、Sana、Allegro、Mochi、SD 3/3.5、Amused 以及 DiT-XL 等。
  • 极致易用:在大多数场景下,仅需♥️ 一行 ♥️ 代码即可启用:cache_dit.enable_cache(...)。调用该接口后,正常使用流水线即可享受加速。
  • 轻松集成新模型:统一缓存接口、前向模式匹配、自动块适配、混合前向模式及 Patch Functor 等特性,使其具备极强的功能性与灵活性。例如,我们实现了对 HunyuanImage-2.1 的 🎉 首日支持(Day 1 Support)——即便该模型当时尚未在 Diffusers 库中正式发布。
  • 业界领先性能:与 Δ-DiT、Chipmunk、FORA、DuCa、TaylorSeer、FoCa 等算法相比,在加速比低于 3 倍的场景下,cache-dit 的 DBCache 机制实现了最优精度。
  • 支持 4/8 步蒸馏模型:令人惊喜的是,cache-dit 的 DBCache 机制可适配极少量步数的蒸馏模型,而这是许多其他方法无法实现的。
  • 兼容多种优化方案:设计上可与 torch.compile、模型 CPU 卸载、顺序 CPU 卸载、分组卸载等优化方案无缝协同。
  • 混合缓存加速:目前已支持 DBCache + 校准器 混合方案(例如 DBCache + TaylorSeerCalibrator)。其中 DBCache 作为指示器(Indicator) 决定何时(when) 缓存,校准器则负责决定如何(how) 缓存。未来将支持更多主流缓存加速算法(如 FoCa 等)及更多基准测试,敬请期待更新!
  • 🤗 Diffusers 生态集成:🔥 cache-dit 已正式加入 🤗 Diffusers 社区生态,成为首个针对 DiT 的缓存加速框架!查看文档:Diffusers 官方文档

New Contributors

Full Changelog: v0.1.0...v1.0.0