{"id":28813960,"url":"https://github.com/vipshop/cache-dit","last_synced_at":"2026-03-11T12:28:27.806Z","repository":{"id":299451564,"uuid":"1000711946","full_name":"vipshop/cache-dit","owner":"vipshop","description":"🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers🔥","archived":false,"fork":false,"pushed_at":"2025-06-25T08:02:26.000Z","size":91322,"stargazers_count":64,"open_issues_count":2,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-25T08:30:34.668Z","etag":null,"topics":["acceleration","cogvideox","diffusion","dit","flux","transformers","wan"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/cache-dit","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vipshop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-12T07:54:30.000Z","updated_at":"2025-06-25T07:58:02.000Z","dependencies_parsed_at":"2025-06-17T07:35:22.093Z","dependency_job_id":null,"html_url":"https://github.com/vipshop/cache-dit","commit_stats":null,"previous_names":["vipshop/dbcache","vipshop/cachedit"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/vipshop/cache-dit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vipshop%2Fcache-dit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vipshop%2Fcache-dit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vipshop%2Fcache-dit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vipshop%2Fcache-dit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vipshop","download_url":"https://codeload.github.com/vipshop/cache-dit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vipshop%2Fcache-dit/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262444943,"owners_count":23312258,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acceleration","cogvideox","diffusion","dit","flux","transformers","wan"],"created_at":"2025-06-18T15:02:07.969Z","updated_at":"2025-10-08T07:21:42.351Z","avatar_url":"https://github.com/vipshop.png","language":"Python","funding_links":[],"categories":["📖 News 🔥🔥","📙 Caching","Computation and Communication Optimisation"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ch2\u003e🤗 CacheDiT: A Training-free and Easy-to-use Cache Acceleration \u003cbr\u003eToolbox for Diffusion Transformers\u003c/h2\u003e\n  \u003c/p\u003e\n  \u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/cache-dit-v1.png \u003e\n  \u003cdiv align='center'\u003e\n      \u003cimg src=https://img.shields.io/badge/Language-Python-brightgreen.svg \u003e\n      \u003cimg src=https://img.shields.io/badge/PRs-welcome-9cf.svg \u003e\n      \u003cimg src=https://img.shields.io/badge/PyPI-pass-brightgreen.svg \u003e\n      \u003cimg src=https://static.pepy.tech/badge/cache-dit \u003e\n      \u003cimg src=https://img.shields.io/badge/Python-3.10|3.11|3.12-9cf.svg \u003e\n      \u003cimg src=https://img.shields.io/badge/Release-v0.2.1-brightgreen.svg \u003e\n \u003c/div\u003e\n  \u003cp align=\"center\"\u003e\n    DeepCache is for UNet not DiT. Most DiT cache speedups are complex and not training-free. CacheDiT \u003cbr\u003eoffers a set of training-free cache accelerators for DiT: 🔥DBCache, DBPrune, FBCache, etc🔥\n  \u003c/p\u003e\n\u003c/div\u003e\n\n\n## 🤗 Introduction \n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ch3\u003e🔥DBCache: Dual Block Caching for Diffusion Transformers\u003c/h3\u003e\n  \u003c/p\u003e\n\u003c/div\u003e \n\n**DBCache**: **Dual Block Caching** for Diffusion Transformers. We have enhanced `FBCache` into a more general and customizable cache algorithm, namely `DBCache`, enabling it to achieve fully `UNet-style` cache acceleration for DiT models. Different configurations of compute blocks (**F8B12**, etc.) can be customized in DBCache. Moreover, it can be entirely **training**-**free**. DBCache can strike a perfect **balance** between performance and precision!\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    DBCache, \u003cb\u003e L20x1 \u003c/b\u003e, Steps: 28, \"A cat holding a sign that says hello world with complex background\"\n  \u003c/p\u003e\n\u003c/div\u003e\n\n|Baseline(L20x1)|F1B0 (0.08)|F1B0 (0.20)|F8B8 (0.15)|F12B12 (0.20)|F16B16 (0.20)|\n|:---:|:---:|:---:|:---:|:---:|:---:|\n|24.85s|15.59s|8.58s|15.41s|15.11s|17.74s|\n|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.08_S11.png width=105px\u003e | \u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.2_S19.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F8B8S1_R0.15_S15.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F12B12S4_R0.2_S16.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F16B16S4_R0.2_S13.png width=105px\u003e|\n|**Baseline(L20x1)**|**F1B0 (0.08)**|**F8B8 (0.12)**|**F8B12 (0.12)**|**F8B16 (0.20)**|**F8B20 (0.20)**|\n|27.85s|6.04s|5.88s|5.77s|6.01s|6.20s|\n|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/TEXTURE_NONE_R0.08.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/TEXTURE_DBCACHE_F1B0_R0.08.png width=105px\u003e |\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/TEXTURE_DBCACHE_F8B8_R0.12.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/TEXTURE_DBCACHE_F8B12_R0.12.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/TEXTURE_DBCACHE_F8B16_R0.2.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/TEXTURE_DBCACHE_F8B20_R0.2.png width=105px\u003e|\n\n\u003c!--\n|\u003cimg src=https://github.com/user-attachments/assets/70ea57f4-d8f2-415b-8a96-d8315974a5e6 width=105px\u003e|\u003cimg src=https://github.com/user-attachments/assets/fc0e1a67-19cc-44aa-bf50-04696e7978a0 width=105px\u003e |\u003cimg src=https://github.com/user-attachments/assets/d1434896-628c-436b-95ad-43c085a8629e width=105px\u003e|\u003cimg src=https://github.com/user-attachments/assets/aaa42cd2-57de-4c4e-8bfb-913018a8251d width=105px\u003e|\u003cimg src=https://github.com/user-attachments/assets/dc0ba2a4-ef7c-436d-8a39-67055deab92f width=105px\u003e|\u003cimg src=https://github.com/user-attachments/assets/aede466f-61ed-4256-8df0-fecf8020c5ca width=105px\u003e|\n--\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    DBCache, \u003cb\u003e L20x4 \u003c/b\u003e, Steps: 20, case to show the texture recovery ability of DBCache\n  \u003c/p\u003e\n\u003c/div\u003e\n\nThese case studies demonstrate that even with relatively high thresholds (such as 0.12, 0.15, 0.2, etc.) under the DBCache **F12B12** or **F8B16** configuration, the detailed texture of the kitten's fur, colored cloth, and the clarity of text can still be preserved. This suggests that users can leverage DBCache to effectively balance performance and precision in their workflows! \n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ch3\u003e🔥DBPrune: Dynamic Block Prune with Residual Caching\u003c/h3\u003e\n  \u003c/p\u003e\n\u003c/div\u003e \n\n**DBPrune**: We have further implemented a new **Dynamic Block Prune** algorithm based on **Residual Caching** for Diffusion Transformers, referred to as DBPrune. DBPrune caches each block's hidden states and residuals, then **dynamically prunes** blocks during inference by computing the L1 distance between previous hidden states. When a block is pruned, its output is approximated using the cached residuals.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    DBPrune, \u003cb\u003e L20x1 \u003c/b\u003e, Steps: 28, \"A cat holding a sign that says hello world with complex background\"\n  \u003c/p\u003e\n\u003c/div\u003e\n\n|Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|\n|:---:|:---:|:---:|:---:|:---:|:---:|\n|24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|\n|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px\u003e | \u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px\u003e|\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ch3\u003e🔥Context Parallelism and Torch Compile\u003c/h3\u003e\n  \u003c/p\u003e\n\u003c/div\u003e \n\nMoreover, **CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can easily tap into its **Context Parallelism** features for distributed inference. By the way, CacheDiT is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n  DBPrune + \u003cb\u003etorch.compile + context parallelism\u003c/b\u003e \u003cbr\u003eSteps: 28, \"A cat holding a sign that says hello world with complex background\"\n  \u003c/p\u003e\n\u003c/div\u003e\n\n|Baseline|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|\n|:---:|:---:|:---:|:---:|:---:|:---:|\n|+compile:20.43s|16.25s|14.12s|13.41s|12.00s|8.86s|\n|+L20x4:7.75s|6.62s|6.03s|5.81s|5.24s|3.93s|\n|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_NONE_R0.08_S0_T20.43s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.03_P24.0_T16.25s.png width=105px\u003e | \u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.04_P34.6_T14.12s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.045_P38.2_T13.41s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.055_P45.1_T12.00s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/U0_C1_DBPRUNE_F1B0_R0.2_P59.5_T8.86s.png width=105px\u003e|\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cb\u003e♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️\u003c/b\u003e\n  \u003c/p\u003e\n\u003c/div\u003e \n\n## ©️Citations\n\n```BibTeX\n@misc{CacheDiT@2025,\n  title={CacheDiT: A Training-free and Easy-to-use cache acceleration Toolbox for Diffusion Transformers},\n  url={https://github.com/vipshop/cache-dit.git},\n  note={Open-source software available at https://github.com/vipshop/cache-dit.git},\n  author={vipshop.com},\n  year={2025}\n}\n```\n\n## 👋Reference\n\n\u003cdiv id=\"reference\"\u003e\u003c/div\u003e\n\nThe **CacheDiT** codebase is adapted from [FBCache](https://github.com/chengzeyi/ParaAttention/tree/main/src/para_attn/first_block_cache). Special thanks to their excellent work! \n\n## 📖Contents \n\n\u003cdiv id=\"contents\"\u003e\u003c/div\u003e  \n\n- [⚙️Installation](#️installation)\n- [🔥Supported Models](#supported)\n- [⚡️Dual Block Cache](#dbcache)\n- [🎉First Block Cache](#fbcache)\n- [⚡️Dynamic Block Prune](#dbprune)\n- [🎉Context Parallelism](#context-parallelism)  \n- [🔥Torch Compile](#compile)\n- [👋Contribute](#contribute)\n- [©️License](#license)\n\n## ⚙️Installation  \n\n\u003cdiv id=\"installation\"\u003e\u003c/div\u003e\n\nYou can install the stable release of `cache-dit` from PyPI:\n\n```bash\npip3 install cache-dit\n```\nOr you can install the latest develop version from GitHub:\n\n```bash\npip3 install git+https://github.com/vipshop/cache-dit.git\n```\n\n## 🔥Supported Models  \n\n\u003cdiv id=\"supported\"\u003e\u003c/div\u003e\n\n- [🚀FLUX.1](https://github.com/vipshop/cache-dit/raw/main/examples)  \n- [🚀Mochi](https://github.com/vipshop/cache-dit/raw/main/examples)\n- [🚀CogVideoX](https://github.com/vipshop/cache-dit/raw/main/examples)\n- [🚀CogVideoX1.5](https://github.com/vipshop/cache-dit/raw/main/examples)\n- [🚀Wan2.1](https://github.com/vipshop/cache-dit/raw/main/examples)\n- [🚀HunyuanVideo](https://github.com/vipshop/cache-dit/raw/main/examples)\n\n\n\u003c!--\n\u003cp align=\"center\"\u003e\n  \u003ch4\u003e 🔥Supported Models🔥\u003c/h4\u003e\n  \u003ca href=https://github.com/vipshop/cache-dit/raw/main/examples\u003e \u003cb\u003e🚀FLUX.1\u003c/b\u003e: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥\u003c/a\u003e \u003cbr\u003e\n  \u003ca href=https://github.com/vipshop/cache-dit/raw/main/examples\u003e \u003cb\u003e🚀Mochi\u003c/b\u003e: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥\u003c/a\u003e \u003cbr\u003e\n  \u003ca href=https://github.com/vipshop/cache-dit/raw/main/examples\u003e \u003cb\u003e🚀CogVideoX\u003c/b\u003e: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥\u003c/a\u003e \u003cbr\u003e\n  \u003ca href=https://github.com/vipshop/cache-dit/raw/main/examples\u003e \u003cb\u003e🚀CogVideoX1.5\u003c/b\u003e: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥\u003c/a\u003e \u003cbr\u003e\n  \u003ca href=https://github.com/vipshop/cache-dit/raw/main/examples\u003e \u003cb\u003e🚀Wan2.1\u003c/b\u003e: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥\u003c/a\u003e \u003cbr\u003e\n  \u003ca href=https://github.com/vipshop/cache-dit/raw/main/examples\u003e \u003cb\u003e🚀HunyuanVideo\u003c/b\u003e: ✔️DBCache, ✔️DBPrune, ✔️FBCache🔥\u003c/a\u003e \u003cbr\u003e\n\u003c/p\u003e\n--\u003e\n\n## ⚡️DBCache: Dual Block Cache  \n\n\u003cdiv id=\"dbcache\"\u003e\u003c/div\u003e\n\n![](https://github.com/vipshop/cache-dit/raw/main/assets/dbcache-v1.png)\n\n**DBCache** provides configurable parameters for custom optimization, enabling a balanced trade-off between performance and precision:\n\n- **Fn**: Specifies that DBCache uses the **first n** Transformer blocks to fit the information at time step t, enabling the calculation of a more stable L1 diff and delivering more accurate information to subsequent blocks.\n- **Bn**: Further fuses approximate information in the **last n** Transformer blocks to enhance prediction accuracy. These blocks act as an auto-scaler for approximate hidden states that use residual cache.\n\n![](https://github.com/vipshop/cache-dit/raw/main/assets/dbcache-fnbn-v1.png)\n\n- **warmup_steps**: (default: 0) DBCache does not apply the caching strategy when the number of running steps is less than or equal to this value, ensuring the model sufficiently learns basic features during warmup.\n- **max_cached_steps**:  (default: -1) DBCache disables the caching strategy when the previous cached steps exceed this value to prevent precision degradation.\n- **residual_diff_threshold**: The value of residual diff threshold, a higher value leads to faster performance at the cost of lower precision.\n\nFor a good balance between performance and precision, DBCache is configured by default with **F8B8**, 8 warmup steps, and unlimited cached steps.\n\n```python\nfrom diffusers import FluxPipeline\nfrom cache_dit.cache_factory import apply_cache_on_pipe, CacheType\n\npipe = FluxPipeline.from_pretrained(\n    \"black-forest-labs/FLUX.1-dev\",\n    torch_dtype=torch.bfloat16,\n).to(\"cuda\")\n\n# Default options, F8B8, good balance between performance and precision\ncache_options = CacheType.default_options(CacheType.DBCache)\n\n# Custom options, F8B16, higher precision\ncache_options = {\n    \"cache_type\": CacheType.DBCache,\n    \"warmup_steps\": 8,\n    \"max_cached_steps\": 8,    # -1 means no limit\n    \"Fn_compute_blocks\": 8,   # Fn, F8, etc.\n    \"Bn_compute_blocks\": 16,  # Bn, B16, etc.\n    \"residual_diff_threshold\": 0.12,\n}\n\napply_cache_on_pipe(pipe, **cache_options)\n```\nMoreover, users configuring higher **Bn** values (e.g., **F8B16**) while aiming to maintain good performance can specify **Bn_compute_blocks_ids** to work with Bn. DBCache will only compute the specified blocks, with the remaining estimated using the previous step's residual cache.\n\n```python\n# Custom options, F8B16, higher precision with good performance.\ncache_options = {\n    # 0, 2, 4, ..., 14, 15, etc. [0,16)\n    \"Bn_compute_blocks_ids\": CacheType.range(0, 16, 2),\n    # If the L1 difference is below this threshold, skip Bn blocks \n    # not in `Bn_compute_blocks_ids`(1, 3,..., etc), Otherwise, \n    # compute these blocks.\n    \"non_compute_blocks_diff_threshold\": 0.08,\n}\n```\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    DBCache, \u003cb\u003e L20x1 \u003c/b\u003e, Steps: 28, \"A cat holding a sign that says hello world with complex background\"\n  \u003c/p\u003e\n\u003c/div\u003e\n\n|Baseline(L20x1)|F1B0 (0.08)|F1B0 (0.20)|F8B8 (0.15)|F12B12 (0.20)|F16B16 (0.20)|\n|:---:|:---:|:---:|:---:|:---:|:---:|\n|24.85s|15.59s|8.58s|15.41s|15.11s|17.74s|\n|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.08_S11.png width=105px\u003e | \u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F1B0S1_R0.2_S19.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F8B8S1_R0.15_S15.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F12B12S4_R0.2_S16.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBCACHE_F16B16S4_R0.2_S13.png width=105px\u003e|\n\n## 🎉FBCache: First Block Cache  \n\n\u003cdiv id=\"fbcache\"\u003e\u003c/div\u003e\n\n\u003c!--\n![](https://github.com/user-attachments/assets/0fb66656-b711-457a-92a7-a830f134272d)\n--\u003e\n\n![](https://github.com/vipshop/cache-dit/raw/main/assets/fbcache-v1.png)\n\n**DBCache** is a more general cache algorithm than **FBCache**. When Fn=1 and Bn=0, DBCache behaves identically to FBCache. Therefore, you can either use the original FBCache implementation directly or configure **DBCache** with **F1B0** settings to achieve the same functionality.\n\n```python\nfrom diffusers import FluxPipeline\nfrom cache_dit.cache_factory import apply_cache_on_pipe, CacheType\n\npipe = FluxPipeline.from_pretrained(\n    \"black-forest-labs/FLUX.1-dev\",\n    torch_dtype=torch.bfloat16,\n).to(\"cuda\")\n\n# Using FBCache directly\ncache_options = CacheType.default_options(CacheType.FBCache)\n\n# Or using DBCache with F1B0. \n# Fn=1, Bn=0, means FB Cache, otherwise, Dual Block Cache\ncache_options = {\n    \"cache_type\": CacheType.DBCache,\n    \"warmup_steps\": 8,\n    \"max_cached_steps\": 8,   # -1 means no limit\n    \"Fn_compute_blocks\": 1,  # Fn, F1, etc.\n    \"Bn_compute_blocks\": 0,  # Bn, B0, etc.\n    \"residual_diff_threshold\": 0.12,\n}\n\napply_cache_on_pipe(pipe, **cache_options)\n```\n\n## ⚡️DBPrune: Dynamic Block Prune\n\n\u003cdiv id=\"dbprune\"\u003e\u003c/div\u003e  \n\n\u003c!--\n![](https://github.com/user-attachments/assets/932b6360-9533-4352-b176-4c4d84bd4695)\n--\u003e\n\n![](https://github.com/vipshop/cache-dit/raw/main/assets/dbprune-v1.png)\n\nWe have further implemented a new **Dynamic Block Prune** algorithm based on **Residual Caching** for Diffusion Transformers, which is referred to as **DBPrune**. DBPrune caches each block's hidden states and residuals, then dynamically prunes blocks during inference by computing the L1 distance between previous hidden states. When a block is pruned, its output is approximated using the cached residuals. DBPrune is currently in the experimental phase, and we kindly invite you to stay tuned for upcoming updates.\n\n```python\nfrom diffusers import FluxPipeline\nfrom cache_dit.cache_factory import apply_cache_on_pipe, CacheType\n\npipe = FluxPipeline.from_pretrained(\n    \"black-forest-labs/FLUX.1-dev\",\n    torch_dtype=torch.bfloat16,\n).to(\"cuda\")\n\n# Using DBPrune with default options\ncache_options = CacheType.default_options(CacheType.DBPrune)\n\napply_cache_on_pipe(pipe, **cache_options)\n```\n\nWe have also brought the designs from DBCache to DBPrune to make it a more general and customizable block prune algorithm. You can specify the values of **Fn** and **Bn** for higher precision, or set up the non-prune blocks list **non_prune_blocks_ids** to avoid aggressive pruning. For example:\n\n```python\n# Custom options for DBPrune\ncache_options = {\n    \"cache_type\": CacheType.DBPrune,\n    \"residual_diff_threshold\": 0.05,\n    # Never prune the first `Fn` and last `Bn` blocks.\n    \"Fn_compute_blocks\": 8,  # default 1\n    \"Bn_compute_blocks\": 8,  # default 0\n    \"warmup_steps\": 8,  # default -1\n    # Disables the pruning strategy when the previous \n    # pruned steps greater than this value.\n    \"max_pruned_steps\": 12,  # default, -1 means no limit\n    # Enable dynamic prune threshold within step, higher \n    # `max_dynamic_prune_threshold` value may introduce a more \n    # ageressive pruning strategy.\n    \"enable_dynamic_prune_threshold\": True,\n    \"max_dynamic_prune_threshold\": 2 * 0.05,\n    # (New thresh) = mean(previous_block_diffs_within_step) * 1.25\n    # (New thresh) = ((New thresh) if (New thresh) \u003c\n    # max_dynamic_prune_threshold else residual_diff_threshold)\n    \"dynamic_prune_threshold_relax_ratio\": 1.25,\n    # The step interval to update residual cache. For example, \n    # 2: means the update steps will be [0, 2, 4, ...].\n    \"residual_cache_update_interval\": 1,\n    # You can set non-prune blocks to avoid ageressive pruning. \n    # For example, FLUX.1 has 19 + 38 blocks, so we can set it \n    # to 0, 2, 4, ..., 56, etc.\n    \"non_prune_blocks_ids\": [],\n}\n\napply_cache_on_pipe(pipe, **cache_options)\n```\n\n\u003e [!Important]\n\u003e Please note that for GPUs with lower VRAM, DBPrune may not be suitable for use on video DiTs, as it caches the hidden states and residuals of each block, leading to higher GPU memory requirements. In such cases, please use DBCache, which only caches the hidden states and residuals of 2 blocks.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp align=\"center\"\u003e\n    DBPrune, \u003cb\u003e L20x1 \u003c/b\u003e, Steps: 28, \"A cat holding a sign that says hello world with complex background\"\n  \u003c/p\u003e\n\u003c/div\u003e\n\n|Baseline(L20x1)|Pruned(24%)|Pruned(35%)|Pruned(38%)|Pruned(45%)|Pruned(60%)|\n|:---:|:---:|:---:|:---:|:---:|:---:|\n|24.85s|19.43s|16.82s|15.95s|14.24s|10.66s|\n|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/NONE_R0.08_S0.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.03_P24.0_T19.43s.png width=105px\u003e | \u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.04_P34.6_T16.82s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.05_P38.3_T15.95s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.06_P45.2_T14.24s.png width=105px\u003e|\u003cimg src=https://github.com/vipshop/cache-dit/raw/main/assets/DBPRUNE_F1B0_R0.2_P59.5_T10.66s.png width=105px\u003e|\n\n## 🎉Context Parallelism\n\n\u003cdiv id=\"context-parallelism\"\u003e\u003c/div\u003e  \n\n**CacheDiT** are **plug-and-play** solutions that works hand-in-hand with [ParaAttention](https://github.com/chengzeyi/ParaAttention). Users can **easily tap into** its **Context Parallelism** features for distributed inference. Firstly, install `para-attn` from PyPI:\n\n```bash\npip3 install para-attn  # or install `para-attn` from sources.\n```\n\nThen, you can run **DBCache** or **DBPrune** with **Context Parallelism** on 4 GPUs:\n\n```python\nimport torch.distributed as dist\nfrom diffusers import FluxPipeline\nfrom para_attn.context_parallel import init_context_parallel_mesh\nfrom para_attn.context_parallel.diffusers_adapters import parallelize_pipe\nfrom cache_dit.cache_factory import apply_cache_on_pipe, CacheType\n\n # Init distributed process group\ndist.init_process_group()\ntorch.cuda.set_device(dist.get_rank())\n\npipe = FluxPipeline.from_pretrained(\n    \"black-forest-labs/FLUX.1-dev\",\n    torch_dtype=torch.bfloat16,\n).to(\"cuda\")\n\n# Context Parallel from ParaAttention\nparallelize_pipe(\n    pipe, mesh=init_context_parallel_mesh(\n        pipe.device.type, max_ulysses_dim_size=4\n    )\n)\n\n# DBPrune with default options from this library\napply_cache_on_pipe(\n    pipe, **CacheType.default_options(CacheType.DBPrune)\n)\n\ndist.destroy_process_group()\n```\nThen, run the python test script with `torchrun`:\n```bash\ntorchrun --nproc_per_node=4 parallel_cache.py\n```\n\n## 🔥Torch Compile\n\n\u003cdiv id=\"compile\"\u003e\u003c/div\u003e  \n\nBy the way, **CacheDiT** is designed to work compatibly with **torch.compile.** You can easily use CacheDiT with torch.compile to further achieve a better performance. For example:\n\n```python\napply_cache_on_pipe(\n    pipe, **CacheType.default_options(CacheType.DBPrune)\n)\n# Compile the Transformer module\npipe.transformer = torch.compile(pipe.transformer)\n```\nHowever, users intending to use **CacheDiT** for DiT with **dynamic input shapes** should consider increasing the **recompile** **limit** of `torch._dynamo`. Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode. \n```python\ntorch._dynamo.config.recompile_limit = 96  # default is 8\ntorch._dynamo.config.accumulated_recompile_limit = 2048  # default is 256\n```\n\n## 👋Contribute \n\u003cdiv id=\"contribute\"\u003e\u003c/div\u003e\n\nHow to contribute? Star ⭐️ this repo to support us or check [CONTRIBUTE.md](https://github.com/vipshop/cache-dit/raw/main/CONTRIBUTE.md).\n\n## ©️License   \n\n\u003cdiv id=\"license\"\u003e\u003c/div\u003e\n\n\nWe have followed the original License from [ParaAttention](https://github.com/chengzeyi/ParaAttention), please check [LICENSE](https://github.com/vipshop/cache-dit/raw/main/LICENSE) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvipshop%2Fcache-dit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvipshop%2Fcache-dit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvipshop%2Fcache-dit/lists"}