An open API service indexing awesome lists of open source software.

https://github.com/vchitect/taca

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
https://github.com/vchitect/taca

Last synced: 12 months ago
JSON representation

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Awesome Lists containing this project

README

          


TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers




Zhengyao Lv*1,



Tianlin Pan*2,3,



Chenyang Si2‡†,



Zhaoxi Chen4,



Wangmeng Zuo5,



Ziwei Liu4†,



Kwan-Yee K. Wong1†


1The University of Hong Kong      
2Nanjing University

3University of Chinese Academy of Sciences      
4Nanyang Technological University

5Harbin Institute of Technology

(*Equal Contribution.    Project Leader.    Corresponding Author.)


Paper |
Project Page |
LoRA Weights

# About
We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.

https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9

# Usage
For Stable Diffusion 3.5, simply run:
``` sh
python infer/infer_sd3.py
```

For FLUX.1, run:
``` sh
python infer/infer_flux.py
```

# Benchmark
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.

| Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ |
|---|---|---|---|---|---|---|
| | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | |
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
| FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** |
| FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 |
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
| SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 |
| SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** |

# Showcases
![](static/images/short_1.png)
![](static/images/short_2.png)
![](static/images/long_1.png)
![](static/images/long_2.png)