https://github.com/foundationvision/llamagen

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
https://github.com/foundationvision/llamagen

auto-regressive-model diffusion diffusion-models image-generation llama llm text2image

Last synced: 5 months ago
JSON representation

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation

Host: GitHub
URL: https://github.com/foundationvision/llamagen
Owner: FoundationVision
License: mit
Created: 2024-06-03T08:18:41.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-15T20:10:35.000Z (about 1 year ago)
Last Synced: 2025-05-07T17:37:13.723Z (5 months ago)
Topics: auto-regressive-model, diffusion, diffusion-models, image-generation, llama, llm, text2image
Language: Python
Homepage: https://arxiv.org/abs/2406.06525
Size: 5.35 MB
Stars: 1,741
Watchers: 23
Forks: 77
Open Issues: 58
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation



[![demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-blue)](https://huggingface.co/spaces/FoundationVision/LlamaGen) 

[![arXiv](https://img.shields.io/badge/arXiv%20paper-2406.06525-b31b1b.svg)](https://arxiv.org/abs/2406.06525) 

[![project page](https://img.shields.io/badge/Project_page-More_visualizations-green)](https://peizesun.github.io/llamagen/) 









This repo contains pre-trained model weights and training/sampling PyTorch(torch>=2.1.0) codes used in

> [**Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation**](https://arxiv.org/abs/2406.06525)


> [Peize Sun](https://peizesun.github.io/), [Yi Jiang](https://enjoyyi.github.io/), [Shoufa Chen](https://www.shoufachen.com/), [Shilong Zhang](https://jshilong.github.io/), [Bingyue Peng](), [Ping Luo](http://luoping.me/), [Zehuan Yuan](https://shallowyuan.github.io/)

> 
HKU, ByteDance


You can find more visualizations on [![project page](https://img.shields.io/badge/Project_page-More_visualizations-green)](https://peizesun.github.io/llamagen/)

## 🔥 Update

- [2024.06.28] Image tokenizers and AR models for text-conditional image generation are released ! Try it !

- [2024.06.15] All models ranging from 100M to 3B parameters are supported by vLLM ! 

- [2024.06.11] Image tokenizers and AR models for class-conditional image generation are released !

- [2024.06.11] Code and Demo are released !

## 🌿 Introduction

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction`` paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, ``without inductive biases`` on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality.

In this repo, we release:

* Two image tokenizers of downsample ratio 16 and 8.

* Seven class-conditional generation models ranging from 100M to 3B parameters.

* Two text-conditional generation models of 700M parameters.

* Online demos in  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/FoundationVision/LlamaGen) for running pre-trained models.

* Supported vLLM serving framework to enable 300% - 400% speedup.

## 🦄 Class-conditional image generation on ImageNet

### VQ-VAE models

Method | params | tokens | rFID (256x256) | weight

--- |:---:|:---:|:---:|:---:

vq_ds16_c2i | 72M | 16x16 | 2.19 | [vq_ds16_c2i.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/vq_ds16_c2i.pt) 

vq_ds16_c2i | 72M | 24x24 | 0.94 | above

vq_ds16_c2i | 72M | 32x32 | 0.70 | above

vq_ds8_c2i  | 70M | 32x32 | 0.59 | [vq_ds8_c2i.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/vq_ds8_c2i.pt)

### AR models

Method | params | training | tokens | FID (256x256) | weight 

--- |:---:|:---:|:---:|:---:|:---:|

LlamaGen-B   | 111M | DDP | 16x16 | 5.46 | [c2i_B_256.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/c2i_B_256.pt)

LlamaGen-B   | 111M | DDP | 24x24 | 6.09 | [c2i_B_384.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/c2i_B_384.pt)

LlamaGen-L   | 343M | DDP | 16x16 | 3.80 | [c2i_L_256.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/c2i_L_256.pt)

LlamaGen-L   | 343M | DDP | 24x24 | 3.07 | [c2i_L_384.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/c2i_L_384.pt)

LlamaGen-XL  | 775M | DDP | 24x24 | 2.62 | [c2i_X_384L.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/c2i_XL_384.pt)

LlamaGen-XXL | 1.4B | FSDP | 24x24 | 2.34 | [c2i_XXL_384.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/c2i_XXL_384.pt)

LlamaGen-3B  | 3.1B | FSDP | 24x24 | 2.18 | [c2i_3B_384.pt](https://huggingface.co/FoundationVision/LlamaGen/resolve/main/c2i_3B_384.pt)

### Demo

Please download models, put them in the folder `./pretrained_models`, and run

```

python3 autoregressive/sample/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_L_384.pt --gpt-model GPT-L --image-size 384

# or

python3 autoregressive/sample/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_XXL_384.pt --gpt-model GPT-XXL --from-fsdp --image-size 384

```

The generated images will be saved to `sample_c2i.png`.

### Gradio Demo 

You can use our online gradio demo [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/FoundationVision/LlamaGen) or run gradio locally:

```bash

python app.py

```

## 🚀 Text-conditional image generation

### VQ-VAE models

Method | params | tokens | data | weight

--- |:---:|:---:|:---:|:---:

vq_ds16_t2i | 72M | 16x16 | LAION COCO (50M) + internal data (10M) | [vq_ds16_t2i.pt](https://huggingface.co/peizesun/llamagen_t2i/resolve/main/vq_ds16_t2i.pt)

### AR models

Method | params | tokens | data | weight 

--- |:---:|:---:|:---:|:---:

LlamaGen-XL  | 775M | 16x16 | LAION COCO (50M) | [t2i_XL_stage1_256.pt](https://huggingface.co/peizesun/llamagen_t2i/resolve/main/t2i_XL_stage1_256.pt)

LlamaGen-XL  | 775M | 32x32 | internal data (10M) | [t2i_XL_stage2_512.pt](https://huggingface.co/peizesun/llamagen_t2i/resolve/main/t2i_XL_stage2_512.pt)

### Demo

Before running demo, please refer to [language readme](language/README.md) to install the required packages and language models.  

Please download models, put them in the folder `./pretrained_models`, and run

```

python3 autoregressive/sample/sample_t2i.py --vq-ckpt ./pretrained_models/vq_ds16_t2i.pt --gpt-ckpt ./pretrained_models/t2i_XL_stage1_256.pt --gpt-model GPT-XL --image-size 256

# or

python3 autoregressive/sample/sample_t2i.py --vq-ckpt ./pretrained_models/vq_ds16_t2i.pt --gpt-ckpt ./pretrained_models/t2i_XL_stage2_512.pt --gpt-model GPT-XL --image-size 512

```

The generated images will be saved to `sample_t2i.png`.

### Local Gradio Demo

## ⚡ Serving

We use serving framework [vLLM](https://github.com/vllm-project/vllm) to enable higher throughput. Please refer to [serving readme](autoregressive/serve/README.md) to install the required packages.  

```

python3 autoregressive/serve/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_XXL_384.pt --gpt-model GPT-XXL --from-fsdp --image-size 384

```

The generated images will be saved to `sample_c2i_vllm.png`.

## Getting Started

See [Getting Started](GETTING_STARTED.md) for installation, training and evaluation.

## License

The majority of this project is licensed under MIT License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.

## BibTeX

```bibtex

@article{sun2024autoregressive,

  title={Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation},

  author={Sun, Peize and Jiang, Yi and Chen, Shoufa and Zhang, Shilong and Peng, Bingyue and Luo, Ping and Yuan, Zehuan},

  journal={arXiv preprint arXiv:2406.06525},

  year={2024}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/foundationvision/llamagen

Awesome Lists containing this project

README