https://github.com/Alpha-VLLM/Lumina-mGPT

Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"
https://github.com/Alpha-VLLM/Lumina-mGPT

Last synced: 4 months ago
JSON representation

Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"

Host: GitHub
URL: https://github.com/Alpha-VLLM/Lumina-mGPT
Owner: Alpha-VLLM
Created: 2024-08-02T07:50:23.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-16T17:11:30.000Z (over 1 year ago)
Last Synced: 2024-08-16T18:40:47.572Z (over 1 year ago)
Language: Python
Homepage: https://arxiv.org/abs/2408.02657
Size: 10.7 MB
Stars: 384
Watchers: 7
Forks: 14
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

ai-game-devtools - Lumina-mGPT - mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining. |[arXiv](https://arxiv.org/abs/2408.02657) | | Image | (<span id="image">Image</span> / <span id="tool">LLM (LLM & Tool)</span>)

README

          




# Lumina-mGPT

 A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. 👋 join our WeChat 

[![Lumina-mGPT](https://img.shields.io/badge/Paper-Lumina--mGPT-2b9348.svg?logo=arXiv)](https://arxiv.org/abs/2408.02657) 

[![Static Badge](https://img.shields.io/badge/Official(node1)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-mGPT)](http://106.14.2.150:10020/) 

[![Static Badge](https://img.shields.io/badge/Official(node2)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-mGPT)](http://106.14.2.150:10021/) 





## 📰 News

- **[2024-08-11] 🎉🎉🎉 [Training codes and documents](./lumina_mgpt/TRAIN.md) are released! 🎉🎉🎉**

- **[2024-07-08] 🎉🎉🎉 Lumina-mGPT is released! 🎉🎉🎉**

## ⚙️ Installation

See [INSTALL.md](./INSTALL.md) for detailed instructions.

Note that the Lumina-mGPT implementation heavily relies on

the [xllmx](./xllmx) module, which is evolved from [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory) for supporting

LLM-centered multimodal tasks. Make sure it is installed correctly as a python package before going on.

## ⛽ Training

See [lumina_mgpt/TRAIN.md](lumina_mgpt/TRAIN.md)

## 📽️ Inference

> [!Note]

>

> Before using the Lumina-mGPT model, run

>

> ```bash

> # bash

> cd lumina_mgpt

> ```

>

> to enter the directory of the Lumina-mGPT implementation.

### Perpetration

Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights [provided by Meta](https://github.com/facebookresearch/chameleon) and

put them to the following directory:

```

Lumina-mGPT

- lumina_mgpt/

    - ckpts/

        - chameleon/

            - tokenizer/

                - text_tokenizer.json

                - vqgan.yaml

                - vqgan.ckpt

- xllmx/

- ...

```

### Local Gradio Demos

We have prepared three different Gradio demos, each showcasing unique functionalities, to help you quickly become familiar with the capabilities of the Lumina-mGPT models.

#### 1. [demos/demo_image_generation.py](./Lumina-mGPT/demos/demo_image_generation.py)

This demo is customized for Image Generation tasks, where you can input a text description and generate a corresponding image.

To host this demo, run:

```bash

# Note to set the `--target_size` argument consistent with the checkpoint

python -u demos/demo_image_generation.py \

--pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768 \

--target_size 768

```

#### 2. [demos/demo_image2image.py](./Lumina-mGPT/demos/demo_image2image.py)

This demo is designed for models trained with Omni-SFT. you can conveniently switch between the multiple downstream tasks using this demo.

```bash

# Note to set the `--target_size` argument consistent with the checkpoint

python -u demos/demo_image2image.py \

--pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni \

--target_size 768

```

#### 3. [demos/demo_freeform.py](./Lumina-mGPT/demos/demo_freeform.py)

This is a powerful demo with minimal constraint on the input format. It supports flexible interation and is suitable for in-deep exploration.

```bash

# Note to set the `--target_size` argument consistent with the checkpoint

python -u demos/demo_freeform.py \

--pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni \

--target_size 768

```

### Simple Inference

The simplest code for Lumina-mGPT inference:

```python

from inference_solver import FlexARInferenceSolver

from PIL import Image

# ******************** Image Generation ********************

inference_solver = FlexARInferenceSolver(

    model_path="Alpha-VLLM/Lumina-mGPT-7B-768",

    precision="bf16",

    target_size=768,

)

q1 = f"Generate an image of 768x768 according to the following prompt:\n"

     f"Image of a dog playing water, and a waterfall is in the background."

# generated: tuple of (generated response, list of generated images)

generated = inference_solver.generate(

    images=[],

    qas=[[q1, None]],

    max_gen_len=8192,

    temperature=1.0,

    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),

)

a1, new_image = generated[0], generated[1][0]

# ******************* Image Understanding ******************

inference_solver = FlexARInferenceSolver(

    model_path="Alpha-VLLM/Lumina-mGPT-7B-512",

    precision="bf16",

    target_size=512,

)

# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM

q1 = "Describe the image in detail. <|image|>"

images = [Image.open("image.png")]

qas = [[q1, None]]

# `len(images)` should be equal to the number of appearance of "<|image|>" in qas

generated = inference_solver.generate(

    images=images,

    qas=qas,

    max_gen_len=8192,

    temperature=1.0,

    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),

)

a1 = generated[0]

# generated[1], namely the list of newly generated images, should typically be empty in this case.

# ********************* Omni-Potent *********************

inference_solver = FlexARInferenceSolver(

    model_path="Alpha-VLLM/Lumina-mGPT-7B-768-Omni",

    precision="bf16",

    target_size=768,

)

# Example: Depth Estimation

# For more instructions, see demos/demo_image2image.py

q1 = "Depth estimation. <|image|>"

images = [Image.open("image.png")]

qas = [[q1, None]]

generated = inference_solver.generate(

    images=images,

    qas=qas,

    max_gen_len=8192,

    temperature=1.0,

    logits_processor=inference_solver.create_logits_processor(cfg=1.0, image_top_k=200),

)

a1 = generated[0]

new_image = generated[1][0]

```

## 🤗 Checkpoints

**Configurations**





**7B models**

| Model        | Size | Huggingface                                                                              |

| ------------ | ---- | ---------------------------------------------------------------------------------------- |

| FP-SFT@512   | 7B   | [Alpha-VLLM/Lumina-mGPT-7B-512](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-512)       |

| FP-SFT@768   | 7B   | [Alpha-VLLM/Lumina-mGPT-7B-768](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-768)       |

| Omni-SFT@768 | 7B   | [Alpha-VLLM/Lumina-mGPT-7B-768-Omni](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-768-Omni) |

| FP-SFT@1024  | 7B   | [Alpha-VLLM/Lumina-mGPT-7B-1024](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-1024)     |

**34B models**

| Model      | Size | Huggingface                                                                          |

| ---------- | ---- | ------------------------------------------------------------------------------------ |

| FP-SFT@512 | 34B  | [Alpha-VLLM/Lumina-mGPT-34B-512](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-34B-512) |

More checkpoints coming soon.

## 📑 Open-source Plan

- [X] Inference code

- [X] Training code

## 🔥 Open positions

We are hiring interns, postdocs, and full-time researchers at the General Vision Group, Shanghai AI Lab, with a focus on multi-modality and vision foundation models. If you are interested, please contact gaopengcuhk@gmail.com.

## 📄 Citation

```

@misc{liu2024lumina-mgpt,

      title={Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining},

      author={Dongyang Liu and Shitian Zhao and Le Zhuo and Weifeng Lin and Yu Qiao and Hongsheng Li and Peng Gao},

      year={2024},

      eprint={2408.02657},

      archivePrefix={arXiv},

      primaryClass={cs.CV},

      url={https://arxiv.org/abs/2408.02657},

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Alpha-VLLM/Lumina-mGPT

Awesome Lists containing this project

README