https://github.com/jy0205/LaVIT

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
https://github.com/jy0205/LaVIT

Last synced: 5 months ago
JSON representation

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content

Host: GitHub
URL: https://github.com/jy0205/LaVIT
Owner: jy0205
License: other
Created: 2023-09-09T02:21:27.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-01T15:39:32.000Z (10 months ago)
Last Synced: 2024-08-07T23:51:33.235Z (9 months ago)
Language: Jupyter Notebook
Homepage:
Size: 83.5 MB
Stars: 471
Watchers: 15
Forks: 24
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - jy0205/LaVIT - 语言的统一建模。LaVIT的核心思想是利用视觉标记（Visual Tokens）作为视觉信息的桥梁，让语言模型能够像处理文本一样处理图像。该项目支持多种视觉任务，例如图像描述、视觉问答和图像生成。LaVIT的训练过程包括预训练和微调两个阶段，预训练阶段旨在学习视觉标记的表示，微调阶段则针对特定任务进行优化。项目提供了详细的代码和文档，方便用户进行实验和二次开发。LaVIT的主要优势在于其简单性和可扩展性，它能够轻松地集成到现有的语言模型中，并支持多种视觉模态。LaVIT为探索通用视觉-语言模型提供了一个有价值的框架。 (多模态大模型 / 资源传输下载)

README

# LaVIT: Empower the Large Language Model to Understand and Generate Visual Content

This is the official repository for the multi-modal large language models: **LaVIT** and **Video-LaVIT**. The LaVIT project aims to leverage the exceptional capability of LLM to deal with visual content. The proposed pre-training strategy supports visual understanding and generation with one unified framework.

* Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization, ICLR 2024, [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]

* Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization, [[`arXiv`](https://arxiv.org/abs/2402.03161)] [[`Project`](https://video-lavit.github.io)] [[`BibTeX`](#Citing)]

## News and Updates

* ```2024.04.21``` 🚀🚀🚀 We have released the pre-trained weight for **Video-LaVIT** on the HuggingFace and provide the inference code.

* ```2024.02.05``` 🌟🌟🌟 We have proposed the **Video-LaVIT**: an effective multimodal pre-training approach that empowers LLMs to comprehend and generate video content in a unified framework.

* ```2024.01.15``` 👏👏👏 LaVIT has been accepted by ICLR 2024!

* ```2023.10.17``` 🚀🚀🚀 We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.

## Introduction
The **LaVIT** and **Video-LaVIT** are general-purpose multi-modal foundation models that inherit the successful learning paradigm of LLM: predicting the next visual/textual token in an auto-regressive manner. The core design of the LaVIT series works includes a **visual tokenizer** and a **detokenizer**. The visual tokenizer aims to translate the non-linguistic visual content (e.g., image, video) into a sequence of discrete tokens like a foreign language that LLM can read. The detokenizer recovers the generated discrete tokens from LLM to the continuous visual signals.

LaVIT Pipeline

Video-LaVIT Pipeline

After pre-training, LaVIT and Video-LaVIT can support

* Read image and video content, generate the captions, and answer the questions.
* Text-to-image, Text-to-Video and Image-to-Video generation.
* Generation via Multi-modal Prompt.

## Citation
Consider giving this repository a star and cite LaVIT in your publications if it helps your research.

```
@article{jin2023unified,
title={Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization},
author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
journal={arXiv preprint arXiv:2309.04669},
year={2023}
}

@article{jin2024video,
title={Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization},
author={Jin, Yang and Sun, Zhicheng and Xu, Kun and Chen, Liwei and Jiang, Hao and Huang, Quzhe and Song, Chengru and Liu, Yuliang and Zhang, Di and Song, Yang and others},
journal={arXiv preprint arXiv:2402.03161},
year={2024}
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jy0205/LaVIT

Awesome Lists containing this project

README