https://github.com/Yangyi-Chen/SOLO
[TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
https://github.com/Yangyi-Chen/SOLO
Last synced: 7 months ago
JSON representation
[TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
- Host: GitHub
- URL: https://github.com/Yangyi-Chen/SOLO
- Owner: Yangyi-Chen
- License: apache-2.0
- Created: 2024-07-04T00:24:33.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-14T00:01:34.000Z (about 1 year ago)
- Last Synced: 2024-11-14T01:17:45.391Z (about 1 year ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 2.4 MB
- Stars: 113
- Watchers: 2
- Forks: 4
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ai-game-devtools - SOLO - Language Modeling. |[arXiv](https://arxiv.org/abs/2407.06438) | | Visual | (<span id="visual">VLM (Visual)</span> / <span id="tool">LLM (LLM & Tool)</span>)
README
[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling
📃 Paper
•
🤗 Model (SOLO-7B)

We present **SOLO**, a **single Transformer architecture for unified vision-language modeling**.
SOLO accepts both raw image patches (in *pixels*) and texts as inputs, *without* using a separate pre-trained vision encoder.
## TODO Roadmap
 ✅ **Release the instruction tuning data mixture**
 ✅ **Release the [code for instruction tuning](https://github.com/Yangyi-Chen/SOLO/blob/main/SFT_GUIDE.md)**
 ✅ **Release the [pre-training code](https://github.com/Yangyi-Chen/SOLO/blob/main/PRETRAIN_GUIDE.md)**
 ✅ **Release the SOLO model** 🤗 Model (SOLO-7B)
 ✅ **Paper on arxiv** 📃 Paper
## Setup
### Clone Repo
```bash
git clone https://github.com/Yangyi-Chen/SOLO
git submodule update --init --recursive
```
### Setup Environment for Data Processing
```bash
conda env create -f environment.yml
conda activate solo
```
OR simply
```bash
pip install -r requirements.txt
```
## SOLO Inference with Huggingface
Check [`scripts/notebook/demo.ipynb`](scripts/notebook/demo.ipynb) for an example of performing inference on the model.
## Pre-Training
Please refer to [PRETRAIN_GUIDE.md](PRETRAIN_GUIDE.md) for more details about how to perform pre-training. The following table documents the data statistics in pre-training:

## Instruction Fine-Tuning
Please refer to [SFT_GUIDE.md](SFT_GUIDE.md) for more details about how to perform instruction fine-tuning. The following table documents the data statistics in instruction fine-tuning:

## Citation
If you use or extend our work, please consider citing our paper.
```bibtex
@article{chen2024single,
title={A Single Transformer for Scalable Vision-Language Modeling},
author={Chen, Yangyi and Wang, Xingyao and Peng, Hao and Ji, Heng},
journal={arXiv preprint arXiv:2407.06438},
year={2024}
}
```