https://github.com/bytedance/lynx-llm

paper: https://arxiv.org/abs/2307.02469 page: https://lynx-llm.github.io/
https://github.com/bytedance/lynx-llm

research

Last synced: 4 months ago
JSON representation

paper: https://arxiv.org/abs/2307.02469 page: https://lynx-llm.github.io/

Host: GitHub
URL: https://github.com/bytedance/lynx-llm
Owner: bytedance
License: apache-2.0
Created: 2023-07-13T02:16:16.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-08-09T03:16:42.000Z (almost 2 years ago)
Last Synced: 2025-03-03T23:49:44.199Z (4 months ago)
Topics: research
Language: Python
Homepage:
Size: 2.34 MB
Stars: 237
Watchers: 8
Forks: 8
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

awesome-multi-modal - https://github.com/bytedance/lynx-llm
awesome-multi-modal - https://github.com/bytedance/lynx-llm
jimsghstars - bytedance/lynx-llm - paper: https://arxiv.org/abs/2307.02469 page: https://lynx-llm.github.io/ (Python)

README

# What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

**Yan Zeng\*, Hanbo Zhang\*, Jiani Zheng\*, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong**
*Equal Contribution

[![Project](http://img.shields.io/badge/Project-Lynx-E3E4C8.svg)](https://lynx-llm.github.io/)
[![Paper](http://img.shields.io/badge/Paper-arxiv.2307.02469-99D4C8.svg)](https://arxiv.org/abs/2307.02469)

**update**
- Jul 2023: Release preprint in [arXiv](https://arxiv.org/abs/2307.02469), and [page](https://lynx-llm.github.io/)

Lynx (8B parameters):

results on Open-VQA image testsets

results on Open-VQA video testsets && OwlEval human eval && MME benchmark

ablation result

## Quick Start

### environment
```angular2html
conda env create -f environment.yml
conda activate lynx
```

### prepare data
#### step 1: prepare annotation file
Open-VQA annotations file is under the path `data/Open_VQA_images.jsonl` and `data/Open_VQA_videos.jsonl`, there is an example:
```angular2html
{
"dataset": "Open_VQA_images", # the dataset name of your data
"question": "What is in the image?",
"answer": ["platform or tunnel"], # list
"index": 1,
"image": "images/places365/val_256/Places365_val_00000698.jpg", # relative path of image
"origin_dataset": "places365",
"class": "Place", # eight image VQA types and two video VQA types correspond to the open_VQA dataset
}
```
You can also convert your own data in jsonl format, the keys `origin_dataset` and `class` are optional.

#### step 2: prepare images
Download raw images from corresponding websites: [Places365(256x256)](http://places2.csail.mit.edu/download.html), [VQAv2](https://visualqa.org/download.html), [OCRVQA](https://ocr-vqa.github.io/), [Something-Something-v.2](https://developer.qualcomm.com/software/ai-datasets/something-something), [MSVD-QA](https://github.com/xudejing/video-question-answering), [NeXT-QA](https://github.com/doc-doc/NExT-QA) and [MSRVTT-QA](https://github.com/xudejing/video-question-answering).

#### step 3: modify the default setting in the code
You need the check some import settings in the configs `configs/LYNX.yaml`, for example:
```yaml
# change this prompt for different task, this is the default prompt
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"
```

### prepare checkpoint
- step 1: download the `eva_vit_1b` on official [website](https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA01_g_psz14.pt) and put it under the `data/`, rename it as `eva_vit_g.pth`
- step 2: prepare the `vicuna-7b` and put it under the `data/`
- method 1: download from [huggingface](https://huggingface.co/lmsys/vicuna-7b-v1.1) directly.
- method 2:
- download Vicuna’s **delta** weight from [v1.1 version](https://huggingface.co/lmsys/vicuna-7b-delta-v1.1) (use git-lfs)
- get `LLaMA-7b` from [here](https://huggingface.co/docs/transformers/main/model_doc/llama) or from the Internet.
- install FastChat `pip install git+https://github.com/lm-sys/FastChat.git`
- run `python -m fastchat.model.apply_delta --base /path/to/llama-7b-hf/ --target ./data/vicuna-7b/ --delta /path/to/vicuna-7b-delta-v1.1/`
- step 3: download the [pretrain_lynx.pt](https://lf-robot-opensource.bytetos.com/obj/lab-robot-public/lynx_release/pretrain_lynx.pt) or [finetune_lynx.pt](https://lf-robot-opensource.bytetos.com/obj/lab-robot-public/lynx_release/finetune_lynx.pt) and put it under the `data/`(please check the `checkpoint` in the config is match the file you download.)

organize the files like this:
```angular2html
lynx-llm/
data/
Open_VQA_images.jsonl
Open_VQA_videos.jsonl
eva_vit_g.pth
vicuna-7b/
finetune_lynx.pt
pretrain_lynx.pt
images/
vqav2/val2014/*.jpg
places365/val_256/*.jpg
ocrvqa/images/*.jpg
sthsthv2cap/val/*.mp4
msvdqa/test/*.mp4
nextqa/*.mp4
msrvttqa/*.mp4
```

### infer
```angular2html
sh generate.sh
```

## Citation
If you find this repository useful, please considering giving ⭐ or citing:
```
@article{zeng2023matters,
title={What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?},
author={Zeng, Yan and Zhang, Hanbo and Zheng, Jiani and Xia, Jiangnan and Wei, Guoqiang and Wei, Yang and Zhang, Yuchen and Kong, Tao},
journal={arXiv preprint arXiv:2307.02469},
year={2023}
}
```

## Contact
For issues using this code, please submit a GitHub issue.

## License

This project is licensed under the [Apache-2.0 License](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bytedance/lynx-llm

Awesome Lists containing this project

README