An open API service indexing awesome lists of open source software.

https://github.com/hewei2001/reachqa

Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"
https://github.com/hewei2001/reachqa

data-synthesis llm mllm

Last synced: 11 months ago
JSON representation

Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"

Awesome Lists containing this project

README

          


๐Ÿช„Distill Visual Chart Reasoning Ability

from LLMs to MLLMs

This is the official repository for ๐Ÿ“ƒ[Distill Visual Chart Reasoning Ability from LLMs to MLLMs](https://arxiv.org/abs/2410.18798).

You have two options to obtain our dataset:

1. Download directly from the ๐Ÿค—**HuggingFace** Datasets: [hewei2001/ReachQA](https://huggingface.co/datasets/hewei2001/ReachQA).
2. Clone this repository and **generate ๐Ÿ“Šcharts using the synthetic code**: The process takes about **3 minutes**!

## ๐Ÿ“–Introduction

### ๐Ÿ”ฎCode-as-Intermediary Translation

We propose **Code-as-Intermediary Translation (CIT)**, a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities **from LLMs to MLLMs**. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce **ReachQA**, a dataset containing 3k **rea**soning-intensive **ch**arts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks such as MathVista.

> Figure: Overview of the CIT method for synthesizing multimodal instruction data. The process begins with **33 seed codes** and generates plot codes across various chart types, topics, and complexity levels through the Self-Instruct and Evol-Instruct stages. The chart set and instruction set are constructed bi-directionally, and the final filtered data yields ReachQA, a dataset for distilling visual chart reasoning abilities from LLMs to MLLMs.

### ๐Ÿ“ˆReachQA

> Table: Comparison of existing chart-related datasets across **three properties**. Only the chart question-answering (CQA) task is considered, despite some datasets having multiple tasks. Abbreviations: Vis.=visual, Comp.=complexity, Temp.=template, Refer.=Reference, Reas.=reasoning, Rat.=rationale, Annot.=annotation and Scal.=scalable.

> Table: ReachQA dataset statistics. Question and answer lengths are calculated based on the GPT-4o tokenizer.

## ๐Ÿ› Install

1. For dataset usage:
```bash
git clone https://github.com/hewei2001/ReachQA.git
cd ReachQA
conda create -n ReachQA_data python=3.10 -y
conda activate ReachQA_data

pip install -r requirements_data.txt
pip install lmdeploy # Optional, for MLLM filter
```

2. For training / evaluation usage:
```Shell
git clone https://github.com/hewei2001/ReachQA.git
cd ReachQA
conda create -n ReachQA_train python=3.10 -y
conda activate ReachQA_train

pip install -r requirements_train.txt --force-reinstall --no-deps
```

## ๐ŸŒณProject Structure

```
ReachQA
โ”œโ”€โ”€ assets
โ”œโ”€โ”€ data
โ”‚ โ”œโ”€โ”€ reachqa_seed
โ”‚ โ”œโ”€โ”€ reachqa_test
โ”‚ โ””โ”€โ”€ reachqa_train
โ”œโ”€โ”€ scripts
โ”‚ โ”œโ”€โ”€ data
โ”‚ โ”œโ”€โ”€ eval
โ”‚ โ”œโ”€โ”€ filter
โ”‚ โ””โ”€โ”€ train
โ”œโ”€โ”€ utils
โ”‚ โ”œโ”€โ”€ chart_notes.py
โ”‚ โ”œโ”€โ”€ openai_utils.py
โ”‚ โ””โ”€โ”€ __init__.py
โ”œโ”€โ”€ batch_filter_image.py
โ”œโ”€โ”€ batch_filter_QA.py
โ”œโ”€โ”€ openai_generate_code.py
โ”œโ”€โ”€ openai_generate_QA.py
โ”œโ”€โ”€ openai_llm_evaluation.py
โ”œโ”€โ”€ swift_infer_dataset.py
โ”œโ”€โ”€ requirements_data.txt
โ””โ”€โ”€ README.md
```
| File | Description |
|--------------------------|--------------------------------------------|
| assets/ | Folder for project-related resources |
| data/ | Folder for dataset storage |
| scripts/ | Folder for scripts to run |
| utils/ | Folder for utility functions |
| batch_filter_QA.py | Code for filtering Q&A with MLLMs |
| batch_filter_image.py | Code for filtering images with MLLMs |
| openai_generate_QA.py | Code for synthesizing Q&A |
| openai_generate_code.py | Code for synthesizing code for charts |
| openai_llm_evaluation.py | Code for LLM-as-a-Jugde evaluation |

## โฉ๏ธQuick Start

1. **Obtain ReachQA dataset in 3 minutes:**

```bash
cd ReachQA
conda activate ReachQA_data

python ./data/reachqa_train/execute_code.py \
--code_dir ./data/reachqa_train/code/ \
--image_dir ./data/reachqa_train/images/

python ./data/reachqa_test/execute_code.py
--code_dir ./data/reachqa_test/code/ \
--image_dir ./data/reachqa_test/images/
```

2. **Data Construction with CIT:**

Before generating, the parameters in the `scripts/` should be modified!

```bash
cd ReachQA
conda activate ReachQA_data

# Generate code
bash ./scripts/data/run_openai_generate_code.sh

# Execute code and generate images
python ./data/reachqa_train/execute_code.py \
--code_dir ./data/reachqa_train/all_code/ \
--image_dir ./data/reachqa_train/all_images/

# Filter images
bash ./scripts/filter/run_rating_images.sh
python ./data/reachqa_train/filter_rated_image.py \
--data_dir ./data/reachqa_train/

# Generate QA
bash ./scripts/data/run_openai_generate_QA.sh

# Filter QA
bash ./scripts/filter/run_rating_QA.sh
python ./data/reachqa_train/filter_rated_QA.py \
--data_dir ./data/reachqa_train/
```

3. **Training / Inference / Evaluation:**

Before training, the JSON instruction file needs to be processed into **Swift format**!

For the specific format, refer to the [Official Swift Documentation](https://github.com/modelscope/ms-swift/tree/main).
```bash
cd ReachQA
conda activate ReachQA_train

# Swift format
cd ./data/reachqa_train/
python process_to_swift_internvl.py

# Training
cd ../..
bash ./scripts/train/internvl2_lora.sh

# Inference
bash ./scripts/eval/infer_InternVL2-8B.sh

# Evaluation
bash ./scripts/eval/run_openai_evaluation.sh
```
## ๐ŸŒŸMain Results

> Table: Evaluation results on seven benchmarks. Details for these benchmarks and models are presented in ยง 4.1. The best performance for each category and task is in **bold**. The percentage of performance improvements compared to the vanilla model is denoted by (โ†‘).

---

> Figure: An example of **attention visualization** from the ChartQA dataset. The top row shows the results from the vanilla LLaVA-Next-Llama3-8B model, while the bottom row displays the results from our fine-tuned model. For each output, we present the attention distribution (highlighted zones) at **three key steps**, calculated by averaging the attention values of all tokens in each step.

## ๐Ÿ“ŒTODOs

- [x] Release the implementation of Code-as-Intermediary Translation (CIT).
- [x] Release the example code for training & evaluation.
- [x] Release the full ReachQA dataset we used in this paper.
- [ ] Release the [vllm](https://github.com/vllm-project/vllm)-implementation of CIT, for generating data with open-source LLMs.
- [ ] Release the manually curated ReachQA-v2 training set.

## ๐Ÿ“งContact

If you have any questions, please feel free to reach us at [whe23@m.fudan.edu.cn](mailto:whe23@m.fudan.edu.cn).

## ๐Ÿ”ŽCitation

If you find our work helpful or relevant to your research, please kindly cite our paper:

```
@article{he2024distill,
title={Distill Visual Chart Reasoning Ability from LLMs to MLLMs},
author={He, Wei and Xi, Zhiheng and Zhao, Wanxu and Fan, Xiaoran and Ding, Yiwen and Shan, Zifei and Gui, Tao and Zhang, Qi and Huang, Xuan-Jing},
journal={arXiv preprint arXiv:2410.18798},
year={2024}
}
```