{"id":22340851,"url":"https://github.com/hewei2001/reachqa","last_synced_at":"2025-07-30T01:31:38.052Z","repository":{"id":259520063,"uuid":"878044680","full_name":"hewei2001/ReachQA","owner":"hewei2001","description":"Code \u0026 Dataset for Paper: \"Distill Visual Chart Reasoning Ability from LLMs to MLLMs\"","archived":false,"fork":false,"pushed_at":"2024-10-28T07:31:58.000Z","size":10295,"stargazers_count":51,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T22:51:12.866Z","etag":null,"topics":["data-synthesis","llm","mllm"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2410.18798","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hewei2001.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-24T17:20:07.000Z","updated_at":"2025-03-25T09:56:25.000Z","dependencies_parsed_at":"2024-12-04T07:53:49.375Z","dependency_job_id":null,"html_url":"https://github.com/hewei2001/ReachQA","commit_stats":null,"previous_names":["hewei2001/reachqa"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hewei2001/ReachQA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hewei2001%2FReachQA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hewei2001%2FReachQA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hewei2001%2FReachQA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hewei2001%2FReachQA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hewei2001","download_url":"https://codeload.github.com/hewei2001/ReachQA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hewei2001%2FReachQA/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267792646,"owners_count":24144929,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-synthesis","llm","mllm"],"created_at":"2024-12-04T07:42:05.404Z","updated_at":"2025-07-30T01:31:36.304Z","avatar_url":"https://github.com/hewei2001.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=center\u003e\u003ch1\u003e\n    🪄Distill Visual Chart Reasoning Ability\u003cbr\u003e\n    from LLMs to MLLMs\n\u003c/h1\u003e\u003c/div\u003e\n\nThis is the official repository for 📃[Distill Visual Chart Reasoning Ability from LLMs to MLLMs](https://arxiv.org/abs/2410.18798).\n\nYou have two options to obtain our dataset:\n\n1. Download directly from the 🤗**HuggingFace** Datasets: [hewei2001/ReachQA](https://huggingface.co/datasets/hewei2001/ReachQA).\n2. Clone this repository and **generate 📊charts using the synthetic code**: The process takes about **3 minutes**!\n\n## 📖Introduction\n\n### 🔮Code-as-Intermediary Translation\n\nWe propose **Code-as-Intermediary Translation (CIT)**, a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities **from LLMs to MLLMs**.  The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce **ReachQA**, a dataset containing 3k **rea**soning-intensive **ch**arts and 20k Q\u0026A pairs to enhance both recognition and reasoning abilities.  Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks such as MathVista.\n\n\u003cdiv align=center\u003e\u003cimg src=\"./assets/ReachQA.jpg\" width=\"90%\" /\u003e\u003c/div\u003e\n\n\u003e Figure: Overview of the CIT method for synthesizing multimodal instruction data. The process begins with **33 seed codes** and generates plot codes across various chart types, topics, and complexity levels through the Self-Instruct and Evol-Instruct stages. The chart set and instruction set are constructed bi-directionally, and the final filtered data yields ReachQA, a dataset for distilling visual chart reasoning abilities from LLMs to MLLMs.\n\n### 📈ReachQA\n\n\u003e Table: Comparison of existing chart-related datasets across **three properties**. Only the chart question-answering (CQA) task is considered, despite some datasets having multiple tasks. Abbreviations: Vis.=visual, Comp.=complexity, Temp.=template, Refer.=Reference, Reas.=reasoning, Rat.=rationale, Annot.=annotation and Scal.=scalable.\n\n\u003cdiv align=center\u003e\u003cimg src=\"./assets/Compare.png\" width=\"90%\" /\u003e\u003c/div\u003e\n\n\u003e Table: ReachQA dataset statistics. Question and answer lengths are calculated based on the GPT-4o tokenizer.\n\n\u003cdiv align=center\u003e\u003cimg src=\"./assets/statistics.png\" width=\"40%\" /\u003e\u003c/div\u003e\n\n## 🛠Install\n\n1. For dataset usage:\n```bash\ngit clone https://github.com/hewei2001/ReachQA.git\ncd ReachQA\nconda create -n ReachQA_data python=3.10 -y\nconda activate ReachQA_data\n\npip install -r requirements_data.txt\npip install lmdeploy # Optional, for MLLM filter\n```\n\n2. For training / evaluation usage:\n```Shell\ngit clone https://github.com/hewei2001/ReachQA.git\ncd ReachQA\nconda create -n ReachQA_train python=3.10 -y\nconda activate ReachQA_train\n\npip install -r requirements_train.txt --force-reinstall --no-deps\n```\n\n## 🌳Project Structure\n\n```\nReachQA\n ├── assets\n ├── data\n │   ├── reachqa_seed\n │   ├── reachqa_test\n │   └── reachqa_train\n ├── scripts\n │   ├── data\n │   ├── eval\n │   ├── filter\n │   └── train\n ├── utils\n │   ├── chart_notes.py\n │   ├── openai_utils.py\n │   └── __init__.py\n ├── batch_filter_image.py\n ├── batch_filter_QA.py\n ├── openai_generate_code.py\n ├── openai_generate_QA.py\n ├── openai_llm_evaluation.py\n ├── swift_infer_dataset.py\n ├── requirements_data.txt\n └── README.md\n```\n| File                     | Description                                |\n|--------------------------|--------------------------------------------|\n| assets/                  | Folder for project-related resources       |\n| data/                    | Folder for dataset storage                 |\n| scripts/                 | Folder for scripts to run |\n| utils/                   | Folder for utility functions               |\n| batch_filter_QA.py      | Code for filtering Q\u0026A with MLLMs |\n| batch_filter_image.py    | Code for filtering images with MLLMs |\n| openai_generate_QA.py    | Code for synthesizing Q\u0026A |\n| openai_generate_code.py  | Code for synthesizing code for charts |\n| openai_llm_evaluation.py | Code for LLM-as-a-Jugde evaluation |\n\n## ⏩️Quick Start\n\n1. **Obtain ReachQA dataset in 3 minutes:**\n\n```bash\ncd ReachQA\nconda activate ReachQA_data\n\npython ./data/reachqa_train/execute_code.py \\\n\t--code_dir ./data/reachqa_train/code/ \\\n\t--image_dir ./data/reachqa_train/images/ \n\t\npython ./data/reachqa_test/execute_code.py \n\t--code_dir ./data/reachqa_test/code/ \\\n\t--image_dir ./data/reachqa_test/images/ \n```\n\n2. **Data Construction with CIT:**\n\nBefore generating, the parameters in the `scripts/` should be modified!\n\n```bash\ncd ReachQA\nconda activate ReachQA_data\n\n# Generate code\nbash ./scripts/data/run_openai_generate_code.sh\n\n# Execute code and generate images\npython ./data/reachqa_train/execute_code.py \\\n\t--code_dir ./data/reachqa_train/all_code/ \\\n\t--image_dir ./data/reachqa_train/all_images/ \n\n# Filter images\nbash ./scripts/filter/run_rating_images.sh\npython ./data/reachqa_train/filter_rated_image.py \\\n\t--data_dir ./data/reachqa_train/\n\n# Generate QA\nbash ./scripts/data/run_openai_generate_QA.sh\n\n# Filter QA\nbash ./scripts/filter/run_rating_QA.sh\npython ./data/reachqa_train/filter_rated_QA.py \\\n\t--data_dir ./data/reachqa_train/\n```\n\n3. **Training / Inference / Evaluation:**\n\nBefore training, the JSON instruction file needs to be processed into **Swift format**!\n\nFor the specific format, refer to the [Official Swift Documentation](https://github.com/modelscope/ms-swift/tree/main).\n```bash\ncd ReachQA\nconda activate ReachQA_train\n\n# Swift format\ncd ./data/reachqa_train/\npython process_to_swift_internvl.py\n\n# Training\ncd ../..\nbash ./scripts/train/internvl2_lora.sh\n\n# Inference\nbash ./scripts/eval/infer_InternVL2-8B.sh\n\n# Evaluation\nbash ./scripts/eval/run_openai_evaluation.sh\n```\n## 🌟Main Results\n\n\u003e Table: Evaluation results on seven benchmarks. Details for these benchmarks and models are presented in § 4.1. The best performance for each category and task is in **bold**. The percentage of performance improvements compared to the vanilla model is denoted by (↑).\n\n\u003cdiv align=center\u003e\u003cimg src=\"./assets/results.png\" width=\"90%\" /\u003e\u003c/div\u003e\n\n---\n\n\u003cdiv align=center\u003e\u003cimg src=\"./assets/attention.png\" width=\"90%\" /\u003e\u003c/div\u003e\n\n\u003e Figure: An example of **attention visualization** from the ChartQA dataset. The top row shows the results from the vanilla LLaVA-Next-Llama3-8B model, while the bottom row displays the results from our fine-tuned model. For each output, we present the attention distribution (highlighted zones) at **three key steps**, calculated by averaging the attention values of all tokens in each step.\n\n## 📌TODOs\n\n- [x] Release the implementation of Code-as-Intermediary Translation (CIT).\n- [x] Release the example code for training \u0026 evaluation.\n- [x] Release the full ReachQA dataset we used in this paper.\n- [ ] Release the [vllm](https://github.com/vllm-project/vllm)-implementation of CIT, for generating data with open-source LLMs.\n- [ ] Release the manually curated ReachQA-v2 training set.\n\n## 📧Contact\n\nIf you have any questions, please feel free to reach us at [whe23@m.fudan.edu.cn](mailto:whe23@m.fudan.edu.cn).\n\n## 🔎Citation\n\nIf you find our work helpful or relevant to your research, please kindly cite our paper:\n\n```\n@article{he2024distill,\n      title={Distill Visual Chart Reasoning Ability from LLMs to MLLMs}, \n      author={He, Wei and Xi, Zhiheng and Zhao, Wanxu and Fan, Xiaoran and Ding, Yiwen and Shan, Zifei and Gui, Tao and Zhang, Qi and Huang, Xuan-Jing},\n      journal={arXiv preprint arXiv:2410.18798},\n      year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhewei2001%2Freachqa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhewei2001%2Freachqa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhewei2001%2Freachqa/lists"}