https://github.com/andrewliao11/longperceptualthoughts

[COLM'25] The official implementation of "LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception"
https://github.com/andrewliao11/longperceptualthoughts

computer-vision large-language-models reasoning reasoning-language-models vision-language-model visual-reasoning

Last synced: 10 months ago
JSON representation

[COLM'25] The official implementation of "LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception"

Host: GitHub
URL: https://github.com/andrewliao11/longperceptualthoughts
Owner: andrewliao11
License: apache-2.0
Created: 2025-03-30T16:45:20.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-08-04T18:23:15.000Z (11 months ago)
Last Synced: 2025-08-18T05:19:24.872Z (11 months ago)
Topics: computer-vision, large-language-models, reasoning, reasoning-language-models, vision-language-model, visual-reasoning
Language: Python
Homepage: https://andrewliao11.github.io/LongPerceptualThoughts/
Size: 5.03 MB
Stars: 7
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# LongPerceptualThoughts

A data engine that produces long **Chain-of-thoughts** (CoTs) data for visual reasoning. This is a joint work with Sven Elflein, Liu He, Laura Leal-Taixé, Yejin Choi, Sanja Fidler, and David Acuna.

**🎉 This paper is accepted to COLM'25. See you in Montreal**

[**paper**](https://arxiv.org/abs/2504.15362) |
[**website**](https://andrewliao11.github.io/LongPerceptualThoughts/) |
[**dataset host on Huggingface**](https://huggingface.co/datasets/andrewliao11/LongPerceptualThoughts-30k) |
[**checkpoints on Huggingface**](https://huggingface.co/collections/andrewliao11/longperceptualthoughts-6882358a8a6143fe5b4c5f44) |
[**X post**](https://x.com/andrewliao11/status/1917602672493973818)

![](./assets/overall_pipeline.gif)

## News
- ⭐ 2025/08/05: released checkpoints
- ⭐ 2025/05/26: updated LLaMA-Factory version for DPO training
- ⭐ 2025/05/23: released train and eval code
- ⭐ 2025/05/09: released code for data generation
- ⭐ 2025/04/21: released paper and dataset

## 🔧 Usage

### Environment setup

Prerequisites

1. CUDA==12.4
2. torch==2.6.0
3. transformers==4.53.2
4. xformers==0.0.29.post2

Simple environment setup

```
git clone https://github.com/andrewliao11/LongPerceptualThoughts.git --recursive
cd LongPerceptualThoughts/

conda env create -f environment.yml -n LongPerceptualThoughts
# or use the script to install the environment line-by-line:
# conda create -n LongPerceptualThoughts python==3.10 -y
# conda activate LongPerceptualThoughts
# bash scripts/install_conda_env.sh
```

Note: Both LLaMA-Factory and vllm are actively developed open-source projecets and the code might break when there are version mismatches.

### Evaluate our checkpoints

The following snippet will download and prepare the benchmark data in ShareGPT format. Then download our checkpoints for evaluation.
```bash
# 1. Prepare evaluation benchmark
bash ./scripts/prepare_benchmark.sh
# 2. Run evaluation using vllm and LLaMA-factory
bash ./scripts/evaluate_lpt_checkpoints.sh
```

### Generate your own LongPerceptualThoughts

We provide a three-stage data synthesis pipeline using image-caption datasets (e.g., [google/DOCCI](https://huggingface.co/datasets/google/docci)) to generate multiple-choice questions, short CoTs and long CoTs.

```bash
export OPENAI_API_KEY=API_KEY # Model used in stage 1
export QWEN2_5_VL_INSTRUCT_PATH="/PATH/TO/QWEN2.5-VL-INSTRUCT-7B" # Model used in stage 2
export R1_DISTILLED_QWEN_32_B="/PATH/TO/R1-DISTILLED-QWEN-32B" # Model used in stage 3
bash ./scripts/generate_custom_lpt.sh
```

### Download pre-generated LongPerceptualThoughts and Post-train using LLaMA-Factory

The following snippet will first download pre-generated long CoTs from huggingface and run SFT or DPO using LLaMA-Factory.

```bash
# Download DOCCI and the pre-generated CoTs
bash download_and_process_lpt_30k.sh
export DISABLE_VERSION_CHECK=1
export LLAMAFACTORY_DIR="LLaMA-Factory"
# The following training configs are for references. You may need to modify `model_name_or_path`, `template`, etc if needed.
llamafactory-cli train config/llama_factory_sft_train_config.yaml # SFT training
llamafactory-cli train config/llama_factory_dpo_train_config.yaml # DPO training
```

## 📚 Citation

If you find this repository helpful, please cite:

```bibtex
@misc{liao2025longperceptualthoughtsdistillingsystem2reasoning,
title={LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception},
author={Yuan-Hong Liao and Sven Elflein and Liu He and Laura Leal-Taixé and Yejin Choi and Sanja Fidler and David Acuna},
year={2025},
eprint={2504.15362},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.15362},
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/andrewliao11/longperceptualthoughts

Awesome Lists containing this project

README