{"id":28653887,"url":"https://github.com/tiger-ai-lab/visualwebinstruct","last_synced_at":"2025-06-13T07:08:01.791Z","repository":{"id":282162657,"uuid":"947669576","full_name":"TIGER-AI-Lab/VisualWebInstruct","owner":"TIGER-AI-Lab","description":"The official repo for \"VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search\"","archived":false,"fork":false,"pushed_at":"2025-05-04T05:25:38.000Z","size":8198,"stargazers_count":24,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-04T05:26:09.004Z","etag":null,"topics":["llm","vlm"],"latest_commit_sha":null,"homepage":"https://tiger-ai-lab.github.io/VisualWebInstruct/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-13T03:59:36.000Z","updated_at":"2025-05-04T05:25:42.000Z","dependencies_parsed_at":"2025-05-04T05:34:41.011Z","dependency_job_id":null,"html_url":"https://github.com/TIGER-AI-Lab/VisualWebInstruct","commit_stats":null,"previous_names":["tiger-ai-lab/visualwebinstruct"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/VisualWebInstruct","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVisualWebInstruct","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVisualWebInstruct/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVisualWebInstruct/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVisualWebInstruct/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/VisualWebInstruct/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVisualWebInstruct/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259599331,"owners_count":22882357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","vlm"],"created_at":"2025-06-13T07:08:00.611Z","updated_at":"2025-06-13T07:08:01.713Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VisualWebInstruct\nThe official repo for [VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search](https://arxiv.org/abs/2503.10582).\n\n\u003ca target=\"_blank\" href=\"https://arxiv.org/abs/2503.10582\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Paper-black?style=flat\u0026logo=arxiv\"\u003e\n\u003c/a\u003e\n\n\u003ca target=\"_blank\" href=\"https://huggingface.co/datasets/TIGER-Lab/VisualWebInstruct\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Dataset-red?style=flat\"\u003e\n\u003c/a\u003e\n\n\u003ca target=\"_blank\" href=\"https://huggingface.co/TIGER-Lab/MAmmoTH-VL2\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Models-red?style=flat\"\u003e\n\u003c/a\u003e\n\n\u003ca target=\"_blank\" href=\"https://huggingface.co/spaces/TIGER-Lab/MAmmoTH-VL2\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Demo-red?style=flat\"\u003e\n\u003c/a\u003e\n\n\u003ca target=\"_blank\" href=\"https://tiger-ai-lab.github.io/VisualWebInstruct/\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-📝%20Website-red?style=flat\"\u003e\n\u003c/a\u003e\n\n\u003cbr\u003e\n\n\n## Overview\nWe utilize Google Search as a tool to augment multimodal reasoning dataset:\n\u003cimg width=\"800\" alt=\"abs\" src=\"teaser.jpg\"\u003e\n\n\n\n## Introduction\nVision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.\n\n## Repository Structure\n\nThe repository is organized into the following directories:\n\n### VisualWebInstruct\nContains the data processing pipeline used to create the dataset:\n\n- **Stage 1: Mining Data from the Internet**\n  - Google Image searching\n  - Accessibility tree building\n  - QA pair extraction\n  - Post-processing\n\n- **Stage 2: Dataset Refinement**\n  - Answer refinement with consistency checking\n  - Answer alignment with original web content\n\n### MAmmoTH-VL\nContains code for model training and evaluation. Since we finetune our model based on MAmmoTH-VL, we use the same codebase:\n\n- **train**: Scripts for finetuning MAmmoTH-VL on VisualWebInstruct\n\n- **evaluation**: Code for evaluating the model on various benchmarks\n\n## Dataset Statistics\n\nOur dataset exhibits the following distribution across knowledge domains:\n\n| Category | Percentage |\n|----------|------------|\n| Math | 62.50% |\n| Physics | 14.50% |\n| Finance | 7.25% |\n| Chemistry | 4.80% |\n| Engineering | 4.35% |\n| Others | 6.60% |\n\nThe \"Others\" category includes General Knowledge (2.45%), Computer Science (2.25%), Biology (1.40%), and humanities subjects.\n\n## Model Performance\n\nModels fine-tuned on VisualWebInstruct demonstrate significant performance gains:\n\n1. Training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks\n2. Training from MAmmoTH-VL shows 5% absolute gain\n\nOur best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on:\n- MMMU-Pro-std (40.7%)\n- MathVerse (42.6%)\n- DynaMath (55.7%)\n\n\n## Dataset Access\n\nThe VisualWebInstruct dataset is available on [Hugging Face](https://huggingface.co/datasets/TIGER-Lab/VisualWebInstruct).\n\nTo download the data for finetuning, you can use\n```bash\n# Data Preparation\n\nexport DATA_DIR=xxx #set your data folder first\n\nhuggingface-cli download TIGER-Lab/VisualWebInstruct --repo-type dataset --revision main --local-dir $DATA_DIR\n\nunzip $DATA_DIR/images.zip -d $DATA_DIR/imgs\n```\nAfter unzipping, the folder structure in `$DATA_DIR/imgs` will be:\n\n```\n$DATA_DIR/imgs/\n├── CLEVR_v1.0\n├── ai2d\n├── chartqa\n├── coco\n├── data\n├── docvqa\n├── geoqa+\n├── gqa\n├── llava\n├── ocr_vqa\n├── pisc\n├── sam\n├── share_textvqa\n├── sqa\n├── textvqa\n├── vg\n├── visualwebinstruct \u003c-- This is the image folder of our dataset\n├── web-celebrity\n├── web-landmark\n└── wikiart\n```\n## Model Training\n\n\n### Environment Setup\n\n```bash\n# System configuration\nexport OMP_NUM_THREADS=8\nexport NCCL_IB_DISABLE=0\nexport NCCL_IB_GID_INDEX=3\nexport NCCL_SOCKET_IFNAME=eth0\nexport NCCL_DEBUG=INFO\n\n# Model configuration\nexport LLM_VERSION=\"Qwen/Qwen2.5-7B-Instruct\"\nexport LLM_VERSION_CLEAN=\"${LLM_VERSION//\\//_}\"\nexport VISION_MODEL_VERSION=\"google/siglip-so400m-patch14-384\"\nexport VISION_MODEL_VERSION_CLEAN=\"${VISION_MODEL_VERSION//\\//_}\"\nexport PROMPT_VERSION=\"qwen_2_5\"\n\n# Path configuration\nexport HF_HOME=\u003cyour_huggingface_cache_path\u003e\nexport IMAGE_FOLDER=\"$DATA_DIR/imgs\"\nexport OUTPUT_DIR=\u003cpath_to_output_directory\u003e\n\n# Wandb configuration\nexport WANDB_API_KEY=\u003cyour_wandb_api_key\u003e\n\n# Training configuration\nexport BASE_RUN_NAME=\u003cyour_run_name\u003e\nexport CKPT_PATH=$LLM_VERSION  # this could be the previous stage checkpoint like MammothVL\nexport NUM_GPUS=\u003cnumber_of_gpus\u003e\nexport NNODES=\u003cnumber_of_nodes\u003e\nexport RANK=\u003cnode_rank\u003e\nexport ADDR=\u003cmaster_address\u003e\nexport PORT=\u003cmaster_port\u003e\nexport CUDA_VISIBLE_DEVICES=\u003cgpu ids\u003e\n\n```\n\n### Login to Weights \u0026 Biases\n\n```bash\nwandb login --relogin $WANDB_API_KEY\n```\n\n### Run Training\n```bash\ncd train/LLaVA-NeXT\n```\n\nYou can run from commandline:\n```bash\nACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node=\"${NUM_GPUS}\" --nnodes=\"${NNODES}\" --node_rank=\"${RANK}\" --master_addr=\"${ADDR}\" --master_port=\"${PORT}\" \\\n    llava/train/train_mem.py \\\n    --deepspeed scripts/zero3.json \\\n    --model_name_or_path ${CKPT_PATH} \\\n    --version ${PROMPT_VERSION} \\\n    --data_path scripts/train/mammoth_vl/visualwebinstruct.yaml \\\n    --image_folder ${IMAGE_FOLDER} \\\n    --video_folder \"\" \\\n    --mm_tunable_parts=\"mm_vision_tower,mm_mlp_adapter,mm_language_model\" \\\n    --mm_vision_tower_lr=2e-6 \\\n    --vision_tower ${VISION_MODEL_VERSION} \\\n    --mm_projector_type mlp2x_gelu \\\n    --mm_vision_select_layer -2 \\\n    --mm_use_im_start_end False \\\n    --mm_use_im_patch_token False \\\n    --group_by_modality_length True \\\n    --image_aspect_ratio anyres_max_4 \\\n    --image_grid_pinpoints  \"(1x1),...,(6x6)\" \\\n    --mm_patch_merge_type spatial_unpad \\\n    --bf16 True \\\n    --run_name $BASE_RUN_NAME \\\n    --output_dir ${OUTPUT_DIR} \\\n    --num_train_epochs 1 \\\n    --per_device_train_batch_size 1 \\\n    --per_device_eval_batch_size 4 \\\n    --gradient_accumulation_steps 2 \\\n    --evaluation_strategy \"no\" \\\n    --save_strategy \"steps\" \\\n    --save_steps 1000 \\\n    --save_total_limit 20 \\\n    --learning_rate 1e-5 \\\n    --weight_decay 0. \\\n    --warmup_ratio 0.03 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --tf32 True \\\n    --model_max_length 8192 \\\n    --gradient_checkpointing True \\\n    --dataloader_num_workers 4 \\\n    --lazy_preprocess True \\\n    --report_to wandb \\\n    --torch_compile True \\\n    --torch_compile_backend \"inductor\" \\\n    --dataloader_drop_last True \\\n    --frames_upbound 32\n```\nAlternatively, you can also run the training script:\n```bash\nbash scripts/train/mammoth_vl/finetune_visualwebinstruct.sh\n```\n### Dataset Configuration\n\nYou'll need to set up the VisualWebInstruct dataset YAML file. Create or modify the YAML file at `VisualWebInstruct/MAmmoTH-VL/train/LLaVA-NeXT/scripts/train/mammoth_vl/visualwebinstruct.yaml` with the following content:\n\n```yaml\ndatasets:\n  - json_path:  # Path to the jsonl file of visualwebinstruct\n    sampling_strategy: \"all\"\n```\n\nThis configuration uses the `$DATA_DIR` environment variable that you set in the environment setup section.\n\n### Notes\n\n- This script trains a multimodal model combining Qwen2.5-7B-Instruct with SigLIP vision model\n- The training uses DeepSpeed ZeRO-3 for optimization\n- Parameters like `NUM_GPUS`, `NNODES`, etc. should be set according to your environment\n- Replace placeholder values (indicated by `\u003c...\u003e`) with your actual configuration\n\n## Evaluation\n\n### Installation\n```base\ncurl -LsSf https://astral.sh/uv/install.sh | sh\nuv venv eval\nuv venv --python 3.12\nsource eval/bin/activate\ncd MAmmoTH-VL/train/LLaVA-NeXT/\nuv pip install -e .\ncd -\ncd MAmmoTH-VL/eval/lmms-eval\nuv pip install -e .\ncd -\n```\n\n### Setup Environment\nEnter the evaluation folder.\n\n```bash\n# Required environment variables\nexport HF_TOKEN=\u003cyour_huggingface_token\u003e\nexport OPENAI_API_KEY=\u003cyour_openai_api_key\u003e\nexport MODEL_PATH=TIGER-Lab/MAmmoTH-VL2\nexport TASK_NAME=mmmu_pro_standard_cot\nexport OUTPUT_PATH=./log/\n\nexport VLLM_WORKER_MULTIPROC_METHOD=spawn\nexport NCCL_BLOCKING_WAIT=1\nexport NCCL_TIMEOUT=18000000\nexport NCCL_DEBUG=DEBUG\n```\n\nTo evaluate the model:\n```bash\nCUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 \\\n    -m lmms_eval \\\n    --model llava_onevision \\\n    --model_args pretrained=${MODEL_PATH},conv_template=qwen_2_5,model_name=llava_qwen \\\n    --tasks ${TASK_NAME} \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix ${TASK_NAME} \\\n    --output_path ${OUTPUT_PATH}\n```\n\n## Pretrained Models\n\nOur pretrained models are available on [Hugging Face](https://huggingface.co/TIGER-Lab/MAmmoTH-VL2).\n\n## Acknowledgements\n\nOur implementation builds upon the following codebases:\n- [MAmmoTH-VL](https://github.com/MAmmoTH-VL/MAmmoTH-VL)\n- [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)\n- [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)\n\nWe thank the authors of these repositories for their valuable contributions.\n\n## Citation\n```\n@article{visualwebinstruct,\n    title={VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search},\n    author = {Jia, Yiming and Li, Jiachen and Yue, Xiang and Li, Bo and Nie, Ping and Zou, Kai and Chen, Wenhu},\n    journal={arXiv preprint arXiv:2503.10582},\n    year={2025}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fvisualwebinstruct","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Fvisualwebinstruct","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fvisualwebinstruct/lists"}