{"id":30254577,"url":"https://github.com/wanghao9610/X-SAM","last_synced_at":"2025-08-15T14:04:37.516Z","repository":{"id":306212649,"uuid":"1020939066","full_name":"wanghao9610/X-SAM","owner":"wanghao9610","description":"X-SAM: From Segment Anything to Any Segmentation","archived":false,"fork":false,"pushed_at":"2025-08-14T03:11:31.000Z","size":103820,"stargazers_count":78,"open_issues_count":2,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-14T04:30:52.966Z","etag":null,"topics":["mllms","sam","segmentation"],"latest_commit_sha":null,"homepage":"https://wanghao9610.github.io/X-SAM/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wanghao9610.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-16T16:13:58.000Z","updated_at":"2025-08-14T03:11:35.000Z","dependencies_parsed_at":"2025-07-24T11:27:14.499Z","dependency_job_id":"ea0d98aa-8fe2-4f54-b0ba-b11441350d19","html_url":"https://github.com/wanghao9610/X-SAM","commit_stats":null,"previous_names":["wanghao9610/x-sam"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/wanghao9610/X-SAM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanghao9610%2FX-SAM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanghao9610%2FX-SAM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanghao9610%2FX-SAM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanghao9610%2FX-SAM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wanghao9610","download_url":"https://codeload.github.com/wanghao9610/X-SAM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wanghao9610%2FX-SAM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270579613,"owners_count":24610044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-15T02:00:12.559Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mllms","sam","segmentation"],"created_at":"2025-08-15T14:02:16.476Z","updated_at":"2025-08-15T14:04:37.504Z","avatar_url":"https://github.com/wanghao9610.png","language":"Python","funding_links":[],"categories":["Paper List","Python"],"sub_categories":["Follow-up Papers"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e✨X-SAM \u003c/h1\u003e\n\u003ch3\u003eFrom Segment Anything to Any Segmentation\u003c/h3\u003e\n\n[Hao Wang](https://github.com/wanghao9610)\u003csup\u003e1,2\u003c/sup\u003e,[Limeng Qiao](https://scholar.google.com/citations?user=3PFZAg0AAAAJ\u0026hl=en)\u003csup\u003e3\u003c/sup\u003e,[Zequn Jie](https://scholar.google.com/citations?user=4sKGNB0AAAAJ\u0026hl)\u003csup\u003e3\u003c/sup\u003e, [Zhijian Huang](https://zhijian11.github.io/)\u003csup\u003e1\u003c/sup\u003e, [Chengjian Feng](https://fcjian.github.io/)\u003csup\u003e3\u003c/sup\u003e, \n\n[Qingfang Zheng](https://openreview.net/profile?id=%7EZheng_Qingfang1)\u003csup\u003e2\u003c/sup\u003e, [Lin Ma](https://forestlinma.com/)\u003csup\u003e3\u003c/sup\u003e, [Xiangyuan Lan](https://scholar.google.com/citations?user=c3iwWRcAAAAJ\u0026hl)\u003csup\u003e2\u003c/sup\u003e\u003csup\u003e:email:\u003c/sup\u003e, [Xiaodan Liang](https://scholar.google.com/citations?user=voxznZAAAAAJ\u0026hl)\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e:email:\u003c/sup\u003e\n\n\u003csup\u003e1\u003c/sup\u003e Sun Yat-sen University, \u003csup\u003e2\u003c/sup\u003e Peng Cheng Laboratory, \u003csup\u003e3\u003c/sup\u003e Meituan Inc.\n\n\u003csup\u003e:email:\u003c/sup\u003e Corresponding author\n\u003c/div\u003e\n\n\u003cdiv align=\"center\" style=\"display: flex; justify-content: center; align-items: center;\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2508.04655\" style=\"margin: 0 2px;\"\u003e\n    \u003cimg src='https://img.shields.io/badge/arXiv-2508.04655-red?style=flat\u0026logo=arXiv\u0026logoColor=red' alt='arxiv'\u003e\n  \u003c/a\u003e\n  \u003ca href='https://huggingface.co/hao9610/X-SAM' style=\"margin: 0 2px;\"\u003e\n    \u003cimg src='https://img.shields.io/badge/HuggingFace-ckpts-orange?style=flat\u0026logo=HuggingFace\u0026logoColor=orange' alt='huggingface'\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/wanghao9610/X-SAM\" style=\"margin: 0 2px;\"\u003e\n    \u003cimg src='https://img.shields.io/badge/GitHub-Repo-blue?style=flat\u0026logo=GitHub' alt='GitHub'\u003e\n  \u003c/a\u003e\n  \u003ca href=\"http://47.115.200.157:7861\" style=\"margin: 0 2px;\"\u003e\n    \u003cimg src='https://img.shields.io/badge/Demo-Gradio-gold?style=flat\u0026logo=Gradio\u0026logoColor=red' alt='Demo'\u003e\n  \u003c/a\u003e\n  \u003ca href=\"http://121.43.252.12:7862\" style=\"margin: 0 2px;\"\u003e\n    \u003cimg src='https://img.shields.io/badge/Demo-Gradio-gold?style=flat\u0026logo=Gradio\u0026logoColor=red' alt='Demo'\u003e\n  \u003c/a\u003e\n  \u003ca href='https://wanghao9610.github.io/X-SAM/' style=\"margin: 0 2px;\"\u003e\n    \u003cimg src='https://img.shields.io/badge/🌐_Project-Webpage-green?style=flat\u0026logoColor=white' alt='webpage'\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\n## :eyes: Notice\n\nX-SAM is under active development, and we will continue to update the code and documentation.\n\nWe recommend that everyone use English to communicate in issues, as this helps developers from around the world discuss, share experiences, and answer questions together.\n\n*If you have any questions or would like to collaborate, please feel free to open an issue or reach out to me at `wanghao9610@gmail.com`.*\n\n## :boom: Updates\n\n- **`2025-08-11`**: Thanks for your great attention to our work! We have deployed another [Online Demo2](http://121.43.252.12:7862). You can also try it if [Online Demo1](http://47.115.200.157:7861) is not available.\n- **`2025-08-11`**: We released the effective code for [Evaluation on All Segmentation Benchmarks](#evaluate-on-all-segmentation-benchmarks). We have updated all code except for [Training X-SAM](#stage-3-mixed-fine-tuning).\n- **`2025-08-10`**: We released the detailed instructions for [Demo Deployment](#computer-demo).\n- **`2025-08-09`**: We released the code for [Training LLaVA-based MLLMs](#llava).\n- **`2025-08-08`**: We released the simple code for [Evaluation on All VLM Benchmarks](#evaluate-on-all-vlm-benchmarks).\n- **`2025-08-06`**: We are excited to publish the [Technical Report](https://arxiv.org/pdf/2508.04655), please check it out for more technical details.\n- **`2025-08-05`**: We provided the [Model Weights](https://huggingface.co/hao9610/X-SAM) on the HuggingFace🤗.\n- **`2025-07-26`**: We deployed the [Online Demo](http://47.115.200.157:7861), you can try it now!\n\n## :rocket: Introduction\nThis repository provides the official PyTorch implementation, pre-trained models, training, evaluation, visualization, and demo code of X-SAM:\n\n* X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from *segment anything* to *any segmentation*, thereby enhancing pixel-level perceptual understanding.\n\n* X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.\n\n* X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on various image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.\n\n:sparkles: **HIGHLIGHT**: This repository provides unified and effective code for training, evaluation, and visualization of segmentation MLLMs, including LLaVA-based MLLMs. We hope this repository will promote further research on MLLMs.\n\n## :bookmark: Abstract\n\nLarge Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from *segment anything* to *any segmentation*. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.\n\n## :mag: Overview\n\n\u003cimg src=\"docs/images/xsam_framework.png\" width=\"800\"\u003e\n\n## :bar_chart: Benchmarks\n\nPlease refer to the [Benchmark Results](docs/benchmark_results.md) for more details.\n\n## :checkered_flag: Getting Started\n### 1. Structure\nWe provide a detailed project structure for X-SAM. Please follow this structure to organize the project.\n\n\u003cdetails open\u003e\n\u003csummary\u003e📁 Structure (Click to collapse)\u003c/summary\u003e\n\n```bash\nX-SAM\n├── datas\n│   ├── gcg_seg_data\n│   ├── gen_seg_data\n│   ├── img_conv_data\n│   ├── inter_seg_data\n│   ├── LMUData\n│   ├── ov_seg_data\n│   ├── rea_seg_data\n│   ├── ref_seg_data\n│   └── vgd_seg_data\n├── inits\n│   ├── huggingface\n│   ├── mask2former-swin-large-coco-panoptic\n│   ├── Phi-3-mini-4k-instruct\n│   ├── sam-vit-large\n│   └── X-SAM\n├── xsam\n│   ├── docs\n│   ├── requirements\n│   ├── xsam\n│   │   ├── configs\n│   │   ├── dataset\n│   │   ├── demo\n│   │   ├── engine\n│   │   ├── evaluation\n│   │   ├── model\n│   │   ├── structures\n│   │   ├── tools\n│   │   └── utils\n├── wkdrs\n│   ├── s1_seg_finetune\n│   │   ├── ...\n│   ├── s2_align_pretrain\n│   │   ├── ...\n│   ├── s2_mixed_finetune\n│   │   ├── ...\n│   ├── ...\n...\n```\n\u003c/details\u003e\n\n### 2. Installation\nWe provide a detailed installation guide to create a environment for X-SAM, please refer to the following steps.\n\n\u003cdetails open\u003e\n\u003csummary\u003e⚙️ Installation (Click to collapse)\u003c/summary\u003e\n\n```bash\n# clone X-SAM\ngit clone --depth=1 https://github.com/wanghao9610/X-SAM.git\n\n# set root_dir\ncd X-SAM\nexport root_dir=$(realpath ./)\ncd $root_dir/xsam\n\n# set CUDA_HOME for cuda12.4(optional).\n# X-SAM utilizes the cuda12.4 default, if your cuda is not cuda12.4, you need first export CUDA_HOME env manually.\nexport CUDA_HOME=\"your_cuda12.4_path\"\nexport PATH=$CUDA_HOME/bin:$PATH\nexport LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH\necho -e \"cuda version:\\n$(nvcc -V)\"\n\n# create conda env for X-SAM\nconda create -n xsam python=3.10 -y\nconda activate xsam\nconda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia\n# install gcc11(optional)\nconda install gcc=11 gxx=11 -c conda-forge -y\n# install xtuner0.2.0\npip install git+https://github.com/InternLM/xtuner.git@v0.2.0\n# or install xtuner0.2.0 from source code\n# git clone -b v0.2.0 https://github.com/InternLM/xtuner.git\n# cd xtuner\n# pip install '.[all]'\n# install deepspeed\npip install -r requirements/deepspeed.txt\n# install xsam requirements\npip install -r requirements/xsam.txt\n# install flash-attention\npip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl\n\n# install VLMEvalKit for evaluation on VLM benchmarks(optional)\ncd $root_dir\ngit clone -b v0.3rc1 https://github.com/open-compass/VLMEvalKit.git\ncd VLMEvalKit\npip install -e .\n\n# install aria2 for downloading datasets and models(optional)\npip install aria2\n```\n\n\u003c/details\u003e\n\n### 3. Preparing\nThere are many datasets and models to prepare, please refer to [Dataset Preparing](docs/dataset_preparing.md) and [Model Preparing](docs/model_preparing.md) for more details.\n\n### 4. Training \u0026 Evaluation\n:sparkles: **One Script for All !**\n```bash\ncd $root_dir\nbash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix WORK_DIR_SUFFIX\n# MODES: train, segeval, vlmeval, visualize, demo\n# bash runs/run.sh -h # echo help.\n# Read the runs/run.sh for more details.\n```\nPrepare the [Datasets](docs/dataset_preparing.md) and [Models](docs/model_preparing.md), and then refer to the following commands to start training and evaluation.\n\n\n#### X-SAM\n\n\u003cdetails open\u003e\n\u003csummary\u003e🔥 Training (Click to collapse)\u003c/summary\u003e\n\n##### Stage 1: Segmentor Fine-tuning\n```bash\ncd $root_dir\nbash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s1_seg_finetune/xsam_sam_large_m2f_e36_gpu16_seg_finetune.py\n```\n\n##### Stage 2: Alignment Pre-training\n```bash\ncd $root_dir\nbash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_align_pretrain/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_e1_gpu16_align_pretrain.py\n```\n\n##### Stage 3: Mixed Fine-tuning\n```bash\n# 🫣Coming soon...\n\n# ‼️NOTE: Training for Mixed Fine-tuning will be available with more than 500 🌟.\n```\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003e🧪 Evaluation (Click to collapse)\u003c/summary\u003e\n\n##### Evaluate on all segmentation benchmarks\n```bash\ncd $root_dir\n# Evaluate on all segmentation benchmarks.\n# NOTE: ONLY generic segmentation and VGD segmentation are supported NOW.\nbash runs/run.sh --modes segeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune\n```\n\n##### Evaluate on all VLM benchmarks\n```bash\ncd $root_dir\n# Evaluate on all VLM benchmarks.\nbash runs/run.sh --modes vlmeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune\n```\n\n\u003c/details\u003e\n\n#### LLaVA\n\n\u003cdetails\u003e\n\u003csummary\u003e🔥 Training (Click to expand)\u003c/summary\u003e\n\n##### Stage 1: Alignment Pre-training\n```bash\ncd $root_dir\nbash runs/run.sh --modes train --config xsam/configs/llava/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s1_pretrain/llava_phi3_mini_4k_instruct_siglip2_so400m_p14_384_e1_gpu16_pretrain.py\n```\n\n##### Stage 2: Instruction Fine-tuning\n```bash\ncd $root_dir\nbash runs/run.sh --modes train --config xsam/configs/llava/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_finetune/llava_phi3_mini_4k_instruct_siglip2_so400m_p14_384_e1_gpu16_finetune.py\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e🧪 Evaluation (Click to expand)\u003c/summary\u003e\n\n##### Evaluate on all VLM benchmarks\n```bash\ncd $root_dir\nbash runs/run.sh --modes vlmeval --config xsam/configs/llava/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_finetune/llava_phi3_mini_4k_instruct_siglip2_so400m_p14_384_e1_gpu16_finetune.py\n```\n\u003c/details\u003e\n\n## :computer: Demo\nWe provide detalied instructions for demo deployment, and a demo video is shown below.\n\n\u003cdetails open\u003e\n\u003csummary\u003e🛠️ Deployment (Click to collapse)\u003c/summary\u003e\n\n```bash\ncd $root_dir\nbash runs/run.sh --modes demo --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune\n```\n\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003e🎥 Video (Click to collapse)\u003c/summary\u003e\n\u003cvideo src=\"https://github.com/user-attachments/assets/1a21cf21-c0bb-42cd-91c8-290324b68618\"\n  controls\n  muted\n  style=\"max-width:100%;\"\u003e\u003c/video\u003e\n\u003c/details\u003e\n\n## :white_check_mark: TODO\n- [x] Release the [Online Demo](http://47.115.200.157:7861).\n- [x] Release the [Model Weights](https://huggingface.co/hao9610/X-SAM).\n- [x] Release the [Technical Report](https://arxiv.org/abs/2508.04655).\n- [x] Release the code for [Training LLaVA-based MLLMs](#llava).\n- [x] Release the code for [Evaluation on All VLM Benchmarks](#evaluate-on-all-vlm-benchmarks).\n- [x] Release the code for [Demo Deployment](#computer-demo).\n- [x] Release the code for [Evaluation on All Segmentation Benchmarks](#evaluate-on-all-segmentation-benchmarks).\n- [ ] Release the code for [Training X-SAM](#stage-3-mixed-fine-tuning) (more than 500 🌟).\n\n## :blush: Acknowledge\nThis project has referenced some excellent open-sourced repos ([xtuner](https://github.com/InternLM/xtuner), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [Sa2VA](https://github.com/magic-research/Sa2VA)). Thanks for their wonderful works and contributions to the community.\n\n## :pushpin: Citation\nIf you find X-SAM is helpful for your research or applications, please consider giving us a star 🌟 and citing it by the following BibTex entry.\n\n```bibtex\n@article{wang2025xsam,\n  title={X-SAM: From Segment Anything to Any Segmentation},\n  author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},\n  journal={arXiv preprint arXiv:2508.04655},\n  year={2025}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwanghao9610%2FX-SAM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwanghao9610%2FX-SAM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwanghao9610%2FX-SAM/lists"}