{"id":13562898,"url":"https://github.com/dvlab-research/LISA","last_synced_at":"2025-04-03T19:31:48.391Z","repository":{"id":185486724,"uuid":"673431525","full_name":"dvlab-research/LISA","owner":"dvlab-research","description":"Project Page for \"LISA: Reasoning Segmentation via Large Language Model\"","archived":false,"fork":false,"pushed_at":"2024-07-02T08:27:03.000Z","size":29141,"stargazers_count":1826,"open_issues_count":73,"forks_count":128,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-10-29T10:08:38.798Z","etag":null,"topics":["large-language-model","llm","multi-modal","segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dvlab-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-01T15:53:33.000Z","updated_at":"2024-10-29T04:01:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"5c918e1d-1bab-41dd-820a-bdcc2f8fcf85","html_url":"https://github.com/dvlab-research/LISA","commit_stats":{"total_commits":127,"total_committers":12,"mean_commits":"10.583333333333334","dds":0.6220472440944882,"last_synced_commit":"dbe026abfdf2e15457d1f6849a21bd3c32b690eb"},"previous_names":["dvlab-research/lisa"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FLISA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FLISA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FLISA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FLISA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dvlab-research","download_url":"https://codeload.github.com/dvlab-research/LISA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246790026,"owners_count":20834411,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-model","llm","multi-modal","segmentation"],"created_at":"2024-08-01T13:01:13.241Z","updated_at":"2025-04-03T19:31:48.382Z","avatar_url":"https://github.com/dvlab-research.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![Gradio](https://img.shields.io/badge/Gradio-Online%20Demo-blue)](http://103.170.5.190:7860/)\n[![Open in OpenXLab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/openxlab-app/LISA)\n\n# LISA: Reasoning Segmentation via Large Language Model\n\n\u003cfont size=7\u003e\u003cdiv align='center'\u003e\u003cb\u003eLISA\u003c/b\u003e: Large \u003cb\u003eL\u003c/b\u003eanguage \u003cb\u003eI\u003c/b\u003enstructed \u003cb\u003eS\u003c/b\u003eegmentation \u003cb\u003eA\u003c/b\u003essistant\u003c/div\u003e\u003c/font\u003e\n\n\u003cfont size=7\u003e\u003cdiv align='center'\u003e\n    \u003ca href=\"https://arxiv.org/pdf/2308.00692.pdf\"\u003e\u003cstrong\u003ePaper\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"https://huggingface.co/xinlai\"\u003e\u003cstrong\u003eModels\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"#training\"\u003e\u003cstrong\u003eTraining\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"#inference\"\u003e\u003cstrong\u003eInference\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"#deployment\"\u003e\u003cstrong\u003eLocal Deployment\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"#dataset\"\u003e\u003cstrong\u003eDataset\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"\"\u003e\u003cstrong\u003eOnline Demo\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"https://huggingface.co/collections/Senqiao/lisa-67713837a32d6abf516a162e\"\u003e\u003cstrong\u003eLISA++ Dataset\u003c/strong\u003e\u003c/a\u003e | \n    \u003ca href=\"https://huggingface.co/Senqiao/LISA_Plus_7b\"\u003e\u003cstrong\u003eLISA++ Model\u003c/strong\u003e\u003c/a\u003e\n\u003c/div\u003e\u003c/font\u003e\n\n\u003c!-- \u003cp align=\"center\"\u003e \u003cimg src=\"imgs/teaser.jpg\" width=\"100%\"\u003e \u003c/p\u003e --\u003e\n\n\u003ctable class=\"center\"\u003e\n\u003ctr\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003cb\u003eInput\u003c/b\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003cb\u003eOutput\u003c/b\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003cb\u003eInput\u003c/b\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\"\u003e\u003cb\u003eOutput\u003c/b\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/obama.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"vis_output/obama.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/trump.jpg\"\u003e\u003c/td\u003e              \n  \u003ctd\u003e\u003cimg src=\"vis_output/trump.jpg\"\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray;\"\u003e\"Who was \u003cins\u003ethe president of the US\u003c/ins\u003e in this image? Please output segmentation mask and explain the reason.\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, the segmentation result is [SEG]. The President of the United States in the image is \u003cins\u003ePresident Obama.\u003c/ins\u003e”\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray\"\u003e\"Who was \u003cins\u003ethe president of the US\u003c/ins\u003e in this image? Please output segmentation mask and explain why.\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, [SEG]. In the image, the President of the United States is \u003cins\u003ePresident Trump.\u003c/ins\u003e\"\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/stand_higher.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"vis_output/stand_higher.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/camera_lens.jpg\"\u003e\u003c/td\u003e              \n  \u003ctd\u003e\u003cimg src=\"vis_output/camera_lens.jpg\"\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray;\"\u003e\"What can \u003cins\u003emake the woman stand higher\u003c/ins\u003e? Please output segmentation mask and explain why.\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, [SEG]. The woman is \u003cins\u003estanding higher by using a ladder\u003c/ins\u003e...\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray\"\u003e\"Can you segment \u003cins\u003ethe camera lens that is more suitable for photographing nearby objects\u003c/ins\u003e in this image?\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, [SEG].\"\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/dog_with_horn.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"vis_output/dog_with_horn.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/wash_hands.jpg\"\u003e\u003c/td\u003e              \n  \u003ctd\u003e\u003cimg src=\"vis_output/wash_hands.jpg\"\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray;\"\u003e\"Can you segment \u003cins\u003ethe unusual part\u003c/ins\u003e in this image and explain why.\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, it is [SEG]. In the image, the unusual part is \u003cins\u003ethe dog wearing a reindeer antler headband\u003c/ins\u003e...\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray\"\u003e\"Where to \u003cins\u003ewash hands\u003c/ins\u003e in this image? Please output segmentation mask.\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, [SEG].\"\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/jackma.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"vis_output/jackma.jpg\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=\"imgs/blackpink.jpg\"\u003e\u003c/td\u003e              \n  \u003ctd\u003e\u003cimg src=\"vis_output/blackpink.jpg\"\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray;\"\u003e\"Can you segment \u003cins\u003ethe founder of Alibaba\u003c/ins\u003e in this image and explain why?\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is \u003cins\u003eJack Ma\u003c/ins\u003e, the co-founder of Alibaba Group...\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;color:gray\"\u003e\"Please segment \u003cins\u003eLisa\u003c/ins\u003e in this figure.\"\u003c/td\u003e\n  \u003ctd width=25% style=\"text-align:center;\"\u003e\"Sure, [SEG].\"\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"imgs/fig_overview.jpg\" width=\"100%\"\u003e \u003c/p\u003e\n\n## News\n- [x] [2024.12.30] We released the [LISA++](https://arxiv.org/abs/2312.17240) model and datasets, available [here](https://huggingface.co/collections/Senqiao/lisa-67713837a32d6abf516a162e). Our findings show that incorporating Visual COT data can further enhance the model’s global understanding. We will update the paper soon, stay tuned!\n- [x] [2024.6.21] LISA is selected as Oral Presentation in CVPR 2024!\n- [x] [2023.8.30] Release three new models [LISA-7B-v1](https://huggingface.co/xinlai/LISA-7B-v1), [LISA-7B-v1-explanatory](https://huggingface.co/xinlai/LISA-7B-v1-explanatory), and [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory). Welcome to check them out!\n- [x] [2023.8.23] Refactor code, and release new model [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1). Welcome to check it out!\n- [x] [2023.8.9] Training code is released!\n- [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released! \n- [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory) model are released! \n- [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check them out!\n- [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.\n\n**LISA: Reasoning Segmentation via Large Language Model [[Paper](https://arxiv.org/abs/2308.00692)]** \u003cbr /\u003e\n[Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ\u0026hl=zh-CN),\n[Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ\u0026hl=en),\n[Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ\u0026hl=en),\n[Yanwei Li](https://scholar.google.com/citations?user=I-UCPPcAAAAJ\u0026hl=zh-CN),\n[Yuhui Yuan](https://scholar.google.com/citations?user=PzyvzksAAAAJ\u0026hl=en),\n[Shu Liu](https://scholar.google.com.hk/citations?user=BUEDUFkAAAAJ\u0026hl=zh-CN),\n[Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ\u0026hl=en)\u003cbr /\u003e\n\n**LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model [[Paper](https://arxiv.org/abs/2312.17240)]** \u003cbr /\u003e\n[Senqiao Yang](https://scholar.google.com/citations?user=NcJc-RwAAAAJ),\nTianyuan Qu,\n[Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ\u0026hl=zh-CN),\n[Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ\u0026hl=en),\n[Bohao Peng](https://scholar.google.com.hk/citations?user=9xcCm1oAAAAJ),\n[Shu Liu](https://scholar.google.com.hk/citations?user=BUEDUFkAAAAJ\u0026hl=zh-CN),\n[Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ\u0026hl=en)\u003cbr /\u003e\n\n## Abstract\nIn this work, we propose a new segmentation task --- ***reasoning segmentation***. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks.\nFor more details, please refer to the [paper](https://arxiv.org/abs/2308.00692).\n\n## Highlights\n**LISA** unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving: \n1. complex reasoning; \n2. world knowledge; \n3. explanatory answers; \n4. multi-turn conversation. \n\n**LISA** also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement.\n\n## Experimental results\n\u003cp align=\"center\"\u003e \u003cimg src=\"imgs/table1.jpg\" width=\"80%\"\u003e \u003c/p\u003e\n\n## Installation\n```\npip install -r requirements.txt\npip install flash-attn --no-build-isolation\n```\n\n## Training\n### Training Data Preparation\nThe training data consists of 4 types of data:\n\n1. Semantic segmentation datasets: [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip), [COCO-Stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip), [Mapillary](https://www.mapillary.com/dataset/vistas), [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup), [PASCAL-Part](https://github.com/facebookresearch/VLPart/tree/main/datasets#pascal-part), [COCO Images](http://images.cocodataset.org/zips/train2017.zip)\n\n    Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the `dataset/coco/` directory.\n\n3. Referring segmentation datasets: [refCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip), [refCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip), [refCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip), [refCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip) ([saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip)) \n\n    Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a [OneDrive link](https://mycuhk-my.sharepoint.com/:f:/g/personal/1155154502_link_cuhk_edu_hk/Em5yELVBvfREodKC94nOFLoBLro_LPxsOxNV44PHRWgLcA?e=zQPjsc) to download. **You must also follow the rules that the original datasets require.**\n\n4. Visual Question Answering dataset: [LLaVA-Instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json)\n\n5. Reasoning segmentation dataset: [ReasonSeg](https://github.com/dvlab-research/LISA#dataset)\n\nDownload them from the above links, and organize them as follows.\n\n```\n├── dataset\n│   ├── ade20k\n│   │   ├── annotations\n│   │   └── images\n│   ├── coco\n│   │   └── train2017\n│   │       ├── 000000000009.jpg\n│   │       └── ...\n│   ├── cocostuff\n│   │   └── train2017\n│   │       ├── 000000000009.png\n│   │       └── ...\n│   ├── llava_dataset\n│   │   └── llava_instruct_150k.json\n│   ├── mapillary\n│   │   ├── config_v2.0.json\n│   │   ├── testing\n│   │   ├── training\n│   │   └── validation\n│   ├── reason_seg\n│   │   └── ReasonSeg\n│   │       ├── train\n│   │       ├── val\n│   │       └── explanatory\n│   ├── refer_seg\n│   │   ├── images\n│   │   |   ├── saiapr_tc-12 \n│   │   |   └── mscoco\n│   │   |       └── images\n│   │   |           └── train2014\n│   │   ├── refclef\n│   │   ├── refcoco\n│   │   ├── refcoco+\n│   │   └── refcocog\n│   └── vlpart\n│       ├── paco\n│       │   └── annotations\n│       └── pascal_part\n│           ├── train.json\n│           └── VOCdevkit\n```\n\n### Pre-trained weights\n\n#### LLaVA\nTo train LISA-7B or 13B, you need to follow the [instruction](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) to merge the LLaVA delta weights. Typically, we use the final weights `LLaVA-Lightning-7B-v1-1` and `LLaVA-13B-v1-1` merged from `liuhaotian/LLaVA-Lightning-7B-delta-v1-1` and `liuhaotian/LLaVA-13b-delta-v1-1`, respectively. For Llama2, we can directly use the LLaVA full weights `liuhaotian/llava-llama-2-13b-chat-lightning-preview`.\n\n#### SAM ViT-H weights\nDownload SAM ViT-H pre-trained weights from the [link](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth).\n\n### Training\n```\ndeepspeed --master_port=24999 train_ds.py \\\n  --version=\"PATH_TO_LLaVA\" \\\n  --dataset_dir='./dataset' \\\n  --vision_pretrained=\"PATH_TO_SAM\" \\\n  --dataset=\"sem_seg||refer_seg||vqa||reason_seg\" \\\n  --sample_rates=\"9,3,3,1\" \\\n  --exp_name=\"lisa-7b\"\n```\nWhen training is finished, to get the full model weight:\n```\ncd ./runs/lisa-7b/ckpt_model \u0026\u0026 python zero_to_fp32.py . ../pytorch_model.bin\n```\n\n### Merge LoRA Weight\nMerge the LoRA weights of `pytorch_model.bin`, save the resulting model into your desired path in the Hugging Face format:\n```\nCUDA_VISIBLE_DEVICES=\"\" python merge_lora_weights_and_save_hf_model.py \\\n  --version=\"PATH_TO_LLaVA\" \\\n  --weight=\"PATH_TO_pytorch_model.bin\" \\\n  --save_path=\"PATH_TO_SAVED_MODEL\"\n```\n\nFor example:\n```\nCUDA_VISIBLE_DEVICES=\"\" python3 merge_lora_weights_and_save_hf_model.py \\\n  --version=\"./LLaVA/LLaVA-Lightning-7B-v1-1\" \\\n  --weight=\"lisa-7b/pytorch_model.bin\" \\\n  --save_path=\"./LISA-7B\"\n```\n\n### Validation\n```\ndeepspeed --master_port=24999 train_ds.py \\\n  --version=\"PATH_TO_LISA_HF_Model_Directory\" \\\n  --dataset_dir='./dataset' \\\n  --vision_pretrained=\"PATH_TO_SAM\" \\\n  --exp_name=\"lisa-7b\" \\\n  --eval_only\n```\n\nNote: the `v1` model is trained using both `train+val` sets, so please use the `v0` model to reproduce the validation results. (To use the `v0` models, please first checkout to the legacy version repo with `git checkout 0e26916`.)\n\n \n## Inference \n\nTo chat with [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1) or [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory):\n(Note that `chat.py` currently does not support `v0` models (i.e., `LISA-13B-llama2-v0` and `LISA-13B-llama2-v0-explanatory`), if you want to use the `v0` models, please first checkout to the legacy version repo `git checkout 0e26916`.)\n```\nCUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1'\nCUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1-explanatory'\n```\nTo use `bf16` or `fp16` data type for inference:\n```\nCUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='bf16'\n```\nTo use `8bit` or `4bit` data type for inference (this enables running 13B model on a single 24G or 12G GPU at some cost of generation quality):\n```\nCUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_8bit\nCUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_4bit\n```\nHint: for 13B model, 16-bit inference consumes 30G VRAM with a single GPU, 8-bit inference consumes 16G, and 4-bit inference consumes 9G.\n\nAfter that, input the text prompt and then the image path. For example，\n```\n- Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask.\n- Please input the image path: imgs/example1.jpg\n\n- Please input your prompt: Can you segment the food that tastes spicy and hot?\n- Please input the image path: imgs/example2.jpg\n```\nThe results should be like:\n\u003cp align=\"center\"\u003e \u003cimg src=\"imgs/example1.jpg\" width=\"22%\"\u003e \u003cimg src=\"vis_output/example1_masked_img_0.jpg\" width=\"22%\"\u003e \u003cimg src=\"imgs/example2.jpg\" width=\"25%\"\u003e \u003cimg src=\"vis_output/example2_masked_img_0.jpg\" width=\"25%\"\u003e \u003c/p\u003e\n\n## Deployment\n```\nCUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1 --load_in_4bit'\nCUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1-explanatory --load_in_4bit'\n```\nBy default, we use 4-bit quantization. Feel free to delete the `--load_in_4bit` argument for 16-bit inference or replace it with `--load_in_8bit` argument for 8-bit inference.\n\n\n## Dataset\nIn ReasonSeg, we have collected 1218 images (239 train, 200 val, and 779 test). The training and validation sets can be download from \u003ca href=\"https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing\"\u003e**this link**\u003c/a\u003e. \n\nEach image is provided with an annotation JSON file:\n```\nimage_1.jpg, image_1.json\nimage_2.jpg, image_2.json\n...\nimage_n.jpg, image_n.json\n```\nImportant keys contained in JSON files:\n```\n- \"text\": text instructions.\n- \"is_sentence\": whether the text instructions are long sentences.\n- \"shapes\": target polygons.\n```\n\nThe elements of the \"shapes\" exhibit two categories, namely **\"target\"** and **\"ignore\"**. The former category is indispensable for evaluation, while the latter category denotes the ambiguous region and hence disregarded during the evaluation process. \n\nWe provide a \u003ca href=\"https://github.com/dvlab-research/LISA/blob/main/utils/data_processing.py\"\u003e**script**\u003c/a\u003e that demonstrates how to process the annotations:\n```\npython3 utils/data_processing.py\n```\n\nBesides, we leveraged GPT-3.5 for rephrasing instructions, so images in the training set may have **more than one instructions (but fewer than six)** in the \"text\" field. During training, users may randomly select one as the text query to obtain a better model.\n\n\n## Citation \nIf you find this project useful in your research, please consider citing:\n\n```\n@article{lai2023lisa,\n  title={LISA: Reasoning Segmentation via Large Language Model},\n  author={Lai, Xin and Tian, Zhuotao and Chen, Yukang and Li, Yanwei and Yuan, Yuhui and Liu, Shu and Jia, Jiaya},\n  journal={arXiv preprint arXiv:2308.00692},\n  year={2023}\n}\n@article{yang2023improved,\n  title={An Improved Baseline for Reasoning Segmentation with Large Language Model},\n  author={Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya},\n  journal={arXiv preprint arXiv:2312.17240},\n  year={2023}\n}\n```\n\n## Acknowledgement\n-  This work is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) and [SAM](https://github.com/facebookresearch/segment-anything). \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvlab-research%2FLISA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdvlab-research%2FLISA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvlab-research%2FLISA/lists"}