Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/UX-Decoder/FIND
[NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"
https://github.com/UX-Decoder/FIND
Last synced: about 2 months ago
JSON representation
[NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"
- Host: GitHub
- URL: https://github.com/UX-Decoder/FIND
- Owner: UX-Decoder
- Created: 2023-11-19T06:53:54.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2024-08-21T00:21:54.000Z (5 months ago)
- Last Synced: 2024-11-30T18:13:29.222Z (about 2 months ago)
- Language: Python
- Homepage:
- Size: 23.6 MB
- Stars: 112
- Watchers: 7
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- Awesome-Segment-Anything - [code
README
# π FIND: Interfacing Foundation Models' Embeddings
:grapes: \[[Read our arXiv Paper](https://arxiv.org/pdf/2312.07532.pdf)\] Β :apple: \[[Try our Demo](https://e6efa3093a88ff2321.gradio.live)\] Β :orange: \[[Walk through Project Page](https://x-decoder-vl.github.io/)\]We introduce **FIND** that can **IN**terfacing **F**oundation models' embe**DD**ings in an interleaved shared embedding space. Below is a brief introduction to the generic and interleave tasks we can do!
by [Xueyan Zou](https://maureenzou.github.io/), [Linjie Li](https://scholar.google.com/citations?user=WR875gYAAAAJ&hl=en), [Jianfeng Wang](http://jianfengwang.me/), [Jianwei Yang](https://jwyang.github.io/), [Mingyu Ding](https://dingmyu.github.io/), [Junyi Wei](https://scholar.google.com/citations?user=Kb1GL40AAAAJ&hl=en), [Zhengyuan Yang](https://zyang-ur.github.io/), [Feng Li](https://fengli-ust.github.io/), [Hao Zhang](https://scholar.google.com/citations?user=B8hPxMQAAAAJ&hl=en), [Shilong Liu](https://lsl.zone/), [Arul Aravinthan](https://www.linkedin.com/in/arul-aravinthan-414509218/), [Yong Jae Lee*](https://pages.cs.wisc.edu/~yongjaelee/), [Lijuan Wang*](https://scholar.google.com/citations?user=cDcWXuIAAAAJ&hl=zh-CN),
** Equal Advising **
![FIND design](assets/images/teaser.jpg?raw=true)
## :rocket: Updates
* **[2024.8.20]** We have released an updated version arXiv with comprehensive user guide on GitHub!
* **[2023.12.3]** We have a poster session @ NeurIPS24 for [SEEM](https://arxiv.org/pdf/2304.06718.pdf), feel free to visit us during 5:00-7:00pm (CT)!
* **[2023.12.2]** We have released all the training, evaluation, and demo code!## :bookmark_tabs: Catalog
- [x] Demo Code
- [x] Model Checkpoint
- [x] Comprehensive User Guide
- [x] Dataset
- [x] Training Code
- [x] Evaluation Code## :hammer: Getting Started
Install Conda
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda
eval "$($HOME/miniconda/bin/conda shell.bash hook)"
conda init
conda init zsh**Build Environment**
```
conda create --name find python=3.10
conda activate find
conda install -c conda-forge mpi4py
conda install -c conda-forge cudatoolkit=11.7
conda install -c nvidia/label/cuda-11.7.0 cuda-toolkit
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2+cu117 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r assets/requirements/requirements.txt
pip install -r assets/requirements/requirements_custom.txt
cd modeling/vision/encoder/ops
sh make.sh
cd ../../..
```**Build Dataset**
Explore through [π€ Hugging Face: FIND-Bench](https://huggingface.co/datasets/xueyanz/FIND-Bench).
**Download Raw File:**
| entity_train2017.json | entity_val2017.json | entity_val2017_long.json |
|-----------------------|---------------------|--------------------------|
| [download](https://huggingface.co/datasets/xueyanz/FIND-Bench/resolve/main/entity_train2017.json) | [download](https://huggingface.co/datasets/xueyanz/FIND-Bench/resolve/main/entity_val2017.json) | [download](https://huggingface.co/datasets/xueyanz/FIND-Bench/resolve/main/entity_val2017_long.json) |Data Structure
data/
βββ coco/
βββ annotations/
β βββ entity_train2017.json
β βββ *entity_val2017.json*
β βββ *entity_val2017_long.json*
βββ panoptic_semseg_train2017/
βββ panoptic_semseg_val2017/
βββ panoptic_train2017/
βββ panoptic_val2017/
βββ train2017/
βββ *val2017/** To run the **demo**, files/folders within * * are required, please download [COCO dataset](https://cocodataset.org/#download) and FIND-Bench annotations [entity_val2017.json](https://huggingface.co/datasets/xueyanz/FIND-Bench/resolve/main/entity_val2017.json) and [entity_val2017_long.json](https://huggingface.co/datasets/xueyanz/FIND-Bench/resolve/main/entity_val2017_long.json).
* To run evaluation, please additionally prepare panoptic_val2017 according to [Mask2Former](https://github.com/facebookresearch/Mask2Former).
* To run training, please additionally download and prepare all other files.
**Run Demo**
Command
python3 -m demo.find.demo_interleave_llama evaluate \
--conf_files configs/find/focall_llama_lang.yaml \
--overrides \
MODEL.DECODER.HIDDEN_DIM 512 \
MODEL.ENCODER.CONVS_DIM 512 \
MODEL.ENCODER.MASK_DIM 512 \
VLP.INPUT.SHORTEST_EDGE True \
VLP.INPUT.MIN_SIZE_TEST 480 \
VLP.INPUT.MAX_SIZE_TEST 640 \
VLP.TEST.BATCH_SIZE_TOTAL 8 \
RESUME_FROM /pth/to/grin_focall_llama_x640.pt \
FP16 True \
FAKE_UPDATE True**Run Evaluation**
Single-GPU
python entry.py evaluate \
--conf_files configs/find/focall_llama_lang.yaml \
--overrides \
FP16 True \
MODEL.DECODER.MASK.ENABLED True \
MODEL.DECODER.CAPTION.ENABLED True \
MODEL.DECODER.SPATIAL.ENABLED True \
MODEL.DECODER.RETRIEVAL.ENABLED True \
MODEL.DECODER.GROUNDING.ENABLED True \
MODEL.DECODER.INTERLEAVE.ENABLED True \
MODEL.DECODER.INTERLEAVE.VISUAL_PROB 0.5 \
COCO.TRAIN.BATCH_SIZE_TOTAL 1 \
COCO.TRAIN.BATCH_SIZE_PER_GPU 1 \
COCO.TEST.BATCH_SIZE_TOTAL 1 \
FP16 True \
REF.TEST.BATCH_SIZE_TOTAL 1 \
VLP.TEST.BATCH_SIZE_TOTAL 1 \
VLP.INPUT.SHORTEST_EDGE True \
VLP.INPUT.MIN_SIZE_TEST 512 \
VLP.INPUT.MAX_SIZE_TEST 720 \
COCO.INPUT.MIN_SIZE_TEST 640 \
COCO.INPUT.MAX_SIZE_TEST 1024 \
WEIGHT True \
RESUME_FROM /pth/to/grin_focall_llama_x640.ptMulti-GPU
CUDA_VISIBLE_DEVICES=4,5,6,7 mpirun -n 4 python entry.py evaluate \
--conf_files configs/find/focall_llama_lang.yaml \
--overrides \
FP16 True \
MODEL.DECODER.MASK.ENABLED True \
MODEL.DECODER.CAPTION.ENABLED True \
MODEL.DECODER.SPATIAL.ENABLED True \
MODEL.DECODER.RETRIEVAL.ENABLED True \
MODEL.DECODER.GROUNDING.ENABLED True \
MODEL.DECODER.INTERLEAVE.ENABLED True \
MODEL.DECODER.INTERLEAVE.VISUAL_PROB 0.5 \
COCO.TRAIN.BATCH_SIZE_TOTAL 1 \
COCO.TRAIN.BATCH_SIZE_PER_GPU 1 \
COCO.TEST.BATCH_SIZE_TOTAL 4 \
FP16 True \
REF.TEST.BATCH_SIZE_TOTAL 4 \
VLP.TEST.BATCH_SIZE_TOTAL 4 \
VLP.INPUT.SHORTEST_EDGE True \
VLP.INPUT.MIN_SIZE_TEST 512 \
VLP.INPUT.MAX_SIZE_TEST 720 \
COCO.INPUT.MIN_SIZE_TEST 640 \
COCO.INPUT.MAX_SIZE_TEST 1024 \
WEIGHT True \
RESUME_FROM /pth/to/grin_focall_llama_x640.pt**Run Training**
## β³ Interleave Checkpoint
| | | COCO-Entity | | | | COCO-Entity-Long | | | |
|-------------------|----------|-------------|------|------|-------|------------------|------|------|-------|
| | | cIoU | AP50 | IR@5 | IR@10 | cIoU | AP50 | IR@5 | IR@10 |
| ImageBIND (H) | - | - | - | 51.4 | 61.3 | - | - | 58.7 | 68.9 |
| Grounding-SAM (H) | - | 58.9 | 63.2 | - | - | 56.1 | 62.5 | - | - |
| Focal-T | [ckpt](https://huggingface.co/xueyanz/FIND/resolve/main/find_focalt_llama_x640.pt) | 74.9 | 79.5 | 43.5 | 57.1 | 73.2 | 77.7 | 49.4 | 63.9 |
| Focal-L | [ckpt](https://huggingface.co/xueyanz/FIND/resolve/main/find_focall_llama_x640.pt) | 76.2 | 81.3 | 81.1 | 88.7 | 74.8 | 79.3 | 89.3 | 94.6 |## :framed_picture: FIND-Bench Visualization
## π Citation
If you find this repo useful for your research and applications, please cite using this BibTeX:
```
@misc{zou2022xdecoder,
title={Generalized decoding for pixel, image, and language},
author={Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Behl, Harkirat and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Leeβ , Yong Jae and Gaoβ , Jianfeng},
publisher={CVPR},
year={2023},
}@misc{zou2023seem,
title={Segment everything everywhere all at once},
author={Zou*, Xueyan and Yang*, Jianwei and Zhang*, Hao and Li*, Feng and Li, Linjie and Wang, Jianfeng and Wang, Lijuan and Gaoβ , Jianfeng and Leeβ , Yong Jae},
publisher={NeurIPS},
year={2023},
}@misc{zou2024find,
title={Interfacing Foundation Models' Embeddings},
author={Zou, Xueyan and Li, Linjie and Wang, Jianfeng and Yang, Jianwei and Ding, Mingyu and Yang, Zhengyuan and Li, Feng and Zhang, Hao and Liu, Shilong and Aravinthan, Arul and Leeβ , Yong Jae and Wangβ , Lijuan},
publisher={arXiv preprint arXiv:2312.07532},
year={2024},
}
```## π Acknowledgement
This research project has benefitted from the Microsoft Accelerate Foundation Models Research (AFMR) grant program.