Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/showlab/sparseformer
(ICLR 2024, CVPR 2024) SparseFormer
https://github.com/showlab/sparseformer
computer-vision efficient-neural-networks sparseformer transformer vision-transformer
Last synced: about 3 hours ago
JSON representation
(ICLR 2024, CVPR 2024) SparseFormer
- Host: GitHub
- URL: https://github.com/showlab/sparseformer
- Owner: showlab
- License: mit
- Created: 2023-04-02T05:31:12.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-10T12:28:01.000Z (about 2 months ago)
- Last Synced: 2024-12-26T23:05:06.260Z (7 days ago)
- Topics: computer-vision, efficient-neural-networks, sparseformer, transformer, vision-transformer
- Language: Python
- Homepage:
- Size: 261 KB
- Stars: 65
- Watchers: 9
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🎆 SparseFormer
This is the offical repo for SparseFormer researches:
> [**SparseFormer: Sparse Visual Recognition via Limited Latent Tokens**](https://arxiv.org/abs/2304.03768) **(ICLR 2024)**
> Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou> [**Bootstrapping SparseFormers from Vision Foundation Models**](https://arxiv.org/abs/2312.01987) **(CVPR 2024)**
> Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou## Out-of-box SparseFormer as a Library (recommended)
We provide the out-of-box SparseFormer usage with the sparseformer library installation.__Getting started__. You can install sparseformer as a library by the following command:
```shell
pip install -e sparseformer # in this folder
```Available pre-trained model weights are listed [here](./sparseformer/sparseformer/factory.py#L11), including weights of v1 and bootstrapped ones. You can simply use [`create_model`](./sparseformer/sparseformer/factory.py#L37) with the argument `download=True` to get pre-trained models. You can play like this!
```python
from sparseformer.factory import create_model# e.g., make a SparseFormer v1 tiny model
model = create_model("sparseformer_v1_tiny", download=True)# or make a CLIP SparseFormer large model and put it in OpenClip pipeline
import open_clip
clip = open_clip.create_model_and_transforms("ViT-L-14", "openai")
visual = create_model("sparseformer_btsp_openai_clip_large", download=True)
clip.visual = visual
# ...```
__Video SparseFormers__. We also provide unified [`MediaSparseFormer`](./sparseformer/sparseformer/media.py#L103) implementation for both video and image inputs (an image as single-frame video) with the token inflation argument `replicates`. MediaSparseFormer can load pre-trained weights of the image `SparseFormer` by [`load_2d_state_dict`](./sparseformer/sparseformer/media.py#L147).
Notes: Pre-trained weights VideoSparseFormers are currently unavailable. We might reproduce VideoSparseFormers if highly needed by the community.
__ADVANCED: Make your own SparseFormer and load timm weights__.
Our codebase is generally compatible with [timm vision transformer](https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py) weights. So here comes something to play: you can make your own SparseFormer and load timm transformers weights, not limited to our provided configurations!For example, you can make a SparseFormer similar to ViT-224/16 and with sampling & decoding and roi adjusting every 3 block, and load it with CLIP OpenAI official pre-trained weights:
```python
from sparseformer.modeling import SparseFormer, OP
from sparseformer.config import base_btsp_configops_list = []
num_layers = 12
for i in range(num_layers):
if i % 3 == 0:
ops_list.append([OP.SAMPLING_M, OP.ATTN, OP.MLP, OP.ROI_ADJ, OP.PE_INJECT,])
else:
ops_list.append([OP.ATTN, OP.MLP])config = base_btsp_config()
config.update(
num_latent_tokens=16,
num_sampling_points=9,
width_configs=[768, ]*num_layers,
repeats=[1, ]*num_layers,
ops_list=ops_list,
)model = SparseFormer(**config)
import timm
pretrained = timm.create_model("vit_base_patch16_clip_224.openai", pretrained=True)
new_dict = dict()
old_dict = pretrained.state_dict()
for k in old_dict:
nk = k
if "blocks" in k:
nk = nk.replace("blocks", "layers")
new_dict[nk] = old_dict[k]
print(model.load_state_dict(new_dict, strict=False))
```
All weights attention and MLP layers should be successfully loaded. The resulted SparseFormer should be fine-tuned to output meaningful results since the sampling & decoding and roi adjusting part are newly initialized. Maybe you can fine-tune it to be a CLIP-based open-vocabulary detector (have not yet tried, but very promising imo! :D).## Training (SparseFormer v1)
For training SparseFormer v1 in ImageNets ([**SparseFormer: Sparse Visual Recognition via Limited Latent Tokens**](https://arxiv.org/abs/2304.03768)), please check [imagenet](./imagenet/).**Note:** this [imagenet](./imagenet/) sub-codebase will be refactored soon.
## Citation
If you find SparseFormer useful in your research or work, please consider citing us using the following entry:
```
@inproceedings{gao2024sparseformer,
author = {Ziteng Gao and
Zhan Tong and
Limin Wang and
Mike Zheng Shou},
title = {SparseFormer: Sparse Visual Recognition via Limited Latent Tokens},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2024}
}@inproceedings{gao2024bootstrapping,
author = {Ziteng Gao and
Zhan Tong and
Kevin Qinghong Lin and
Joya Chen and
Mike Zheng Shou},
title = {Bootstrapping SparseFormers from Vision Foundation Models},
booktitle = {{CVPR}},
pages = {17710--17721},
publisher = {{IEEE}},
year = {2024}
}
```