https://github.com/showlab/sparseformer

(ICLR 2024, CVPR 2024) SparseFormer
https://github.com/showlab/sparseformer

computer-vision efficient-neural-networks sparseformer transformer vision-transformer

Last synced: 7 days ago
JSON representation

(ICLR 2024, CVPR 2024) SparseFormer

Host: GitHub
URL: https://github.com/showlab/sparseformer
Owner: showlab
License: mit
Created: 2023-04-02T05:31:12.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-11-10T12:28:01.000Z (8 months ago)
Last Synced: 2025-03-24T09:13:46.013Z (3 months ago)
Topics: computer-vision, efficient-neural-networks, sparseformer, transformer, vision-transformer
Language: Python
Homepage:
Size: 261 KB
Stars: 73
Watchers: 9
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # 🎆 SparseFormer

This is the offical repo for SparseFormer researches:

> [**SparseFormer: Sparse Visual Recognition via Limited Latent Tokens**](https://arxiv.org/abs/2304.03768) **(ICLR 2024)**


> Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou


> [**Bootstrapping SparseFormers from Vision Foundation Models**](https://arxiv.org/abs/2312.01987) **(CVPR 2024)**


> Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou


## Out-of-box SparseFormer as a Library (recommended)

We provide the out-of-box SparseFormer usage with the sparseformer library installation. 

__Getting started__. You can install sparseformer as a library by the following command:

```shell

pip install -e sparseformer # in this folder

```

Available pre-trained model weights are listed [here](./sparseformer/sparseformer/factory.py#L11), including weights of v1 and bootstrapped ones. You can simply use [`create_model`](./sparseformer/sparseformer/factory.py#L37) with the argument `download=True` to get pre-trained models. You can play like this!

```python

from sparseformer.factory import create_model

# e.g., make a SparseFormer v1 tiny model

model = create_model("sparseformer_v1_tiny", download=True)

# or make a CLIP SparseFormer large model and put it in OpenClip pipeline

import open_clip

clip = open_clip.create_model_and_transforms("ViT-L-14", "openai")

visual = create_model("sparseformer_btsp_openai_clip_large", download=True)

clip.visual = visual

# ...

```

__Video SparseFormers__. We also provide unified [`MediaSparseFormer`](./sparseformer/sparseformer/media.py#L103) implementation for both video and image inputs (an image as single-frame video) with the token inflation argument `replicates`. MediaSparseFormer can load pre-trained weights of the image `SparseFormer` by [`load_2d_state_dict`](./sparseformer/sparseformer/media.py#L147).

Notes: Pre-trained weights VideoSparseFormers are currently unavailable. We might reproduce VideoSparseFormers if highly needed by the community.

__ADVANCED: Make your own SparseFormer and load timm weights__. 

Our codebase is generally compatible with [timm vision transformer](https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py) weights. So here comes something to play: you can make your own SparseFormer and load timm transformers weights, not limited to our provided configurations!

For example, you can make a SparseFormer similar to ViT-224/16 and with sampling & decoding and roi adjusting every 3 block, and load it with CLIP OpenAI official pre-trained weights:

```python

from sparseformer.modeling import SparseFormer, OP

from sparseformer.config import base_btsp_config

ops_list = []

num_layers = 12

for i in range(num_layers):

    if i % 3 == 0:

        ops_list.append([OP.SAMPLING_M, OP.ATTN, OP.MLP, OP.ROI_ADJ, OP.PE_INJECT,])

    else:

        ops_list.append([OP.ATTN, OP.MLP])

config = base_btsp_config()

config.update(

    num_latent_tokens=16,

    num_sampling_points=9,

    width_configs=[768, ]*num_layers,

    repeats=[1, ]*num_layers,

    ops_list=ops_list,

)

model = SparseFormer(**config)

import timm

pretrained = timm.create_model("vit_base_patch16_clip_224.openai", pretrained=True)

new_dict = dict()

old_dict = pretrained.state_dict()

for k in old_dict:

    nk = k

    if "blocks" in k:

        nk = nk.replace("blocks", "layers")

    new_dict[nk] = old_dict[k]

print(model.load_state_dict(new_dict, strict=False))

```

All weights attention and MLP layers should be successfully loaded. The resulted SparseFormer should be fine-tuned to output meaningful results since the sampling & decoding and roi adjusting part are newly initialized. Maybe you can fine-tune it to be a CLIP-based open-vocabulary detector (have not yet tried, but very promising imo! :D).

## Training (SparseFormer v1)

For training SparseFormer v1 in ImageNets ([**SparseFormer: Sparse Visual Recognition via Limited Latent Tokens**](https://arxiv.org/abs/2304.03768)), please check [imagenet](./imagenet/).

**Note:** this [imagenet](./imagenet/) sub-codebase will be refactored soon.

## Citation

If you find SparseFormer useful in your research or work, please consider citing us using the following entry:

```

@inproceedings{gao2024sparseformer,

  author       = {Ziteng Gao and

                  Zhan Tong and

                  Limin Wang and

                  Mike Zheng Shou},

  title        = {SparseFormer: Sparse Visual Recognition via Limited Latent Tokens},

  booktitle    = {{ICLR}},

  publisher    = {OpenReview.net},

  year         = {2024}

}

@inproceedings{gao2024bootstrapping,

  author       = {Ziteng Gao and

                  Zhan Tong and

                  Kevin Qinghong Lin and

                  Joya Chen and

                  Mike Zheng Shou},

  title        = {Bootstrapping SparseFormers from Vision Foundation Models},

  booktitle    = {{CVPR}},

  pages        = {17710--17721},

  publisher    = {{IEEE}},

  year         = {2024}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/showlab/sparseformer

Awesome Lists containing this project

README