https://github.com/NVlabs/FasterViT

[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention
https://github.com/NVlabs/FasterViT

ade20k backbone coco deep-learning foundation-models image-classification image-net object-detection pre-trained-model self-attention semantic-segmentation vision-transformer visual-recognition

Last synced: 10 months ago
JSON representation

[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention

Host: GitHub
URL: https://github.com/NVlabs/FasterViT
Owner: NVlabs
License: other
Created: 2023-05-19T22:19:34.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-06-02T19:30:17.000Z (over 1 year ago)
Last Synced: 2024-09-16T14:19:17.416Z (over 1 year ago)
Topics: ade20k, backbone, coco, deep-learning, foundation-models, image-classification, image-net, object-detection, pre-trained-model, self-attention, semantic-segmentation, vision-transformer, visual-recognition
Language: Python
Homepage: https://arxiv.org/abs/2306.06189
Size: 1.28 MB
Stars: 763
Watchers: 18
Forks: 62
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # FasterViT: Fast Vision Transformers with Hierarchical Attention

Official PyTorch implementation of [**FasterViT: Fast Vision Transformers with Hierarchical Attention**](https://arxiv.org/abs/2306.06189).

[![Star on GitHub](https://img.shields.io/github/stars/NVlabs/FasterViT.svg?style=social)](https://github.com/NVlabs/FasterViT/stargazers)

[Ali Hatamizadeh](https://research.nvidia.com/person/ali-hatamizadeh),

[Greg Heinrich](https://developer.nvidia.com/blog/author/gheinrich/),

[Hongxu (Danny) Yin](https://hongxu-yin.github.io/),

[Andrew Tao](https://developer.nvidia.com/blog/author/atao/),

[Jose M. Alvarez](https://alvarezlopezjosem.github.io/),

[Jan Kautz](https://jankautz.com/), 

[Pavlo Molchanov](https://www.pmolchanov.com/).

For business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https://www.nvidia.com/en-us/research/inquiries/)

--- 

FasterViT achieves a new SOTA Pareto-front in

terms of Top-1 accuracy and throughput without extra training data !







We introduce a new self-attention mechanism, denoted as Hierarchical

Attention (HAT), that captures both short and long-range information by learning

cross-window carrier tokens.

![teaser](./fastervit/assets/hierarchial_attn.png)

Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops. 

## 💥 News 💥

- **[04.02.2024]** 🔥 Updated [manuscript](https://arxiv.org/abs/2306.06189) now available on arXiv !

- **[01.24.2024]** 🔥🔥🔥 **Object Tracking with MOTRv2 + FasterViT** is now open-sourced ([link](./downstream/object_tracking/motrv2/README.md)) ! 

- **[01.17.2024]** 🔥🔥🔥 FasterViT paper has been accepted to [ICLR 2024](https://openreview.net/group?id=ICLR.cc/2024/Conference#tab-your-consoles) !

- **[10.14.2023]** 🔥🔥 We have added the FasterViT [object detection repository](./downstream/object_detection/dino/README.md) with [DINO](https://arxiv.org/abs/2203.03605) !

- **[08.24.2023]** 🔥 FasterViT Keras models with pre-trained weights published in [keras_cv_attention_models](https://github.com/leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/fastervit) !  

- **[08.20.2023]** 🔥🔥 We have added ImageNet-21K SOTA pre-trained models for various resolutions !   

- **[07.20.2023]** We have created official NVIDIA FasterViT [HuggingFace](https://huggingface.co/nvidia/FasterViT) page.

- **[07.06.2023]** FasterViT checkpoints are now also accecible in HuggingFace!

- **[07.04.2023]** ImageNet pretrained FasterViT models can now be imported with **1 line of code**. Please install the latest FasterViT pip package to use this functionality (also supports Any-resolution FasterViT models).

- **[06.30.2023]** We have further improved the [TensorRT](https://developer.nvidia.com/tensorrt-getting-started) throughput of FasterViT models by 10-15% on average across different models. Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to use these throughput performance gains. 

- **[06.29.2023]** Any-resolution FasterViT model can now be intitialized from pre-trained ImageNet resolution (224 x 244) models.

- **[06.18.2023]** We have released the FasterViT [pip package](https://pypi.org/project/fastervit/) !

- **[06.17.2023]** [Any-resolution FasterViT](./fastervit/models/faster_vit_any_res.py)  model is now available ! the model can be used for variety of applications such as detection and segmentation or high-resolution fine-tuning with arbitrary input image resolutions.

- **[06.09.2023]** 🔥🔥 We have released source code and ImageNet-1K FasterViT-models !

## Quick Start

### Object Detection

Please see FasterViT [object detection repository](./object_detection/README.md) with [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605) for more details. 

### Classification

We can import pre-trained FasterViT models with **1 line of code**. Firstly, FasterViT can be simply installed:

```bash

pip install fastervit

```

Note: Please upgrate the package to ```fastervit>=0.9.8``` if you have already installed the package to use the pretrained weights. 

A pretrained FasterViT model with default hyper-parameters can be created as in:

```python

>>> from fastervit import create_model

# Define fastervit-0 model with 224 x 224 resolution

>>> model = create_model('faster_vit_0_224', 

                          pretrained=True,

                          model_path="/tmp/faster_vit_0.pth.tar")

```

`model_path` is used to set the directory to download the model.

We can also simply test the model by passing a dummy input image. The output is the logits:

```python

>>> import torch

>>> image = torch.rand(1, 3, 224, 224)

>>> output = model(image) # torch.Size([1, 1000])

```

We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0

model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of

64:

```python

>>> from fastervit import create_model

# Define any-resolution FasterViT-0 model with 576 x 960 resolution

>>> model = create_model('faster_vit_0_any_res', 

                          resolution=[576, 960],

                          window_size=[7, 7, 12, 6],

                          ct_size=2,

                          dim=64,

                          pretrained=True)

```

Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.) 

We can test the model by passing a dummy input image. The output is the logits:

```python

>>> import torch

>>> image = torch.rand(1, 3, 576, 960)

>>> output = model(image) # torch.Size([1, 1000])

```

## Catalog

- [x] ImageNet-1K training code

- [x] ImageNet-1K pre-trained models

- [x] Any-resolution FasterViT

- [x] FasterViT pip-package release

- [x] Add capablity to initialize any-resolution FasterViT from ImageNet-pretrained weights. 

- [x] ImageNet-21K pre-trained models

- [x] Detection code + models

--- 

## Results + Pretrained Models

### ImageNet-1K

**FasterViT ImageNet-1K Pretrained Models**

  

    Name

    Acc@1(%)

    Acc@5(%)

    Throughput(Img/Sec)

    Resolution

    #Params(M)

    FLOPs(G)

    Download

  

    FasterViT-0

    82.1

    95.9

    5802

    224x224

    31.4

    3.3

    model

    FasterViT-1

    83.2

    96.5

    4188

    224x224

    53.4

    5.3

    model

    FasterViT-2

    84.2

    96.8

    3161

    224x224

    75.9

    8.7

    model

    FasterViT-3

    84.9

    97.2

    1780

    224x224

    159.5

    18.2

    model

    FasterViT-4

    85.4

    97.3

    849

    224x224

    424.6

    36.6

    model

    FasterViT-5

    85.6

    97.4

    449

    224x224

    975.5

    113.0

    model

    FasterViT-6

    85.8

    97.4

    352

    224x224

    1360.0

    142.0

    model

### ImageNet-21K

**FasterViT ImageNet-21K Pretrained Models (ImageNet-1K Fine-tuned)**

  

    Name

    Acc@1(%)

    Acc@5(%)

    Resolution

    #Params(M)

    FLOPs(G)

    Download

  

    FasterViT-4-21K-224

    86.6

    97.8

    224x224

    271.9

    40.8

    model

    FasterViT-4-21K-384

    87.6

    98.3

    384x384

    271.9

    120.1

    model

    FasterViT-4-21K-512

    87.8

    98.4

    512x512

    271.9

    213.5

    model

    FasterViT-4-21K-768

    87.9

    98.5

    768x768

    271.9

    480.4

    model

Raw pre-trained ImageNet-21K model weights for FasterViT-4 is also available for download in this [link](https://drive.google.com/file/d/1T3jDrzlTmTcZVS1Dh01Fl3J2LXZHWKdL/view?usp=sharing).

### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2)

All models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.

  

    Name

    A-Acc@1(%)

    A-Acc@5(%)

    R-Acc@1(%)

    R-Acc@5(%)

    V2-Acc@1(%)

    V2-Acc@5(%)

  

    FasterViT-0

    23.9

    57.6

    45.9

    60.4

    70.9

    90.0

    FasterViT-1

    31.2

    63.3

    47.5

    61.9

    72.6

    91.0

    FasterViT-2

    38.2

    68.9

    49.6

    63.4

    73.7

    91.6

    FasterViT-3

    44.2

    73.0

    51.9

    65.6

    75.0

    92.2

    FasterViT-4

    49.0

    75.4

    56.0

    69.6

    75.7

    92.7

    FasterViT-5

    52.7

    77.6

    56.9

    70.0

    76.0

    93.0

    FasterViT-6

    53.7

    78.4

    57.1

    70.1

    76.1

    93.0

A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively. 

## Installation

We provide a [docker file](./Dockerfile). In addition, assuming that a recent [PyTorch](https://pytorch.org/get-started/locally/) package is installed, the dependencies can be installed by running:

```bash

pip install -r requirements.txt

```

## Training

Please see [TRAINING.md](TRAINING.md) for detailed training instructions of all models. 

## Evaluation

The FasterViT models can be evaluated on ImageNet-1K validation set using the following: 

```

python validate.py \

--model 

--checkpoint 

--data_dir 

--batch-size

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/NVlabs/FasterViT

Awesome Lists containing this project

README