An open API service indexing awesome lists of open source software.

https://github.com/NVlabs/FasterViT

[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention
https://github.com/NVlabs/FasterViT

ade20k backbone coco deep-learning foundation-models image-classification image-net object-detection pre-trained-model self-attention semantic-segmentation vision-transformer visual-recognition

Last synced: 10 months ago
JSON representation

[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention

Awesome Lists containing this project

README

          

# FasterViT: Fast Vision Transformers with Hierarchical Attention

Official PyTorch implementation of [**FasterViT: Fast Vision Transformers with Hierarchical Attention**](https://arxiv.org/abs/2306.06189).

[![Star on GitHub](https://img.shields.io/github/stars/NVlabs/FasterViT.svg?style=social)](https://github.com/NVlabs/FasterViT/stargazers)

[Ali Hatamizadeh](https://research.nvidia.com/person/ali-hatamizadeh),
[Greg Heinrich](https://developer.nvidia.com/blog/author/gheinrich/),
[Hongxu (Danny) Yin](https://hongxu-yin.github.io/),
[Andrew Tao](https://developer.nvidia.com/blog/author/atao/),
[Jose M. Alvarez](https://alvarezlopezjosem.github.io/),
[Jan Kautz](https://jankautz.com/),
[Pavlo Molchanov](https://www.pmolchanov.com/).

For business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https://www.nvidia.com/en-us/research/inquiries/)

---

FasterViT achieves a new SOTA Pareto-front in
terms of Top-1 accuracy and throughput without extra training data !



We introduce a new self-attention mechanism, denoted as Hierarchical
Attention (HAT), that captures both short and long-range information by learning
cross-window carrier tokens.

![teaser](./fastervit/assets/hierarchial_attn.png)

Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops.

## 💥 News 💥
- **[04.02.2024]** 🔥 Updated [manuscript](https://arxiv.org/abs/2306.06189) now available on arXiv !
- **[01.24.2024]** 🔥🔥🔥 **Object Tracking with MOTRv2 + FasterViT** is now open-sourced ([link](./downstream/object_tracking/motrv2/README.md)) !
- **[01.17.2024]** 🔥🔥🔥 FasterViT paper has been accepted to [ICLR 2024](https://openreview.net/group?id=ICLR.cc/2024/Conference#tab-your-consoles) !
- **[10.14.2023]** 🔥🔥 We have added the FasterViT [object detection repository](./downstream/object_detection/dino/README.md) with [DINO](https://arxiv.org/abs/2203.03605) !
- **[08.24.2023]** 🔥 FasterViT Keras models with pre-trained weights published in [keras_cv_attention_models](https://github.com/leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/fastervit) !
- **[08.20.2023]** 🔥🔥 We have added ImageNet-21K SOTA pre-trained models for various resolutions !
- **[07.20.2023]** We have created official NVIDIA FasterViT [HuggingFace](https://huggingface.co/nvidia/FasterViT) page.
- **[07.06.2023]** FasterViT checkpoints are now also accecible in HuggingFace!
- **[07.04.2023]** ImageNet pretrained FasterViT models can now be imported with **1 line of code**. Please install the latest FasterViT pip package to use this functionality (also supports Any-resolution FasterViT models).
- **[06.30.2023]** We have further improved the [TensorRT](https://developer.nvidia.com/tensorrt-getting-started) throughput of FasterViT models by 10-15% on average across different models. Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to use these throughput performance gains.
- **[06.29.2023]** Any-resolution FasterViT model can now be intitialized from pre-trained ImageNet resolution (224 x 244) models.
- **[06.18.2023]** We have released the FasterViT [pip package](https://pypi.org/project/fastervit/) !
- **[06.17.2023]** [Any-resolution FasterViT](./fastervit/models/faster_vit_any_res.py) model is now available ! the model can be used for variety of applications such as detection and segmentation or high-resolution fine-tuning with arbitrary input image resolutions.
- **[06.09.2023]** 🔥🔥 We have released source code and ImageNet-1K FasterViT-models !

## Quick Start

### Object Detection

Please see FasterViT [object detection repository](./object_detection/README.md) with [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605) for more details.

### Classification

We can import pre-trained FasterViT models with **1 line of code**. Firstly, FasterViT can be simply installed:

```bash
pip install fastervit
```
Note: Please upgrate the package to ```fastervit>=0.9.8``` if you have already installed the package to use the pretrained weights.

A pretrained FasterViT model with default hyper-parameters can be created as in:

```python
>>> from fastervit import create_model

# Define fastervit-0 model with 224 x 224 resolution

>>> model = create_model('faster_vit_0_224',
pretrained=True,
model_path="/tmp/faster_vit_0.pth.tar")
```

`model_path` is used to set the directory to download the model.

We can also simply test the model by passing a dummy input image. The output is the logits:

```python
>>> import torch

>>> image = torch.rand(1, 3, 224, 224)
>>> output = model(image) # torch.Size([1, 1000])
```

We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0
model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of
64:

```python
>>> from fastervit import create_model

# Define any-resolution FasterViT-0 model with 576 x 960 resolution
>>> model = create_model('faster_vit_0_any_res',
resolution=[576, 960],
window_size=[7, 7, 12, 6],
ct_size=2,
dim=64,
pretrained=True)
```
Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.)

We can test the model by passing a dummy input image. The output is the logits:

```python
>>> import torch

>>> image = torch.rand(1, 3, 576, 960)
>>> output = model(image) # torch.Size([1, 1000])
```

## Catalog
- [x] ImageNet-1K training code
- [x] ImageNet-1K pre-trained models
- [x] Any-resolution FasterViT
- [x] FasterViT pip-package release
- [x] Add capablity to initialize any-resolution FasterViT from ImageNet-pretrained weights.
- [x] ImageNet-21K pre-trained models
- [x] Detection code + models

---

## Results + Pretrained Models

### ImageNet-1K
**FasterViT ImageNet-1K Pretrained Models**


Name
Acc@1(%)
Acc@5(%)
Throughput(Img/Sec)
Resolution
#Params(M)
FLOPs(G)
Download

FasterViT-0
82.1
95.9
5802
224x224
31.4
3.3
model

FasterViT-1
83.2
96.5
4188
224x224
53.4
5.3
model

FasterViT-2
84.2
96.8
3161
224x224
75.9
8.7
model

FasterViT-3
84.9
97.2
1780
224x224
159.5
18.2
model

FasterViT-4
85.4
97.3
849
224x224
424.6
36.6
model

FasterViT-5
85.6
97.4
449
224x224
975.5
113.0
model

FasterViT-6
85.8
97.4
352
224x224
1360.0
142.0
model

### ImageNet-21K
**FasterViT ImageNet-21K Pretrained Models (ImageNet-1K Fine-tuned)**


Name
Acc@1(%)
Acc@5(%)
Resolution
#Params(M)
FLOPs(G)
Download

FasterViT-4-21K-224
86.6
97.8
224x224
271.9
40.8
model

FasterViT-4-21K-384
87.6
98.3
384x384
271.9
120.1
model

FasterViT-4-21K-512
87.8
98.4
512x512
271.9
213.5
model

FasterViT-4-21K-768
87.9
98.5
768x768
271.9
480.4
model

Raw pre-trained ImageNet-21K model weights for FasterViT-4 is also available for download in this [link](https://drive.google.com/file/d/1T3jDrzlTmTcZVS1Dh01Fl3J2LXZHWKdL/view?usp=sharing).
### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2)

All models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.


Name
A-Acc@1(%)
A-Acc@5(%)
R-Acc@1(%)
R-Acc@5(%)
V2-Acc@1(%)
V2-Acc@5(%)

FasterViT-0
23.9
57.6
45.9
60.4
70.9
90.0

FasterViT-1
31.2
63.3
47.5
61.9
72.6
91.0

FasterViT-2
38.2
68.9
49.6
63.4
73.7
91.6

FasterViT-3
44.2
73.0
51.9
65.6
75.0
92.2

FasterViT-4
49.0
75.4
56.0
69.6
75.7
92.7

FasterViT-5
52.7
77.6
56.9
70.0
76.0
93.0

FasterViT-6
53.7
78.4
57.1
70.1
76.1
93.0

A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively.

## Installation

We provide a [docker file](./Dockerfile). In addition, assuming that a recent [PyTorch](https://pytorch.org/get-started/locally/) package is installed, the dependencies can be installed by running:

```bash
pip install -r requirements.txt
```

## Training

Please see [TRAINING.md](TRAINING.md) for detailed training instructions of all models.

## Evaluation

The FasterViT models can be evaluated on ImageNet-1K validation set using the following:

```
python validate.py \
--model
--checkpoint
--data_dir
--batch-size