Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/google-research/maxvit

[ECCV 2022] Official repository for "MaxViT: Multi-Axis Vision Transformer". SOTA foundation models for classification, detection, segmentation, image quality, and generative modeling...
https://github.com/google-research/maxvit

architecture classification cnn computer-vision image image-processing mlp object-detection resnet segmentation transformer transformer-architecture vision-transformer

Last synced: 2 months ago
JSON representation

[ECCV 2022] Official repository for "MaxViT: Multi-Axis Vision Transformer". SOTA foundation models for classification, detection, segmentation, image quality, and generative modeling...

Awesome Lists containing this project

README

        

# MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

[![Paper](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2204.01697)
[![Tutorial In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-research/maxvit/blob/master/MaxViT_tutorial.ipynb)
[![video](https://img.shields.io/badge/Video-Presentation-F9D371)](https://youtu.be/WEgB4lAZyKM)

This repository hosts the official TensorFlow implementation of MAXViT models:

__[MaxViT: Multi-Axis Vision Transformer](https://arxiv.org/abs/2204.01697)__. ECCV 2022.\
[Zhengzhong Tu](https://twitter.com/_vztu), [Hossein Talebi](https://scholar.google.com/citations?hl=en&user=UOX9BigAAAAJ), [Han Zhang](https://sites.google.com/view/hanzhang), [Feng Yang](https://sites.google.com/view/feng-yang), [Peyman Milanfar](https://sites.google.com/view/milanfarhome/), [Alan Bovik](https://www.ece.utexas.edu/people/faculty/alan-bovik), and [Yinxiao Li](https://scholar.google.com/citations?user=kZsIU74AAAAJ&hl=en)\
Google Research, University of Texas at Austin

*Disclaimer: This is not an officially supported Google product.*

**News**:

- May, 2023: MaxViT is officially released in [Tensorflow model garden](https://github.com/tensorflow/models/tree/master/official/projects/maxvit) to support training!
- Oct 12, 2022: Added the remaining ImageNet-1K and -21K checkpoints.
- Oct 4, 2022: A list of updates
* Added MaxViTTiny and MaxViTSmall checkpoints.
* Added a Colab tutorial.
- Sep 8, 2022: our Google AI blog covering both [MaxViT](https://arxiv.org/abs/2204.01697) and [MAXIM](https://github.com/google-research/maxim) is [live](https://ai.googleblog.com/2022/09/a-multi-axis-approach-for-vision.html).
- Sep 7, 2022: [@rwightman](https://github.com/rwightman) released a few small model weights in [timm](https://github.com/rwightman/pytorch-image-models#aug-26-2022). Achieves even better results than our paper. See more [here](https://github.com/rwightman/pytorch-image-models#aug-26-2022).
- Aug 26, 2022: our MaxViT models have been implemented in [timm (pytorch-image-models)](https://github.com/rwightman/pytorch-image-models#aug-26-2022). Kudos to [@rwightman](https://github.com/rwightman)!
- July 21, 2022: Initial code release of [MaxViT models](https://arxiv.org/abs/2204.01697): accepted to ECCV'22.
- Apr 6, 2022: MaxViT has been implemented by [@lucidrains](https://github.com/lucidrains): [vit-pytorch](https://github.com/lucidrains/vit-pytorch#maxvit) :scream: :exploding_head:
- Apr 4, 2022: initial uploads to [Arxiv](https://arxiv.org/abs/2204.01697)

## MaxViT Models

[MaxViT](https://arxiv.org/abs/2204.01697) is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers. They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages.

MaxViT meta-architecture:



Results on ImageNet-1k train and test:



Results on ImageNet-21k and JFT pre-trained models:



## Colab Demo

We have released a Google Colab Demo on the tutorials of how to run MaxViT on images. Try it here [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-research/maxvit/blob/master/MaxViT_tutorial.ipynb)

## Pretrained MaxViT Checkpoints

We have provided a list of results and checkpoints as follows:

| Name | Resolution | Top1 Acc. | #Params | FLOPs | Model |
| ---------- | ---------| ------ | ------ | ------ | ------ |
| MaxViT-T | 224x224 | 83.62% | 31M | 5.6B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvittiny/i1k/224)
| MaxViT-T | 384x384 | 85.24% | 31M | 17.7B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvittiny/i1k/384)
| MaxViT-T | 512x512 | 85.72% | 31M | 33.7B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvittiny/i1k/512)
| MaxViT-S | 224x224 | 84.45% | 69M | 11.7B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitsmall/i1k/224)
| MaxViT-S | 384x384 | 85.74% | 69M | 36.1B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitsmall/i1k/384)
| MaxViT-S | 512x512 | 86.19% | 69M | 67.6B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitsmall/i1k/512)
| MaxViT-B | 224x224 | 84.95% | 119M | 24.2B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i1k/224)
| MaxViT-B | 384x384 | 86.34% | 119M | 74.2B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i1k/384)
| MaxViT-B | 512x512 | 86.66% | 119M | 138.5B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i1k/512)
| MaxViT-L | 224x224 | 85.17% | 212M | 43.9B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i1k/224)
| MaxViT-L | 384x384 | 86.40% | 212M | 133.1B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i1k/384)
| MaxViT-L | 512x512 | 86.70% | 212M | 245.4B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i1k/512)

Here are a list of ImageNet-21K pretrained and ImageNet-1K finetuned models:

| Name | Resolution | Top1 Acc. | #Params | FLOPs | 21k model | 1k model |
| ---------- | ------ | ------ | ------ | ------ | ------ | --------|
| MaxViT-B | 224x224 | - | 119M | 24.2B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i21k_pt/224) | - |
| MaxViT-B | 384x384 | - | 119M | 74.2B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i21k_i1k/384)
| MaxViT-B | 512x512 | - | 119M | 138.5B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitbase/i21k_i1k/512)
| MaxViT-L | 224x224 | - | 212M | 43.9B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i21k_pt/224) | - |
| MaxViT-L | 384x384 | - | 212M | 133.1B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i21k_i1k/384)
| MaxViT-L | 512x512 | - | 212M | 245.4B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitlarge/i21k_i1k/512)
| MaxViT-XL | 224x224 | - | 475M | 97.8B | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitxlarge/i21k_pt/224) | - |
| MaxViT-XL | 384x384 | - | 475M | 293.7B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitxlarge/i21k_i1k/384)
| MaxViT-XL | 512x512 | - | 475M | 535.2B | - | [ckpt](https://console.cloud.google.com/storage/browser/gresearch/maxvit/ckpts/maxvitxlarge/i21k_i1k/512)

## Citation
Should you find this repository useful, please consider citing:
```
@article{tu2022maxvit,
title={MaxViT: Multi-Axis Vision Transformer},
author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
journal={ECCV},
year={2022},
}
```

## Other Related Works

* MAXIM: Multi-Axis MLP for Image Processing, CVPR 2022. [Paper](https://arxiv.org/abs/2201.02973) | [Code](https://github.com/google-research/maxim)
* CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers, CoRL 2022. [Paper](https://arxiv.org/abs/2207.02202) | [Code](https://github.com/DerrickXuNu/CoBEVT)
* Improved Transformer for High-Resolution GANs, NeurIPS 2021. [Paper](https://arxiv.org/abs/2106.07631) | [Code](https://github.com/google-research/hit-gan)
* CoAtNet: Marrying Convolution and Attention for All Data Sizes, NeurIPS 2021. [Paper](https://arxiv.org/abs/2106.04803)
* EfficientNetV2: Smaller Models and Faster Training, ICML 2021. [Paper](https://arxiv.org/abs/2104.00298) | [Code](https://github.com/google/automl/tree/master/efficientnetv2)

**Acknowledgement:** This repository is built on the [EfficientNets](https://github.com/google/automl) and [CoAtNet](https://arxiv.org/abs/2106.04803).