https://github.com/kyegomez/vit-rgts

Open source implementation of "Vision Transformers Need Registers"
https://github.com/kyegomez/vit-rgts

attention-mechanism gpt4 vision-api vision-transformer vit

Last synced: 6 months ago
JSON representation

Open source implementation of "Vision Transformers Need Registers"

Host: GitHub
URL: https://github.com/kyegomez/vit-rgts
Owner: kyegomez
License: mit
Created: 2023-10-04T23:08:32.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-01-27T15:58:08.000Z (9 months ago)
Last Synced: 2025-04-03T22:07:15.786Z (6 months ago)
Topics: attention-mechanism, gpt4, vision-api, vision-transformer, vit
Language: Python
Homepage: https://discord.gg/qUtxnK2NMf
Size: 214 KB
Stars: 166
Watchers: 5
Forks: 15
Open Issues: 1
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

# VISION TRANSFORMERS NEED REGISTERS

The vit model from the paper "VISION TRANSFORMERS NEED REGISTERS" that reaches SOTA for dense

visual prediction tasks, enables object discovery methods with larger model, and leads to smoother feature maps and attentions maps for downstream visual processing.

Register tokens enable interpretable attention maps in all vision transofrmers!

[Paper Link](https://arxiv.org/pdf/2309.16588.pdf)

# Appreciation

* Lucidrains

* Agorians

# Install

`pip install vit-rgts`

# Usage

```python

import torch

from vit_rgts.main import VitRGTS

v = VitRGTS(

    image_size = 256,

    patch_size = 32,

    num_classes = 1000,

    dim = 1024,

    depth = 6,

    heads = 16,

    mlp_dim = 2048,

    dropout = 0.1,

    emb_dropout = 0.1

)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)

print(preds)

```

# Architecture

- Additional tokens to input sequence that cleanup low informative background areas of images

# Dataset Srtrategy

Here is a table summarizing the key datasets mentioned in the paper along with their metadata and source links:

| Dataset | Type | Size | Tasks | Source |

|-|-|-|-|-|  

| ImageNet-1k | Image Classification | 1.2M images, 1000 classes | Pretraining | http://www.image-net.org/ |

| ImageNet-22k | Image Classification | 14M images, 21841 classes | Pretraining | https://github.com/google-research-datasets/ImageNet-21k-P |

| INaturalist (IN1k) | Image Classification | 437K images, 1000 classes | Evaluation | https://github.com/visipedia/inat_comp/tree/master/2018 |  

| Places205 (P205) | Image Classification | 2.4M images, 205 classes | Evaluation | http://places2.csail.mit.edu/index.html |

| Aircraft (Airc.) | Image Classification | 10K images, 100 classes | Evaluation | https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/ |

| CIFAR-10 (CF10) | Image Classification | 60K images, 10 classes | Evaluation | https://www.cs.toronto.edu/~kriz/cifar.html |  

| CIFAR-100 (CF100) | Image Classification | 60K images, 100 classes | Evaluation | https://www.cs.toronto.edu/~kriz/cifar.html |

| CUB-200-2011 (CUB) | Image Classification | 11.8K images, 200 classes | Evaluation | http://www.vision.caltech.edu/visipedia/CUB-200-2011.html |

| Caltech 101 (Cal101) | Image Classification | 9K images, 101 classes | Evaluation | http://www.vision.caltech.edu/Image_Datasets/Caltech101/ |

| Stanford Cars (Cars) | Image Classification | 16K images, 196 classes | Evaluation | https://ai.stanford.edu/~jkrause/cars/car_dataset.html | 

| Describable Textures (DTD) | Image Classification | 5640 images, 47 classes | Evaluation | https://www.robots.ox.ac.uk/~vgg/data/dtd/index.html |

| MPI Sintel (Flow.) | Optical Flow | 1041 images | Evaluation | http://sintel.is.tue.mpg.de/ |

| Food-101 (Food) | Image Classification | 101K images, 101 classes | Evaluation | https://www.vision.ee.ethz.ch/datasets_extra/food-101/ |  

| Oxford-IIIT Pets (Pets) | Image Classification | 7349 images, 37 classes | Evaluation | https://www.robots.ox.ac.uk/~vgg/data/pets/ |

| SUN397 (SUN) | Scene Classification | 108K images, 397 classes | Evaluation | https://groups.csail.mit.edu/vision/SUN/ |

| PASCAL VOC 2007 (VOC) | Object Detection | 5011 images, 20 classes | Evaluation | http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ |

| PASCAL VOC 2012 (VOC) | Object Detection | 11540 images, 20 classes | Evaluation | http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ |

| COCO 2017 (COCO) | Object Detection | 118K images, 80 classes | Evaluation | https://cocodataset.org/#home |

| ADE20K (ADE20k) | Semantic Segmentation | 20K images, 150 classes | Evaluation | https://groups.csail.mit.edu/vision/datasets/ADE20K/ |

| NYU Depth V2 (NYUd) | Monocular Depth Estimation | 1449 images | Evaluation | https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html |

# License

MIT

# Citations

```

@misc{2309.16588,

Author = {Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},

Title = {Vision Transformers Need Registers},

Year = {2023},

Eprint = {arXiv:2309.16588},

}

```

# Todo

- [x] Make a new training script

- [x] Make a table of datasets used in the paper

- [ ] Make a blog article on architecture and applications

- [ ] Clean up operations, remove reduancy in attention, transformer, and vitgi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kyegomez/vit-rgts

Awesome Lists containing this project

README