https://github.com/sayakpaul/probing-vits

Probing the representations of Vision Transformers.
https://github.com/sayakpaul/probing-vits

attention explaining-vits image-recognition keras pre-training self-supervision tensorflow transformers vits

Last synced: 2 months ago
JSON representation

Probing the representations of Vision Transformers.

Host: GitHub
URL: https://github.com/sayakpaul/probing-vits
Owner: sayakpaul
License: apache-2.0
Created: 2022-03-12T07:44:18.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-10-05T11:41:11.000Z (over 2 years ago)
Last Synced: 2025-03-25T13:46:11.742Z (3 months ago)
Topics: attention, explaining-vits, image-recognition, keras, pre-training, self-supervision, tensorflow, transformers, vits
Language: Jupyter Notebook
Homepage: https://keras.io/examples/vision/probing_vits/
Size: 33.3 MB
Stars: 321
Watchers: 9
Forks: 20
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Probing ViTs

[![TensorFlow 2.8](https://img.shields.io/badge/TensorFlow-2.8-FF6F00?logo=tensorflow)](https://github.com/tensorflow/tensorflow/releases/tag/v2.8.0)

[![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow.svg)](https://huggingface.co/spaces)

_By [Aritra Roy Gosthipaty](https://github.com/ariG23498) and [Sayak Paul](https://github.com/sayakpaul) (equal contribution)_

In this repository, we provide tools to probe into the representations learned by different families of Vision Transformers (supervised pre-training with ImageNet-21k, ImageNet-1k, distillation, self-supervised pre-training):

* Original ViT [1] 

* DeiT [2]

* DINO [3]

We hope these tools will prove to be useful for the community. Please follow along with [this post on keras.io](https://keras.io/examples/vision/probing_vits/) for a better navigation through the repository. 

**Updates**

* June 3, 2022: The project got the [Google OSS Expert Prize](https://www.kaggle.com/discussions/general/328914).

* May 10, 2022: The project got a mention from Yannic Kilcher in [ML News](https://youtu.be/pwSnC8jlh50?t=712). Thanks, Yannic!

* May 4, 2022: We're glad to receive the [#TFCommunitySpotlight award](https://twitter.com/TensorFlow/status/1521558632768409600?s=20&t=hXgrZOfT_26AuTC_RyCZ_g) for this project.

## Self-attention visualization

| Original Image | Attention Maps | Attention Maps Overlayed |

| :--: | :--: | :--: |

| ![original image](./assets/bird.png) | ![attention maps](./assets/dino_attention_heads_inferno.png) | ![attention maps overlay](./assets/dino_attention_heads.png) |

https://user-images.githubusercontent.com/36856589/162609884-8e51156e-d461-421d-9f8a-4d4e48967bd6.mp4

Original Video Source

https://user-images.githubusercontent.com/36856589/162609907-4e432dc4-a731-40f4-9a20-94e0c8f648bc.mp4

Original Video Source

## Supervised salient representations

In the [DINO](https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/) blog post, the authors show a video with the following caption:

> The original video is shown on the left. In the middle is a segmentation example generated by a supervised model, and on the right is one generated by DINO. 

A screenshot of the video is as follows:




We obtain the attention maps generated with the supervised pre-trained model and find that they are not that salient w.r.t the DINO model. We observe a similar behaviour in our experiments as well. The figure below shows the attention heatmaps extracted with

a ViT-B16 model pre-trained (supervised) using ImageNet-1k:

| Dinosaur | Dog | 

| :--: | :--: | 

| ![](./assets/supervised-dino.gif) | ![](./assets/supervised-dog.gif) | 

We used this [Colab Notebook](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/vitb16-attention-maps-video.ipynb) to conduct this experiment.

## Hugging Face Spaces

You can now probe into the ViTs with your own input images.

| Attention Heat Maps | Attention Rollout |

| :--: | :--: |

| [![Generic badge](https://img.shields.io/badge/🤗%20Spaces-Attention%20Heat%20Maps-black.svg)](https://huggingface.co/spaces/probing-vits/attention-heat-maps) | [![Generic badge](https://img.shields.io/badge/🤗%20Spaces-Attention%20Rollout-black.svg)](https://huggingface.co/spaces/probing-vits/attention-rollout) |

## Visualizing mean attention distances







## Methods

**We don't propose any novel methods of probing the representations of neural networks. Instead we take the existing works and implement them in TensorFlow.**

* Mean attention distance [1, 4]

* Attention Rollout [5]

* Visualization of the learned projection filters [1]

* Visualization of the learned positioanl embeddings

* Attention maps from individual attention heads [3]

* Generation of attention heatmaps from videos [3]

Another interesting repository that also visualizes ViTs in PyTorch: https://github.com/jacobgil/vit-explain.

## Notes

We first implemented the above-mentioned architectures in TensorFlow and then we populated the pre-trained parameters into them using the official codebases. In order to validate this, we evaluated the implementations on the ImageNet-1k validation set and ensured that the reported top-1 accuracies matched. 

We value the spirit of open-source. So, if you spot any bugs in the code or see a scope for improvement don't hesitate to open up an issue or contribute a PR. We'd very much appreciate it. 

## Navigating through the codebase

Our ViT implementations are in `vit`. We provide utility notebooks in the `notebooks` directory which contains the following:

* [`dino-attention-maps-video.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/dino-attention-maps-video.ipynb) shows how to generate attention heatmaps from a video. (Visually,) best results were obtained with DINO.

* [`dino-attention-maps.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/dino-attention-maps.ipynb) shows how to generate attention maps from individual attention heads from the final transformer block. (Visually,) best results were obtained with DINO.

* [`load-dino-weights-vitb16.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/load-dino-weights-vitb16.ipynb) shows how to populate the pre-trained DINO parameters into our implementation (only for ViT B-16 but can easily be extended to others). 

* [`load-jax-weights-vitb16.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/load-jax-weights-vitb16.ipynb) shows how to populate the pre-trained ViT parameters into our implementation (only for ViT B-16 but can easily be extended to others).

* [`mean-attention-distance-1k.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/mean-attention-distance-1k.ipynb) shows how to plot mean attention distances of different transformer blocks of different ViTs computed over 1000 images.

* [`single-instance-probing.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/single-instance-probing.ipynb) shows how to compute mean attention distance, attention-rollout map for a single prediction instance.

* [`visualizing-linear-projections.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/visualizing-linear-projections.ipynb) shows visualizations of the linear projection filters learned by ViTs.

* [`visualizing-positional-embeddings.ipynb`](https://github.com/sayakpaul/probing-vits/blob/main/notebooks/visualizing-positional-embeddings.ipynb) shows visualizations of the similarities of the positional embeddings learned by ViTs.

DeiT-related code has its separate repository: https://github.com/sayakpaul/deit-tf.

## Models

Here are the links to the models where the pre-trained parameters were populated:

* [Original ViT model (pretrained on ImageNet-21k and fine-tuned on ImageNet-1k)](https://huggingface.co/probing-vits/vit_b16_patch16_224_i21k_i1k)

* [Original ViT model (pretrained on ImageNet-1k)](https://huggingface.co/probing-vits/vit_b16_patch16_224_i1k)

* [DINO model (pretrained on ImageNet-1k)](https://huggingface.co/probing-vits/vit-dino-base16)

* [DeiT models (pretrained on ImageNet-1k including distilled and non-distilled ones)](https://tfhub.dev/sayakpaul/collections/deit/1)

## Training and visualizing with small datasets

Coming soon!

## References

[1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale:  [https://arxiv.org/abs/2010.11929](https://arxiv.org/abs/2010.11929)

[2] DeiT: https://arxiv.org/abs/2012.12877

[3] DINO: https://arxiv.org/abs/2104.14294

[4] Do Vision Transformers See Like Convolutional Neural Networks?:  [https://arxiv.org/abs/2108.08810](https://arxiv.org/abs/2108.08810)

[5] Quantifying Attention Flow in Transformers: https://arxiv.org/abs/2005.00928

## Acknowledgements

- [PyImageSearch](https://pyimagesearch.com)

- [Jarvislabs.ai](https://jarvislabs.ai/)

- [GDE Program](https://developers.google.com/programs/experts/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sayakpaul/probing-vits

Awesome Lists containing this project

README