https://github.com/topazape/vit-pytorch

Vision Transformer in Pytorch
https://github.com/topazape/vit-pytorch

computer-vision deep-learning pytorch transformer-architecture vision-transformer

Last synced: 7 months ago
JSON representation

Vision Transformer in Pytorch

Host: GitHub
URL: https://github.com/topazape/vit-pytorch
Owner: topazape
License: unlicense
Created: 2022-09-20T07:56:31.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2022-11-20T03:27:58.000Z (over 2 years ago)
Last Synced: 2024-12-24T23:04:22.052Z (7 months ago)
Topics: computer-vision, deep-learning, pytorch, transformer-architecture, vision-transformer
Language: Python
Homepage:
Size: 4.62 MB
Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Vision Transformer (ViT)

Implementation of Vision Transformer (ViT) in Pytorch. ViT is presented in the paper, [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://openreview.net/forum?id=YicbFdNTTy).

![](assets/vit.png)

# Implementation

The ViT code for this repo is based on the book ["Vision Transformer 入門"](https://gihyo.jp/book/2022/978-4-297-13058-9) written in Japanese. I added some code for dataset preparation and training procedures using [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html).

# Usage

```

python run.py [-h] [-s SEED] FILE

positional arguments:

  FILE                  path to config file

options:

  -h, --help            show this help message and exit

  -s SEED, --seed SEED  seed for initializing training

```

# Example

```

python run.py examples/CIFAR10/config.ini

```

# Config

Here shows a list of settings and what they mean.

Parameters are based on the ViT experiment [conducted by GMO](https://recruit.gmo.jp/engineer/jisedai/blog/vision_transformer/).

```ini

[dataset]

dir = ./datasets	; training data save directory

name = CIFAR10		; dataset name, only CIFAR10 is acceptable

in_channels = 3		; number of channels

image_size = 32		; image size; 32x32

num_classes = 10	; 10 class classification

[dataloader]

batch_size = 32

shuffle = true

[model]

patch_size = 4		; use 4 x 4 px for patch

embed_dim = 256		; same meaning of dim=256 of `vit-pytorch`

num_blocks = 3		; same meaning of depth=3 of `vit-pytorch`

heads = 4			; number of multihead attention

hidden_dim = 256	; same meaning of mlp_dim=256 of `vit-pytorch`

dropout = 0.1		; dropout ratio

[learning]

epochs = 20

learning_rate = 0.001

```

# Result

ViT is inherently accurate when pre-trained on large image data sets (like [JFT-300M](https://paperswithcode.com/dataset/jft-300m)), so simply training on CIFAR10, as in this code, does not reduce cross-entropy loss.

```

[2022-09-23 11:52:17] :vision_transformer.utils.logger: [INFO] loss: 2.0047439576718755

[2022-09-23 11:52:38] :vision_transformer.utils.logger: [INFO] loss: 1.8455862294370755

...

[2022-09-23 11:58:37] :vision_transformer.utils.logger: [INFO] loss: 1.2203882005268012

[2022-09-23 11:58:58] :vision_transformer.utils.logger: [INFO] loss: 1.2218489825915986

```

This same has been shown in [GMO experiment](https://recruit.gmo.jp/engineer/jisedai/blog/vision_transformer/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/topazape/vit-pytorch

Awesome Lists containing this project

README