https://github.com/jaykef/min-patchnizer

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.
https://github.com/jaykef/min-patchnizer

computer-vision nlp patchnization tokenization transformer

Last synced: 4 months ago
JSON representation

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.

Host: GitHub
URL: https://github.com/jaykef/min-patchnizer
Owner: Jaykef
Created: 2024-02-27T11:10:12.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-05-16T23:02:28.000Z (about 1 year ago)
Last Synced: 2024-05-17T00:21:45.473Z (about 1 year ago)
Topics: computer-vision, nlp, patchnization, tokenization, transformer
Language: Python
Homepage:
Size: 4.4 MB
Stars: 8
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# min-patchnizer

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder. The code here, first extracts still images (frames) from a video, splits the image frames into smaller fixed-size patches, linearly embeds each of them, adds position embeddings and then saves the resulting sequence of vectors for use in a Vision Transformer encoder. I tried training the resulting sequence vectors with Karpathy's minbpe and it took 2173.45 seconds per frame to tokenize. The whole "patchnization" took ~77.40a for a 20s video on my M2 Air.

![IMG_5672](https://github.com/Jaykef/min-patchnizer/assets/11355002/de2eb521-58d5-4308-b061-19a32217cbb2)

The files in this repo work as follows:

patchnizer.py: Holds code for simple implemenatation of the three stages involved (extract_image_frames from video, reduce image_frames_to_patches of fixed sizes 16x16 pixels, then linearly_embed_patches into a 1D vector sequence with additional position embeddings.

patchnize.py: performs the whole process with custom configs (patch_size, created dirs, video - I am using the "dogs playing in snow" video by sora).

train.py: Trains the resulting one-dimensional vector sequence (linear_patch_embeddings + position_embeddings) on Karpathy's minbpe (a minimal implementation of the byte-pair encoding algorithm).

check.py: Checks to see if the patch embeddings match the original image patches and then reconstructs the original image frames - this basically just do the reverse of linear embedding.

The whole process builds on the approach introduced in the Vision Transformer paper: "An image is worth 16x16 words: Transformers for image recognition at scale."

Youtube Video: Watch Demo

## Usage

First patchnize:
```
python patchnize.py
```

Next check:
```
python check.py
```

Then train:
```
python train.py
```

## References

SORA Technical Report

"An image is worth 16x16 words: Transformers for image recognition at scale", Alexey Dosovitskiy et al.

minbpe by karpathy

## License
MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jaykef/min-patchnizer

Awesome Lists containing this project

README