https://github.com/andrewboessen/simple-1d-tokenizer
Simple 1D image tokenizer from the paper An Image is Worth 32 Tokens for Reconstruction and Generation
https://github.com/andrewboessen/simple-1d-tokenizer
image-tokens vector-quantization vision-transformer
Last synced: 6 months ago
JSON representation
Simple 1D image tokenizer from the paper An Image is Worth 32 Tokens for Reconstruction and Generation
- Host: GitHub
- URL: https://github.com/andrewboessen/simple-1d-tokenizer
- Owner: AndrewBoessen
- License: mit
- Created: 2024-10-22T17:57:05.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-15T16:09:14.000Z (over 1 year ago)
- Last Synced: 2024-11-15T17:20:56.170Z (over 1 year ago)
- Topics: image-tokens, vector-quantization, vision-transformer
- Language: Python
- Homepage:
- Size: 1.03 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 1D Tokenizer
Simple 1D image tokenizer from the paper [_An Image is Worth 32 Tokens for Reconstruction and Generation_](https://arxiv.org/pdf/2406.07550)

# Image Tokenizer
A neural architecture for encoding images into sequences of discrete tokens, enabling efficient image compression and representation learning through the use of Vision Transformers and Vector Quantization.
## Overview
The Image Tokenizer converts images into sequences of discrete tokens using a two-stage process:
1. Vision Transformer (ViT) for learning spatial relationships
2. Vector Quantization (VQ) for discretization
This architecture enables efficient image compression, learned discrete representations, and interpretable latent spaces suitable for downstream tasks.
## Architecture Details
### Image Tokenizer Pipeline
The image tokenizer processes input images $x \in \mathbb{R}^{H \times W \times C}$ through the following stages:
1. Patch embedding and tokenization
2. Transformer-based contextual encoding
3. Vector quantization
4. Discrete token generation
### Vision Transformer (ViT)

The Vision Transformer processes images through several key stages:
1. **Patch Embedding:**
- Input image $x \in \mathbb{R}^{H \times W \times C}$ is divided into $N = \frac{HW}{P^2}$ patches
- Each patch $x_p \in \mathbb{R}^{P^2 \cdot C}$ is projected to dimension $D$
- Result: sequence of patch embeddings $z_0 \in \mathbb{R}^{N \times D}$
2. **Position Encoding:**
- Learned position embeddings $E_{pos} \in \mathbb{R}^{N \times D}$ added to patch embeddings
- Input sequence: $z_0 + E_{pos}$
3. **Transformer Encoding:**
- $L$ layers of multi-head self-attention and MLP blocks
- Layer $l$ computation: $$z'_l = \text{MLP}(\text{LN}(z'_l)) + z'_l$$
- Output: contextual representations $z_L \in \mathbb{R}^{N \times D}$
### Vector Quantization (VQ)

The VQ layer maps continuous latent vectors to discrete tokens:
1. **Codebook:**
- Contains $K$ embedding vectors: $\{e_k\}_{k=1}^K$ where $e_k \in \mathbb{R}^D$
- Learned during training through straight-through gradient estimation
2. **Quantization Process:**
- For each input vector $z_i$, find nearest codebook vector:
$$k(i) = \arg\min_k \|z_i - e_k\|_2$$
- Replace with selected codebook vector:
$$z_q^i = e_{k(i)}$$
3. **Training Objectives:**
- Codebook loss: $\|sg(z) - e\|_2^2$
- Commitment loss: $\beta\|z - sg(e)\|_2^2$
- Where $sg()$ is the stop-gradient operator
4. **Token Generation:**
- Each quantized vector replaced by codebook index
- Final output: sequence of $N$ discrete tokens $\{k(i)\}_{i=1}^N$
## Mathematical Framework
### Image Processing
For an input image with dimensions $H \times W$:
- Patch size $P$ results in $N = \frac{HW}{P^2}$ patches
- Each patch produces one embedding in final sequence
- Example: 256×256 image with 64×64 patches yields 16 embeddings
### Attention Mechanism
Multi-head attention computed as:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- $Q, K, V \in \mathbb{R}^{N \times D}$ are query, key, value matrices
- $d_k$ is scaling factor equal to head dimension
### Vector Quantization
The quantization operation $q(z)$ is defined as:
$$q(z) = e_k \text{ where } k = \arg\min_j \|z - e_j\|_2$$
Total loss:
$$\mathcal{L} = \mathcal{L}_\text{reconstruction} + \|sg(z) - e\|_2^2 + \beta\|z - sg(e)\|_2^2$$
## Model Configuration
Typical hyperparameters:
- Image size: 256×256
- Patch size: 64×64
- Model dimension: 1024
- Number of heads: 16
- Number of layers: 12
- Codebook size: 8192
- $\beta$ (commitment cost): 0.25
## Input-Output Specifications
Input:
- RGB images: $\mathbb{R}^{H \times W \times 3}$
- Normalized to [-1, 1] range
Output:
- Sequence of discrete tokens: $\{0, ..., K-1\}^N$
- Token sequence length = $\frac{HW}{P^2}$
## Performance Characteristics
1. **Compression Rate:**
- Input: $H \times W \times 3$ bytes
- Output: $\frac{HW}{P^2} \times \log_2(K)$ bits
- Example compression ratio ≈ 24:1
2. **Computational Complexity:**
- Attention: $O(N^2D)$ per layer
- Vector Quantization: $O(NKD)$
3. **Memory Usage:**
- Codebook: $O(KD)$ parameters
- Transformer: $O(L D^2)$ parameters