{"id":26265748,"url":"https://github.com/andrewboessen/simple-1d-tokenizer","last_synced_at":"2025-10-06T06:04:20.248Z","repository":{"id":259495724,"uuid":"876883277","full_name":"AndrewBoessen/simple-1d-tokenizer","owner":"AndrewBoessen","description":"Simple 1D image tokenizer from the paper An Image is Worth 32 Tokens for Reconstruction and Generation","archived":false,"fork":false,"pushed_at":"2024-11-15T16:09:14.000Z","size":1082,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-15T17:20:56.170Z","etag":null,"topics":["image-tokens","vector-quantization","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AndrewBoessen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-22T17:57:05.000Z","updated_at":"2024-11-15T16:09:18.000Z","dependencies_parsed_at":"2024-10-26T01:28:42.466Z","dependency_job_id":"1ae21512-58cd-42f5-95e2-a190e4c61c93","html_url":"https://github.com/AndrewBoessen/simple-1d-tokenizer","commit_stats":null,"previous_names":["andrewboessen/simple-1d-tokenizer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewBoessen%2Fsimple-1d-tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewBoessen%2Fsimple-1d-tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewBoessen%2Fsimple-1d-tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndrewBoessen%2Fsimple-1d-tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AndrewBoessen","download_url":"https://codeload.github.com/AndrewBoessen/simple-1d-tokenizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243515553,"owners_count":20303258,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-tokens","vector-quantization","vision-transformer"],"created_at":"2025-03-14T03:15:09.673Z","updated_at":"2025-10-06T06:04:15.195Z","avatar_url":"https://github.com/AndrewBoessen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 1D Tokenizer\n\nSimple 1D image tokenizer from the paper [_An Image is Worth 32 Tokens for Reconstruction and Generation_](https://arxiv.org/pdf/2406.07550)\n\n![Image Tokenizer](./assets/encoder.png)\n\n# Image Tokenizer\n\nA neural architecture for encoding images into sequences of discrete tokens, enabling efficient image compression and representation learning through the use of Vision Transformers and Vector Quantization.\n\n## Overview\n\nThe Image Tokenizer converts images into sequences of discrete tokens using a two-stage process:\n\n1. Vision Transformer (ViT) for learning spatial relationships\n2. Vector Quantization (VQ) for discretization\n\nThis architecture enables efficient image compression, learned discrete representations, and interpretable latent spaces suitable for downstream tasks.\n\n## Architecture Details\n\n### Image Tokenizer Pipeline\n\nThe image tokenizer processes input images $x \\in \\mathbb{R}^{H \\times W \\times C}$ through the following stages:\n\n1. Patch embedding and tokenization\n2. Transformer-based contextual encoding\n3. Vector quantization\n4. Discrete token generation\n\n### Vision Transformer (ViT)\n\n![ViT](./assets/vision_transformer.png)\n\nThe Vision Transformer processes images through several key stages:\n\n1. **Patch Embedding:**\n\n   - Input image $x \\in \\mathbb{R}^{H \\times W \\times C}$ is divided into $N = \\frac{HW}{P^2}$ patches\n   - Each patch $x_p \\in \\mathbb{R}^{P^2 \\cdot C}$ is projected to dimension $D$\n   - Result: sequence of patch embeddings $z_0 \\in \\mathbb{R}^{N \\times D}$\n\n2. **Position Encoding:**\n\n   - Learned position embeddings $E_{pos} \\in \\mathbb{R}^{N \\times D}$ added to patch embeddings\n   - Input sequence: $z_0 + E_{pos}$\n\n3. **Transformer Encoding:**\n   - $L$ layers of multi-head self-attention and MLP blocks\n   - Layer $l$ computation: $$z'_l = \\text{MLP}(\\text{LN}(z'_l)) + z'_l$$\n   - Output: contextual representations $z_L \\in \\mathbb{R}^{N \\times D}$\n\n### Vector Quantization (VQ)\n\n![VQ](./assets/vector_quant.png)\n\nThe VQ layer maps continuous latent vectors to discrete tokens:\n\n1. **Codebook:**\n\n   - Contains $K$ embedding vectors: $\\{e_k\\}_{k=1}^K$ where $e_k \\in \\mathbb{R}^D$\n   - Learned during training through straight-through gradient estimation\n\n2. **Quantization Process:**\n\n   - For each input vector $z_i$, find nearest codebook vector:\n     $$k(i) = \\arg\\min_k \\|z_i - e_k\\|_2$$\n   - Replace with selected codebook vector:\n     $$z_q^i = e_{k(i)}$$\n\n3. **Training Objectives:**\n\n   - Codebook loss: $\\|sg(z) - e\\|_2^2$\n   - Commitment loss: $\\beta\\|z - sg(e)\\|_2^2$\n   - Where $sg()$ is the stop-gradient operator\n\n4. **Token Generation:**\n   - Each quantized vector replaced by codebook index\n   - Final output: sequence of $N$ discrete tokens $\\{k(i)\\}_{i=1}^N$\n\n## Mathematical Framework\n\n### Image Processing\n\nFor an input image with dimensions $H \\times W$:\n\n- Patch size $P$ results in $N = \\frac{HW}{P^2}$ patches\n- Each patch produces one embedding in final sequence\n- Example: 256×256 image with 64×64 patches yields 16 embeddings\n\n### Attention Mechanism\n\nMulti-head attention computed as:\n$$\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V$$\n\nWhere:\n\n- $Q, K, V \\in \\mathbb{R}^{N \\times D}$ are query, key, value matrices\n- $d_k$ is scaling factor equal to head dimension\n\n### Vector Quantization\n\nThe quantization operation $q(z)$ is defined as:\n$$q(z) = e_k \\text{ where } k = \\arg\\min_j \\|z - e_j\\|_2$$\n\nTotal loss:\n$$\\mathcal{L} = \\mathcal{L}_\\text{reconstruction} + \\|sg(z) - e\\|_2^2 + \\beta\\|z - sg(e)\\|_2^2$$\n\n## Model Configuration\n\nTypical hyperparameters:\n\n- Image size: 256×256\n- Patch size: 64×64\n- Model dimension: 1024\n- Number of heads: 16\n- Number of layers: 12\n- Codebook size: 8192\n- $\\beta$ (commitment cost): 0.25\n\n## Input-Output Specifications\n\nInput:\n\n- RGB images: $\\mathbb{R}^{H \\times W \\times 3}$\n- Normalized to [-1, 1] range\n\nOutput:\n\n- Sequence of discrete tokens: $\\{0, ..., K-1\\}^N$\n- Token sequence length = $\\frac{HW}{P^2}$\n\n## Performance Characteristics\n\n1. **Compression Rate:**\n\n   - Input: $H \\times W \\times 3$ bytes\n   - Output: $\\frac{HW}{P^2} \\times \\log_2(K)$ bits\n   - Example compression ratio ≈ 24:1\n\n2. **Computational Complexity:**\n\n   - Attention: $O(N^2D)$ per layer\n   - Vector Quantization: $O(NKD)$\n\n3. **Memory Usage:**\n   - Codebook: $O(KD)$ parameters\n   - Transformer: $O(L D^2)$ parameters\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewboessen%2Fsimple-1d-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandrewboessen%2Fsimple-1d-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewboessen%2Fsimple-1d-tokenizer/lists"}