{"id":18322465,"url":"https://github.com/TencentARC/SEED-Voken","last_synced_at":"2025-07-22T22:30:36.269Z","repository":{"id":244741633,"uuid":"814038296","full_name":"TencentARC/Open-MAGVIT2","owner":"TencentARC","description":"Open-MAGVIT2: Democratizing Autoregressive Visual Generation","archived":false,"fork":false,"pushed_at":"2024-09-27T03:45:00.000Z","size":15610,"stargazers_count":705,"open_issues_count":4,"forks_count":29,"subscribers_count":20,"default_branch":"main","last_synced_at":"2024-11-19T19:07:58.213Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-12T08:16:18.000Z","updated_at":"2024-11-19T11:56:37.000Z","dependencies_parsed_at":"2024-06-17T06:23:45.603Z","dependency_job_id":"c99ff760-9f36-4f3f-876e-7880e0205e73","html_url":"https://github.com/TencentARC/Open-MAGVIT2","commit_stats":null,"previous_names":["tencentarc/open-magvit2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FOpen-MAGVIT2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FOpen-MAGVIT2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FOpen-MAGVIT2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FOpen-MAGVIT2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/Open-MAGVIT2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227184651,"owners_count":17744298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T18:24:43.221Z","updated_at":"2025-07-22T22:30:36.242Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\r\n\u003ch1\u003e🚀 SEED-Voken: A Series of Powerful Visual Tokenizers\u003c/h1\u003e\r\n\r\n\u003c/div\u003e\r\n\r\nThe project aims to provide advanced visual tokenizers for autoregressive visual generation and currently supports the following methods: \u003cbr\u003e\u003cbr\u003e\r\n\r\n\u003e\u003ca href=\"https://arxiv.org/abs/2409.04410\"\u003eOpen-MAGVIT2: An Open-source Project Toward Democratizing Auto-Regressive Visual Generation\u003c/a\u003e\u003cbr\u003e\r\n\u003e[Zhuoyan Luo*](https://robertluo1.github.io/), [Fengyuan Shi*](https://shifengyuan1999.github.io/), [Yixiao Ge](https://geyixiao.com/), [Yujiu Yang](https://sites.google.com/view/iigroup-thu/people), [Limin Wang](https://wanglimin.github.io/), [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ\u0026hl=en)\u003cbr\u003e\r\n\u003eARC Lab Tencent PCG, Tsinghua University, Nanjing University\u003cbr\u003e\r\n\u003ca href=\"./docs/Open-MAGVIT2.md\"\u003e📚Open-MAGVIT2.md\u003c/a\u003e\r\n\u003e ```\r\n\u003e @article{luo2024open,\r\n\u003e   title={Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation},\r\n\u003e   author={Luo, Zhuoyan and Shi, Fengyuan and Ge, Yixiao and Yang, Yujiu and Wang, Limin and Shan, Ying},\r\n\u003e   journal={arXiv preprint arXiv:2409.04410},\r\n\u003e   year={2024}\r\n\u003e }\r\n\u003e ```\r\n\r\n\u003e \u003ca href=\"https://arxiv.org/abs/2412.02692\"\u003eIBQ: Taming Scalable Visual Tokenizer for Autoregressive Image Generation\u003c/a\u003e\u003cbr\u003e\r\n\u003e [Fengyuan Shi*](https://shifengyuan1999.github.io/), [Zhuoyan Luo*](https://robertluo1.github.io/), [Yixiao Ge](https://geyixiao.com/), [Yujiu Yang](https://sites.google.com/view/iigroup-thu/people), [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ\u0026hl=en), [Limin Wang](https://wanglimin.github.io/)\u003cbr\u003e\r\n\u003e Nanjing University, Tsinghua University, ARC Lab Tencent PCG\u003cbr\u003e\r\n\u003e \u003ca href=\"./docs/IBQ.md\"\u003e📚IBQ.md\u003c/a\u003e\r\n\u003e ```\r\n\u003e @article{shi2024taming,\r\n\u003e   title={Taming Scalable Visual Tokenizer for Autoregressive Image Generation},\r\n\u003e   author={Shi, Fengyuan and Luo, Zhuoyan and Ge, Yixiao and Yang, Yujiu and Shan, Ying and Wang, Limin},\r\n\u003e   journal={arXiv preprint arXiv:2412.02692},\r\n\u003e   year={2024}\r\n\u003e }\r\n\u003e ```\r\n\r\n\u003cp align=\"center\"\u003e\r\n\u003cimg src=\"./assets/comparsion.png\" width=90%\u003e\r\n\u003c/p\u003e\r\n\r\n## 📰 News\r\n* **[2025.06.26]**:fire::fire::fire: **IBQ is accepted by ICCV 2025.**\r\n* **[2025.02.14]** The pretrained version of **IBQ** visual tokenizers, which achieves SOTA performance with high code dimension is released.\r\n* **[2025.02.09]** We release Open-MAGVIT2 Video tokenizers, which achieves SOTA performance compared to OmniTokenizer, LARP and SweetTokenizer. \r\n* **[2025.01.21]** Open-MAGVIT2 tokenizers (codebook size of 16384 and 262144) for text-conditional image generation are now released! They are pretrained with large-scale image-text datasets, achieving SOTA performance compared to LlamaGen, Show-o, and Cosmos.\r\n* **[2024.11.26]** We are excited to release **IBQ**, a series of scalable visual tokenizers, which achieve a large-scale codebook (2^18) with high dimension (256) and high utilization.\r\n* **[2024.09.09]** We release an improved version of Open-MAGVIT2 tokenizer and a family of auto-regressive models ranging from 300M to 1.5B.\r\n* **[2024.06.17]** We release the training code of the **Open-MAGVIT2** tokenizer and checkpoints for different resolutions, **achieving state-of-the-art performance (`0.39 rFID` for 8x downsampling)** compared to VQGAN, MaskGIT, and recent TiTok, LlamaGen, and OmniTokenizer.\r\n\r\n## 📖 Implementations\r\n\r\n**Our codebase supports both NPU and GPU for training and inference. All experiments were conducted using the Ascend 910B for training, and we validated our models on the V100. The observed performance between the two platforms is nearly identical.**\r\n\r\n### 🛠️ Installation\r\n#### GPU\r\n- **Env**: We have tested on `Python 3.8.8` and `CUDA 11.8` (other versions may also be fine).\r\n- **Dependencies**: `pip install -r requirements.txt`\r\n\r\n#### NPU\r\n##### Image Version\r\n- **Env**: `Python 3.9.16` and [`CANN 8.0.T13`](https://www.hiascend.com/en/software/cann)\r\n- **Main Dependencies**: `torch=2.1.0+cpu` + `torch-npu=2.1.0.post3-20240523` + [`Lightning`](https://github.com/hipudding/pytorch-lightning/tree/npu_support)\r\n\r\n##### Video Version\r\n- **Env** `Python 3.9.16` and [`CANN 8.0.T62`](https://www.hiascend.com/en/software/cann)\r\n- **Main Dependencies**: `torch=2.1.0+cpu` + `torch-npu=2.1.0.post10.dev20241128` + [`Lightning`](https://github.com/hipudding/pytorch-lightning/tree/npu_support)\r\n\r\n**Other Dependencies**: see in `requirements.txt`\r\n\r\n#### Datasets\r\n\r\n- **Image Dataset**\r\n\r\nWe use Imagenet2012 as our Image dataset.\r\n```\r\nimagenet\r\n└── train/\r\n    ├── n01440764\r\n        ├── n01440764_10026.JPEG\r\n        ├── n01440764_10027.JPEG\r\n        ├── ...\r\n    ├── n01443537\r\n    ├── ...\r\n└── val/\r\n    ├── ...\r\n```\r\n\r\n- **Video Dataset**\r\n\r\nWe use UCF-101 as our Video Dataset\r\n```\r\nUCF101\r\n└── train/\r\n    ├── class_0\r\n        ├── video_1.mp4\r\n        ├── video_2.mp4\r\n        ├── ...\r\n    ├── class_1\r\n    ├── class_2\r\n└── val/\r\n    ├── ...\r\n```\r\nThe preparation of UCF-101 can be referred to [VideoGPT](https://github.com/wilson1yan/VideoGPT)\r\n\r\n- **Text2Image Datasets**\r\n\r\nWe recommend the data are organized in the following tar format.\r\n```\r\ndata\r\n└── LAION_COCO/\r\n    ├── webdataset\r\n        ├── 1.tar\r\n        ├── 2.tar\r\n        ├── 3.tar\r\n        ├── ...\r\n└── CC12M/\r\n    ├── webdataset\r\n        ├── 1.tar\r\n        ├── 2.tar\r\n        ├── 3.tar\r\n        ├── ...\r\n```\r\nBefore pretraining, the sample.json and filter_keys.json of each datasets should be prepared. Please refer to **src/Open_MAGVIT2/data/prepare_pretrain.py**\r\n\r\n### ⚡ Training \u0026 Evaluation\r\nThe training and evaluation scripts are in \u003ca href=\"docs/Open-MAGVIT2.md\"\u003eOpen-MAGVIT2.md\u003c/a\u003e and \u003ca href=\"docs/IBQ.md\"\u003eIBQ.md\u003c/a\u003e.\r\n\r\n## ❤️ Acknowledgement\r\nWe thank [Lijun Yu](https://me.lj-y.com/) for his encouraging discussions. We refer a lot from [VQGAN](https://github.com/CompVis/taming-transformers) and [MAGVIT](https://github.com/google-research/magvit). We also refer to [LlamaGen](https://github.com/FoundationVision/LlamaGen), [VAR](https://github.com/FoundationVision/VAR), [RQVAE](https://github.com/kakaobrain/rq-vae-transformer) and [VideoGPT](https://github.com/wilson1yan/VideoGPT), [OmniTokenizer](https://github.com/FoundationVision/OmniTokenizer). Thanks for their wonderful work.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencentARC%2FSEED-Voken","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTencentARC%2FSEED-Voken","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencentARC%2FSEED-Voken/lists"}