{"id":19735972,"url":"https://github.com/nvidia/cosmos-tokenizer","last_synced_at":"2025-10-30T10:30:52.972Z","repository":{"id":261454903,"uuid":"881054702","full_name":"NVIDIA/Cosmos-Tokenizer","owner":"NVIDIA","description":"A suite of image and video neural tokenizers","archived":true,"fork":false,"pushed_at":"2025-02-11T05:49:24.000Z","size":17326,"stargazers_count":1552,"open_issues_count":3,"forks_count":67,"subscribers_count":25,"default_branch":"main","last_synced_at":"2025-02-15T05:05:05.017Z","etag":null,"topics":["diffusion","tokenization","transformers"],"latest_commit_sha":null,"homepage":"https://research.nvidia.com/labs/dir/cosmos-tokenizer","language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-30T20:48:53.000Z","updated_at":"2025-02-14T12:52:09.000Z","dependencies_parsed_at":"2024-11-06T17:19:37.165Z","dependency_job_id":"31c2ab1c-f596-49f9-9482-f3aec3d8b7ce","html_url":"https://github.com/NVIDIA/Cosmos-Tokenizer","commit_stats":{"total_commits":30,"total_committers":5,"mean_commits":6.0,"dds":"0.23333333333333328","last_synced_commit":"b080d0e72c2c3470d18e74927d8490713cdd1e2c"},"previous_names":["nvidia/cosmos-tokenizer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FCosmos-Tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FCosmos-Tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FCosmos-Tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FCosmos-Tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/Cosmos-Tokenizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238950461,"owners_count":19557534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion","tokenization","transformers"],"created_at":"2024-11-12T01:04:25.555Z","updated_at":"2025-10-30T10:30:46.990Z","avatar_url":"https://github.com/NVIDIA.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION \u0026 AFFILIATES. All rights reserved.\n# SPDX-License-Identifier: Apache-2.0\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n# http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License. --\u003e\n# Cosmos Tokenizer: A suite of image and video neural tokenizers.\n\nAs of February 10th, 2025, this repository is **read-only**. \u003cbr /\u003e\n**Please visit [github.com/NVIDIA/Cosmos](https://github.com/nvidia/cosmos) for the latest updates and support on Cosmos Tokenizer**.\n\n### [Website](https://research.nvidia.com/labs/dir/cosmos-tokenizer) | [Paper](https://arxiv.org/abs/2501.03575) | [NVIDIA Cosmos](https://www.nvidia.com/en-us/ai/cosmos/) | [NVIDIA Blog](https://developer.nvidia.com/blog/state-of-the-art-multimodal-generative-ai-model-development-with-nvidia-nemo/)  | [Hugging Face](https://huggingface.co/collections/nvidia/cosmos-tokenizer-672b93023add81b66a8ff8e6) | [YouTube](https://youtu.be/Soy_myOfWIU) | [TokenBench](https://github.com/NVlabs/TokenBench)\n\nWe present [**NVIDIA Cosmos Tokenizer**](https://github.com/NVIDIA/Cosmos-Tokenizer), a suite of image and video tokenizers that advances the state-of-the-art in visual tokenization, paving the way for scalable, robust and efficient development of large auto-regressive transformers (such as LLMs) or diffusion generators. Cosmos Tokenizer is the core component of the [**NVIDIA Cosmos**](https://github.com/NVIDIA/Cosmos), a developer-first video foundation model platform designed to help Physical AI developers build their Physical AI systems better and faster. Please check out our [demo video](https://youtu.be/Soy_myOfWIU).\n\n\n|                   | Continuous ( C )    | Discrete ( D )      |\n| ------------------|---------------------|---------------------|\n| **Images ( I )**        | Cosmos-Tokenizer-CI      | Cosmos-Tokenizer-DI      |\n| **Videos ( V )**        | Cosmos-Tokenizer-CV      | Cosmos-Tokenizer-DV      |\n\u003cvideo src=\"https://github.com/user-attachments/assets/a40b0cc0-17dc-42e9-a97c-fe1c8bb03548\" controls poster=\"https://github.com/NVIDIA/Cosmos-Tokenizer/blob/main/assets/cosmos-tokenizer.jpg?raw=true\"\u003e\n  Your browser does not support the video tag.\n\u003c/video\u003e\n\nGiven an image or video, Cosmos Tokenizer outputs either continuous latents or discrete tokens. Cosmos Tokenizer achieves spatial compression rates of 8x or 16x and temporal compression factors of 4x or 8x, resulting in a total compression factor of up to 2048x (=8x16x16).\nCosmos Tokenizer delivers 8x more total compression than state-of-the-art (SOTA) methods, while simultaneously maintaining higher image quality and running up to 12x faster than the best available SOTA tokenizers.\n\n![Arch](assets/arch_diagram.jpg)\n\n## Web Demo\n\n* Image Tokenization [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nvidia/Cosmos-Tokenizer/blob/main/notebook/Image_Tokenization.ipynb)\n* Video Tokenization [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nvidia/Cosmos-Tokenizer/blob/main/notebook/Video_Tokenization.ipynb)\n\n## Licenses\n- **Models**: The models are licensed under [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf). Under the NVIDIA Open Model License, NVIDIA confirms:\n  - Models are commercially usable. \n  - You are free to create and distribute Derivative Models. \n  - NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.\n- **GitHub Code**: This repository is licensed under the [Apache 2.0\n  license](https://github.com/NVIDIA/Cosmos-Tokenizer/blob/main/LICENSE).\n\n## Installation\n- Clone the source code\n```\ngit clone https://github.com/NVIDIA/Cosmos-Tokenizer.git\ncd Cosmos-Tokenizer\n```\n- Install via pip\n```\napt-get install -y ffmpeg git-lfs\ngit lfs pull\npip3 install -e .\n```\n\nPreferably, build a docker image using the provided Dockerfile\n```\ndocker build -t cosmos-tokenizer -f Dockerfile .\n\n# You can run the container as:\ndocker run --gpus all -it --rm -v /home/${USER}:/home/${USER} \\\n    --workdir ${PWD} cosmos-tokenizer /bin/bash\n```\n\n## Download Pre-trained Checkpoints from Hugging Face\n\n\nWe host 12 Cosmos-Tokenizer models on [Hugging Face](https://huggingface.co/collections/nvidia/cosmos-tokenizer-672b93023add81b66a8ff8e6), with the following model names. You can use this snippet to download:\n```python\nfrom huggingface_hub import login, snapshot_download\nimport os\n\nlogin(token=\"\u003cYOUR-HF-TOKEN\u003e\", add_to_git_credential=True)\nmodel_names = [\n        \"Cosmos-0.1-Tokenizer-CI8x8\",\n        \"Cosmos-0.1-Tokenizer-CI16x16\",\n        \"Cosmos-0.1-Tokenizer-CV4x8x8\",\n        \"Cosmos-0.1-Tokenizer-CV8x8x8\",\n        \"Cosmos-0.1-Tokenizer-CV8x16x16\",\n        \"Cosmos-0.1-Tokenizer-DI8x8\",\n        \"Cosmos-0.1-Tokenizer-DI16x16\",\n        \"Cosmos-0.1-Tokenizer-DV4x8x8\",\n        \"Cosmos-0.1-Tokenizer-DV8x8x8\",\n        \"Cosmos-0.1-Tokenizer-DV8x16x16\",\n        \"Cosmos-1.0-Tokenizer-CV8x8x8\",\n        \"Cosmos-1.0-Tokenizer-DV8x16x16\",\n]\nfor model_name in model_names:\n    hf_repo = \"nvidia/\" + model_name\n    local_dir = \"pretrained_ckpts/\" + model_name\n    os.makedirs(local_dir, exist_ok=True)\n    print(f\"downloading {model_name}...\")\n    snapshot_download(repo_id=hf_repo, local_dir=local_dir)\n```\nUnder the checkpoint repository `pretrained_ckpts/{model_name}`, we provide the encoder, decoder and the full autoencoder JIT models.\n```bash \n├── Cosmos-1.0-Tokenizer-CV8x8x8/\n│   ├── encoder.jit\n│   ├── decoder.jit\n│   ├── autoencoder.jit\n```\n## Running the codes\nYou can use the following example commands to encode and decode images or videos. \u003cbr /\u003e\nFor each, the same command works for both continuous and discrete tokenization. Simply provide the proper JIT-compiled ckpt to `checkpoint_enc`, `checkpoint_dec`, or the full autoencoder ckpt to `checkpoint`.\n\n### Encoding into Continuous Latent Space\n\n```python\nimport torch\nfrom cosmos_tokenizer.video_lib import CausalVideoTokenizer\n\nmodel_name = \"Cosmos-0.1-Tokenizer-CV4x8x8\"\ninput_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)  # [B, C, T, H, W]\nencoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')\n(latent,) = encoder.encode(input_tensor)\ntorch.testing.assert_close(latent.shape, (1, 16, 3, 64, 64))\n\n# The input tensor can be reconstructed by the decoder as:\ndecoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')\nreconstructed_tensor = decoder.decode(latent)\ntorch.testing.assert_close(reconstructed_tensor.shape, input_tensor.shape)\n```\nThe `latent` will have the shape `(1, 16, 3, 64, 64)`, where the first of the three latents represents the first frame, and C=16 is the number of channels of the latent.\n\n### Encoding into Discrete Tokens\n```python\nimport torch\nfrom cosmos_tokenizer.video_lib import CausalVideoTokenizer\n\nmodel_name = \"Cosmos-0.1-Tokenizer-DV4x8x8\"\ninput_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)  # [B, C, T, H, W]\nencoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')\n(indices, codes) = encoder.encode(input_tensor)\ntorch.testing.assert_close(indices.shape, (1, 3, 64, 64))\ntorch.testing.assert_close(codes.shape, (1, 6, 3, 64, 64))\n\n# The input tensor can be reconstructed by the decoder as:\ndecoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')\nreconstructed_tensor = decoder.decode(indices)\ntorch.testing.assert_close(reconstructed_tensor.shape, input_tensor.shape)\n```\nThe `indices` will have the shape `(1, 3, 64, 64)` and contain integral values in the range `[1..64K]`, where the first of the three integral maps represents the first frame. \nThe `codes` will contain the pre-quantization continuous latent with shape `(1, 6, 3, 64, 64)`, where C=6 represents the number of FSQ levels.\n\n## Torchscript (PyTorch JIT) Inference APIs\nThe following instructions run the various tokenizer on the example image and video provided in `test_data/`.\n\n- Autoencoding images. Accepts an input image, and outputs a reconstruction of the image obtained by decoding the encoded latents. \n```bash\n# Autoencoding images using `Cosmos-CI` with a compression rate of 8x8.\nmodel_name=\"Cosmos-0.1-Tokenizer-CI8x8\"\npython3 -m cosmos_tokenizer.image_cli \\\n    --image_pattern 'test_data/image.png' \\\n    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \\\n    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit\n```\nIf `--output_dir` is not specified, you can find the reconstructed image at `test_data/reconstructions/image.png`.\n\n- Autoencoding videos. Accepts an input video, and outputs a reconstruction of the video obtained by decoding the encoded latents.\n```bash\n# Autoencoding videos using `Cosmos-DV` with a compression rate of 4x8x8.\nmodel_name=\"Cosmos-0.1-Tokenizer-DV4x8x8\"\npython3 -m cosmos_tokenizer.video_cli \\\n    --video_pattern 'test_data/video.mp4' \\\n    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \\\n    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit\n```\nIf `--output_dir` is not specified, then you can find the reconstructed video at `test_data/reconstructions/video.mp4`.\n\n## PyTorch Inference APIs\n\nTo run the tokenizers in native PyTorch, append your commands with `--mode=torch`.  \u003cbr /\u003e\nIn PyTorch mode, the model is constructed from the native network definition scripts, which requires providing additional arguments to configure the model for instantiation. \n\nFor example, to instantiate a `Cosmos-DI` with a spatial compression factor of 8, append the following command line arguments:\n\n- `--mode=torch`\n- `--tokenizer_type=DI`\n- `--spatial_compression=8`\n\nNote that the `--checkpoint_enc`, `--checkpoint_dec`, and `--checkpoint` should still refer to JIT files. \u003cbr /\u003e\nThe necessary `state_dict`s will be extracted from the loaded JIT models to initialize the weights of the constructed native PyTorch model.\n\n```bash\n# Autoencoding images using `Cosmos-DI` with a compression rate of 8x8.\nmodel_name=\"Cosmos-0.1-Tokenizer-DI8x8\"\npython3 -m cosmos_tokenizer.image_cli \\\n    --image_pattern 'test_data/*.png' \\\n    --mode=torch \\\n    --tokenizer_type=DI \\\n    --spatial_compression=8 \\\n    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \\\n    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit\n```\n\nTo instantiate a `Cosmos-CV` with a temporal factor of 8 and a spatial compression factor of 8, append the following command line arguments:\n\n- `--mode=torch`\n- `--tokenizer_type=CV`\n- `--temporal_compression=8`\n- `--spatial_compression=8`\n\n```bash\n# Autoencoding videos using `Cosmos-CV` with a compression rate of 8x8x8.\nmodel_name=\"Cosmos-1.0-Tokenizer-CV8x8x8\"\npython3 -m cosmos_tokenizer.video_cli \\\n    --video_pattern 'test_data/*.mp4' \\\n    --mode=torch \\\n    --tokenizer_type=CV \\\n    --temporal_compression=8 \\\n    --spatial_compression=8 \\\n    --checkpoint_enc pretrained_ckpts/${model_name}/encoder.jit \\\n    --checkpoint_dec pretrained_ckpts/${model_name}/decoder.jit\n```\n\n## Inference \u0026 dataset tokenization with NeMo (JIT/TensorRT)\nTensorRT inference is coming soon, which will be available in [Cosmos Tokenizer README within the NeMo repository](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/common/video_tokenizers)\n\n### JIT inference\nPlease install NeMo from the GitHub `main` branch following the instructions [here](https://github.com/NVIDIA/NeMo?tab=readme-ov-file#pip-from-a-source-branch).\n\nRun the following code to tokenize the video:\n\n```python\nimport torch\nfrom nemo.collections.common.video_tokenizers.cosmos_vision_tokenizer import CausalVideoTokenizer\nmodel_name = \"Cosmos-0.1-Tokenizer-CV4x8x8\"\nmodel = CausalVideoTokenizer.from_pretrained(model_name)\ninput_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)\n(latent, ) = model.encode(input_tensor)\n```\n\n### dataset tokenization and multimodal model training\nPlease see the [Cosmos Tokenizer README within the NeMo repository](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/common/video_tokenizers) for additional examples to create multimodal training datasets with the Cosmos Tokenizer.\n\n\n## Evaluation\nQuantitative comparision of our tokenizer and previous tokenizers on DAVIS (Perazzi et al., 2016) dataset. Cosmos Tokenizer achieves state-of-the-art results. Even at higer compression rates (8x8x8 and 8x16x16), Cosmos Tokenizer outperforms previous methods, demonstrating excellent compression-quality trade-off.\n![Arch](assets/Davis-results.jpg)\n## Performance\nComparision of parameter counts and average encoding and decoding times per image or per video frame on a single A100 80GB GPU. Cosmos Tokenizer achieves 2x to 12x faster speeds than previous methods while maintaining smallest model sizes, demonstrating high tokenization efficiency. \n![Arch](assets/Performance.jpg)\n\n\n## [TokenBench](https://github.com/NVlabs/TokenBench)\nTokenBench is a comprehensive benchmark that we have curated to standardize the evaluation of [Cosmos-Tokenizer](https://github.com/NVIDIA/Cosmos-Tokenizer). It covers a wide variety of domains including robotic manipulation, driving, egocentric, and web videos. It consists of high-resolution, long-duration videos, and is designed to benchmark video tokenizers. We have made TokenBench publicly available at [github.com/NVlabs/TokenBench](https://github.com/NVlabs/TokenBench).\n\n## Core Contributors\n\nFitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu\n\n\n## Citation\n\nIf you find Cosmos Tokenizer useful in your works, please acknowledge it\nappropriately by citing:\n\n```\n@article{agarwal2025cosmos,\n  title={Cosmos World Foundation Model Platform for Physical AI},\n  author={NVIDIA et. al.},\n  journal={arXiv preprint arXiv:2501.03575},\n  year={2025}\n}\n```\n## Acknowledgments\nWe would like to acknowledge the following projects where parts of the codes in the [cosmos_tokenizer/modules](cosmos_tokenizer/modules) folder is derived from:\n- [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion)\n- [lucidrains/magvit2-pytorch](https://github.com/lucidrains/magvit2-pytorch)\n- [lucidrains/vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch)\n- [CompVis/taming-transformers](https://github.com/CompVis/taming-transformers)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fcosmos-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvidia%2Fcosmos-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fcosmos-tokenizer/lists"}