{"id":28366993,"url":"https://github.com/thuml/ivideogpt","last_synced_at":"2025-10-27T22:33:59.553Z","repository":{"id":241457774,"uuid":"805279472","full_name":"thuml/iVideoGPT","owner":"thuml","description":"Official repository for \"iVideoGPT: Interactive VideoGPTs are Scalable World Models\" (NeurIPS 2024), https://arxiv.org/abs/2405.15223","archived":false,"fork":false,"pushed_at":"2025-05-22T03:48:56.000Z","size":40067,"stargazers_count":144,"open_issues_count":2,"forks_count":12,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-09-08T04:45:37.322Z","etag":null,"topics":["gpt","model-based-reinforcement-learning","open-x-embodiment","robotic-manipulation","video-generation","video-prediction","visual-planning","world-model"],"latest_commit_sha":null,"homepage":"https://thuml.github.io/iVideoGPT/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thuml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-24T08:45:14.000Z","updated_at":"2025-09-07T16:01:03.000Z","dependencies_parsed_at":"2024-05-28T13:07:45.863Z","dependency_job_id":"b0e30564-5cd4-4bf9-86d3-1eb469071357","html_url":"https://github.com/thuml/iVideoGPT","commit_stats":null,"previous_names":["thuml/ivideogpt"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/thuml/iVideoGPT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FiVideoGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FiVideoGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FiVideoGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FiVideoGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thuml","download_url":"https://codeload.github.com/thuml/iVideoGPT/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FiVideoGPT/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281355381,"owners_count":26486897,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-27T02:00:05.855Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt","model-based-reinforcement-learning","open-x-embodiment","robotic-manipulation","video-generation","video-prediction","visual-planning","world-model"],"created_at":"2025-05-29T00:13:49.062Z","updated_at":"2025-10-27T22:33:59.548Z","avatar_url":"https://github.com/thuml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🌏 iVideoGPT: Interactive VideoGPTs are Scalable World Models (NeurIPS 2024)\n\n[[Project Page]](https://thuml.github.io/iVideoGPT/) [[Paper]](https://arxiv.org/abs/2405.15223) [[Models]](https://huggingface.co/collections/thuml/ivideogpt-674c59cae32231024d82d6c5) [[Poster]](https://manchery.github.io/assets/pub/nips2024_ivideogpt/poster.pdf) [[Slides]](https://manchery.github.io/assets/pub/nips2024_ivideogpt/slides.pdf) [[Blog (In Chinese)]](https://mp.weixin.qq.com/s/D94aamdqtO9WLekr4BSCUw)\n\nThis repo provides official code and checkpoints for iVideoGPT, a generic and efficient world model architecture that has been pre-trained on millions of human and robotic manipulation trajectories. \n\n![architecture](assets/architecture.png)\n\n## 🔥 News\n\n- 🚩 **2025.09.18**: [RLVR-World](https://github.com/thuml/RLVR-World) has been accepted by NeurIPS 2025, congrats!\n- 🚩 **2025.05.21**: We are excited to release a new work, [RLVR-World](https://github.com/thuml/RLVR-World), demonstrating that iVideoGPTs can be improved by reinforcement learning with verifiable rewards (RLVR)!\n- 🚩 **2024.11.01**: NeurIPS 2024 camera-ready version is released on [arXiv](https://arxiv.org/abs/2405.15223v3).\n- 🚩 **2024.09.26**: iVideoGPT has been accepted by NeurIPS 2024, congrats!\n- 🚩 **2024.08.31**: Training code is released.\n- 🚩 **2024.05.31**: Project website with video samples is released.\n- 🚩 **2024.05.30**: Model pre-trained on Open X-Embodiment and inference code are released.\n- 🚩 **2024.05.27**: Our paper is released on [arXiv](https://arxiv.org/abs/2405.15223v1).\n\n## 🛠️ Installation\n\n```bash\nconda create -n ivideogpt python==3.9\nconda activate ivideogpt\npip install -r requirements.txt\n```\n\nTo evaluate the FVD metric, download the [pretrained I3D model](https://www.dropbox.com/s/ge9e5ujwgetktms/i3d_torchscript.pt?dl=1) into `pretrained_models/i3d/i3d_torchscript.pt`.\n\n## 🤗 Models\n\nAt the moment we provide the following pre-trained models:\n\n| Model | Resolution | Action-conditioned | Goal-conditioned | Tokenizer Size | Transformer Size |\n| ---- | ---- | ---- | ---- | ---- | ---- |\n| [ivideogpt-oxe-64-act-free](https://huggingface.co/thuml/ivideogpt-oxe-64-act-free) | 64x64 | No | No | 114M   |  138M    |\n| [ivideogpt-oxe-64-act-free-medium](https://huggingface.co/thuml/ivideogpt-oxe-64-act-free-medium) | 64x64 | No | No |  114M   |  436M    |\n| [ivideogpt-oxe-64-goal-cond](https://huggingface.co/thuml/ivideogpt-oxe-64-goal-cond) | 64x64 | No | Yes | 114M   |  138M    |\n| [ivideogpt-oxe-256-act-free](https://huggingface.co/thuml/ivideogpt-oxe-256-act-free) | 256x256 | No | No | 310M   |  138M    |\n\nIf no network connection to Hugging Face, you can manually download from [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/d/ef7d94c798504587a95e/).\n\n**Notes**:\n\n- Due to the heterogeneity of action spaces, we currently do not have an action-conditioned prediction model on OXE.\n- Pre-trained models at 256x256 resolution may not perform best due to insufficient training, but can serve as a good starting point for downstream fine-tuning.\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eMore models on downstream tasks\u003c/b\u003e\u003c/summary\u003e\n  \u003cbr\u003e\n  \n| Model | Resolution | Action-conditioned | Goal-conditioned | Tokenizer Size | Transformer Size |\n| ---- | ---- | ---- | ---- | ---- | ---- |\n| [ivideogpt-bair-64-act-free](https://huggingface.co/thuml/ivideogpt-bair-64-act-free) | 64x64 | No | No |  114M   |  138M    |\n| [ivideogpt-bair-64-act-cond](https://huggingface.co/thuml/ivideogpt-bair-64-act-cond) | 64x64 | Yes | No | 114M   |  138M    |\n| [ivideogpt-robonet-64-act-cond](https://huggingface.co/thuml/ivideogpt-robonet-64-act-cond) | 64x64 | Yes | No |  114M   |  138M    |\n| [ivideogpt-vp2-robosuite-64-act-cond](https://huggingface.co/thuml/ivideogpt-vp2-robosuite-64-act-cond) | 64x64 | Yes | No |  114M   |  138M    |\n| [ivideogpt-vp2-robodesk-64-act-cond](https://huggingface.co/thuml/ivideogpt-vp2-robodesk-64-act-cond) | 64x64 | Yes | No |  114M   |  138M    |\n\n- We are sorry that the checkpoints for RoboNet at 256x256 resolution were deleted by mistake during a disk cleanup, we will retrain and release them as our computational resources become idle.\n\u003c/details\u003e\n\n## 📦 Data Preparation\n\n**Open X-Embodiment**: Download datasets from [Open X-Embodiment](https://github.com/google-deepmind/open_x_embodiment) and extract single episodes as `.npz` files:\n\n```bash\npython datasets/oxe_data_converter.py --dataset_name {dataset name, e.g. bridge} --input_path {path to downloaded OXE} --output_path {path to stored npz}\n```\n\nTo replicate our pre-training on OXE, you need to extract all datasets listed under `OXE_SELECT` in `ivideogpt/data/dataset_mixes.py`.\n\nSee instructions at [`datasets`](/datasets) on preprocessing more datasets.\n\n## 🚀 Inference Examples\n\nFor action-free video prediction on Open X-Embodiment, run:\n\n```bash\npython inference/predict.py --pretrained_model_name_or_path \"thuml/ivideogpt-oxe-64-act-free\" --input_path inference/samples/fractal_sample.npz --dataset_name fractal20220817_data\n```\n\nSee more examples at [`inference`](/inference).\n\n## 🌟 Pre-training\n\nTo pre-train iVideoGPT, adjust the arguments in the command below as needed and run:\n\n```bash\nbash ./scripts/pretrain/ivideogpt-oxe-64-act-free.sh\n```\n\nSee more scripts for [pre-trained models](#-models) at [`scripts/pretrain`](/scripts/pretrain).\n\n## 🎇 Fine-tuning Video Prediction\n\n### Finetuning Tokenizer\n\nAfter preparing the [BAIR](/datasets#bair-robot-pushing) dataset, run the following:\n\n```bash\naccelerate launch train_tokenizer.py \\\n    --exp_name bair_tokenizer_ft --output_dir log_vqgan --seed 0 --mixed_precision bf16 \\\n    --model_type ctx_vqgan \\\n    --train_batch_size 16 --gradient_accumulation_steps 1 --disc_start 1000005 \\\n    --oxe_data_mixes_type bair --resolution 64 --dataloader_num_workers 16 \\\n    --rand_select --video_stepsize 1 --segment_horizon 16 --segment_length 8 --context_length 1 \\\n    --pretrained_model_name_or_path pretrained_models/ivideogpt-oxe-64-act-free/tokenizer \\\n    --max_train_steps 200005\n```\n\n### Finetuning Transformer\n\nFor action-conditioned video prediction, run the following:\n\n```bash\naccelerate launch train_gpt.py \\\n    --exp_name bair_llama_ft --output_dir log_trm --seed 0 --mixed_precision bf16 \\\n    --vqgan_type ctx_vqgan \\\n    --pretrained_model_name_or_path {log directory of finetuned tokenizer}/unwrapped_model \\\n    --config_name configs/llama/config.json --load_internal_llm --action_conditioned --action_dim 4 \\\n    --pretrained_transformer_path pretrained_models/ivideogpt-oxe-64-act-free/transformer \\\n    --per_device_train_batch_size 16 --gradient_accumulation_steps 1 \\\n    --learning_rate 1e-4 --lr_scheduler_type cosine \\\n    --oxe_data_mixes_type bair --resolution 64 --dataloader_num_workers 16 \\\n    --video_stepsize 1 --segment_length 16 --context_length 1 \\\n    --use_eval_dataset --use_fvd --use_frame_metrics \\\n    --weight_decay 0.01 --llama_attn_drop 0.1 --embed_no_wd \\\n    --max_train_steps 100005\n```\n\nFor action-free video prediction, remove `--load_internal_llm --action_conditioned`.\n\nSee more scripts at [`scripts/finetune`](/scripts/finetune).\n\n### Evaluation\n\nTo evaluate the checkpoints only, run:\n\n```bash\nbash ./scripts/evaluation/bair-64-act-cond.sh\n```\n\nSee more scripts for [released checkpoints](#-models) at [`scripts/evaluation`](/scripts/evaluation).\n\n## 🤖 Visual Control\n\n### Visual Model-based RL\n\nInstall the Metaworld version we used:\n\n```bash\npip install git+https://github.com/Farama-Foundation/Metaworld.git@83ac03ca3207c0060112bfc101393ca794ebf1bd\n```\n\nModify paths in `mbrl/cfgs/mbpo_config.yaml` to your own paths (currently only support absolute paths).\n\nRun model-based RL with iVideoGPT:\n\n```bash\npython mbrl/train_metaworld_mbpo.py task=plate_slide num_train_frames=100002 demo=true\n```\n\n### Visual Planning\n\nSee [`vp`](/vp) for detailed instructions.\n\n## 🎥 Showcases\n\n![showcase](assets/showcase.png)\n\n## 📜 Citation\n\nIf you find this project useful, please cite our paper as:\n\n```\n@inproceedings{wu2024ivideogpt,\n    title={iVideoGPT: Interactive VideoGPTs are Scalable World Models}, \n    author={Jialong Wu and Shaofeng Yin and Ningya Feng and Xu He and Dong Li and Jianye Hao and Mingsheng Long},\n    booktitle={Advances in Neural Information Processing Systems},\n    year={2024},\n}\n```\n\n## 🤝 Contact\n\nIf you have any question, please contact wujialong0229@gmail.com.\n\n## 💡 Acknowledgement\n\nOur codebase is based on [huggingface/diffusers](https://github.com/huggingface/diffusers) and [facebookresearch/drqv2](https://github.com/facebookresearch/drqv2).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthuml%2Fivideogpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthuml%2Fivideogpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthuml%2Fivideogpt/lists"}