{"id":21337157,"url":"https://github.com/flymin/magicdrivedit","last_synced_at":"2025-05-16T04:03:36.530Z","repository":{"id":263967320,"uuid":"891941403","full_name":"flymin/MagicDriveDiT","owner":"flymin","description":"Official implementation of the paper “MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control”","archived":false,"fork":false,"pushed_at":"2025-02-03T09:06:18.000Z","size":488,"stargazers_count":489,"open_issues_count":17,"forks_count":15,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-05-16T04:02:43.323Z","etag":null,"topics":["autonomous-driving","diffusion-models","multi-view","video-generation"],"latest_commit_sha":null,"homepage":"https://gaoruiyuan.com/magicdrivedit/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/flymin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-21T08:29:30.000Z","updated_at":"2025-05-15T07:02:39.000Z","dependencies_parsed_at":"2025-01-13T04:19:09.777Z","dependency_job_id":"40edaee1-17ce-4024-803b-2aad9e7097ef","html_url":"https://github.com/flymin/MagicDriveDiT","commit_stats":null,"previous_names":["flymin/magicdrivedit"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flymin%2FMagicDriveDiT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flymin%2FMagicDriveDiT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flymin%2FMagicDriveDiT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flymin%2FMagicDriveDiT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/flymin","download_url":"https://codeload.github.com/flymin/MagicDriveDiT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254464891,"owners_count":22075570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autonomous-driving","diffusion-models","multi-view","video-generation"],"created_at":"2024-11-21T23:57:57.016Z","updated_at":"2025-05-16T04:03:36.432Z","avatar_url":"https://github.com/flymin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MagicDriveDiT\n\n[![arXiv](https://img.shields.io/badge/ArXiv-2411.13807-b31b1b.svg?style=plastic)](https://arxiv.org/abs/2411.13807) [![web](https://img.shields.io/badge/Web-MagicDriveDiT-blue.svg?style=plastic)](https://gaoruiyuan.com/magicdrivedit/) [![license](https://img.shields.io/github/license/flymin/MagicDriveDiT?style=plastic)](https://github.com/flymin/MagicDriveDiT/blob/main/LICENSE) [![star](https://img.shields.io/github/stars/flymin/MagicDriveDiT)](https://github.com/flymin/MagicDriveDiT) [![Paper](https://huggingface.co/datasets/huggingface/badges/resolve/main/paper-page-sm.svg)](https://huggingface.co/papers/2411.13807) [![Model](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm.svg)](https://huggingface.co/flymin/MagicDriveDiT-stage3-40k-ft) [![Dataset](https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-sm.svg)](https://huggingface.co/datasets/flymin/MagicDriveDiT-nuScenes-metadata)\n\nThis repository contains the implementation of the paper \n\n\u003e MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control \u003cbr\u003e\n\u003e [Ruiyuan Gao](https://gaoruiyuan.com/)\u003csup\u003e1\u003c/sup\u003e, [Kai Chen](https://kaichen1998.github.io/)\u003csup\u003e2\u003c/sup\u003e, [Bo Xiao](https://www.linkedin.com/in/bo-xiao-19909955/?originalSubdomain=ie)\u003csup\u003e3\u003c/sup\u003e, [Lanqing Hong](https://scholar.google.com.sg/citations?user=2p7x6OUAAAAJ\u0026hl=en)\u003csup\u003e4\u003c/sup\u003e, [Zhenguo Li](https://scholar.google.com/citations?user=XboZC1AAAAAJ\u0026hl=en)\u003csup\u003e4\u003c/sup\u003e, [Qiang Xu](https://cure-lab.github.io/)\u003csup\u003e1\u003c/sup\u003e\u003cbr\u003e\n\u003e \u003csup\u003e1\u003c/sup\u003eCUHK \u003csup\u003e2\u003c/sup\u003eHKUST \u003csup\u003e3\u003c/sup\u003eHuawei Cloud \u003csup\u003e4\u003c/sup\u003eHuawei Noah's Ark Lab \u003cbr\u003e\n\nhttps://github.com/user-attachments/assets/f43812ea-087b-4b70-883b-1e2f1c0df8d7\n\n## Abstract\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eTL; DR\u003c/b\u003e MagicDriveDiT generates high-resolution and long videos for street-view with diverse 3D geometry control and multiview consistency.\u003c/summary\u003e\n\nThe rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is essential for applications like autonomous driving. However, existing methods are limited by scalability and how control conditions are integrated, failing to meet the needs for high-resolution and long videos for autonomous driving applications. In this paper, we introduce MagicDriveDiT, a novel approach based on the DiT architecture, and tackle these challenges. Our method enhances scalability through flow matching and employs a progressive training strategy to manage complex scenarios. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Comprehensive experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames. MagicDriveDiT significantly improves video generation quality and spatial-temporal controls, expanding its potential applications across various tasks in autonomous driving.\n\n\u003c/details\u003e\n\n## News\n- [2025/01/27] We update **fine-tuned results on Waymo Open Dataset** on our project page. [Check it out](https://gaoruiyuan.com/magicdrivedit/#waymo)!\n- [2024/12/07] Stage-3 checkpoint and nuScenes metadata for training \u0026 inference release!\n- [2024/12/03] Train \u0026 inference code release! We will update links in readme later.\n- [2024/11/22] Paper and project page released! Check https://gaoruiyuan.com/magicdrivedit/\n\n## TODO\n\n- [x] train \u0026 inference code\n- [x] pretrained weight for stage 3 \u0026 metadata for nuScenes\n- [ ] pretrained weight for stage 1 \u0026 2 (will be released later)\n\n## Getting Started\n\n### Environment Setup\n\nClone this repo\n\n```bash\ngit clone https://github.com/flymin/MagicDriveDiT.git\n```\n\nThe code is tested on **A800/H20/Ascend 910b** servers. To setup the python environment, follow:\n\n\u003e [!NOTE]  \n\u003e Please use `pip` to set up your environment. We DO NOT recommend using `conda`+`yaml` directly for environment configuration.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eNVIDIA Servers\u003c/b\u003e step-by-step guide:\u003c/summary\u003e\n\n1. Make sure you have an environment with the following packages:\n    ```bash\n    torch==2.4.0\n    torchvision==0.19.0\n\n    # may need to build from source\n    apex (https://github.com/NVIDIA/apex)\n    \n    # choose the correct wheel packages or build from the source\n    xformers\u003e=0.0.27\n    flash-attn\u003e=2.6.3\n    ```\n2. Install Colossalai\n    ```bash\n    git clone https://github.com/flymin/ColossalAI.git\n    git checkout pt2.4 \u0026\u0026 git pull\n    cd ColossalAI\n    BUILD_EXT=1 pip install .\n    ```\n3. Install other dependencies\n    ```bash\n    pip install -r requirements/requirements.txt\n    ```\n\u003c/details\u003e\n\nPlease refer to the following yaml files for further details:\n- A800: `requirements/a800_cu118.yaml`\n- H20: `requirements/h20_cu124.yaml`\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eAscend Servers\u003c/b\u003e step-by-step guide:\u003c/summary\u003e\n\n1. Make sure you have an environment with the following packages (please refer to [this page](https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/configandinstg/instg/insg_0003.html?sub_id=%2Fzh%2FPytorch%2F60RC2%2Fconfigandinstg%2Finstg%2Finsg_0008.html) to setup pytorch env):\n    ```bash\n    # based on CANN 8.0RC2\n    torch==2.3.1\n    torchvision==0.18.1\n    torch-npu==2.3.1\n    apex (https://gitee.com/ascend/apex)\n\n    # choose the correct wheel packages or build from the source\n    xformers==0.0.27\n    ```\n2. Install Colossalai\n    ```bash\n    # We remove dependency on `bitsandbytes`.\n    git clone https://github.com/flymin/ColossalAI.git\n    git checkout ascend \u0026\u0026 git pull\n    cd ColossalAI\n    BUILD_EXT=1 pip install .\n    ```\n3. Install other dependencies\n    ```bash\n    pip install -r requirements/requirements.txt\n    ```\n\u003c/details\u003e\n\nPlease refer to `requirements/910b_cann8.0.RC2_aarch64.yaml` for further details.\n\n### Pretrained Weights\n\n**VAE**: We use the 3DVAE from [THUDM/CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b). It is OK if you only download the `vae` sub-folder.\n\n**Text Encoder**: We use T5 Encoder from [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl).\n\nYou should organize them as follows:\n\n```bash\n${CODE_ROOT}/pretrained/\n├── CogVideoX-2b\n│   └── vae\n└── t5-v1_1-xxl\n```\n\n### MagicDriveDiT Checkpoints\n\nPlease download the stage-3 checkpoint from [flymin/MagicDriveDiT-stage3-40k-ft](https://huggingface.co/flymin/MagicDriveDiT-stage3-40k-ft) and put it in `${CODE_ROOT}/ckpts/` as:\n\n```bash\n${CODE_ROOT}/ckpts/\n└── MagicDriveDiT-stage3-40k-ft\n```\n\n### Prepare Data\n\nWe prepare the nuScenes dataset similar to [bevfusion's instructions](https://github.com/mit-han-lab/bevfusion#data-preparation). Specifically,\n\n1. Download the nuScenes dataset from the [website](https://www.nuscenes.org/nuscenes) and put them in `./data/`. You should have these files:\n    ```bash\n    ${CODE_ROOT}/data/nuscenes\n    ├── can_bus\n    ├── maps\n    ├── mini\n    ├── samples\n    ├── sweeps\n    ├── v1.0-mini\n    └── v1.0-trainval\n    ```\n    \n2. Download the metadata for `mmdet` from [flymin/MagicDriveDiT-nuScenes-metadata](https://huggingface.co/datasets/flymin/MagicDriveDiT-nuScenes-metadata). \n\n    \u003cdetails\u003e\u003csummary\u003e\u003cb\u003eOtherwise\u003c/b\u003e\u003c/summary\u003e\n    \n    Please interpolate the annotations to 12Hz as  [MagicDrive-t](https://github.com/cure-lab/MagicDrive/tree/video), and generate the meta data by yourself with the command in `tools/prepare_data/prepare_dataset.sh`.\n\n    If you have the meta data files from [MagicDrive-t](https://github.com/cure-lab/MagicDrive/tree/video), you can use `tools/prepare_data/add_box_id.py` to add the keys for instance id. See commands in `tools/prepare_data/prepare_dataset.sh`.\n\n    \u003c/details\u003e\n    \n    Your data folder should look like:\n\n    ```bash\n    ${CODE_ROOT}/data\n    ├── nuscenes\n    │   ├── ...\n    │   └── interp_12Hz_trainval\n    └── nuscenes_mmdet3d-12Hz\n        ├── nuscenes_interp_12Hz_infos_train_with_bid.pkl\n        └── nuscenes_interp_12Hz_infos_val_with_bid.pkl\n    ```\n\n4. (Optional) To accelerate data loading, we prepared cache files in h5 format for BEV maps.\n   \u003cdetails\u003e\u003csummary\u003e\u003cb\u003eInstructions\u003c/b\u003e\u003c/summary\u003e\n   \n   They can be generated through `tools/prepare_data/prepare_map_aux.py` with different configs in `configs/cache_gen` For example:\n    ```bash\n    python tools/prepare_data/prepare_map_aux.py +cache_gen=map_cache_gen_interp \\\n        +process=val +subfix=8x200x200_12Hz\n    ```\n    Please find the full commands in `tools/prepare_data/prepare_dataset.sh`.\n    \n    Please make sure you move the generated cache file to the right path. Our defaults are:\n    \n    ```bash\n    ${CODE_ROOT}/data/nuscenes_map_aux_12Hz\n    ├── train_8x200x200_12Hz.h5 (25G)\n    ├── train_8x400x400_12Hz.h5 (99G)\n    ├── val_8x200x200_12Hz.h5 (5.3G)\n    └── val_8x400x400_12Hz.h5 (22G)\n\t```\n  \u003c/details\u003e\n\n## Try MagicDriveDiT\n\n*In most cases, you can use the same commands on both GPU servers and Ascend servers.*\n\n### Inference the model for Generation\n\n```bash\n# ${GPUS} can be 1/2/4/8 for sequence parallel.\n# ${CFG} can be any file located in `configs/magicdrive/inference/`.\n# ${PATH_TO_MODEL} can be path to `ema.pt` or path to `model` from the checkpoint.\n# ${FRAME} can be 1/9/17/33/65/129/full...(8n+1). 1 for image; full for the full-length of nuScenes.\n# `cpu_offload=true` and `scheduler.type=rflow-slice` can be omitted if you have enough GPU memory.\nexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True\ntorchrun --standalone --nproc_per_node ${GPUS} scripts/inference_magicdrive.py ${CFG} \\\n    --cfg-options model.from_pretrained=${PATH_TO_MODEL} num_frames=${FRAME} \\\n    cpu_offload=true scheduler.type=rflow-slice\n```\n\nPlease check [FAQ](https://github.com/flymin/MagicDriveDiT/blob/flymin-dev/doc/FAQ.md#q21-minimum-gpu-memory-requirements-for-inference) for more information about GPU memory requirements.\n\nFor example, to generate the full-length video (20s@12fps) as the highest resolution (848x1600), with 8*H20/A800:\n\n```bash\nPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node 8 \\\n    scripts/inference_magicdrive.py \\\n    configs/magicdrive/inference/fullx848x1600_stdit3_CogVAE_boxTDS_wCT_xCE_wSST.py \\\n    --cfg-options model.from_pretrained=./ckpts/MagicDriveDiT-stage3-40k-ft/ema.pt \\\n    num_frames=full cpu_offload=true scheduler.type=rflow-slice\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eOther options for generation:\u003c/summary\u003e\n\u003cul\u003e\n\u003cli\u003e \u003ccode\u003eforce_daytime\u003c/code\u003e: (bool) force to generate daytime scenes. \u003c/li\u003e\n\u003cli\u003e \u003ccode\u003eforce_rainy\u003c/code\u003e: (bool) force to generate rainy scenes. \u003c/li\u003e\n\u003cli\u003e \u003ccode\u003eforce_night\u003c/code\u003e: (bool) force to generate night scenes. \u003c/li\u003e\n\u003cli\u003e \u003ccode\u003eallow_class\u003c/code\u003e: (list) limit the classes for generation. \u003c/li\u003e\n\u003cli\u003e \u003ccode\u003edel_box_ratio\u003c/code\u003e: (float) randomly drop boxes for generation. \u003c/li\u003e\n\u003cli\u003e \u003ccode\u003edrop_nearest_car\u003c/code\u003e: (int) drop N-nearest vehicles during generation. \u003c/li\u003e\n\u003c/ul\u003e\n\n\u003c/details\u003e\n\n### Inference the model for Test\n\nWe generate the videos in the format of [W-CODA2024 Track2](https://coda-dataset.github.io/w-coda2024/track2/) and test with the established benchmark. Before generation, please make sure the meta data for evaluation is prepared as follows:\n\n```bash\n${CODE_ROOT}/data/nuscenes_mmdet3d-12Hz\n├── nuscenes_interp_12Hz_infos_track2_eval.pkl # this can be downloaded from the page for track2\n└── nuscenes_interp_12Hz_infos_track2_eval_with_bid.pkl  # this can be generated or downloaded from this project.\n```\n\nTo generate the videos (with 8 GPUs/NPUs):\n\n```bash\nexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True  # for GPU\nexport PYTORCH_NPU_ALLOC_CONF=expandable_segments:True  # for NPU\ntorchrun --standalone --nproc_per_node 8 scripts/test_magicdrive.py \\\n    configs/magicdrive/test/17-16x848x1600_stdit3_CogVAE_boxTDS_wCT_xCE_wSST_map0_fsp8_cfg2.0.py \\\n    --cfg-options model.from_pretrained=${PATH_TO_MODEL} tag=${TAG}\n```\n\n\n## Train MagicDriveDiT\n\nLaunch training with (with 32xA800/H20):\n```bash\n# please change \"xx\" to real rank and ip\n# ${config} can be any file in `configs/magicdrive/train`.\n# For example: configs/magicdrive/train/stage3_higher-b-v3.1-12Hz_stdit3_CogVAE_boxTDS_wCT_xCE_wSST_bs4_lr1e-5_sp4simu8.py\ntorchrun --nproc-per-node=8 --nnode=4 --node_rank=xx --master_addr xx --master_port 18836 \\\n    scripts/train_magicdrive.py ${config} --cfg-options num_workers=2 prefetch_factor=2\n```\nWe also use 64 Ascend 910b to train stage 2, please see the config in `configs/magicdrive/npu_64g`.\n\nBesides, we provide debug config to test your environment and data loading process:\n```bash\n# for example (with 4xA800)\n# ${config} can be any file in `configs/magicdrive/train`.\n# For example: configs/magicdrive/train/stage3_higher-b-v3.1-12Hz_stdit3_CogVAE_boxTDS_wCT_xCE_wSST_bs4_lr1e-5_sp4simu8.py\nbash scripts/launch_1node.sh 4 ${config} --cfg-options debug=true\n\t\n# by setting `vsdebug=true` with 1 process, you can use the 'attach mode' from vscode to debug.\n```\n\nNote: `sp=4` (stage 3) needs at least 4 GPUs to run.\n\n\n## Cite Us\n\n```bibtex\n@misc{gao2024magicdrivedit,\n  title={{MagicDriveDiT}: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control},\n  author={Gao, Ruiyuan and Chen, Kai and Xiao, Bo and Hong, Lanqing and Li, Zhenguo and Xu, Qiang},\n  year={2024},\n  eprint={2411.13807},\n  archivePrefix={arXiv},\n}\n```\n\n## Credit\n\nWe adopt the following open-sourced projects:\n\n- [BEVFusion](https://github.com/mit-han-lab/bevfusion): dataloader to handle 3d bounding boxes and BEV map\n- [Open-Sora](https://github.com/hpcaitech/Open-Sora): STDiT3 and framework to train\n- [ColossalAI](https://github.com/hpcaitech/ColossalAI): framework for parallel and zero2\n- [CogVideoX](https://github.com/THUDM/CogVideo): we use their CogVAE\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflymin%2Fmagicdrivedit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fflymin%2Fmagicdrivedit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflymin%2Fmagicdrivedit/lists"}