{"id":25951364,"url":"https://github.com/buoyancy99/diffusion-forcing","last_synced_at":"2025-03-04T13:48:03.254Z","repository":{"id":248005788,"uuid":"820501957","full_name":"buoyancy99/diffusion-forcing","owner":"buoyancy99","description":"code for \"Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion\"","archived":false,"fork":false,"pushed_at":"2025-02-16T05:45:01.000Z","size":83571,"stargazers_count":719,"open_issues_count":1,"forks_count":37,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-02-16T06:27:03.716Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/buoyancy99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-26T15:39:12.000Z","updated_at":"2025-02-16T05:45:05.000Z","dependencies_parsed_at":"2024-11-13T18:24:44.565Z","dependency_job_id":"dd98378f-cccc-4425-8865-50ad642f0afc","html_url":"https://github.com/buoyancy99/diffusion-forcing","commit_stats":null,"previous_names":["buoyancy99/diffusion-forcing"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buoyancy99%2Fdiffusion-forcing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buoyancy99%2Fdiffusion-forcing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buoyancy99%2Fdiffusion-forcing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buoyancy99%2Fdiffusion-forcing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/buoyancy99","download_url":"https://codeload.github.com/buoyancy99/diffusion-forcing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241860009,"owners_count":20032318,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-04T13:47:51.911Z","updated_at":"2025-03-04T13:48:03.246Z","avatar_url":"https://github.com/buoyancy99.png","language":"Python","funding_links":[],"categories":["Offline Reinforcement Learning","A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion\n\n#### [[Project Website]](https://boyuan.space/diffusion-forcing) [[Paper]](https://arxiv.org/abs/2407.01392)\n\n[Boyuan Chen\u003csup\u003e1\u003c/sup\u003e](https://boyuan.space/), [Diego Martí Monsó\u003csup\u003e2\u003c/sup\u003e](https://www.linkedin.com/in/diego-marti/?originalSubdomain=de), [ Yilun Du\u003csup\u003e1\u003c/sup\u003e](https://yilundu.github.io/), [Max Simchowitz\u003csup\u003e1\u003c/sup\u003e](https://msimchowitz.github.io/), [Russ Tedrake\u003csup\u003e1\u003c/sup\u003e](https://groups.csail.mit.edu/locomotion/russt.html), [Vincent Sitzmann\u003csup\u003e1\u003c/sup\u003e](https://www.vincentsitzmann.com/) \u003cbr/\u003e\n\u003csup\u003e1\u003c/sup\u003eMIT \u003csup\u003e2\u003c/sup\u003eTechnical University of Munich \u003c/br\u003e\n\nThis is the v1.5 code base for our paper [Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion](https://boyuan.space/diffusion-forcing). The **main** branch contains our latest reimplementation with temporal attention (recommended) while the **paper** branch contains RNN code used by original paper for reproduction purpose. [Diffusion Forcing v2](https://boyuan.space/history-guidance/) has been released! It is a stronger technique to roll out extremely long video generation, with modern architectures like DiT and latent diffusion. Please check out its [github repo](https://github.com/kwsong0113/diffusion-forcing-transformer) as well if you are only interested in video generation. I will reimplement diffusion forcing 2 on robot and planning domains here when I have time.\n\n![plot](teaser.png)\n\n```\n@article{chen2025diffusion,\n  title={Diffusion forcing: Next-token prediction meets full-sequence diffusion},\n  author={Chen, Boyuan and Mart{\\'\\i} Mons{\\'o}, Diego and Du, Yilun and Simchowitz, Max and Tedrake, Russ and Sitzmann, Vincent},\n  journal={Advances in Neural Information Processing Systems},\n  volume={37},\n  pages={24081--24125},\n  year={2025}\n}\n```\n\n# Project Instructions\n\n## Setup\n\nFirst of all, if you are interested in higher quality video generation only, please also checkout our newly released [Diffusion Forcing v2](https://boyuan.space/history-guidance/) repo. If you are interested in a cleaner code base and other domains (planning and robot), keep reading. If you want to use our latest improved implementation for video and planning with temporal attention instead of RNN, stay on this branch. If you are instead interested in reproducing claims by orignal paper, switch to the branch used by original paper via `git checkout paper`.\n\nRun `conda create python=3.10 -n diffusion-forcing` to create environment.\nRun `conda activate diffusion-forcing` to activate this environment.\n\nInstall dependencies for time series, video and robotics:\n\n```\npip install -r requirements.txt\n```\n\n[Sign up](https://wandb.ai/site) a wandb account for cloud logging and checkpointing. In command line, run `wandb login` to login.\n\nThen modify the wandb entity in `configurations/config.yaml` to your wandb account.\n\nOptionally, if you want to do maze planning, install the following complicated dependencies due to outdated dependencies of d4rl. This involves first installing mujoco 210 and then run\n\n```\npip install -r extra_requirements.txt\n```\n\n## Quick start with pretrained checkpoints\n\nSince dataset is huge, we provide a mini subset and pre-trained checkpoints for you to quickly test out our model! To do so, download mini dataset and checkpoints from [here](https://drive.google.com/file/d/1xAOQxWcLzcFyD4zc0_rC9jGXe_uaHb7b/view?usp=sharing) to project root and extract with `tar -xzvf quickstart_atten.tar.gz`. Files shall appear in `data` and `outputs/xxx.ckpt`. Make sure you also git pull upstream to use latest version of code if you forked before ckpt release!\n\nThen run the following commands and go to the wandb panel to see the results.\n\n### Video Prediction:\n\nOur visualization is side by side, with prediction on the left and ground truth on the right. However, ground truth is expected to not align with prediction since the sequence is highly stochastic. Ground truth is provided to provide an idea about quality only.\n\nAutoregressively generate minecraft video with 1x the length it's trained on:\n`python -m main +name=sample_minecraft_pretrained load=outputs/minecraft.ckpt experiment.tasks=[validation]`\n\nTo let the model roll out **longer than it's trained on**, simply append `dataset.validation_multiplier=8` to the above commands, and it will rollout `8x` longer than maximum sequence length it's trained on.\n\nThe above checkpoint is trained for 100K steps with small number of frames. We've already verified diffusion forcing works in latent diffusion setting and can be extended to many more tokens without sacrificing compositionally (with some addition techniques outside this repo)! Stay tuned for our next project!\n\n### Maze Planning:\n\nThe maze planning setting is changed a bit as we gain more insighs, please see corresponding paragraphs in training section for details. We haven't reimplemented MCTG yet, but you can already see nice visualizations on wandb log.\n\nMedium Maze\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_medium dataset.action_mean=[] dataset.action_std=[] dataset.observation_mean=[3.5092521,3.4765592] dataset.observation_std=[1.3371079,1.52102] load=outputs/maze2d_medium_x.ckpt experiment.tasks=[validation] algorithm.guidance_scale=3 +name=maze2d_medium_x_sampling`\n\nLarge Maze\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_large dataset.observation_mean=[3.7296331,5.3047247] dataset.observation_std=[1.8070312,2.5687592] dataset.action_mean=[] dataset.action_std=[] load=outputs/maze2d_large_x.ckpt experiment.tasks=[validation] algorithm.guidance_scale=2 +name=maze2d_large_x_sampling`\n\nWe also explored a couple more settings but haven't reimplemented everything in original paper yet. If you are interestted in those checkpoints, see the source code of this README file for ckpt loading instructions that's commented out.\n\n\u003c!--\nHere is also a position + velocity setting ckpt, but we don't recommend this because diffusing quantity and its derivative together creates some bad optimization landscape.\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_medium dataset.observation_std=[2.6742158,3.04204,9.3630628,9.4774808] dataset.action_mean=[] dataset.action_std=[] load=outputs/maze2d_medium_xv.ckpt experiment.tasks=[validation] algorithm.guidance_scale=4 +name=maze2d_medium_xv_sampling`\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_large dataset.observation_std=[3.6140624,5.1375184,9.747382,10.5974788] dataset.action_mean=[] dataset.action_std=[] load=outputs/maze2d_large_xv.ckpt experiment.tasks=[validation] algorithm.guidance_scale=4 +name=maze2d_large_xv_sampling`\n\nHere is also ckpt where we take diffused actions,a challenging setting that's not done in prior papers. We haven't got it working as well as original RNN version of diffusion forcing, but it does have okay numbers. You can tune up the guidance scale a bit.\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_medium dataset.observation_std=[2.67,3.04,8,8] dataset.action_std=[6,6] load=outputs/maze2d_medium_xva.ckpt experiment.tasks=[validation] algorithm.guidance_scale=2 algorithm.open_loop_horizon=10 +name=maze2d_medium_xva_sampling`\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_large dataset.observation_std=[3.62,5.14,9.76,10.6] dataset.action_std=[3,3] load=outputs/maze2d_large_xva.ckpt experiment.tasks=[validation] algorithm.guidance_scale=2 algorithm.open_loop_horizon=10 +name=maze2d_large_xva_sampling` --\u003e\n\n## Training\n\n### Video\n\nVideo prediction requires downloading giant datasets. First, if you downloaded the mini subset following `Quick start with pretrained checkpoints` section, delete the mini subset folders `data/minecraft` and `data/dmlab` because we have to download the whole dataset this time. We've coded in python that it will download the dataset for you it doesn't already exist. Due to the slowness of the [source](https://github.com/wilson1yan/teco), this may take a couple days. If you prefer to do it yourself via bash script, please refer to the bash scripts in original [TECO dataset](https://github.com/wilson1yan/teco) and use `dmlab.sh` and `minecraft.sh` in their Dataset section of README, any maybe split bash script into parallel scripts.\n\nThen just run the corresponding commands:\n\n#### Minecraft\n\n`python -m main +name=your_experiment_name algorithm=df_video dataset=video_minecraft`\n\n#### DMLab\n\n`python -m main +name=your_experiment_name algorithm=df_video dataset=video_dmlab algorithm.weight_decay=1e-3 algorithm.diffusion.architecture.network_size=48 algorithm.diffusion.architecture.attn_dim_head=32 algorithm.diffusion.architecture.attn_resolutions=[8,16,32,64] algorithm.diffusion.beta_schedule=cosine`\n\n#### No causal masking\n\nSimply append `algorithm.causal=False` to your command.\n\n#### Play with sampling\n\nPlease take a look at \"Load a checkpoint to eval\" paragraph to understand how to use load checkpoint with `load=`. Then, run the exact training command with `experiment.tasks=[validation] load={wandb_run_id}` to load a checkpoint and experiment with sampling.\n\nTo see how you can roll out longer than the sequence is trained on, you can find instructions in `quick start with pretrained checkpoints` section. Keep in mind that rolling out infinitely without sliding window is a property of original RNN implementation on `paper` branch, and this version has to use sliding window since it's temporal attention.\n\nBy default, we run autoregressive sampling with stablization. To sample next 2 tokens jointly, you can append the following to the above command: `algorithm.scheduling_matrix=full_sequence algorithm.chunk_size=2`.\n\n## Maze Planning\n\nFor those who only wish to reproduce the original paper instead of transformer architecture, please checkout`paper` branch of the code instead.\n\n**Medium Maze**\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_medium dataset.action_mean=[] dataset.action_std=[] dataset.observation_mean=[3.5092521,3.4765592] dataset.observation_std=[1.3371079,1.52102] +name=maze2d_medium_x`\n\n**Large Maze**\n\n`python -m main experiment=exp_planning algorithm=df_planning dataset=maze2d_large dataset.observation_mean=[3.7296331,5.3047247] dataset.observation_std=[1.8070312,2.5687592] dataset.action_mean=[] dataset.action_std=[] +name=maze2d_large_x`\n\n**Run planning after model is trained**\n\nPlease take a look at \"Load a checkpoint to eval\" paragraph to understand how to use load checkpoint with `load=`. To sample, simply append `load={wandb_id_of_above_runs} experiment.tasks=[validation] algorithm.guidance_scale=2 +name=maze2d_sampling` to above command after trained. Feel free to tune the `guidance_scale` from 1 - 5.\n\nThis version of maze planning uses a different version of diffusion forcing from original paper - while doing the follow up to diffusion forcing, we realized that training with independent noise actually constructed a smooth interpolation between causal and non-causal models too, since we can just masked out future by complete noise (fully causal) or some noise (interpolation). The best thing is, you can still account for causal uncertainty via pyramoid sampling in this setting, by masking out tokens at different noise levels, and you can still have flexible horizon because you can tell the model that padded entries are pure noise, a unique ability of diffusion forcing.\n\nWe also reflected a bit about the environment and concluded that the original metric isn't necessarily a good metric, because maze planning should reward those who can plan the fastest route to goal, not a slow walking agent that goes there at the end of episode. The dataset never contains data of staying at the goal, so agents are supposed to walk away after reaching the goal. I think [Diffuser](https://arxiv.org/abs/2205.09991) had an unfair advantage of just generating slow plans, that happend to let the agent stay in the neighbour hood of goal for longer and got very high reward, exploiting flaws in the environment design (a good design would involve penalty of longer time taken to reach goal). So, in this version of code, we just optimize for flexible horizon planning that tries to reach goal asap, and the planner will automatically come back to goal if it left the goal since staying is never in dataset. You can see new metrics we designed in wandb logging interface.\n\n## Timeseries and Robotics\n\nPlease checkout `paper` branch for the code used by original paper. If I have time later, I will reimplement these two domains with transformer as well to complete this branch.\n\n# Change Log\n\n| Data      |                                              Notes                                              |\n| --------- | :---------------------------------------------------------------------------------------------: |\n| Jul/30/24 |             Upgrade RNN to temporal attention, move orignal code to 'paper' branch              |\n| Jul/03/24 | Initial release of the code. Email me if you have questions or find any errors in this version. |\n\n# Infra instructions\n\nThis repo is forked from [Boyuan Chen](https://boyuan.space/)'s research template [repo](https://github.com/buoyancy99/research-template). By its MIT license, you must keep the above sentence in `README.md` and the `LICENSE` file to credit the author.\n\nAll experiments can be launched via `python -m main +name=xxxx {options}` where you can fine more details later in this article.\n\nThe code base will automatically use cuda or your Macbook M1 GPU when available.\n\nFor slurm clusters e.g. mit supercloud, you can run `python -m main cluster=mit_supercloud {options}` on login node.\nIt will automatically generate slurm scripts and run them for you on a compute node. Even if compute nodes are offline,\nthe script will still automatically sync wandb logging to cloud with \u003c1min latency. It's also easy to add your own slurm\nby following the `Add slurm clusters` section.\n\n## Modify for your own project\n\nFirst, create a new repository with this template. Make sure the new repository has the name you want to use for wandb\nlogging.\n\nAdd your method and baselines in `algorithms` following the `algorithms/README.md` as well as the example code in\n`algorithms/diffusion_forcing/df_video.py`. For pytorch experiments, write your algorithm as a [pytorch lightning](https://github.com/Lightning-AI/lightning)\n`pl.LightningModule` which has extensive\n[documentation](https://lightning.ai/docs/pytorch/stable/). For a quick start, read \"Define a LightningModule\" in this [link](https://lightning.ai/docs/pytorch/stable/starter/introduction.html). Finally, add a yaml config file to `configurations/algorithm` imitating that of `configurations/algorithm/df_video.yaml`, for each algorithm you added.\n\nAdd your dataset in `datasets` following the `datasets/README.md` as well as the example code in\n`datasets/video`. Finally, add a yaml config file to `configurations/dataset` imitating that of\n`configurations/dataset/video_dmlab.yaml`, for each dataset you added.\n\nAdd your experiment in `experiments` following the `experiments/README.md` or following the example code in\n`experiments/exp_video.py`. Then register your experiment in `experiments/__init__.py`.\nFinally, add a yaml config file to `configurations/experiment` imitating that of\n`configurations/experiment/exp_video.yaml`, for each experiment you added.\n\nModify `configurations/config.yaml` to set `algorithm` to the yaml file you want to use in `configurations/algorithm`;\nset `experiment` to the yaml file you want to use in `configurations/experiment`; set `dataset` to the yaml file you\nwant to use in `configurations/dataset`, or to `null` if no dataset is needed; Notice the fields should not contain the\n`.yaml` suffix.\n\nYou are all set!\n\n`cd` into your project root. Now you can launch your new experiment with `python main.py +name=\u003cname_your_experiment\u003e`. You can run baselines or\ndifferent datasets by add arguments like `algorithm=xxx` or `dataset=xxx`. You can also override any `yaml` configurations by following the next section.\n\nOne special note, if your want to define a new task for your experiment, (e.g. other than `training` and `test`) you can define it as a method in your experiment class and use `experiment.tasks=[task_name]` to run it. Let's say you have a `generate_dataset` task before the task `training` and you implemented it in experiment class, you can then run `python -m main +name xxxx experiment.tasks=[generate_dataset,training]` to execute it before training.\n\n## Pass in arguments\n\nWe use [hydra](https://hydra.cc) instead of `argparse` to configure arguments at every code level. You can both write a static config in `configuration` folder or, at runtime,\n[override part of yur static config](https://hydra.cc/docs/tutorials/basic/your_first_app/simple_cli/) with command line arguments.\n\nFor example, arguments `algorithm=example_classifier experiment.lr=1e-3` will override the `lr` variable in `configurations/experiment/example_classifier.yaml`. The argument `wandb.mode` will override the `mode` under `wandb` namesspace in the file `configurations/config.yaml`.\n\nAll static config and runtime override will be logged to cloud automatically.\n\n## Resume a checkpoint \u0026 logging\n\nFor machine learning experiments, all checkpoints and logs are logged to cloud automatically so you can resume them on another server. Simply append `resume={wandb_run_id}` to your command line arguments to resume it. The run_id can be founded in a url of a wandb run in wandb dashboard. By default, latest checkpoint in a run is stored indefinitely and earlier checkpoints in the run will be deleted after 5 days to save your storage.\n\nOn the other hand, sometimes you may want to start a new run with different run id but still load a prior ckpt. This can be done by setting the `load={wandb_run_id / ckpt path}` flag.\n\n## Load a checkpoint to eval\n\nThe argument `experiment.tasks=[task_name1,task_name2]` (note the `[]` brackets here needed) allows to select a sequence of tasks to execute, such as `training`, `validation` and `test`. Therefore, for testing a machine learning ckpt, you may run `python -m main load={your_wandb_run_id} experiment.tasks=[test]`.\n\nMore generally, the task names are the corresponding method names of your experiment class. For `BaseLightningExperiment`, we already defined three methods `training`, `validation` and `test` for you, but you can also define your own tasks by creating methods to your experiment class under intended task names.\n\n## Debug\n\nWe provide a useful debug flag which you can enable by `python main.py debug=True`. This will enable numerical error tracking as well as setting `cfg.debug` to `True` for your experiments, algorithms and datasets class. However, this debug flag will make ML code very slow as it automatically tracks all parameter / gradients!\n\n## Add slurm clusters\n\nIt's very easy to add your own slurm clusters via adding a yaml file in `configurations/cluster`. You can take a look\nat `configurations/cluster/mit_vision.yaml` for example.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuoyancy99%2Fdiffusion-forcing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbuoyancy99%2Fdiffusion-forcing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuoyancy99%2Fdiffusion-forcing/lists"}