{"id":19650280,"url":"https://github.com/flukeskywalker/nanodd","last_synced_at":"2025-04-30T18:24:44.378Z","repository":{"id":239135450,"uuid":"798604593","full_name":"flukeskywalker/nanoDD","owner":"flukeskywalker","description":"Simple Scalable Discrete Diffusion for text in PyTorch","archived":false,"fork":false,"pushed_at":"2024-09-27T23:04:32.000Z","size":1520,"stargazers_count":33,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-30T18:44:00.638Z","etag":null,"topics":["diffusion-models","discrete-diffusion","generative-model","pytorch","text-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/flukeskywalker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-10T05:37:02.000Z","updated_at":"2025-03-19T12:56:49.000Z","dependencies_parsed_at":"2024-09-18T09:06:01.642Z","dependency_job_id":"6cbd779b-60ce-448c-9a60-9b11d5fc531f","html_url":"https://github.com/flukeskywalker/nanoDD","commit_stats":null,"previous_names":["flukeskywalker/nanodd"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flukeskywalker%2FnanoDD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flukeskywalker%2FnanoDD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flukeskywalker%2FnanoDD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flukeskywalker%2FnanoDD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/flukeskywalker","download_url":"https://codeload.github.com/flukeskywalker/nanoDD/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251759221,"owners_count":21639190,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","discrete-diffusion","generative-model","pytorch","text-generation"],"created_at":"2024-11-11T14:57:53.207Z","updated_at":"2025-04-30T18:24:44.356Z","avatar_url":"https://github.com/flukeskywalker.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./_img/dd.gif\" width=800 style=\"max-width: 100%;\" alt=\"[nano] discrete diffusion\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\" style=\"font-family: sans-serif; font-size: 24px;\"\u003e\n  \u003cb\u003enanoDD\u003c/b\u003e\n\u003c/p\u003e\n\nI'm writing simple \u0026 scalable Discrete Diffusion implementations in PyTorch for education, research and fun!\n\n## What is Discrete Diffusion?\nIn simple terms, typical LLMs (such as GPTs) generate text from left-to-right, while Discrete Diffusion LMs generate a chunk of text in parallel.\n\nMore formally, Diffusion is a set of techniques for modeling data by learning a series of conditional *noisy* distributions over *all* token variables. \nThis is in contrast to autoregressive models (GPTs etc.) that learn *non-noisy* conditional distributions over *one* token variable at a time.\n*Discrete* Diffusion is the application of ideas similar to those used by *continuous* diffusion (used for image generation models like Stable Diffusion, Midjourney, Flux etc.) to discrete data, like text. \n\nExample:\n\n![img](./_img/sample.gif)\n\nThis GIF shows what sampling from a pre-trained masking-based discrete diffusion model looks like using the [sampling script](./sample.py) in this repo.\nWe start out with the whole sequence composed of mask tokens (maximum \"noise\") and iteratively unmask --- and hence reduce the noise --- in the sequence.\n\n## Goals for this repo\nI want more people to work on discrete diffusion, so the primary goals are to be correct, simple and instructive for newcomers to these algorithms.\nReaders should be able to use the implementations to help understand the original papers.\n\nI also want the code to be efficient and scalable so that the repo can be used as a starting point for hacking on ideas.\nSimilar to the philosophy in [nanoGPT](https://github.com/karpathy/nanoGPT), nanoDD relies on pure PyTorch and avoids abstractions (as well as dependencies that contain abstractions such as training frameworks).\nThe training script itself is directly adapted from nanoGPT with several modifications.\n\n## Models\n\nTo start off, you can train, evaluate, and sample from an [Absorbing (or \"Masking\") D3PM](https://arxiv.org/abs/2107.03006) using a [Diffusion Transformer](./dit.py) on the [text8](https://paperswithcode.com/dataset/text8) and [openwebtext](https://skylion007.github.io/OpenWebTextCorpus/) datasets.\nAdditional models are planned.\n\n## Usage\n\nInstall dependencies:\n```bash\npip install -r requirements.txt\n```\n\nDownload pre-trained model and generate samples for text8 dataset:\n```bash\n# download pre-trained weights for D3PMAbsorbing from HF (~700 MB for text8 ,~1GB for openwebtext)\ngit clone git@hf.co:rupspace/nanoDD-D3PM-text8\ngit clone git@hf.co:rupspace/nanoDD-D3PM-openwebtext\n\n# sample 1 text8 sequence (default):\npython sample.py ./nanoDD-D3PM-text8/ckpt.pt\npython sample.py ./nanoDD-D3PM-openwebtext/ckpt.pt --dataset openwebtext\n\n# sample 4 sequences in a batch:\npython sample.py ./nanoDD-D3PM-text8/ckpt.pt --batch_size 4\n# check options for sample.py\npython sample.py --help\n```\n\nOr train from scratch:\n```bash\n# download and prepare text8\npython data/prepare_text8.py\n# train Absorbing D3PM on single GPU (for prototyping etc)\npython train.py d3pm_text8\n```\n\n### Multi-GPU training\nMulti-GPU training is likely necessary if you want to train 12-layer models on openwebtext (or even text8).\nFor openwebtext in particular you should ideally train on 16 or 32 A100 GPUs.\nNote that Discrete Diffusion models take much longer to train than autoregressive models.\n\n```bash\n# d3pm_text8_4gpu modifies the single GPU config (see `configs.py`)\n# note that the batch_size config is per GPU, while global_batch_size == batch_size * gradient_accumulation_steps * num_gpus\n# validation uses all GPUs, so eval_iters should be modified when changing number of GPUs\n# following uses ~35 GB GPU memory per GPU in my experiments\ntorchrun --standalone --nproc_per_node=8 train.py d3pm_text8_4gpu\n\n# to train on openwebtext, first prepare using script borrowed from nanoGPT\npython data/prepare_openwebtext.py\n# train Absorbing D3PM\ntorchrun --standalone --nproc_per_node=8 train.py d3pm_openwebtext_8gpu\n# see train.py for commands to launch on 32 GPUs across 4 nodes\n```\n\n### Training Notes\n\nNote that the training will attempt to compile the model by default, which takes extra time to begin training.\nAppend `--no-compile` to the training commands when launching to disable this for debugging etc.\n\nCurrently, sampling and evaluation scripts do not compile the model, so they get going immediately.\n\n### Configuration\n\nThis is not a full-on research library so the config system is rather simple to avoid using a tool that readers might not know.\nHowever, there is basic support for using different model/configs.\n\nFor example, `d3pm_text8()` is a function that defines a configuration in [configs.py](./configs.py) for training D3PMAbsorbing on text8.\nOne can add new configs by defining functions in this file, and specify their name on the cmd line when running `train.py`.\nAny training args defined by a config function over-ride any global training args defined in `train.py`.\n\n\n### Evaluating loss\n```bash\n# evaluate on val set (default)\npython evaluate.py ./nanoDD-D3PM-text8/ckpt.pt\n# evaluate on test set\npython evaluate.py ./nanoDD-D3PM-text8/ckpt.pt --split test\n# check options for eval script\npython evaluate.py --help\n```\n\n## Results (text8)\n\n| Model               | Test Bits/Token (text8) |\n|---------------------|:-----------------------:|\n| D3PM Absorbing      |          1.37           |\n\nTraining the D3PM Absorbing model will produce loss values similar to those in the plot below, finally reaching a validation loss of 1.30, which results in a test loss of 1.37.\nNote that this is substantially better than in the original paper (1.45).\nThe training loss will start at around 5.0 (approximated and converted to bits-per-token) and you will observe noisy loss values throughout training due to noise in the diffusion process.\n\nYou can sample diffusion time steps more uniformly per batch (\"low-discrepancy sampler\") and this reduces the variance in the loss but in my experience does not make the training faster or reach a lower mean loss.\n\n![img](_img/d3pm_absorbing_loss.png)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflukeskywalker%2Fnanodd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fflukeskywalker%2Fnanodd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflukeskywalker%2Fnanodd/lists"}