{"id":17575882,"url":"https://github.com/LambdaLabsML/distributed-training-guide","last_synced_at":"2025-03-08T04:31:54.208Z","repository":{"id":252049118,"uuid":"836331773","full_name":"LambdaLabsML/distributed-training-guide","owner":"LambdaLabsML","description":"Best practices \u0026 guides on how to write distributed pytorch training code","archived":false,"fork":false,"pushed_at":"2025-02-24T05:42:35.000Z","size":439,"stargazers_count":358,"open_issues_count":1,"forks_count":27,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-03T12:44:13.151Z","etag":null,"topics":["cluster","cuda","deepspeed","distributed-training","fsdp","gpu","gpu-cluster","kuberentes","lambdalabs","mpi","nccl","pytorch","sharding","slurm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LambdaLabsML.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-31T16:11:23.000Z","updated_at":"2025-03-03T09:01:30.000Z","dependencies_parsed_at":"2024-09-06T04:25:16.504Z","dependency_job_id":"a6e4a59e-263a-46c6-94f4-28f534cae860","html_url":"https://github.com/LambdaLabsML/distributed-training-guide","commit_stats":{"total_commits":227,"total_committers":4,"mean_commits":56.75,"dds":"0.17180616740088106","last_synced_commit":"e0c1ddc8a11380bbd4d2e55e927a51181c152f16"},"previous_names":["lambdalabsml/distributed-training-tutorials","lambdalabsml/distributed-training-guide"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LambdaLabsML%2Fdistributed-training-guide","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LambdaLabsML%2Fdistributed-training-guide/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LambdaLabsML%2Fdistributed-training-guide/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LambdaLabsML%2Fdistributed-training-guide/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LambdaLabsML","download_url":"https://codeload.github.com/LambdaLabsML/distributed-training-guide/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242501028,"owners_count":20139319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster","cuda","deepspeed","distributed-training","fsdp","gpu","gpu-cluster","kuberentes","lambdalabs","mpi","nccl","pytorch","sharding","slurm"],"created_at":"2024-10-21T23:01:27.240Z","updated_at":"2025-03-08T04:31:54.202Z","avatar_url":"https://github.com/LambdaLabsML.png","language":"Python","funding_links":[],"categories":["技巧 Tips"],"sub_categories":[],"readme":"# Distributed Training Guide\n\n\u003cimg src=\"https://lambdalabs.com/hubfs/distriubuted-training-guide.png\" width=\"400px\" /\u003e\n\n[Neurips 2024 presentation slides here](https://docs.google.com/presentation/d/1ANMmkOGaruYKTvhnsAbZgI9GrdMliNvibWGuNYw6HX8/edit?usp=sharing)\n\nEver wondered how to train a large neural network across a giant cluster? Look no further!\n\nThis is a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilizing all resources available. It is organized into sequential chapters, each with a `README.md` and a `train_llm.py` script in them. The readme will discuss both the high level concepts of distributed training, and the code changes introduced in that chapter.\n\nThe guide is written entirely in very minimal standard pytorch, using `transformers` and `datasets` for models and data, respectively. No other library is used for distributed code - the distributed stuff is entirely in pytorch.\n\n1. [Chapter 1](./01-single-gpu/) - A standard Causal LLM training script that runs on a **single GPU**.\n2. [Chapter 2](./02-distributed-data-parallel/) - Upgrades the training script to support **multiple GPUs and to use DDP**.\n3. [Chapter 3](./03-job-launchers/) - Covers how to **launch training jobs** across clusters with multiple nodes.\n4. [Chapter 4](./04-fully-sharded-data-parallel/) - Upgrades the training script to **use FSDP** instead of DDP for more optimal memory usage.\n5. [Chapter 5](./05-training-llama-405b/) - Upgrades the training script to **train Llama-405b**.\n6. [Chapter 6](./06-tensor-parallel/) - Upgrades our single GPU training script to support **tensor parallelism**.\n7. [Chapter 7](./06-2d-parallel/) - Upgrades our TP training script to use **2d parallelism (FSDP + TP)**.\n8. [Alternative Frameworks](./alternative-frameworks/) - Covers different frameworks that all work with pytorch under the hood.\n9. [Diagnosing Errors](./diagnosing-errors/) - Best practices and how tos for **quickly diagnosing errors** in your cluster.\n10. [Related Topics](./related-topics/) - Topics that you should be aware of when distributed training.\n\n\nQuestions this guide answers:\n\n- How do I update a single gpu training/fine tuning script to run on multiple GPUs or multiple nodes?\n- How do I diagnose hanging/errors that happen during training?\n- My model/optimizer is too big for a single gpu - how do I train/fine tune it on my cluster?\n- How do I schedule/launch training on a cluster?\n- How do I scale my hyperparameters when increasing the number of workers?\n\nBest practices for logging stdout/stderr and wandb are also included, as logging is vitally important in diagnosing/debugging training runs on a cluster.\n\nEach of the training scripts is aimed at training a causal language model (i.e. gpt/llama).\n\n## Set up\n\n### Clone this repo\n\n```bash\ngit clone https://github.com/LambdaLabsML/distributed-training-guide.git\n```\n\n### Virtual Environment\n\n```bash\ncd distributed-training-guide\npython3 -m venv venv\nsource venv/bin/activate\npython -m pip install -U pip\npip install -U setuptools wheel\npip install -r requirements.txt\npip install flash-attn --no-build-isolation\n```\n\n### wandb\n\nThis tutorial uses `wandb` as an experiment tracker.\n\n```bash\nwandb login\n```\n\n\u003cp align=\"center\"\u003e\n🦄 Other exciting ML projects at Lambda: \u003ca href=\"https://news.lambdalabs.com/news/today\"\u003eML Times\u003c/a\u003e, \u003ca href=\"https://lambdalabsml.github.io/Open-Sora/introduction/\"\u003eText2Video\u003c/a\u003e, \u003ca href=\"https://lambdalabs.com/gpu-benchmarks\"\u003eGPU Benchmark\u003c/a\u003e.\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLambdaLabsML%2Fdistributed-training-guide","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLambdaLabsML%2Fdistributed-training-guide","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLambdaLabsML%2Fdistributed-training-guide/lists"}