{"id":13604203,"url":"https://github.com/MachineLearningSystem/Chimera","last_synced_at":"2025-04-11T23:32:05.795Z","repository":{"id":185461694,"uuid":"492475014","full_name":"MachineLearningSystem/Chimera","owner":"MachineLearningSystem","description":"Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines. ","archived":false,"fork":true,"pushed_at":"2022-03-11T20:45:08.000Z","size":739,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-11-07T08:42:30.510Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"Shigangli/Chimera","license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-05-15T12:04:23.000Z","updated_at":"2022-02-23T14:57:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/Chimera","commit_stats":null,"previous_names":["machinelearningsystem/chimera"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FChimera","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FChimera/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FChimera/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FChimera/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/Chimera/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248495063,"owners_count":21113560,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:41.554Z","updated_at":"2025-04-11T23:32:00.783Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"funding_links":[],"categories":["Paper-Code"],"sub_categories":["Parallellism Training"],"readme":"\n## Chimera: efficiently training large-scale neural networks with bidirectional pipelines\n\nChimera is novel pipeline parallelism approach, which is proposed for efficiently training large-scale neural network models (e.g., BERT, GPT-2/3) on parallel machines (e.g., GPU clusters). The key idea of Chimera is to reduce the number of bubbles in the pipeline, **without** introducing staleness in the training process.\nOur implementation (SC'21) was based on PyTorch and adapted from the PipeDream. We use GLOO as the distributed backend.\n\n**A new (concise and also fully-fledged) verion of Chimera will be added** in the [Chimera-BERT branch](https://github.com/Shigangli/Chimera/tree/Chimera-BERT).\n\n## Directory Structure\n\n`chimera/chimera_bert`\nBert in Chimera.\n\n`chimera/chimera_gpt2` \nGPT-2 in Chimera.\n\n`chimera/chimera_pipes` \nChimera generalized to more than two pipelines.\n\n`chimera/performance_model`\nPerformance modelling for communications.\n\n## Run the Experiments\n\nTo install the required Python modules: \n\n`conda create --name py37 python=3.7`\n\n`source activate py37`\n\n`pip install -r requirements.txt`\n\nWe run experiments on GPU clusters with SLURM job scheduler. For example, one can submit a job to the job queue by\n\n`cd ./job_scripts`\n\n`sbatch daint_bert48_32nodes_chimera_4w8d.sh`\n\n\n## Publication\n\nChimera is pulished in SC'21, **Best Paper Finalist**. See the [paper](https://dl.acm.org/doi/abs/10.1145/3458817.3476145) and the [video talk](https://dl.acm.org/doi/abs/10.1145/3458817.3476145#sec-supp) for more details. To cite our work:\n```bibtex\n@inproceedings{li143,\n  author = {Li, Shigang and Hoefler, Torsten},\n  title = {Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines},\n  year = {2021},\n  isbn = {9781450384421},\n  publisher = {Association for Computing Machinery},\n  address = {New York, NY, USA},\n  url = {https://doi.org/10.1145/3458817.3476145},\n  doi = {10.1145/3458817.3476145},\n  booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},\n  articleno = {27},\n  numpages = {14},\n  location = {St. Louis, Missouri},\n  series = {SC '21}\n}\n\n```\n\n## License\n\nSee [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FChimera","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FChimera","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FChimera/lists"}