{"id":13604280,"url":"https://github.com/MachineLearningSystem/EasyParallelLibrary","last_synced_at":"2025-04-11T23:32:12.142Z","repository":{"id":185461737,"uuid":"543825016","full_name":"MachineLearningSystem/EasyParallelLibrary","owner":"MachineLearningSystem","description":"Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.","archived":false,"fork":true,"pushed_at":"2022-09-16T05:28:01.000Z","size":594,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-11-07T08:42:42.508Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"alibaba/EasyParallelLibrary","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-09-30T23:27:42.000Z","updated_at":"2022-09-26T06:24:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/EasyParallelLibrary","commit_stats":null,"previous_names":["machinelearningsystem/easyparallellibrary"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FEasyParallelLibrary","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FEasyParallelLibrary/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FEasyParallelLibrary/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FEasyParallelLibrary/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/EasyParallelLibrary/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248495097,"owners_count":21113570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:42.790Z","updated_at":"2025-04-11T23:32:07.129Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"funding_links":[],"categories":["Paper-Code"],"sub_categories":["Training"],"readme":"[![pypi](https://img.shields.io/pypi/v/pyepl.svg)](https://pypi.org/project/pyepl)\n[![docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://easyparallellibrary.readthedocs.io/en/latest/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/alibaba/EasyParallelLibrary/blob/main/LICENSE)\n\nEnglish | [简体中文](README_cn.md)\n\n# Easy Parallel Library\n\n## Overview\n\nEasy Parallel Library (EPL) is a general and efficient library for distributed model training.\n- Usability - Users can implement different parallelism strategies with a few lines of annotations, including data parallelism, pipeline parallelism, tensor model parallelism, and their hybrids. \n- Memory Efficient - EPL provides various memory-saving techniques, including gradient checkpoint, ZERO, CPU Offload, etc. Users are able to train larger models with fewer computing resources.\n- High Performance - EPL provides an optimized communication library to achieve high scalability and efficiency.\n\nFor more information, you may [read the docs](https://easyparallellibrary.readthedocs.io/en/latest/).\n\nEPL [Model Zoo](https://github.com/alibaba/FastNN) provides end-to-end parallel training examples.\n\n## Installation\n\nTo install EPL, please refer to the following [instructions](https://easyparallellibrary.readthedocs.io/en/latest/installation_instructions.html).\n\n## Examples\n\nHere are a few examples of different parallelism strategies by changing only annotations.\nPlease refer to [API documentation](https://easyparallellibrary.readthedocs.io/en/latest/api/index.html) for API details and [tutorials](https://easyparallellibrary.readthedocs.io/en/latest/tutorials/index.html) for more examples.\n\n### Data Parallelism\n\nThe following example shows a basic data parallelism annotation.\nThe data parallelism degree is determined by the allocated GPU number.\n\n```diff\n+ import epl\n+ epl.init()\n+ with epl.replicate(device_count=1):\n    model()\n```\n\n\n### Pipeline Parallelism\n\nThe following example shows pipeline parallelism with two pipeline stages, each stage is computed with one GPU.\nIf the total GPU number is 4, EPL will automatically apply two-degree data parallelism over the model pipeline.\n\n```diff\n+ import epl\n+ \n+ config = epl.Config({\"pipeline.num_micro_batch\": 4})\n+ epl.init(config)\n+ with epl.replicate(device_count=1, name=\"stage_0\"):\n    model_part1()\n+ with epl.replicate(device_count=1, name=\"stage_1\"):\n    model_part2()\n```\n\n### Tensor Model Parallelism\nThe following example shows a tensor model parallelism annotation.\nWe apply data parallelism to the `ResNet` part, and apply tensor model parallelism to `classification` part.\n\n```diff\n+ import epl\n+ config = epl.Config({\"cluster.colocate_split_and_replicate\": True})\n+ epl.init(config)\n+ with epl.replicate(8):\n    ResNet()\n+ with epl.split(8):\n    classification()\n```\n\n\n## Publication\n\nIf you use EPL in your publication, please cite it by using the following BibTeX entry.\n\n```BibTeX\n@inproceedings {jia2022whale,\n\tauthor = {Xianyan Jia and Le Jiang and Ang Wang and Wencong Xiao and Ziji Shi and Jie Zhang and Xinyuan Li and Langshi Chen and Yong Li and Zhen Zheng and Xiaoyong Liu and Wei Lin},\n\ttitle = {Whale: Efficient Giant Model Training over Heterogeneous {GPUs}},\n\tbooktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},\n\tyear = {2022},\n\tisbn = {978-1-939133-29-57},\n\taddress = {Carlsbad, CA},\n\tpages = {673--688},\n\turl = {https://www.usenix.org/conference/atc22/presentation/jia-xianyan},\n\tpublisher = {USENIX Association},\n\tmonth = jul,\n}\n```\n\n## Contact Us\n\nJoin the Official Discussion Group on DingTalk.\n\n![DingTalk Group](docs/images/ding-group.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FEasyParallelLibrary","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FEasyParallelLibrary","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FEasyParallelLibrary/lists"}