{"id":13604894,"url":"https://github.com/MachineLearningSystem/slapo","last_synced_at":"2025-04-12T02:32:09.089Z","repository":{"id":185461948,"uuid":"589453800","full_name":"MachineLearningSystem/slapo","owner":"MachineLearningSystem","description":"A schedule language for progressive optimization of large deep learning model training","archived":false,"fork":true,"pushed_at":"2023-01-16T05:20:26.000Z","size":591,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-08-02T19:36:37.214Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"awslabs/slapo","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-01-16T06:37:34.000Z","updated_at":"2023-01-16T06:35:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"743a0b67-0b51-486e-8d5c-a0cbc81e5e10","html_url":"https://github.com/MachineLearningSystem/slapo","commit_stats":null,"previous_names":["machinelearningsystem/slapo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fslapo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fslapo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fslapo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fslapo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/slapo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223489691,"owners_count":17153805,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:52.458Z","updated_at":"2024-11-07T09:31:12.310Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"\u003c!--- Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. --\u003e\n\u003c!--- SPDX-License-Identifier: Apache-2.0  --\u003e\n\n# Slapo: A Schedule Language for Large Model Training\n\nSlapo is a schedule language for progressive optimization of large deep learning model training.\n\nLarge deep learning models demonstrate dominating model accuracy on a range of tasks in NLP and CV, but it is hard to train the model efficiently while preserving the usability. Slapo aims to address this tension through separation of concerns. Slapo decouples model execution from definition, enabling developers to use a set of schedule primitives to convert a PyTorch model for common model training optimizations without directly changing the model itself.\n\nSlapo highlights the following features:\n\n:rocket: **Progressive optimization**. Slapo incorporates a \"trace by need\" approach that only traces a desired module to be a static graph for compiler-based aggressive optimizations.\n\n:building_construction: **Structure-preserving scheduling**. Slapo preserves the module hierarchy when constructing the schedule, so developers can easily locate the module and apply scheduling, which also facilitates the users to debug any performance and convergence issue.\n\n:gear: **Auto-tuning**. Slapo provides a programming interface that allows developers to specify a set of tuneable knobs to form an efficient tuning space, which can then be explored by Slapo auto-tuner to realize the optimal configuration.\n\n\n## Getting Started\n\n### Installation\n\nWe currently only support installation from source, and will provide pip-wheel in the future. Please make sure you have installed [PyTorch](https://pytorch.org/) (\u003e= v1.13) in advance.\n\n```bash\ngit clone https://github.com/awslabs/slapo.git slapo\ncd slapo\npip3 install -e \".[dev]\"\n```\n\nYou can optionally install [HuggingFace Transformers](https://github.com/huggingface/transformers) (\u003e= v4.25.1) to retrieve models. Also, we support the following frameworks. You can run the scheduled models on these frameworks if needed.\n* [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) \u003e= 3.0.2\n* [DeepSpeed](https://github.com/microsoft/DeepSpeed) \u003e= 0.7.7\n\n\n### Usage\nPlease see the [examples](examples/) folder for more details. Documentations will be released soon.\n```python\nimport slapo\n\n# Load a PyTorch model from HuggingFace Hub, TorchVision, etc.\nfrom transformers import BertLMHeadModel, AutoConfig\nconfig = AutoConfig.from_pretrained(\"bert-large-uncased\")\nbert = BertLMHeadModel(config)\n\n# Create a default schedule\nsch = slapo.create_schedule(bert)\n\n# Apply primitives to optimize the model\n# Please refer to examples/bert/schedule.py for details\nsch[\"bert.encoder.layer.0\"].primitve(...)\n\n# Build an optimized model\nopt_model = slapo.build(sch)\n\n# Run the optimized model\ninputs = ...\noutputs = opt_model(inputs)\n```\n\n\n## Supported Primitives\nTo maximally reduce the risk introduced by tracers and compilers, we leverage **progressive optimization** to gradually apply primitives to a part of the model. We classify the primitives into two categories. The first type of primitives does *not* require tracing and can be directly applied to modules and parameters; the second type of primitives requires a static graph, and thus needs to apply the `.trace()` primitive first.\n\nWe provide the following primitives for dynamic graph optimizations:\n| Feature | Primitive |\n| :--: | :-- |\n| Module replacement | `s[op].replace(new_module)` |\n| Tensor parallelism | `s[op].shard(param_name, axis)` |\n| Synchronization | `s[op].sync(mode=\"forward/backward/both\")` |\n| Checkpointing | `s[op].checkpoint()` |\n| Forward/Backward hook | `s[op].hook(mode=\"fw_pre/fw_post/bw_post\", func=hook)` |\n\nAnd the following primitives for static graph optimizations:\n| Feature | Primitive |\n| :--: | :-- |\n| Module Tracing | `s.trace(leaves, flatten)` |\n| Pattern matching | `s.find(mod_name_regex, func_pattern)` |\n| Operator fusion | `s[op].fuse(compiler, subgraph)` |\n| Partial module replacement | `s[op].replace(new_module, subgraph)` |\n| Partial gradient checkpointing | `s[op].checkpoint()` |\n| Pipeline parallelism | `s[op].pipeline_split()` |\n\n\n### Auto-Tuning\nWe also provide a light-weight interface for auto-tuning, so the developers can (1) construct a polyhedral search space using our APIs, and (2) leverage Slapo auto-tuner to automatically search for the best training configuration.\n\n```bash\ncd benchmark\n# Single device\n# The following script will trigger the tuning jobs for all the models\npython3 tune_single_device.py\n# Single node\npython3 tune_single_node.py\n```\n\n\n## Benchmarking\nWe provide scripts to reproduce our results on a single AWS EC2 p3.16xlarge node with 8 * V100 GPUs.\n\n```bash\ncd benchmark\n# Download datasets\nbash download_benchmark_dataset.sh\n# Run benchmarking\n# Megatron-LM and Deepspeed are required for executing the experiments\nbash run_all_single_node.sh config/single_node_v100.cfg\n```\n\n\n## Publication\nIf you use Slapo in your project, please feel free to cite our ArXiv [paper](https://arxiv.org/):\n- **Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training**\n  Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, and Yida Wang.\n\n\n## License\nSlapo is released under the [Apache 2.0 license](LICENSE).\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Schedule and Resource Management"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fslapo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fslapo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fslapo/lists"}