{"id":13605228,"url":"https://github.com/MachineLearningSystem/substation","last_synced_at":"2025-04-12T05:32:12.220Z","repository":{"id":185461957,"uuid":"492476978","full_name":"MachineLearningSystem/substation","owner":"MachineLearningSystem","description":"Research and development for optimizing transformers","archived":false,"fork":true,"pushed_at":"2021-02-16T22:27:11.000Z","size":382,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-11-07T10:40:53.880Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"spcl/substation","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-05-15T12:13:00.000Z","updated_at":"2022-05-15T12:12:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"a3371e46-bf08-4106-9425-f56e26a37db2","html_url":"https://github.com/MachineLearningSystem/substation","commit_stats":null,"previous_names":["machinelearningsystem/substation"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsubstation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsubstation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsubstation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsubstation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/substation/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248524010,"owners_count":21118606,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:56.118Z","updated_at":"2025-04-12T05:32:11.767Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# Substation: Optimized Transformers :zap:\n\nSubstation is a project to optimize transformers using data movement analysis.\n\nThis code is presently at a research-and-development stage. We are actively working to make it both faster and more usable.\n\nFor more background, please see our paper, [_Data Movement Is All You Need: A Case Study on Optimizing Transformers_](https://arxiv.org/abs/2007.00072). If you use our code, please cite the paper:\n```\n@article{ivanov2020data,\n  title={Data Movement Is All You Need: A Case Study on Optimizing Transformers},\n  author={Ivanov, Andrei and Dryden, Nikoli and Ben-Nun, Tal and Li, Shigang and Hoefler, Torsten},\n  journal={arXiv preprint arXiv:2007.00072},\n  year={2020}\n}\n```\n\n## Current Performance\n\nWe presently include configurations for two versions of a single BERT-large encoder layer:\n1. Batch size 8 and max sequence length 512.\n2. Batch size 96 and max sequence length 128.\n\nThese benchmarks were run on the [Lassen supercomputer](https://hpc.llnl.gov/hardware/platforms/lassen). Note that the Nvidia V100s this system uses are the SXM2 variety, with a peak of 125 tflop/s using Tensor Cores. We compare with the same transformer architecture implemented in TensorFlow (with XLA), PyTorch, and DeepSpeed. These results are with the latest version of our code, but see our paper for other details.\n\nAll times are in milliseconds (ms).\n\n#### BERT-large, batch size 8, max sequence length 512 runtime\n| PyTorch | TensorFlow+XLA | DeepSpeed | Substation\n|---------|----------------|-----------|-----------\n| 9.14    | 8.4            | 7.6       | 6.71\n\n#### BERT-large, batch size 96, max sequence length 128 runtime\n| PyTorch | TensorFlow+XLA | DeepSpeed | Substation\n|---------|----------------|-----------|-----------\n| 18.43   | n/a            | 16.19     | 15.42\n\n## Usage\n\n_Note: We are actively working to improve the usability for standard deep learning workflows._\n\nOur encoder implementation is available as a PyTorch module in `pytorch_module/encoder.py`. Whenever you create a Substation encoder, you must specify an associated set of layouts and other configurations (see below for generating one yourself). We have provided the configurations used for the two BERT-large versions above as `layouts-bert-b8-l512-h16-e1024.pickle` and `layouts-bert-b96-l128-h16-e1024.pickle`, respectively. These configurations are optimized for the specific configuration and hardware, but should run for other problem sizes and on other hardware. The underlying optimized implementation for the encoder will be generated and compiled the first time you use it.\n\nFor performance benchmarking, we provide the `run_encoder.py` script. See its `--help` information for details.\n\n### Generating New Configurations\n\nIf you want to get the best performance for your particular problem configuration and/or hardware, you will need to generate a configuration. This involves two phases: benchmarking to gather performance data, then configuration selection.\n\n#### Benchmarking\n\n_Warning: This can take a long time._\n\nThis exhaustively benchmarks the possible layouts (and other options) for every operator used in the encoder layer. There are two sets of benchmarks, one for tensor contractions (which uses cuBLAS) and one for our custom fused kernel implementations.\n\n##### Tensor Contractions\n\nThese are located in `tc_profiling`.\n1. Run `compile.sh` to build cuBLAS benchmarks.\n2. Run `einsum_perms.py` (e.g., `einsum_perms.py --b 8 --j 512 --h 16 --i 1024`) to generate the benchmark configurations for each operator.\n3. These configurations can be run with `runprof.py \u003cconfig file\u003e`.\n\n##### Fused Kernels\n\nThese are run with the `pytorch_module/benchmark.py` script. You specify the kernel to benchmark with `--kernel name`. By default, this uses the batch size 8, sequence length 512 configuration of BERT-large. You can change the size using the `--size` argument. For example:\n```\npython benchmark.py --kernel softmax --size \"H=16,B=96,J=128,K=128,U=4096,N=1024,P=64\"\n```\nSee its `--help` for more arguments.\n\nYou will need to run every tensor contraction and kernel benchmark.\n\n#### Configuration Selection\n\nThese scripts are located in the `config_selection` directory. First, collect the benchmark data into a directory. You can just copy the kernel benchmark output. Use the `parse_tc_results.py` script to assemble the tensor contraction results and then copy them into the same directory.\n\nFinal configuration selection can then be run with `python optimize.py --output_config my_layouts.pickle results-dir`.\n\n##### Advanced\n\nThe `optimize.py` script can use several strategies for performing configuration selection, controlled with the `--graph_order` argument. The default, `bp_first`, will optimize the encoder layer's backpropagation pass first, and then its forward pass. `fp_first` will optimize forward propagation first, then backpropagation. `bp_first` typically results in configurations that are faster than `fp_first`. The third option, `combined`, will optimize over forward and backpropagation simultaneously, and typically results in the fastest configurations. However, this approach is somewhat finnicky, and can often fail to find a valid layout. This can be worked around by telling the optimizer to \"split\" at certain variables using the `--split_vars` argument.\n\nThe `layouts-bert-b8-l512-h16-e1024.pickle` configuration was generated using `optimize.py --graph_order combined --split_vars X LN1 LN2 LIN2 DLIN2`. The `layouts-bert-b96-l128-h16-e1024.pickle` configuration was generated using `optimize.py --graph_order combined --split_vars X DROP2 LN1`.\n\n## Contributors\n\nThis project is led by the [Scalable Parallel Computing Lab](https://spcl.inf.ethz.ch/) at ETH Zurich.\n\nSee also the [list of contributors](https://github.com/spcl/substation/graphs/contributors).\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Optimization"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fsubstation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fsubstation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fsubstation/lists"}