{"id":13604220,"url":"https://github.com/MachineLearningSystem/terapipe","last_synced_at":"2025-04-11T23:32:01.071Z","repository":{"id":185461974,"uuid":"428586641","full_name":"MachineLearningSystem/terapipe","owner":"MachineLearningSystem","description":null,"archived":false,"fork":true,"pushed_at":"2021-05-04T05:50:00.000Z","size":399,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-11-07T08:42:26.708Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"zhuohan123/terapipe","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-11-16T09:08:06.000Z","updated_at":"2021-11-16T09:08:07.000Z","dependencies_parsed_at":"2023-08-02T03:17:01.827Z","dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/terapipe","commit_stats":null,"previous_names":["machinelearningsystem/terapipe"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fterapipe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fterapipe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fterapipe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fterapipe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/terapipe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248495061,"owners_count":21113559,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:41.823Z","updated_at":"2025-04-11T23:32:00.581Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# Large-Scale Language Modeling with Pipeline Parallelism\n\nIn this project, we propose to use pipeline parallelism for large-scale language modeling. Our contributions include:\n\n1. We discover a new dimension (sequence length dimension) for pipeline parallelism on Transformer-based language models. This removes the obstacles to applying previous pipeline parallelism methods on large-scale language models.\n2. We show that the optimal size for input shards for pipeline parallelism is only dependent on the compute bound of a single device, while independent with other factors such as the granularity of the pipeline.\n3. We systematically analyze the trade-off space between pipeline parallelism and model parallelism based on parallel matrix multiplication. We provide clear guidelines on how to choose between the two algorithms and how to combine them given the heterogeneity of interconnection speeds between different devices.\n4. With all proposed algorithms, we greatly accelerate the largest GPT-3 model, without modifying any of the original synchronous training procedure.\n\n## Cluster Setup and Installation\n\nSee [cluster/README.md](cluster/README.md) to set up a cluster for developing and testing. After that, clone the repo to the NFS directory `~/efs` shared by all nodes:\n```bash\ncd ~/efs\ngit clone https://github.com/zhuohan123/model-parallel-speed-test.git\n```\n\n## Model configurations\n\nSee `MODEL_CONFIGS` dictionary in [transformer_models.py](transformer_models.py) for the list of the models we are testing on.\n\n\n## Run all experiments\n\nPipelining on sequence length dimension on all GPUs in the node:\n```bash\n# number of nodes, number of gpus per node, model parallel size, \n# pipeline parallel size, model name, number of slices, number of steps\nN_NODES=1 # Number of nodes in the cluster\nN_GPUS=1 # Number of GPUs per node\nMODEL_PARALLEL_SIZE=1 # Number of devices in a single model parallel (parallel matmul) groups\nPIPELINE_PARALLEL_SIZE=1 # Number of stages for pipelining. \n# Note that $N_NODES * $N_GPUS == $MODEL_PARALLEL_SIZE * $PIPELINE_PARALLEL_SIZE\nMODEL=test # Name of the model to test (see MODEL_CONFIGS)\nN_SLICES=8 # Number of input shards (currently we uniformly slice the input)\nN_STEPS=10 # Number of testing steps to run\nEXTRA_ARGS=\"--mixed-precision\"\n./mpirun_terapipe.sh $N_NODES $N_GPUS $MODEL_PARALLEL_SIZE $PIPELINE_PARALLEL_SIZE $MODEL $N_SLICES $N_STEPS $EXTRA_ARGS\n```\n\n## Latency Model\n\n### Data collection\n\nEdit `auto_latency_benchmark.sh` and add your model for computation latency evaluation.\nRun `./auto_latency_benchmark.sh` over 1 p3.16xlarge machine.\nOutputs in `performance_model_data`.\n\nEdit `p2p_comm_latency.py.py` and add your model for communication latency evaluation.\nRun `./p2p_comm_latency.sh` over 2 p3.16xlarge machines.\nOutputs in `performance_model_data`.\n\n### Fit latency model and generate optimal slices with DP.\n\nEdit and run `latency_model.py` to generate the optimal slices with DP. Results are saved in `dp_results.json`.\n\n### Evaluate the optimal slices.\n\nEdit and run `auto_mpirun_dp_slices_evaluation.sh`. Results under `dp_evaluation_results`.\n\n## Useful scripts:\n\nGet the IPs of all the worker nodes in the cluster:\n\n```bash\npython scripts/get_worker_ips.py\n```\n\nLoad `$MY_IPADDR`, `$OTHERS_IPADDR`, `$ALL_IPADDR` as environment variables:\n\n```bash\nsource scripts/load_cluster_env.sh\n```\n\nRun the same command on all nodes (useful for killing processes and check states):\n\n```bash\nscripts/fornode pkill python\nscripts/fornode nvidia-smi\n```\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Parallellism Training"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fterapipe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fterapipe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fterapipe/lists"}