{"id":13605451,"url":"https://github.com/MachineLearningSystem/FTPipe-ATC21-Finetune","last_synced_at":"2025-04-12T05:33:13.117Z","repository":{"id":185461777,"uuid":"613254099","full_name":"MachineLearningSystem/FTPipe-ATC21-Finetune","owner":"MachineLearningSystem","description":"FTPipe and related pipeline model parallelism research.","archived":false,"fork":true,"pushed_at":"2022-05-25T09:29:07.000Z","size":11936,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-10-15T01:03:13.088Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"saareliad/FTPipe","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-03-13T08:02:43.000Z","updated_at":"2022-12-09T03:29:01.000Z","dependencies_parsed_at":"2023-08-02T05:30:23.446Z","dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/FTPipe-ATC21-Finetune","commit_stats":null,"previous_names":["machinelearningsystem/ftpipe-atc21-finetune"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FFTPipe-ATC21-Finetune","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FFTPipe-ATC21-Finetune/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FFTPipe-ATC21-Finetune/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FFTPipe-ATC21-Finetune/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/FTPipe-ATC21-Finetune/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223497852,"owners_count":17155212,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:58.852Z","updated_at":"2024-11-07T10:30:37.809Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"\n# FTPipe\n\nThis repository contains code used for FTPipe USENIX ATC21 [paper](https://www.usenix.org/system/files/atc21-eliad.pdf) \"Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism\", and future works.\n\nSee [citation](#citation) information at the bottom of this readme.\n\n\n## Overview\nThis repository was used to explore various unexplored territories of pipeline-model-parallelism.\nIt is capable of automatically partitionining, training and fine-tuning giant neural networks, with both synchronous and asynchronous pipelines. \\\nCode for Pipeline Staleness Mitigation study is included as well.\n\n\nModels supported and tested are Huggingface transformers  (T5, GPT2, BERT, RoBerta...), many Torchvision models (probably all), and Vision Transformers. (conducted an out-of-the-box ViT PoC with the first pytorch implementation, by timm, right when it apeared.)\\\nThe setup for T5-11B is currently kept on a seperate branch.\n\n## Basic Usage\n\nClone the repository:\n```\ngit clone https://github.com/saareliad/FTPipe.git\n```\nAll code is currently designed to run from repository root.\n\nAfter completing the [environment setup](#setup), FTPipe's usage is mainly the two following steps:\n1. Partitioning models\n2. Running models.\n\n```bash\npython -m autopipe.partition ... # partition models\n```\n\n```bash\npython -m pipe.main ... # train models (+eval)\n```\n\nAdditional documentations:\n* Training arguments should be passed via json [configuration files](https://github.com/saareliad/FTPipe/blob/master/pipe/configs) (*)\n* New Models, training/fine-tuning tasks, and datasets should be [registered](docs/NewModels.md) to the framework.\n* Additional arguments are passed as cmd args. Do use the `--help` option to exlore. (NOTE: It is also possible to override some configuration arguments using the command line, use with caution. Partitioning uses mostly cmd args.)\n* As P2P communication is done with MPI, running models often looks like this\n\n```bash\nmpirun -np 8 python -m pipe.main --config $PATH_TO_JSON_CONFIG\n```\n* Refer to [examples](https://github.com/saareliad/FTPipe/tree/master/t5_used_scripts_example) of recent scripts we used to partition and conduct T5 experiments. \n* Do feel free to contact (issue/mail/linkedin/...).\n\n(*Note: a more comprehensive explanation is planned, meanwhile, configuration can be understood via examples or code).\n\n## Setup\n\n* Follow the [instructions](pipe/env_utils/create_env_new_server_new.sh) to setup the required conda env. This includes building pytorch from source with cuda-aware openmpi.\n* NOTE: Model partitioning can be done using a [much simpler conda env](https://github.com/saareliad/FTPipe/blob/master/pipe/env_utils/env_without_mpi.yml) (without mpi or building from source)\n```\nconda env create -f pipe/env_utils/env_without_mpi.yml\n```\n\n\nThe simiple recpie below was used to set it up on our servers\n```bash\nBUILD_DIR=\u003cSOMEPLACE_FOR_DOWNLOADED_SOFTWARE\u003e # openmpi, pytorch\ncd pipe/env_utils\ncp create_env_new_server.sh $BUILD_DIR\ncd $BUILD_DIR\nvim create_env_new_server.sh  # change paths: home_local, FTPIPE_ROOT\nbash create_env_new_server.sh # it is safer to run it step by step.\n```\nwhere `$BUILD_DIR` is set to a a repository to place the clones of openmpi and pytorch.\n\n\n### Aditional docs\nWork in progress to add all docs in thier own [docs directory](docs/).\n\nSome additional usage instructions are documented across the repository.\nFor example: \n - At the [pipe](pipe/) module, there are instructions and scripts for running downloading data, \n - Refer to the [pipes-list](docs/PipeList.md) for availalble staleness mitigation and pipelines which can be used at runtime.\n - See the [autopipe](autopipe/) module for avaialbe partitioning methods. See the [tasks](autopipe/tasks) directory for examples of partitioning tasks (e.g., differnt models architechtures or downstream fine-tuning tasks). \n - A detailed example of steps/changes taken to export a T5 model from huggingface can be found [here](models/new_t5_example).\n\n## Note\n_Note: some hyper-parameters in mpipe partitioning (e.g., GPU memory capacity), env and so on are still hardcoded to our and not available as cmd options. Currently, one will need to change them change them manually to experiment (As we did...)_\n\n## Citation\n```\n@inproceedings {ftpipe,\nauthor = {Saar Eliad and Ido Hakimi and Alon De Jagger and Mark Silberstein and Assaf Schuster},\ntitle = {Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism},\nbooktitle = {2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21)},\nyear = {2021},\nisbn = {978-1-939133-23-6},\npages = {381--396},\nurl = {https://www.usenix.org/conference/atc21/presentation/eliad},\npublisher = {{USENIX} Association},\nmonth = jul,\n}\n```\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Fine-Tune"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FFTPipe-ATC21-Finetune","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FFTPipe-ATC21-Finetune","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FFTPipe-ATC21-Finetune/lists"}