{"id":19529106,"url":"https://github.com/fangjin98/distributed-training-ina","last_synced_at":"2025-04-26T11:34:14.008Z","repository":{"id":128775723,"uuid":"427223004","full_name":"Fangjin98/distributed-training-INA","owner":"Fangjin98","description":"A PS ML training architecture with p4 programmable switches.","archived":false,"fork":false,"pushed_at":"2023-11-09T08:36:07.000Z","size":10900,"stargazers_count":9,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2023-11-10T12:41:01.539Z","etag":null,"topics":["distributed-machine-learning","in-network-compute","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Fangjin98.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-11-12T03:42:17.000Z","updated_at":"2023-11-10T12:41:01.540Z","dependencies_parsed_at":null,"dependency_job_id":"f5384861-eee9-47b9-8399-6753eca08899","html_url":"https://github.com/Fangjin98/distributed-training-INA","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Fangjin98%2Fdistributed-training-INA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Fangjin98%2Fdistributed-training-INA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Fangjin98%2Fdistributed-training-INA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Fangjin98%2Fdistributed-training-INA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Fangjin98","download_url":"https://codeload.github.com/Fangjin98/distributed-training-INA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224033284,"owners_count":17244567,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-machine-learning","in-network-compute","pytorch"],"created_at":"2024-11-11T01:22:00.678Z","updated_at":"2024-11-11T01:22:02.335Z","avatar_url":"https://github.com/Fangjin98.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Distributed ML Training with In-Network Aggregation\n\nA distributed PS training architecture with P4 programmable switches accelerating.\n\n## Dependency\n\npytorch needed\n\n  ```bash\n  sudo apt install libjpeg-dev zlib1g-dev libssl-dev libffi-dev python-dev build-essential libxml2-dev libxslt1-dev\n  ```\n\npython dependency  \n\n```bash\n  pip3 install pulp numpy tensorboard\n  ```\n\ncpu only pytorch\n\n```bash\npip3 install torch==1.10.0+cpu torchvision==0.11.1+cpu torchaudio==0.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html\n```\n\n## Usage\n\nWe ignore the config files for security. You need to create `config\\workers.json` for distributed training.\n\n```json\n[\n    {\n        \"host_ip\": \"id of worker 1\",\n        \"ssh_port\": \"port for ssh\",\n        \"ssh_usr\" : \"user account to ssh\",\n        \"ssh_psw\" : \"password\",\n        \"work_dir\": \"path of files\"\n    },\n    {\n        \"host_ip\": \"id of worker 2\",\n        \"ssh_port\": \"port for ssh\",\n        \"ssh_usr\" : \"user account to ssh\",\n        \"ssh_psw\" : \"password\",\n        \"work_dir\": \"path of files\"\n    },\n]\n```\n\nRun `./deploy.sh` to sync codes among all the machines: make sure you have created the `\u003crepo\u003e` directory.\n\n```bash\n# deploy.sh\n\nscp -r current_path ssh_usr@machine_ip:dest_path\n```\n\nRun `./test.sh $WORKER_NUM` to start training. The scripts will run `python3 launch.py --master True xxx` to launch the PS, which will launch workers via ssh according to the IP list in `config/workers.json`\n\n```bash\n# test.sh\n\nWORKER_NUM=$1\n\nsudo python3 src/launch.py --master 1 --ip machine_ip --worker_num $WORKER_NUM --config_file config/workers.json --dataset CIFAR100 --model resnet50\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffangjin98%2Fdistributed-training-ina","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffangjin98%2Fdistributed-training-ina","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffangjin98%2Fdistributed-training-ina/lists"}