{"id":17661823,"url":"https://github.com/18520339/ml-distributed-training","last_synced_at":"2026-02-27T12:17:44.926Z","repository":{"id":117071527,"uuid":"452232572","full_name":"18520339/ml-distributed-training","owner":"18520339","description":"Reduce the training time of CNNs by leveraging the power of multiple GPUs in 2 approaches, Multi-workers \u0026 Parameter Sever Training using TensorFlow 2","archived":false,"fork":false,"pushed_at":"2022-09-26T15:36:46.000Z","size":8436,"stargazers_count":13,"open_issues_count":2,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T18:57:06.790Z","etag":null,"topics":["distributed","distributed-tensorflow","distributed-training","multi-gpu","multi-workers","parameter-server","tensorflow"],"latest_commit_sha":null,"homepage":"https://elib.vku.udn.vn/handle/123456789/2300","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/18520339.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-26T10:30:00.000Z","updated_at":"2025-04-12T15:03:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"dbe13aac-3b56-406f-b6d2-51806bc08a26","html_url":"https://github.com/18520339/ml-distributed-training","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Fml-distributed-training","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Fml-distributed-training/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Fml-distributed-training/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Fml-distributed-training/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/18520339","download_url":"https://codeload.github.com/18520339/ml-distributed-training/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248993389,"owners_count":21195192,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed","distributed-tensorflow","distributed-training","multi-gpu","multi-workers","parameter-server","tensorflow"],"created_at":"2024-10-23T17:42:42.465Z","updated_at":"2026-02-27T12:17:44.861Z","avatar_url":"https://github.com/18520339.png","language":"Jupyter Notebook","readme":"# ML Distributed training\n\u003e Demo: https://youtu.be/OOPVA-eqBTY\n\n## Introduction\nThis project leverage the power of multiple GPUs with the target is to reduce the training time of complex models by data parallelism method with 2 approaches:\n1. Multi-worker Training using 2 PCs with GeForce RTX GPU as Workers via:\n    -  Local area network (LAN). \n    -  VPN tunnel using [OpenVPN ](https://openvpn.net) (not included in the demo). \n2. Parameter Server Training using 5 machines in LAN:\n    - 2 Laptops as Parameter Server connected via 5GHz Wi-Fi.\n    - 2 PCs with GeForce RTX GPU as Workers.\n    - 1 PC just with CPU as a Coordinator.\n     \n## Dataset \n\nWe used our self-built [30VNFoods](https://github.com/18520339/30VNFoods) dataset which includes collected and labeled images of 30 famous Vietnamese dishes. This dataset is divided into:\n- 17,581 images for training. \n- 2,515 images for validation. \n- 5,040 images for testing. \n\nIn addition, we also used a small [TensorFlow flowers](https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz) dataset with about 3700 images of flowers, which includes 5 folders corresponding to 5 types of flowers (`daisy`, `dandelion`, `roses`, `sunflowers`, `tulips`).\n\n## Setup\n|                   |            |\n|-------------------|------------|\n| Image size        | (224, 224) |\n| Batch size/worker |     32     |\n| Optimizer         |    Adam    |\n| Learning rate     |    0.001   |\n\nThe [iperf3](https://github.com/esnet/iperf) tool is used to measure the [bandwidth](bandwidth.csv) of machines in network.\n### 1. Multi-worker Training\n![](images/multi-worker.png)\n\n### 2. Parameter Server Training \n![](images/parameter-server.png)\n\n## Result\n\n|  Training method  |   Dataset   |   Connection   | Avg. s/epoch |\n|:-----------------:|:-----------:|:--------------:|:------------:|\n| **Single-worker** | **flowers** |     **LAN**    |    **14**    |\n|    Multi-worker   |   flowers   |       LAN      |      18      |\n|  **Multi-worker** | **flowers** | **VPN Tunnel** |    **635**   |\n|    Multi-worker   |  30VNFoods  |       LAN      |      184     |\n|  Parameter Server |  30VNFoods  |       LAN      |      115     |\n\n\u0026rArr; For more information, see [Report.pdf](Report.pdf).\n\n## References\n- [Distributed training with Keras](https://www.tensorflow.org/tutorials/distribute/keras)\n- [A friendly introduction to distributed training (ML Tech Talks)](https://youtu.be/S1tN9a4Proc)\n- [Distributed TensorFlow training (Google I/O '18)](https://youtu.be/bRMGoPqsn20)\n- [Inside TensorFlow: Parameter server training](https://youtu.be/B2Tpv_N7wkg)\n- [Performance issue for Distributed TF](https://github.com/tensorflow/tensorflow/issues/4164)\n- [When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?](https://stackoverflow.com/questions/63374495/when-is-tensorflows-parameterserverstrategy-preferable-to-its-multiworkermirror)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F18520339%2Fml-distributed-training","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F18520339%2Fml-distributed-training","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F18520339%2Fml-distributed-training/lists"}