https://github.com/18520339/ml-distributed-training
Reduce the training time of CNNs by leveraging the power of multiple GPUs in 2 approaches, Multi-workers & Parameter Sever Training using TensorFlow 2
https://github.com/18520339/ml-distributed-training
distributed distributed-tensorflow distributed-training multi-gpu multi-workers parameter-server tensorflow
Last synced: 8 months ago
JSON representation
Reduce the training time of CNNs by leveraging the power of multiple GPUs in 2 approaches, Multi-workers & Parameter Sever Training using TensorFlow 2
- Host: GitHub
- URL: https://github.com/18520339/ml-distributed-training
- Owner: 18520339
- Created: 2022-01-26T10:30:00.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-09-26T15:36:46.000Z (about 3 years ago)
- Last Synced: 2025-04-12T18:57:06.790Z (8 months ago)
- Topics: distributed, distributed-tensorflow, distributed-training, multi-gpu, multi-workers, parameter-server, tensorflow
- Language: Jupyter Notebook
- Homepage: https://elib.vku.udn.vn/handle/123456789/2300
- Size: 8.05 MB
- Stars: 13
- Watchers: 1
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ML Distributed training
> Demo: https://youtu.be/OOPVA-eqBTY
## Introduction
This project leverage the power of multiple GPUs with the target is to reduce the training time of complex models by data parallelism method with 2 approaches:
1. Multi-worker Training using 2 PCs with GeForce RTX GPU as Workers via:
- Local area network (LAN).
- VPN tunnel using [OpenVPN ](https://openvpn.net) (not included in the demo).
2. Parameter Server Training using 5 machines in LAN:
- 2 Laptops as Parameter Server connected via 5GHz Wi-Fi.
- 2 PCs with GeForce RTX GPU as Workers.
- 1 PC just with CPU as a Coordinator.
## Dataset
We used our self-built [30VNFoods](https://github.com/18520339/30VNFoods) dataset which includes collected and labeled images of 30 famous Vietnamese dishes. This dataset is divided into:
- 17,581 images for training.
- 2,515 images for validation.
- 5,040 images for testing.
In addition, we also used a small [TensorFlow flowers](https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz) dataset with about 3700 images of flowers, which includes 5 folders corresponding to 5 types of flowers (`daisy`, `dandelion`, `roses`, `sunflowers`, `tulips`).
## Setup
| | |
|-------------------|------------|
| Image size | (224, 224) |
| Batch size/worker | 32 |
| Optimizer | Adam |
| Learning rate | 0.001 |
The [iperf3](https://github.com/esnet/iperf) tool is used to measure the [bandwidth](bandwidth.csv) of machines in network.
### 1. Multi-worker Training

### 2. Parameter Server Training

## Result
| Training method | Dataset | Connection | Avg. s/epoch |
|:-----------------:|:-----------:|:--------------:|:------------:|
| **Single-worker** | **flowers** | **LAN** | **14** |
| Multi-worker | flowers | LAN | 18 |
| **Multi-worker** | **flowers** | **VPN Tunnel** | **635** |
| Multi-worker | 30VNFoods | LAN | 184 |
| Parameter Server | 30VNFoods | LAN | 115 |
⇒ For more information, see [Report.pdf](Report.pdf).
## References
- [Distributed training with Keras](https://www.tensorflow.org/tutorials/distribute/keras)
- [A friendly introduction to distributed training (ML Tech Talks)](https://youtu.be/S1tN9a4Proc)
- [Distributed TensorFlow training (Google I/O '18)](https://youtu.be/bRMGoPqsn20)
- [Inside TensorFlow: Parameter server training](https://youtu.be/B2Tpv_N7wkg)
- [Performance issue for Distributed TF](https://github.com/tensorflow/tensorflow/issues/4164)
- [When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?](https://stackoverflow.com/questions/63374495/when-is-tensorflows-parameterserverstrategy-preferable-to-its-multiworkermirror)