https://github.com/graykode/horovod-ansible
Create Horovod cluster easily using Ansible
https://github.com/graykode/horovod-ansible
ansible deeplearning distributed-training horovod openmpi pytorch tensorflow terraform
Last synced: 8 days ago
JSON representation
Create Horovod cluster easily using Ansible
- Host: GitHub
- URL: https://github.com/graykode/horovod-ansible
- Owner: graykode
- Created: 2019-07-06T12:47:07.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-07-08T12:56:41.000Z (almost 6 years ago)
- Last Synced: 2025-03-30T22:32:20.899Z (about 1 month ago)
- Topics: ansible, deeplearning, distributed-training, horovod, openmpi, pytorch, tensorflow, terraform
- Language: HCL
- Homepage:
- Size: 217 KB
- Stars: 22
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## horovod-ansible
![]()
[Horovod](https://github.com/horovod/horovod) is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. [Ansible](https://github.com/ansible/ansible) is a radically simple IT automation system. We can easily install the horovod on all server through its automatic setup on **AWS or On-premise**
##### Before Start
- All On-premise nodes should be ubuntu>=16.04. **I assumed that all nodes were equipped with Ansible(On-premise)**
- Until now, only the examples of tensorflow and pyrotorch can used(Not MXNet, Caffe.. etc YET).
- `AWS` Step : 0 - 1 -3
- `On-perm` Step : 0 - 2 - 3## Usage
### 0. docker setting(both AWS, On-premise)
All steps will be conducted under Docker container for beginners.```bash
$ docker run -it --name horovod-ansible graykode/horovod-ansible:0.1 /bin/bash
```### 1. AWS
To create horovod clustering enviorment, start provisioning with `Terraform` code. Change some option [`variables.tf`](https://github.com/graykode/horovod-ansible/blob/master/terraform/variables.tf) which you want. But you should not below `## DO NOT CHANGE BELOW`.
If I created EC2 with option `number_of_worker` 3, Total architecture is same with below picture.
![]()
Export your own AWS Access / Secret keys```bash
$ export AWS_ACCESS_KEY_ID=
$ export AWS_SECRET_ACCESS_KEY=
```Initializing terraform and create private key to use.
```bash
$ cd terraform/ && ssh-keygen -t rsa -N "" -f horovod
$ terraform init
```provisioning all resource EC2, VPC(gateway, router, subnet, etc..)
```bash
$ terraform apply
```Then, you can get output :
```bash
Apply complete! Resources: 12 added, 0 changed, 0 destroyed.Outputs:
horovod_master_public_ip =
horovod_workers_public_ip = ,
```### 2. On-premise
As I said above, assume that all nodes are 'ansible' and network setup is finished. If you want to see install Ansible, Please read [Ansible Install Guide](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) on document.
### 3. Setup Horovod Configure using Ansible(both AWS, On-premise)
Install `ansible` and `jinja2` using pip.
```bash
$ ../ansible && pip install -r requirements.txt
```Set `inventory.ini` in Ansible Folder.
```ini
master ansible_host=
worker0 ansible_host=
worker1 ansible_host=
....
worker[n] ansible_host=[all]
master
worker0
worker1
...
worker[n][master-servers]
master[worker-servers]
worker0
worker1
...
worker[n]
```Ping Test to all nodes!
```bash
$ chmod +x ping.sh && ./ping.sh
```Now ssh configure to using Open MPI, Download Open MPI and build
```bash
$ chmod +x playbook.sh && ./playbook.sh
```Test all nodes of mpi that it is fine in master node.
```bash
$ chmod +x test.sh && ./test.sh# go to master node.
ubuntu@master:~$ mpirun -np 3 -mca btl sm,self,tcp -host master,worker0,worker1 ./test
Processor name: master
master (0/3)
Processor name: worker0
slave (1/3)
Processor name: worker1
slave (2/3)
```### 4. Install DeepLearning Framework which you want and Horovod(both AWS, On-premise)
I'd like you to change this part fluidly.
- Install Tensorflow on CPU, Horovod and Run Distributed
```bash
$ chmod +x tensorflow.sh && ./tensorflow.sh
# go to master node.
ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 tensorflow-train.py
```- Install Pytorch on CPU, Horovod and Run Distributed
```bash
$ chmod +x pytorch.sh && ./pytorch.sh
# go to master node.
ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 pytorch-train.py
```- Issue Note : If you want to change framework after install horovod, you reinstall horovod with `HOROVOD_WITH_*` option, '*' is just framework name. please see [horovod issue](https://github.com/horovod/horovod/issues/314). But in my Ansible Script, I 'm not add it yet.
## Author
- Tae Hwan Jung(Jeff Jung) @graykode
- Author Email : [[email protected]](mailto:[email protected])