https://github.com/tonysy/distributed-tutorial
Tutorial for Distributed Deep Learning on Serveral Frameworks,e.g. Keras, Tensorflow, PyTorch
https://github.com/tonysy/distributed-tutorial
Last synced: about 2 months ago
JSON representation
Tutorial for Distributed Deep Learning on Serveral Frameworks,e.g. Keras, Tensorflow, PyTorch
- Host: GitHub
- URL: https://github.com/tonysy/distributed-tutorial
- Owner: tonysy
- Created: 2018-10-04T05:35:14.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-10-04T05:35:58.000Z (over 6 years ago)
- Last Synced: 2025-02-14T11:53:00.694Z (4 months ago)
- Homepage:
- Size: 10.7 KB
- Stars: 4
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Install Softwares to support distributed training on AI cluster
本教程基于horovod分布式训练框架,支持keras, tensorflow和pytorch以下相关版本已经过测试:
- keras: 2.2.2
- tensorflow: 1.10.0
- pytorch: 0.4.0## 硬件环境
本教程针对ShanghaiTech AI集群
- Container开启时需使用`-network=host`
- 确保每个节点的hostname是不相同的,hostname请勿使用 `.`## 软件环境安装
以下软件均在计算节点安装### NCCL
1. 下载 [NCCL 2](https://developer.nvidia.com/nccl).
2. 安装步骤 [here](http://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html).3. If you have installed NCCL 2 using the `nccl-.txz` package, you should add the library path to `LD_LIBRARY_PATH`
environment variable or register it in `/etc/ld.so.conf`.
```bash
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-/lib
```### OpenMPI
1. 下载最新版安装包 [Open MPI](https://www.open-mpi.org/)
2. 安装步骤 [here](https://www.open-mpi.org/faq/?category=building#easy-build).
3. `ldconfig`### Deep learning framework
- 安装keras,tensorflow,pytorch
```shell
pip install keras==2.2.2 tensorflow-gpu==1.10.0 torch==0.4.0 torchvision
HOROVOD_NCCL_HOME=/usr/local/nccl- HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
```### Example
参见examples中的README.md