Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/douban/tfmesos

Tensorflow in Docker on Mesos #tfmesos #tensorflow #mesos
https://github.com/douban/tfmesos

deep-learning deep-neural-networks distributed docker machine-learning mesos ml neural-network nvidia-docker tensorflow

Last synced: 3 months ago
JSON representation

Tensorflow in Docker on Mesos #tfmesos #tensorflow #mesos

Lists

README

        

TFMesos
========

.. image:: https://badges.gitter.im/douban/tfmesos.svg
:alt: Join the chat at https://gitter.im/douban/tfmesos
:target: https://gitter.im/douban/tfmesos?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge

.. image:: https://img.shields.io/travis/douban/tfmesos.svg
:target: https://travis-ci.org/douban/tfmesos/
.. image:: https://img.shields.io/pypi/v/tfmesos.svg
:target: https://pypi.python.org/pypi/tfmesos
.. image:: https://img.shields.io/docker/automated/tfmesos/tfmesos.svg
:target: https://hub.docker.com/r/tfmesos/tfmesos/

``TFMesos`` is a lightweight framework to help running distributed `Tensorflow `_ Machine Learning tasks on `Apache Mesos `_ within `Docker `_ and `Nvidia-Docker `_ .

``TFMesos`` dynamically allocates resources from a ``Mesos`` cluster, builds a distributed training cluster for ``Tensorflow``, and makes different training tasks mangeed and isolated in the shared ``Mesos`` cluster with the help of ``Docker``.

Prerequisites
--------------

* For ``Mesos >= 1.0.0``:

1. ``Mesos`` Cluster (cf: `Mesos Getting Started `_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical ``/etc/passwd`` and ``/etc/group``.

2. Setup ``Mesos Agent`` to enable `Mesos Containerizer `_ and `Mesos Nvidia GPU Support `_ (optional). eg: ``mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia``

3. (optional) A Distributed Filesystem (eg: `MooseFS `_)

4. Ensure latest ``TFMesos`` docker image (`tfmesos/tfmesos `_) is pulled across the whole cluster

* For ``Mesos < 1.0.0``:

1. ``Mesos`` Cluster (cf: `Mesos Getting Started `_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical ``/etc/passwd`` and ``/etc/group``.

2. ``Docker`` (cf: `Docker Get Start Tutorial `_)

3. ``Mesos Docker Containerizer Support`` (cf: `Mesos Docker Containerizer `_)

4. (optional) ``Nvidia-docker`` installation (cf: `Nvidia-docker installation `_) and make sure nvidia-plugin is accessible from remote host (with ``-l 0.0.0.0:3476``)

5. (optional) A Distributed Filesystem (eg: `MooseFS `_)

6. Ensure latest ``TFMesos`` docker image (`tfmesos/tfmesos `_) is pulled across the whole cluster

If you are using ``AWS G2`` instance, here is a `sample `_ script to setup most of there prerequisites.

Running simple Test
------------------------
After setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.

.. code:: bash

$ docker run -e MESOS_MASTER=mesos-master:5050 \
-e DOCKER_IMAGE=tfmesos/tfmesos \
--net=host \
-v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \
--rm \
-it \
tfmesos/tfmesos \
python /tmp/plus.py mesos-master:5050

Successfully running the test should result in an output of 42 on the console.

Running in replica mode
------------------------
This mode is called `Between-graph replication` in official `Distributed Tensorflow Howto `_

Most distributed training models that Google has open sourced (such as `mnist_replica `_ and `inception `_) are using this mode. In this mode, two kind of Jobs are defined with the names `'ps'` and `'worker'`. `'ps'` tasks act as `'Parameter Server'` and `'worker'` tasks run the actual training process.

Here we use our modified `'mnist_replica' `_ as example:

1. Checkout the `mnist` example codes into a directory in shared filesystem, eg: `/nfs/mnist`
2. Assume Mesos master is `mesos-master:5050`
3. Now we can launch this script using following commands:

CPU:

.. code:: bash

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /nfs/mnist:/nfs/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /nfs/mnist \
tfmesos/tfmesos \
tfrun -w 1 -s 1 \
-V /nfs/mnist:/nfs/mnist \
-- python mnist_replica.py \
--ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
--job_name {job_name} --worker_index {task_index}

GPU (1 GPU per worker):

.. code:: bash

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /nfs/mnist:/nfs/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /nfs/mnist \
tfmesos/tfmesos \
tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \
--ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
--job_name {job_name} --worker_index {task_index}

Note:

In this mode, `tfrun` is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container.
`tfrun` will substitute `{ps_hosts}`, `{worker_hosts}`, `{job_name}`, `{task_index}` with corresponding values of each task.

Running in fine-grained mode
-----------------------------

This mode is called `In-graph replication` in official `Distributed Tensorflow Howto `_

In this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a `Grpc` server. Each worker is driven by a local thread to run the training task.

Here we use our modified `mnist `_ as example:

1. Checkout the `mnist` example codes into a directory, eg: `/tmp/mnist`
2. Assume Mesos master is `mesos-master:5050`
3. Now we can launch this script using following commands:

CPU:

.. code:: bash

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /tmp/mnist:/tmp/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /tmp/mnist \
tfmesos/tfmesos \
python mnist.py

GPU (1 GPU per worker):

.. code:: bash

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /tmp/mnist:/tmp/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /tmp/mnist \
tfmesos/tfmesos \
python mnist.py --worker-gpus 1