{"id":13740122,"url":"https://github.com/douban/tfmesos","last_synced_at":"2025-04-06T01:07:08.024Z","repository":{"id":7823981,"uuid":"56483642","full_name":"douban/tfmesos","owner":"douban","description":"Tensorflow in Docker on Mesos #tfmesos #tensorflow #mesos","archived":false,"fork":false,"pushed_at":"2023-03-24T21:54:14.000Z","size":120,"stargazers_count":191,"open_issues_count":5,"forks_count":48,"subscribers_count":27,"default_branch":"master","last_synced_at":"2025-03-30T00:09:51.254Z","etag":null,"topics":["deep-learning","deep-neural-networks","distributed","docker","machine-learning","mesos","ml","neural-network","nvidia-docker","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/douban.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-04-18T06:44:24.000Z","updated_at":"2024-04-04T15:47:12.000Z","dependencies_parsed_at":"2024-01-18T17:43:59.050Z","dependency_job_id":"830067f2-9c45-4602-9c39-7b3de623dfe4","html_url":"https://github.com/douban/tfmesos","commit_stats":{"total_commits":95,"total_committers":10,"mean_commits":9.5,"dds":"0.42105263157894735","last_synced_commit":"e5da934739e90731e9eb159052b4d3e1d2ee5cf8"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Ftfmesos","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Ftfmesos/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Ftfmesos/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Ftfmesos/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/douban","download_url":"https://codeload.github.com/douban/tfmesos/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247419860,"owners_count":20936012,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","deep-neural-networks","distributed","docker","machine-learning","mesos","ml","neural-network","nvidia-docker","tensorflow"],"created_at":"2024-08-03T04:00:42.942Z","updated_at":"2025-04-06T01:07:08.004Z","avatar_url":"https://github.com/douban.png","language":"Python","funding_links":[],"categories":["Frameworks"],"sub_categories":["Machine Learning"],"readme":"TFMesos \n========\n\n.. image:: https://badges.gitter.im/douban/tfmesos.svg\n   :alt: Join the chat at https://gitter.im/douban/tfmesos\n   :target: https://gitter.im/douban/tfmesos?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge\n\n.. image:: https://img.shields.io/travis/douban/tfmesos.svg\n    :target: https://travis-ci.org/douban/tfmesos/\n.. image:: https://img.shields.io/pypi/v/tfmesos.svg\n    :target: https://pypi.python.org/pypi/tfmesos\n.. image:: https://img.shields.io/docker/automated/tfmesos/tfmesos.svg\n    :target: https://hub.docker.com/r/tfmesos/tfmesos/\n\n``TFMesos`` is a lightweight framework to help running distributed `Tensorflow \u003chttps://www.tensorflow.org\u003e`_ Machine Learning tasks on `Apache Mesos \u003chttp://mesos.apache.org\u003e`_ within `Docker \u003chttps://www.docker.com\u003e`_ and `Nvidia-Docker \u003chttps://github.com/NVIDIA/nvidia-docker/\u003e`_ .\n\n``TFMesos`` dynamically allocates resources from a ``Mesos`` cluster, builds a distributed training cluster for ``Tensorflow``, and makes different training tasks mangeed and isolated in the shared ``Mesos`` cluster with the help of ``Docker``.\n\n\nPrerequisites\n--------------\n\n* For ``Mesos \u003e= 1.0.0``:\n\n1. ``Mesos`` Cluster (cf: `Mesos Getting Started \u003chttp://mesos.apache.org/documentation/latest/getting-started\u003e`_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical ``/etc/passwd`` and ``/etc/group``.\n  \n2. Setup ``Mesos Agent`` to enable `Mesos Containerizer \u003chttp://mesos.apache.org/documentation/container-image/\u003e`_ and `Mesos Nvidia GPU Support \u003chttps://issues.apache.org/jira/browse/MESOS-4626\u003e`_ (optional). eg: ``mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia``\n    \n3. (optional) A Distributed Filesystem (eg: `MooseFS \u003chttps://moosefs.com\u003e`_)\n  \n4. Ensure latest ``TFMesos`` docker image (`tfmesos/tfmesos \u003chttps://hub.docker.com/r/tfmesos/tfmesos/\u003e`_) is pulled across the whole cluster\n\n* For ``Mesos \u003c 1.0.0``:\n\n1. ``Mesos`` Cluster (cf: `Mesos Getting Started \u003chttp://mesos.apache.org/documentation/latest/getting-started\u003e`_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical ``/etc/passwd`` and ``/etc/group``.\n\n2. ``Docker`` (cf: `Docker Get Start Tutorial \u003chttps://docs.docker.com/engine/installation/linux/\u003e`_)\n\n3. ``Mesos Docker Containerizer Support`` (cf: `Mesos Docker Containerizer \u003chttp://mesos.apache.org/documentation/latest/docker-containerizer/\u003e`_)\n\n4. (optional) ``Nvidia-docker`` installation (cf: `Nvidia-docker installation \u003chttps://github.com/NVIDIA/nvidia-docker/wiki/Installation\u003e`_) and make sure nvidia-plugin is accessible from remote host (with ``-l 0.0.0.0:3476``)\n\n5. (optional) A Distributed Filesystem (eg: `MooseFS \u003chttps://moosefs.com\u003e`_)\n\n6. Ensure latest ``TFMesos`` docker image (`tfmesos/tfmesos \u003chttps://hub.docker.com/r/tfmesos/tfmesos/\u003e`_) is pulled across the whole cluster\n\nIf you are using ``AWS G2`` instance, here is a `sample \u003chttps://github.com/douban/tfmesos/blob/master/misc/setup-aws-g2.sh\u003e`_ script to setup most of there prerequisites.\n\n\nRunning simple Test\n------------------------\nAfter setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.\n\n.. code:: bash\n\n    $ docker run -e MESOS_MASTER=mesos-master:5050 \\\n        -e DOCKER_IMAGE=tfmesos/tfmesos \\\n        --net=host \\\n        -v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \\\n        --rm \\\n        -it \\\n        tfmesos/tfmesos \\\n        python /tmp/plus.py mesos-master:5050\n\nSuccessfully running the test should result in an output of 42 on the console.\n\n\nRunning in replica mode\n------------------------\nThis mode is called `Between-graph replication` in official `Distributed Tensorflow Howto \u003chttps://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md#replicated-training\u003e`_\n\nMost distributed training models that Google has open sourced (such as `mnist_replica \u003chttps://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py\u003e`_ and `inception \u003chttps://github.com/tensorflow/models/blob/master/inception/inception/inception_distributed_train.py\u003e`_) are using this mode. In this mode, two kind of Jobs are defined with the names `'ps'` and `'worker'`. `'ps'` tasks act as `'Parameter Server'` and `'worker'` tasks run the actual training process.\n\nHere we use our modified `'mnist_replica' \u003chttps://github.com/douban/tfmesos/blob/master/examples/mnist/mnist_replica.py\u003e`_ as example:\n\n1. Checkout the `mnist` example codes into a directory in shared filesystem, eg: `/nfs/mnist`\n2. Assume Mesos master is `mesos-master:5050`\n3. Now we can launch this script using following commands:\n\nCPU:\n\n.. code:: bash\n\n    $ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \\\n                 --net=host \\\n                 -v /nfs/mnist:/nfs/mnist \\\n                 -v /etc/passwd:/etc/passwd:ro \\\n                 -v /etc/group:/etc/group:ro \\\n                 -u `id -u` \\\n                 -w /nfs/mnist \\\n                 tfmesos/tfmesos \\\n                 tfrun -w 1 -s 1  \\\n                 -V /nfs/mnist:/nfs/mnist \\\n                 -- python mnist_replica.py \\\n                 --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \\\n                 --job_name {job_name} --worker_index {task_index}\n\nGPU (1 GPU per worker):\n\n.. code:: bash\n\n    $ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \\\n                 --net=host \\\n                 -v /nfs/mnist:/nfs/mnist \\\n                 -v /etc/passwd:/etc/passwd:ro \\\n                 -v /etc/group:/etc/group:ro \\\n                 -u `id -u` \\\n                 -w /nfs/mnist \\\n                 tfmesos/tfmesos \\\n                 tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \\\n                 --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \\\n                 --job_name {job_name} --worker_index {task_index}\n\n\nNote:\n\nIn this mode, `tfrun` is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container.\n`tfrun` will substitute `{ps_hosts}`, `{worker_hosts}`, `{job_name}`, `{task_index}` with corresponding values of each task.\n\n\nRunning in fine-grained mode\n-----------------------------\n\nThis mode is called `In-graph replication` in official `Distributed Tensorflow Howto \u003chttps://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md#replicated-training\u003e`_\n\nIn this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a `Grpc` server. Each worker is driven by a local thread to run the training task.\n\nHere we use our modified `mnist \u003chttps://github.com/douban/tfmesos/blob/master/examples/mnist/mnist.py\u003e`_ as example:\n\n1. Checkout the `mnist` example codes into a directory, eg: `/tmp/mnist`\n2. Assume Mesos master is `mesos-master:5050`\n3. Now we can launch this script using following commands:\n\nCPU:\n\n.. code:: bash\n\n    $ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \\\n                 --net=host \\\n                 -v /tmp/mnist:/tmp/mnist \\\n                 -v /etc/passwd:/etc/passwd:ro \\\n                 -v /etc/group:/etc/group:ro \\\n                 -u `id -u` \\\n                 -w /tmp/mnist \\\n                 tfmesos/tfmesos \\\n                 python mnist.py \n\nGPU (1 GPU per worker):\n\n.. code:: bash\n\n    $ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \\\n                 --net=host \\\n                 -v /tmp/mnist:/tmp/mnist \\\n                 -v /etc/passwd:/etc/passwd:ro \\\n                 -v /etc/group:/etc/group:ro \\\n                 -u `id -u` \\\n                 -w /tmp/mnist \\\n                 tfmesos/tfmesos \\\n                 python mnist.py --worker-gpus 1\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdouban%2Ftfmesos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdouban%2Ftfmesos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdouban%2Ftfmesos/lists"}