{"id":13604934,"url":"https://github.com/MachineLearningSystem/KungFu","last_synced_at":"2025-04-12T02:32:21.976Z","repository":{"id":185461822,"uuid":"505392247","full_name":"MachineLearningSystem/KungFu","owner":"MachineLearningSystem","description":"Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.","archived":false,"fork":true,"pushed_at":"2022-01-31T10:22:03.000Z","size":2009,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-08-02T19:36:37.282Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"lsds/KungFu","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-06-20T10:21:06.000Z","updated_at":"2022-06-13T21:48:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"eb273335-3d03-47d4-949e-cdc317aed2e9","html_url":"https://github.com/MachineLearningSystem/KungFu","commit_stats":null,"previous_names":["machinelearningsystem/kungfu"],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FKungFu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FKungFu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FKungFu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FKungFu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/KungFu/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223489708,"owners_count":17153807,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:52.876Z","updated_at":"2024-11-07T09:31:17.765Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"funding_links":[],"categories":["Paper-Code"],"sub_categories":["Schedule and Resource Management"],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"docs/kungfu-logo.png\" width=\"50%\" height=\"30%\"/\u003e\n\u003c/div\u003e\n\n# KungFu\n\nMaking adaptive distributed machine learning easy and efficient.\n\n[![Build Status](https://travis-ci.com/lsds/KungFu.svg?branch=master)](https://travis-ci.com/lsds/KungFu)\n[![Documentation Status](https://readthedocs.org/projects/kungfu/badge/?version=latest)](https://kungfu.readthedocs.io/en/latest/?badge=latest)\n\n## Features\n\nKungFu aims to help users achieve *fast* and *adaptive* distributed machine learning with *minimal* efforts. This is important because a machine learning system must cope with growing complex models and increasingly complicated deployment environments, making it\ndifficult to constantly deliver high performance with an *empirical* configuration.\nTo address this, KungFu provides the following unique features:\n\n* Simplicity: KungFu permits distributed training by adding minimal code in your training program. KungFu is also simple to install and run. It does not require extra deployment like parameter servers and heavy dependencies like MPI in Horovod.\n* Adaptable distributed training: KungFu provides useful advanced [distributed optimizers](srcs/python/kungfu/tensorflow/optimizers/__init__.py) such as\ncommunication-efficient ``PairAveragingOptimizer`` and hyper-parameter-robust ``SynchronousAveragingOptimizer`` to help you address the cases in which conventional Synchronous SGD does not scale. See [Optimizers](https://github.com/lsds/KungFu#choosing-the-right-optimizer) for how to choose the right KungFu optimizer for your training scenario.\n* Online monitoring and control: KungFu aims to support [distributed SGD metrics](srcs/python/kungfu/tensorflow/optimizers/sync_sgd.py) such as [gradient noise scale](https://openai.com/blog/science-of-ai/) to help understand the training process with low overhead.\nKungFu further provides control operators such as ``barrier`` and ``resize_cluster`` to help reconfigure training online, even in response to monitored metrics.\n* Fast and scalable: KungFu has a decentralized architecture, an non-blocking runtime, and high-performance implementations of communication, monitoring and control operators. Check out its performance in [Benchmark](https://github.com/lsds/KungFu#benchmark).\n\nWe have been using KungFu for scaling out different deep learning models such as ResNet, DenseNet, OpenPose, BERT, CycleGAN and Alpha Zero. Check out their [examples](https://github.com/lsds/KungFu#examples).\n\n## Usage\n\nKungFu currently support TensorFlow and Keras. To scale out your TensorFlow program, for example, you need to make two changes:\n\n1. Wrap your ``tf.train.optimizer`` in KungFu's ``SynchronousSGDOptimizer``, ``SynchronousAveragingOptimizer``, or ``PairAveragingOptimizer``.\n\n2. Ensure all workers start with consistent states by broadcasting a worker's initial global variables.\n\n```python\nimport tensorflow as tf\n\n# Build model...\nloss = ...\nopt = tf.train.AdamOptimizer(0.01)\n\n# KungFu Step 1: Wrap tf.optimizer in KungFu optimizers\nfrom kungfu.tensorflow.optimizers import SynchronousSGDOptimizer\nopt = SynchronousSGDOptimizer(opt)\n\n# Make training operation\ntrain_op = opt.minimize(loss)\n\n# Train your model\nwith tf.Session() as sess:\n    sess.run(tf.global_variables_initializer())\n\n    # KungFu Step 2: ensure distributed workers start with consistent states\n    from kungfu.tensorflow.initializer import BroadcastGlobalVariablesOp\n    sess.run(BroadcastGlobalVariablesOp())\n\n    for step in range(10):\n        sess.run(train_op)\n```\n\nYou can find more details in the [Documentation](https://kungfu.readthedocs.io/en/latest/?badge=latest), for example, for how to use KungFu with [Session](examples/tf1_mnist_session.py), [TensorFlow Keras](examples/tf1_mnist_keras.py), [Estimator](examples/tf1_mnist_estimator.py), and [GradientTape](examples/tf2_mnist_gradient_tape.py) in TensorFlow 1 and 2.\nFor KungFu with Keras, check out [here](examples/keras_mnist.py).\n\n## Install\n\nKungFu is implemented in Go and C++.\nCurrently, it has a Python binding for TensorFlow (including v1 and v2) and Keras (assuming you use TensorFlow as the backend).\n\nKungFu for TensorFlow requires [Python 3](https://www.python.org/downloads/), [CMake 3.5+](https://cmake.org/install/), and [Golang 1.13+](https://golang.org/dl/).\nKungFu has been tested with [TensorFlow](https://www.tensorflow.org/install/pip#older-versions-of-tensorflow) 1.12, 1.13, 1.15 and 2.0.0.\nKungFu has a known installation issue with TensorFlow 1.14.\nAssuming you have the above pre-requites, you can install KungFu as follows:\n\n```bash\ngit clone https://github.com/lsds/KungFu.git\ncd KungFu\npip3 install --no-index -U --user .\n```\n\n\u003c!-- If you get `permission denied errors`, try install `kungfu-run` separately\n```bash\n# or download golang during install if golang is missing\n# KUNGFU_DOWNLOAD_GO=1 pip3 install --no-index -U --user .\n\nKUNGFU_BUILD_TOOLS=OFF pip3 install --no-index -U --user . # disable install kungfu-run with pip\nGOBIN=\u003cPATH\u003e go install -v ./srcs/go/cmd/kungfu-run # install kungfu-run to \u003cPATH\u003e\n``` --\u003e\n\nKungFu provides ``kungfu-run`` to launch a KungFu process on a multi-GPU server.\nIn a cluster, we need to launch ``kungfu-run`` on each node.\n\n```bash\n# Show the help of kungfu-run\nkungfu-run -help\n```\n\nYou can use KungFu with Docker. Check out the docker files for [GPU](docker/Dockerfile.tf-gpu) and [CPU](docker/Dockerfile.tf-cpu) machines.\n\n## Run\n\nWe show how to run a KungFu program using a MNIST example.\nDownload the MNIST dataset ([script](scripts/download-mnist.sh)) first and then run the following training script:\n\n```bash\n# Train a Single Layer Perception (SLP) model for the MNIST dataset using 4 CPUs for 10 data epochs.\nkungfu-run -np 4 python3 examples/tf1_mnist_session.py --data-dir=./mnist\n```\n\nYou can run this example on two machines (assuming each with 8 GPUs) using the below command (NOTE: this command must be called on each machine):\n\n```bash\n# Assume the machines have NIC eth0 and their IPs are 192.168.0.1 and 192.168.0.2.\n# Assume NUM_GPU_SLOTS=8, NUM_GPUS=16\nkungfu-run -np $NUM_GPUS \\\n    -H 192.168.0.1:$NUM_GPU_SLOTS,192.168.0.2:$NUM_GPU_SLOTS -nic eth0 \\\n    python3 examples/tf1_mnist_session.py  --data-dir=./mnist\n```\n\n``kungfu-run`` use the ``nic`` option to infer its IP and thus its role in the cluster.\n\n## Examples\n\nWe have been using KungFu in training\ndifferent kinds of AI models.\nThe following are representative examples:\n\n* ***ImageNet***: KungFu can speed up the training\nof ResNet, VGG, DenseNet and others for ImageNet.\nCheck out this in an [ImageNet benchmark suite](https://github.com/luomai/benchmarks/tree/cnn_tf_v1.12_compatible_kungfu/scripts/tf_cnn_benchmarks#running-kungfu) extended from the [TensorFlow benchmark](https://github.com/luomai/benchmarks/tree/cnn_tf_v1.12_compatible_kungfu).\n\n* ***Pose estimation***: Pose estimation models such as [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose) are often batch-size sensitive.\nWe used KungFu in\na popular [OpenPose implementation](https://github.com/tensorlayer/openpose-plus) and improved time-to-accuracy\nusing the model averaging optimizer which preserves the merits of small batch size.\n\n* ***Natural language processing***:\nWe have an [example](https://github.com/luomai/bert) that shows how you can use few lines to enable distributed training for the Google BERT model.\n\n* ***Adversarial learning***:\nAdversarial learning trains multiple networks in parallel and prefer using small batches for training.\nKungFu thus become an attractive option, because of its minimal changes to GAN programs\nand its optimizers that decouple batch size and system parallelism.\nSee the [CycleGAN example](https://github.com/tensorlayer/cyclegan).\n\n* ***Reinforcement learning***:\nWe are working on an Alpha Zero distributed training example and will release it soon.\n\n## Choosing the right optimizer\n\nKungFu aims to help users decrease the\ntime to reach a desired accuracy (time-to-accuracy)\nthrough scaling.\nThere are two major ways to improve time-to-accuracy in KungFu:\n\n* Synchronous SGD: Adopt parallel workers to improve the estimation of gradients, and\nreach a minima quickly using an increased learning rate.\n* Model Averaging: Adopt parallel workers to explore the solution space and collaborate through averaging diverged models in order\nto find a good minima quickly.\n\n***Synchronous SGD***:\nSynchronous SGD (S-SGD) is implemented as ``SynchronousSGDOptimizer`` in KungFu, equivalent to\nthe DistributedOptimizer in Horovod.\nThe use of S-SGD, however, poses scalability and accuracy challenges.\nScalability-wise, all S-SGD workers must exchange all gradients per iteration, making\nthem hard to deal with limited bandwidth and stragglers;\n(ii) accuracy-wise, S-SGD *couples* training batch size with the number of workers,\nenforcing users to use large batch sizes, which can adversely\naffect the generality of a trained model (see [paper](https://arxiv.org/abs/1609.04836)).\nTo compensate the loss in generality, users must explore various [methods](https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf)\nfor tuning hyper-parameters.\n\n***Model averaging***:\nModel averaging is implemented as ``SynchronousAveragingOptimizer`` and\n``PairAveragingOptimizer`` in KungFu.\nThe former realizes the hyper-parameter-robust [SMA](http://www.vldb.org/pvldb/vol12/p1399-koliousis.pdf)\nalgorithm; while the latter implements the [AD-PSGD](https://arxiv.org/abs/1710.06952) algorithm\nwhich reduces bandwidth consumption and tolerates stragglers.\nIn model averaging, each worker trains its local\nmodel using SGD, and average\nits model with peers to speed up the search for minima.\nModel averaging algorithms have a convergence guarantee (see [EA-SGD paper](https://arxiv.org/abs/1412.6651))\nand can converge fast with DL models (see [Lookahead paper](https://arxiv.org/abs/1907.08610)).\nA useful property of model averaging is: it decouples\nbatch size with system parallelism, often making\nit *hyper-parameter robust*. We find\nthis property valuable\nas DL users often find it hard and expensive to\ntune synchronous SGD at scale.\n\n***Convergence evaluation***:\nWe have tested KungFu optimizers using ResNet-50 and ResNet-101 for ImageNet.\nWhen using 8 V100, all KungFu optimizers can reach the target 75% accuracy,\nthe same as the baseline Horovod.\nWhen using 16 V100, Horovod and ``SynchronousSGDOptimizer`` suffer from\nthe increased batch size and their accuracy drop to 59% while\n``SynchronousAveragingOptimizer`` and ``PairAveragingOptimizer`` still\nreach the target 75%.\nAll these tests use a per-GPU batch size as 64 and [hyper-parameters](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks#getting-started)\nsuggested by the TensorFlow benchmark authors.\n\n## Benchmark\n\nWe benchmark KungFu in a cluster that has 16 V100 GPUs hosted by 2 DGX-1 machines.\nThe machines are interconnected by a 100 Gbps network. We measure the training throughput of ResNet-50, VGG16 and InceptionV3. These models represent different kinds of training workloads.\n\nIn the ***synchronous training*** case, we compare KungFu (``SynchronousSGDOptimizer``) with [Horovod](https://github.com/horovod/horovod) (0.16.1). Horovod uses OpenMPI 4.0.0. We evaluate the spectrum of batch size (from 256 to 4096) commonly used by S-SGD users.\nThis batch size is evenly shared by the 16 GPUs.\nKungFu outperforms Horovod on all tested models, in particular with small batch sizes which significantly raise the\nfrequency of synchronization.\n\n![sync](benchmarks/system/result/sync-scalability.svg)\n\nIn the ***asynchronous training*** case, we compare KungFu (``PairAveragingOptimizer``) with TensorFlow parameter servers (1.13.1). We uses the same range of batch sizes as above. KungFu exhibits better scalability as well.\n\n![async](benchmarks/system/result/async-scalability.svg)\n\nWe have also run the same benchmark in a 16-server cluster (each has a P100).\nKungFu exhibits better scalability in this communication-challenging environment,\nand we thus only report the 16 V100 result here. You can find the benchmark scripts [here](benchmarks/system/).\n\n## Development\n\nKungFu is designed with extensibility in mind.\nIt has a low-level API and a modular architecture, making\nit suitable for implementing new distributed training algorithms.\nCheck out the developer [guideline](CONTRIBUTING.md) for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FKungFu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FKungFu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FKungFu/lists"}