{"id":13795209,"url":"https://github.com/cerndb/dist-keras","last_synced_at":"2025-10-03T01:31:44.638Z","repository":{"id":41526922,"uuid":"64122944","full_name":"cerndb/dist-keras","owner":"cerndb","description":"Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.","archived":true,"fork":false,"pushed_at":"2018-07-25T01:44:09.000Z","size":57240,"stargazers_count":624,"open_issues_count":35,"forks_count":167,"subscribers_count":49,"default_branch":"master","last_synced_at":"2025-01-14T14:18:20.742Z","etag":null,"topics":["apache-spark","data-parallelism","data-science","deep-learning","distributed-optimizers","hadoop","keras","machine-learning","optimization-algorithms","tensorflow"],"latest_commit_sha":null,"homepage":"http://joerihermans.com/work/distributed-keras/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cerndb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-07-25T09:47:37.000Z","updated_at":"2024-09-18T23:10:33.000Z","dependencies_parsed_at":"2022-08-26T05:51:11.609Z","dependency_job_id":null,"html_url":"https://github.com/cerndb/dist-keras","commit_stats":null,"previous_names":["joerihermans/dist-keras"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cerndb%2Fdist-keras","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cerndb%2Fdist-keras/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cerndb%2Fdist-keras/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cerndb%2Fdist-keras/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cerndb","download_url":"https://codeload.github.com/cerndb/dist-keras/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235059234,"owners_count":18929279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","data-parallelism","data-science","deep-learning","distributed-optimizers","hadoop","keras","machine-learning","optimization-algorithms","tensorflow"],"created_at":"2024-08-03T23:00:53.301Z","updated_at":"2025-10-03T01:31:39.592Z","avatar_url":"https://github.com/cerndb.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Distributed Keras\n\nDistributed Deep Learning with Apache Spark and Keras.\n\n\n## Introduction\n\nDistributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on \"state-of-the-art\" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of **ensembles** and models using **data parallel** methods.\n\nMost of the distributed optimizers we provide, are based on data parallel methods. A data parallel method, as described in [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf), is a learning paradigm where multiple replicas of a single model are used to optimize a single objective. Using this approach, we are able to dignificantly reduce the training time of a model. Depending on the parametrization, we also observed that it is possible to achieve better statistical model performance compared to a more traditional approach (e.g., like the [SingleTrainer](#single-trainer) implementation), and yet, spending less wallclock time on the training of the model. However, this is subject to further research.\n\n**Attention**: A rather complete introduction to the problem of Distributed Deep Learning is presented in my Master Thesis [http://github.com/JoeriHermans/master-thesis](http://github.com/JoeriHermans/master-thesis). Furthermore, the thesis describes includes several *novel* insights, such as a redefinition of parameter staleness, and several new distributed optimizers such as AGN and ADAG.\n\n\n## Installation\n\nWe will guide you how to install Distributed Keras. However, we will assume that an Apache Spark installation is available. In the following subsections, we describe two approaches to achieve this.\n\n### pip\n\nWhen you only require the framework for development purposes, just use `pip` to install dist-keras.\n\n```bash\npip install --upgrade dist-keras\n\n# OR\n\npip install --upgrade git+https://github.com/JoeriHermans/dist-keras.git\n```\n\n### git \u0026 pip\n\nHowever, if you would like to contribute, or run some of the examples. It is probably best to clone the repository directly from GitHub and install it afterwards using `pip`. This will also resolve possible missing dependencies.\n\n```bash\ngit clone https://github.com/JoeriHermans/dist-keras\ncd dist-keras\npip install -e .\n```\n\n### General notes\n\n#### .bashrc\n\nMake sure the following variables are set in your `.bashrc`. It is possible, depending on your system configuration, that the following configuration **doesn't have to be applied**.\n\n```bash\n# Example of a .bashrc configuration.\nexport SPARK_HOME=/usr/lib/spark\nexport PYTHONPATH=\"$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH\"\n```\n\n\n## Running an example\n\nWe would like to refer the reader to the `workflow.ipynb` notebook in the examples folder. This will give you a complete introduction to the problem of distributed deep learning, and will guide you through the steps that have to be executed.\n\nFurthermore, we would also like to show how you exactly should process \"big\" datasets. This is shown in the examples starting with the prefix ```example_```. Please execute them in the provided sequence.\n\n### Spark 2.0\n\nIf you want to run the examples using Apache Spark 2.0.0 and higher. You will need to remove the line containing `sqlContext = SQLContext(sc)`. We need to do this because in Spark 2.0+, the SQLContext, and Hive context are now merged in the Spark session.\n\n\n## Optimization Algorithms\n\n### Sequential Trainer\n\nThis optimizer follows the traditional scheme of training a model, i.e., it uses sequential gradient updates to optimize the parameters. It does this by executing the training procedure on a single Spark executor.\n\n```python\nSingleTrainer(model, features_col, label_col, batch_size, optimizer, loss, metrics=[\"accuracy\"])\n```\n\n### ADAG (Currently Recommended)\n\nDOWNPOUR variant which is able to achieve significantly better statistical performance while being less sensitive to hyperparameters. This optimizer was developed using insights gained while developing this framework. More research regarding parameter staleness is still being conducted to further improve this optimizer.\n\n```python\nADAG(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n     features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=12)\n```\n\n### Dynamic SGD\n\nDynamic SGD, dynamically maintains a learning rate for every worker by incorperating parameter staleness. This optimization scheme is introduced in \"Heterogeneity-aware Distributed Parameter Servers\" at the SIGMOD 2017 conference [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf).\n\n```python\nDynSGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n       features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=10)\n```\n\n### Asynchronous Elastic Averaging SGD (AEASGD)\n\nThe distinctive idea of EASGD is to allow the local workers to perform more exploration (small rho) and the master to perform exploitation. This approach differs from other settings explored in the literature, and focus on how fast the center variable converges [[2]](https://arxiv.org/pdf/1412.6651.pdf) .\n\nIn this section we show the asynchronous version of EASGD. Instead of waiting on the synchronization of other trainers, this method communicates the elastic difference (as described in the paper), with the parameter server. The only synchronization mechanism that has been implemented, is to ensure no race-conditions occur when updating the center variable.\n\n\n```python\nAEASGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers, batch_size, features_col,\n       label_col, num_epoch, communication_window, rho, learning_rate)\n```\n\n### Asynchronous Elastic Averaging Momentum SGD (AEAMSGD)\n\nAsynchronous EAMSGD is a variant of asynchronous EASGD. It is based on the Nesterov's momentum scheme, where the update of the local worker is modified to incorepare a momentum term [[2]](https://arxiv.org/pdf/1412.6651.pdf) .\n\n```python\nEAMSGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers, batch_size,\n       features_col, label_col, num_epoch, communication_window, rho,\n       learning_rate, momentum)\n```\n\n### DOWNPOUR\n\nAn asynchronous stochastic gradient descent procedure introduced by Dean et al., supporting a large number of model replicas and leverages adaptive learning rates. This implementation is based on the pseudocode provided by [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) .\n\n```python\nDOWNPOUR(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers, batch_size,\n         features_col, label_col, num_epoch, learning_rate, communication_window)\n```\n\n### Ensemble Training\n\nIn ensemble training, we train `n` models in parallel on the same dataset. All models are trained in parallel, but the training of a single model is done in a sequential manner using Keras optimizers. After the training process, one can combine and, for example, average the output of the models.\n\n```python\nEnsembleTrainer(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], features_col,\n                label_col, batch_size, num_ensembles)\n```\n\n### Model Averaging\n\nModel averaging is a data parallel technique which will average the trainable parameters of model replicas after every epoch.\n\n```python\nAveragingTrainer(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], features_col,\n                 label_col, num_epoch, batch_size, num_workers)\n```\n\n## Job deployment\n\nWe also support remote job deployment. For example, imagine you are developing your model on a local notebook using a small development set. However, in order to submit your job on a remote cluster, you first need to develop a cluster job, and run the job there. In order to simplify this process, we have developed a simplified interface for a large scale machine learning job.\n\nIn order to submit a job to a remote cluster, you simply run the following code:\n\n```python\n# Define the distributed optimization procedure, and its parameters.\ntrainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, metrics=[\"accuracy\"], num_workers=20,\n               batch_size=32, communication_window=15, num_epoch=1,\n               features_col=\"features_normalized_dense\", label_col=\"label_encoded\")\n\n# Define the job parameters.\njob = Job(secret, job_name, data_path, num_executors, num_processes, trainer)\njob.send('http://yourcluster:[port]')\njob.wait_completion()\n# Fetch the trained model, and history for training evaluation.\ntrained_model = job.get_trained_model()\nhistory = job.get_history()\n```\n\n### Punchcard Server\n\nJob scheduling, and execution is handled by our `Punchcard` server. This server will accept requests from a remote location given a specific `secret`, which is basically a long identification string of a specific user. However, a user can have multiple secrets. At the moment, a job is only executed if there are no other jobs running for the specified secret.\n\nIn order to submit jobs to `Punchcard` we need to specify a secrets file. This file is basically a JSON structure, it will have the following structure:\n\n```json\n[\n    {\n        \"secret\": \"secret_of_user_1\",\n        \"identity\": \"user1\"\n    },\n    {\n        \"secret\": \"secret_of_user_2\",\n        \"identity\": \"user2\"\n    }\n]\n```\n\nAfter the secrets file has been constructed, the Punchcard server can be started by issueing the following command.\n\n```sh\npython scripts/punchcard.py --secrets /path/to/secrets.json\n```\n\n#### Secret Generation\n\nIn order to simplify secret generation, we have added a costum script which will generate a unique key for the specified identity. The structure can be generated by running the following command.\n\n```sh\npython scripts/generate_secret.py --identity userX\n```\n\n## Optimization Schemes\n\nTODO\n\n## General note\n\nIt is known that adding more asynchronous workers deteriorates the statistical performance of the model. There have been some studies which examinate this particular effect. However, some of them conclude that actually adding more asynchronous workers contributes to something what they call **implicit momentum** [[3]](https://arxiv.org/pdf/1605.09774.pdf). However, this is subject to further investigation.\n\n\n## Known issues\n\n- Python 3 compatibility.\n\n\n## TODO's\n\nList of possible future additions.\n\n- Save Keras model to HDFS.\n- Load Keras model from HDFS.\n- Compression / decompression of network transmissions.\n- Stop on target loss.\n- Multiple parameter servers for large Deep Networks.\n- Python 3 compatibility.\n- For every worker, spawn an additional thread which is responsible for sending updates to the parameter server. The actual worker thread will just submit tasks to this queue.\n\n\n## Citing\n\nIf you use this framework in any academic work, please use the following BibTex code.\n\n```latex\n@misc{dist_keras_joerihermans,\n  author = {Joeri R. Hermans, CERN IT-DB},\n  title = {Distributed Keras: Distributed Deep Learning with Apache Spark and Keras},\n  year = {2016},\n  publisher = {GitHub},\n  journal = {GitHub Repository},\n  howpublished = {\\url{https://github.com/JoeriHermans/dist-keras/}},\n}\n```\n\n## References\n\n* Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... \u0026 Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231). [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf)\n\n* Zhang, S., Choromanska, A. E., \u0026 LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693). [[2]](https://arxiv.org/pdf/1412.6651.pdf)\n\n* Mitliagkas, Ioannis, et al. \"Asynchrony begets Momentum, with an Application to Deep Learning.\" arXiv preprint arXiv:1605.09774 (2016). [[3]](https://arxiv.org/pdf/1605.09774.pdf)\n\n\u003c!-- @misc{pumperla2015, --\u003e\n\u003c!-- author = {Max Pumperla}, --\u003e\n\u003c!-- title = {elephas}, --\u003e\n\u003c!-- year = {2015}, --\u003e\n\u003c!-- publisher = {GitHub}, --\u003e\n\u003c!-- journal = {GitHub repository}, --\u003e\n\u003c!-- howpublished = {\\url{https://github.com/maxpumperla/elephas}} --\u003e\n\u003c!-- } --\u003e\n* Pumperla, M. (2015). Elephas. Github Repository https://github.com/maxpumperla/elephas/. [4]\n* Jiawei Jiang, Bin Cui, Ce Zhang and Lele Yu (2017). Heterogeneity-aware Distributed Parameter Servers [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf)\n\n\n## Licensing\n\n![GPLv3](resources/gpl_v3.png) ![CERN](resources/cern_logo.jpg)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcerndb%2Fdist-keras","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcerndb%2Fdist-keras","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcerndb%2Fdist-keras/lists"}