{"id":13710057,"url":"https://github.com/danielenricocahall/elephas","last_synced_at":"2026-01-14T08:37:50.251Z","repository":{"id":58487805,"uuid":"324855907","full_name":"danielenricocahall/elephas","owner":"danielenricocahall","description":"Distributed Deep learning with Keras \u0026 Spark","archived":false,"fork":true,"pushed_at":"2025-09-13T21:51:08.000Z","size":7837,"stargazers_count":21,"open_issues_count":8,"forks_count":7,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-28T03:47:51.519Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"maxpumperla/elephas","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danielenricocahall.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["danielenricocahall"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2020-12-27T21:55:38.000Z","updated_at":"2025-09-10T21:18:17.000Z","dependencies_parsed_at":"2025-04-14T14:50:47.458Z","dependency_job_id":null,"html_url":"https://github.com/danielenricocahall/elephas","commit_stats":null,"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"purl":"pkg:github/danielenricocahall/elephas","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielenricocahall%2Felephas","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielenricocahall%2Felephas/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielenricocahall%2Felephas/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielenricocahall%2Felephas/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danielenricocahall","download_url":"https://codeload.github.com/danielenricocahall/elephas/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielenricocahall%2Felephas/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414668,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:31:27.429Z","status":"ssl_error","status_checked_at":"2026-01-14T08:31:19.098Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T23:00:51.316Z","updated_at":"2026-01-14T08:37:50.236Z","avatar_url":"https://github.com/danielenricocahall.png","language":"Python","funding_links":["https://github.com/sponsors/danielenricocahall"],"categories":["Deep Learning Framework"],"sub_categories":["Deployment \u0026 Distribution"],"readme":"# Elephas: Distributed Deep Learning with Keras \u0026 Spark \n\n![Elephas](https://raw.githubusercontent.com/danielenricocahall/elephas/master/elephas-logo.png)\n\n## \n\n[![Build Status](https://github.com/danielenricocahall/elephas/actions/workflows/ci.yaml/badge.svg)](https://github.com/danielenricocahall/elephas/actions/workflows/ci.yaml/badge.svg)\n[![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/danielenricocahall/elephas/blob/master/LICENSE)\n[![Supported Versions](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11-blue)\n\nElephas is an extension of [Keras](http://keras.io), which allows you to run distributed deep learning models at \nscale with [Spark](http://spark.apache.org). Elephas currently supports a number of \napplications, including:\n\n- [Data-parallel training of deep learning models](#basic-spark-integration)\n- [Distributed inference and evaluation of deep learning models](#distributed-inference-and-evaluation)\n- [~~Distributed training of ensemble models~~](#distributed-training-of-ensemble-models)  (removed as of 3.0.0)\n- [~~Distributed hyper-parameter optimization~~](#distributed-hyper-parameter-optimization)  (removed as of 3.0.0)\n- [~~Distributed training and inference with Hugging Face models~~](#hugging-face-models-training-and-inference) (removed as 7.0.0)\n\n\n\nSchematically, elephas works as follows.\n\n![Elephas](https://raw.githubusercontent.com/danielenricocahall/elephas/master/elephas.gif)\n\nTable of content:\n* [Elephas: Distributed Deep Learning with Keras \u0026 Spark](#elephas-distributed-deep-learning-with-keras-\u0026-spark-)\n  * [Introduction](#introduction)\n  * [Getting started](#getting-started)\n  * [Basic Spark integration](#basic-spark-integration)\n  * [Distributed Inference and Evaluation](#distributed-inference-and-evaluation)\n  * [Spark MLlib integration](#spark-mllib-integration)\n  * [Spark ML integration](#spark-ml-integration)\n  * [Hadoop integration](#hadoop-integration)\n  * [Distributed hyper-parameter optimization](#distributed-hyper-parameter-optimization)\n  * [Distributed training of ensemble models](#distributed-training-of-ensemble-models)\n  * [Discussion](#discussion)\n  * [Literature](#literature)\n\n\n\n## Introduction\nElephas brings deep learning with [Keras](http://keras.io) to [Spark](http://spark.apache.org). Elephas intends to \nkeep the simplicity and high usability of Keras, thereby allowing for fast prototyping of distributed models, which \ncan be run on massive data sets. For an introductory example, see the following \n[iPython notebook](https://github.com/danielenricocahall/elephas/blob/master/examples/Spark_ML_Pipeline.ipynb).\n\nἐλέφας is Greek for _ivory_ and an accompanying project to κέρας, meaning _horn_. If this seems weird mentioning, like \na bad dream, you should confirm it actually is at the \n[Keras documentation](https://github.com/fchollet/keras/blob/master/README.md). \nElephas also means _elephant_, as in stuffed yellow elephant.\n\nElephas implements a class of data-parallel algorithms on top of Keras, using Spark's RDDs and data frames. \nKeras Models are initialized on the driver, then serialized and shipped to workers, alongside with data and broadcasted \nmodel parameters. Spark workers deserialize the model, train their chunk of data and send their gradients back to the \ndriver. The \"master\" model on the driver is updated by an optimizer, which takes gradients either synchronously or\nasynchronously.\n\n## Getting started\n\nJust install elephas from PyPI with, Spark will be installed through `pyspark` for you.\n\n```\npip install elephas\n```\n\nThat's it, you should now be able to run Elephas examples.\n\n## Basic Spark integration\n\nAfter installing both Elephas, you can train a model as follows. First, create a local pyspark context\n```python\nfrom pyspark import SparkContext, SparkConf\nconf = SparkConf().setAppName('Elephas_App').setMaster('local[8]')\nsc = SparkContext(conf=conf)\n```\n\nNext, you define and compile a Keras model\n```python\nfrom tensorflow.keras.models import Sequential\nfrom tensorflow.keras.layers import Dense, Dropout, Activation\nfrom tensorflow.keras.optimizers import SGD\nmodel = Sequential()\nmodel.add(Dense(128, input_dim=784))\nmodel.add(Activation('relu'))\nmodel.add(Dropout(0.2))\nmodel.add(Dense(128))\nmodel.add(Activation('relu'))\nmodel.add(Dropout(0.2))\nmodel.add(Dense(10))\nmodel.add(Activation('softmax'))\nmodel.compile(loss='categorical_crossentropy', optimizer=SGD())\n```\n\nand create an RDD from numpy arrays (or however you want to create an RDD)\n```python\nfrom elephas.utils.rdd_utils import to_simple_rdd\nrdd = to_simple_rdd(sc, x_train, y_train)\n```\n\nThe basic model in Elephas is the `SparkModel`. You initialize a `SparkModel` by passing in a compiled Keras model, \nan update frequency and a parallelization mode. After that you can simply `fit` the model on your RDD. Elephas `fit`\nhas the same options as a Keras model, so you can pass `epochs`, `batch_size` etc. as you're used to from tensorflow.keras.\n\n```python\nfrom elephas.spark_model import SparkModel, AsynchronousSparkModel\n\nspark_model = SparkModel(model)\n# or, if you want use the asynchronous training paradigm\n# spark_model = AsynchronousSparkModel(model, frequency='epoch', mode='asynchronous')\nspark_model.fit(rdd, epochs=20, batch_size=32, verbose=0, validation_split=0.1)\n```\n\nYour script can now be run using spark-submit\n```bash\nspark-submit --driver-memory 1G ./your_script.py\n```\n\nIncreasing the driver memory even further may be necessary, as the set of parameters in a network may be very large \nand collecting them on the driver eats up a lot of resources. See the examples folder for a few working examples.\n\n## Distributed Inference and Evaluation\n\nThe `SparkModel` can also be used for distributed inference (prediction) and evaluation. Similar to the `fit` method,  the `predict` and `evaluate` methods\nconform to the Keras Model API. \n\n```python\nfrom elephas.spark_model import SparkModel\n\n# create/train the model, similar to the previous section (Basic Spark Integration)\nmodel = ...\nspark_model = SparkModel(model, ...)\nspark_model.fit(...)\n\nx_test, y_test = ... # load test data\n\npredictions = spark_model.predict(x_test) # perform inference\nevaluation = spark_model.evaluate(x_test, y_test) # perform evaluation/scoring\n```\nThe paradigm is identical to the data parallelism in training, as the model is serialized and shipped to the workers and used to evaluate a chunk of the testing data. The predict method will take either a numpy array or an RDD.\n\n## Spark MLlib integration\n\nFollowing up on the last example, to use Spark's MLlib library with Elephas, you create an RDD of LabeledPoints for \nsupervised training as follows\n\n```python\nfrom elephas.utils.rdd_utils import to_labeled_point\nlp_rdd = to_labeled_point(sc, x_train, y_train, categorical=True)\n```\n\nTraining a given LabeledPoint-RDD is very similar to what we've seen already\n\n```python\nfrom elephas.spark_model import SparkMLlibModel\nspark_model = SparkMLlibModel(model, frequency='batch', mode='hogwild')\nspark_model.train(lp_rdd, epochs=20, batch_size=32, verbose=0, validation_split=0.1, \n                  categorical=True, nb_classes=nb_classes)\n```\n\n\n## Spark ML integration\n\nTo train a model with a SparkML estimator on a data frame, use the following syntax.\n```python\ndf = to_data_frame(sc, x_train, y_train, categorical=True)\ntest_df = to_data_frame(sc, x_test, y_test, categorical=True)\n\nestimator = ElephasEstimator(model, epochs=epochs, batch_size=batch_size, frequency='batch', mode='asynchronous',\n                             categorical=True, nb_classes=nb_classes)\nfitted_model = estimator.fit(df)\n```\n\nFitting an estimator results in a SparkML transformer, which we can use for predictions and other evaluations by \ncalling the transform method on it.\n\n```python\nprediction = fitted_model.transform(test_df)\npnl = prediction.select(\"label\", \"prediction\")\npnl.show(100)\nimport numpy as np\nprediction_and_label = pnl.rdd.map(lambda row: (row.label, float(np.argmax(row.prediction))))\n\nmetrics = MulticlassMetrics(prediction_and_label)\nprint(metrics.weightedPrecision)\nprint(metrics.weightedRecall)\n```\n\nIf the model utilizes custom activation function, layer, or loss function, that will need to be supplied using the `set_custom_objects` method:\n\n```python\ndef custom_activation(x):\n    ...\nclass CustomLayer(Layer):\n    ...\nmodel = Sequential()\nmodel.add(CustomLayer(...))\n\nestimator = ElephasEstimator(model, epochs=epochs, batch_size=batch_size)\nestimator.set_custom_objects({'custom_activation': custom_activation, 'CustomLayer': CustomLayer})\n```\n\n## Hadoop Integration\n\nIn addition to saving locally, models may be saved directly into a network-accessible Hadoop cluster.\n\n```python\nspark_model.save('/absolute/file/path/model.h5', to_hadoop=True)\n```\n\nModels saved on a network-accessible Hadoop cluster may be loaded as follows.\n\n```python\nfrom elephas.spark_model import load_spark_model\n\nspark_model = load_spark_model('/absolute/file/path/model.h5', from_hadoop=True)\n```\n\n## Distributed hyper-parameter optimization\n\n\u003cspan style=\"color:red\"\u003e**UPDATE**: As of 3.0.0, Hyper-parameter optimization features have been removed, since Hyperas is no longer active and was causing versioning compatibility issues. To use these features, install version 2.1 or below.\u003c/span\u003e\n\nHyper-parameter optimization with elephas is based on [hyperas](https://github.com/maxpumperla/hyperas), a convenience \nwrapper for hyperopt and keras. Each Spark worker executes a number of trials, the results get collected and the best \nmodel is returned. As the distributed mode in hyperopt (using MongoDB), is somewhat difficult to configure and error \nprone at the time of writing, we chose to implement parallelization ourselves. Right now, the only available \noptimization algorithm is random search.\n\nThe first part of this example is more or less directly taken from the hyperas documentation. We define data and model \nas functions, hyper-parameter ranges are defined through braces. See the hyperas documentation for more on how \nthis works.\n\n```python\nfrom hyperopt import STATUS_OK\nfrom hyperas.distributions import choice, uniform\n\ndef data():\n    from tensorflow.keras.datasets import mnist\n    from tensorflow.keras.utils import to_categorical\n    (x_train, y_train), (x_test, y_test) = mnist.load_data()\n    x_train = x_train.reshape(60000, 784)\n    x_test = x_test.reshape(10000, 784)\n    x_train = x_train.astype('float32')\n    x_test = x_test.astype('float32')\n    x_train /= 255\n    x_test /= 255\n    nb_classes = 10\n    y_train = to_categorical(y_train, nb_classes)\n    y_test = to_categorical(y_test, nb_classes)\n    return x_train, y_train, x_test, y_test\n\n\ndef model(x_train, y_train, x_test, y_test):\n    from tensorflow.keras.models import Sequential\n    from tensorflow.keras.layers import Dense, Dropout, Activation\n    from tensorflow.keras.optimizers import RMSprop\n\n    model = Sequential()\n    model.add(Dense(512, input_shape=(784,)))\n    model.add(Activation('relu'))\n    model.add(Dropout({{uniform(0, 1)}}))\n    model.add(Dense({{choice([256, 512, 1024])}}))\n    model.add(Activation('relu'))\n    model.add(Dropout({{uniform(0, 1)}}))\n    model.add(Dense(10))\n    model.add(Activation('softmax'))\n\n    rms = RMSprop()\n    model.compile(loss='categorical_crossentropy', optimizer=rms)\n\n    model.fit(x_train, y_train,\n              batch_size={{choice([64, 128])}},\n              nb_epoch=1,\n              show_accuracy=True,\n              verbose=2,\n              validation_data=(x_test, y_test))\n    score, acc = model.evaluate(x_test, y_test, show_accuracy=True, verbose=0)\n    print('Test accuracy:', acc)\n    return {'loss': -acc, 'status': STATUS_OK, 'model': model.to_json()}\n```\n\nOnce the basic setup is defined, running the minimization is done in just a few lines of code:\n\n```python\nfrom elephas.hyperparam import HyperParamModel\nfrom pyspark import SparkContext, SparkConf\n\n# Create Spark context\nconf = SparkConf().setAppName('Elephas_Hyperparameter_Optimization').setMaster('local[8]')\nsc = SparkContext(conf=conf)\n\n# Define hyper-parameter model and run optimization\nhyperparam_model = HyperParamModel(sc)\nhyperparam_model.minimize(model=model, data=data, max_evals=5)\n```\n\n## Distributed training of ensemble models\n\u003cspan style=\"color:red\"\u003e**UPDATE**: As of 3.0.0, Hyper-parameter optimization features have been removed, since Hyperas is no longer active and was causing versioning compatibility issues. To use these features, install version 2.1 or below.\u003c/span\u003e\n\nBuilding on the last section, it is possible to train ensemble models with elephas by means of running hyper-parameter \noptimization on large search spaces and defining a resulting voting classifier on the top-n performing models. \nWith ```data``` and ```model``` defined as above, this is a simple as running\n\n```python\nresult = hyperparam_model.best_ensemble(nb_ensemble_models=10, model=model, data=data, max_evals=5)\n```\nIn this example an ensemble of 10 models is built, based on optimization of at most 5 runs on each of the Spark workers.\n\n\n## Hugging Face Models Training and Inference\n**Note**: Due to incompatibilities with Keras 3.0 which would ultimately limit the Tensorflow version we can upgrade to, and the announcement of HuggingFace no longer supporting Tensorflow, HuggingFace support has been removed from Elephas.\nAs of 6.0.0, Elephas now supports distributed training (and inference) with [HuggingFace](https://huggingface.co/) models (using the Tensorflow/Keras backend), currently for text classification, token classification, and causal langugage modeling only, and in the `\"synchronous\"` training mode. In future releases, we hope to expand this to other types of models and the `\"asynchronous\"` and `\"hogwild\"` training modes. This can be accomplished using the `SparkHFModel`:\n\n```python \nfrom elephas.spark_model import SparkHFModel\nfrom elephas.utils.rdd_utils import to_simple_rdd\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import LabelEncoder\nfrom transformers import AutoTokenizer, TFAutoModelForSequenceClassification\nfrom tensorflow.keras.optimizers import SGD\nbatch_size = ...\nepochs = ...\nnum_workers = ...\n\nnewsgroups = fetch_20newsgroups(subset='train')\nx = newsgroups.data\ny = newsgroups.target\n\nencoder = LabelEncoder()\ny_encoded = encoder.fit_transform(y)\n\nx_train, x_test, y_train, y_test = train_test_split(x, y_encoded, test_size=0.2)\n\nmodel_name = 'albert-base-v2'\n\n# Note: the expectation is that text data is being supplied - tokenization is handled during training\nrdd = to_simple_rdd(spark_context, x_train, y_train)\n\nmodel = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(np.unique(y_encoded)))\ntokenizer = AutoTokenizer.from_pretrained(model_name)\ntokenizer_kwargs = {'padding': True, 'truncation': True, ...}\n\nmodel.compile(optimizer=SGD(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])\nspark_model = SparkHFModel(model, num_workers=num_workers, mode=\"synchronous\", tokenizer=tokenizer, tokenizer_kwargs=tokenizer_kwargs, loader=TFAutoModelForSequenceClassification)\n\nspark_model.fit(rdd, epochs=epochs, batch_size=batch_size)\n\npredictions = spark_model.predict(spark_context.parallelize(x_test))\n```\nMore examples can be seen in the `examples` directory, namely `\"hf_causal_modeling.py\"`, `\"hf_token_classification.py\"`, and `\"hf_text_classification.py\"`.\n\nThe computational model is the same as for Keras models, except the model is serialized and deserialized differently due to differences in the HuggingFace API. \n\nTo use this capability, just install this package with the `huggingface` extra:\n\n```bash\npip install elephas[huggingface]\n```\n\n## Discussion\n\nPremature parallelization may not be the root of all evil, but it may not always be the best idea to do so. Keep in \nmind that more workers mean less data per worker and parallelizing a model is not an excuse for actual learning. \nSo, if you can perfectly well fit your data into memory *and* you're happy with training speed of the model consider \njust using keras.\n\nOne exception to this rule may be that you're already working within the Spark ecosystem and want to leverage what's \nthere. The above SparkML example shows how to use evaluation modules from Spark and maybe you wish to further process \nthe outcome of an elephas model down the road. In this case, we recommend to use elephas as a simple wrapper by setting \nnum_workers=1.\n\nNote that right now elephas restricts itself to data-parallel algorithms for two reasons. First, Spark simply makes it \nvery easy to distribute data. Second, neither Spark nor Theano make it particularly easy to split up the actual model \nin parts, thus making model-parallelism practically impossible to realize.\n\nHaving said all that, we hope you learn to appreciate elephas as a pretty easy to setup and use playground for \ndata-parallel deep-learning algorithms.\n\n\n## Literature\n[1] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, QV. Le, MZ. Mao, M’A. Ranzato, A. Senior, P. Tucker, K. Yang, and AY. Ng. [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html).\n\n[2] F. Niu, B. Recht, C. Re, S.J. Wright [HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](http://arxiv.org/abs/1106.5730)\n\n[3] C. Noel, S. Osindero. [Dogwild! — Distributed Hogwild for CPU \u0026 GPU](http://stanford.edu/~rezab/nips2014workshop/submits/dogwild.pdf)\n\n## Maintainers / Contributions\n\nThis great project was started by Max Pumperla, and is currently maintained by Daniel Cahall (https://github.com/danielenricocahall). If you have any questions, please feel free to open up an issue or send an email to danielenricocahall@gmail.com. If you want to contribute, feel free to submit a PR, or start a conversation about how we can go about implementing something.\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=danielenricocahall/elephas\u0026type=Date)](https://star-history.com/#danielenricocahall/elephas\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielenricocahall%2Felephas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanielenricocahall%2Felephas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielenricocahall%2Felephas/lists"}