{"id":14982308,"url":"https://github.com/yahoo/tensorflowonspark","last_synced_at":"2025-05-13T21:07:34.314Z","repository":{"id":40828229,"uuid":"79584587","full_name":"yahoo/TensorFlowOnSpark","owner":"yahoo","description":"TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.","archived":false,"fork":false,"pushed_at":"2023-07-10T10:34:11.000Z","size":9472,"stargazers_count":3873,"open_issues_count":16,"forks_count":944,"subscribers_count":277,"default_branch":"master","last_synced_at":"2025-04-28T13:59:00.911Z","etag":null,"topics":["cluster","featured","machine-learning","python","scala","spark","tensorflow","yahoo"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yahoo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"Contributing.md","funding":null,"license":"LICENSE","code_of_conduct":"Code-of-Conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-01-20T18:15:57.000Z","updated_at":"2025-04-14T03:07:33.000Z","dependencies_parsed_at":"2022-07-12T18:04:31.919Z","dependency_job_id":"de1b54d6-c7e8-4e53-aeac-0caa084aeba9","html_url":"https://github.com/yahoo/TensorFlowOnSpark","commit_stats":{"total_commits":422,"total_committers":31,"mean_commits":"13.612903225806452","dds":0.6374407582938388,"last_synced_commit":"a0f757b5ad6b21bc5466ac1077055525014e49f1"},"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FTensorFlowOnSpark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FTensorFlowOnSpark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FTensorFlowOnSpark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FTensorFlowOnSpark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yahoo","download_url":"https://codeload.github.com/yahoo/TensorFlowOnSpark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254028818,"owners_count":22002279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster","featured","machine-learning","python","scala","spark","tensorflow","yahoo"],"created_at":"2024-09-24T14:05:09.734Z","updated_at":"2025-05-13T21:07:29.302Z","avatar_url":"https://github.com/yahoo.png","language":"Python","readme":"\u003c!--\nCopyright 2019 Yahoo Inc.\nLicensed under the terms of the Apache 2.0 license.\nPlease see LICENSE file in the project root for terms.\n--\u003e\n# TensorFlowOnSpark\n\u003e _TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark\nclusters._\n\n[![Build Status](https://cd.screwdriver.cd/pipelines/6384/badge)](https://cd.screwdriver.cd/pipelines/6384)\n[![Package](https://img.shields.io/badge/package-pypi-blue.svg)](https://pypi.org/project/tensorflowonspark/)\n[![Downloads](https://img.shields.io/pypi/dm/tensorflowonspark.svg)](https://img.shields.io/pypi/dm/tensorflowonspark.svg)\n[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://yahoo.github.io/TensorFlowOnSpark/)\n\nBy combining salient features from the [TensorFlow](https://www.tensorflow.org) deep learning framework with [Apache Spark](http://spark.apache.org) and [Apache Hadoop](http://hadoop.apache.org), TensorFlowOnSpark enables distributed\ndeep learning on a cluster of GPU and CPU servers.\n\nIt enables both distributed TensorFlow training and\ninferencing on Spark clusters, with a goal to minimize the amount\nof code changes required to run existing TensorFlow programs on a\nshared grid.  Its Spark-compatible API helps manage the TensorFlow\ncluster with the following steps:\n\n1. **Startup** - launches the Tensorflow main function on the executors, along with listeners for data/control messages.\n1. **Data ingestion**\n   - **InputMode.TENSORFLOW** - leverages TensorFlow's built-in APIs to read data files directly from HDFS.\n   - **InputMode.SPARK** - sends Spark RDD data to the TensorFlow nodes via a `TFNode.DataFeed` class.  Note that we leverage the [Hadoop Input/Output Format](https://github.com/tensorflow/ecosystem/tree/master/hadoop) to access TFRecords on HDFS.\n1. **Shutdown** - shuts down the Tensorflow workers and PS nodes on the executors.\n\n## Table of Contents\n\n- [Background](#background)\n- [Install](#install)\n- [Usage](#usage)\n- [API](#api)\n- [Contribute](#contribute)\n- [License](#license)\n\n## Background\n\nTensorFlowOnSpark was developed by Yahoo for large-scale distributed\ndeep learning on our Hadoop clusters in Yahoo's private cloud.\n\nTensorFlowOnSpark provides some important benefits (see [our\nblog](https://developer.yahoo.com/blogs/157196317141/))\nover alternative deep learning solutions.\n   * Easily migrate existing TensorFlow programs with \u003c10 lines of code change.\n   * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard.\n   * Server-to-server direct communication achieves faster learning when available.\n   * Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow.\n   * Easily integrate with your existing Spark data processing pipelines.\n   * Easily deployed on cloud or on-premise and on CPUs or GPUs.\n\n## Install\n\nTensorFlowOnSpark is provided as a pip package, which can be installed on single machines via:\n```\n# for tensorflow\u003e=2.0.0\npip install tensorflowonspark\n\n# for tensorflow\u003c2.0.0\npip install tensorflowonspark==1.4.4\n```\n\nFor distributed clusters, please see our [wiki site](../../wiki) for detailed documentation for specific environments, such as our getting started guides for [single-node Spark Standalone](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_Standalone), [YARN clusters](../../wiki/GetStarted_YARN) and [AWS EC2](../../wiki/GetStarted_EC2).  Note: the Windows operating system is not currently supported due to [this issue](https://github.com/yahoo/TensorFlowOnSpark/issues/36).\n\n## Usage\n\nTo use TensorFlowOnSpark with an existing TensorFlow application, you can follow our [Conversion Guide](../../wiki/Conversion-Guide) to describe the required changes.  Additionally, our [wiki site](../../wiki) has pointers to some presentations which provide an overview of the platform.\n\n**Note: since TensorFlow 2.x breaks API compatibility with TensorFlow 1.x, the examples have been updated accordingly.  If you are using TensorFlow 1.x, you will need to checkout the `v1.4.4` tag for compatible examples and instructions.**\n\n## API\n\n[API Documentation](https://yahoo.github.io/TensorFlowOnSpark/) is automatically generated from the code.\n\n## Contribute\n\nPlease join the [TensorFlowOnSpark user group](https://groups.google.com/forum/#!forum/TensorFlowOnSpark-users) for discussions and questions.  If you have a question, please review our [FAQ](../../wiki/Frequently-Asked-Questions) before posting.\n\nContributions are always welcome.  For more information, please see our [guide for getting involved](Contributing.md).\n\n## License\n\nThe use and distribution terms for this software are covered by the Apache 2.0 license.\nSee [LICENSE](LICENSE) file for terms.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyahoo%2Ftensorflowonspark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyahoo%2Ftensorflowonspark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyahoo%2Ftensorflowonspark/lists"}