{"id":13935163,"url":"https://github.com/yahoo/CaffeOnSpark","last_synced_at":"2025-07-19T20:31:00.995Z","repository":{"id":66000684,"uuid":"49448481","full_name":"yahoo/CaffeOnSpark","owner":"yahoo","description":"Distributed deep learning on Hadoop and Spark clusters.","archived":true,"fork":false,"pushed_at":"2019-11-15T21:44:39.000Z","size":17466,"stargazers_count":1267,"open_issues_count":79,"forks_count":358,"subscribers_count":149,"default_branch":"master","last_synced_at":"2024-09-27T02:04:26.018Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yahoo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-01-11T19:21:31.000Z","updated_at":"2024-08-22T12:26:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"f69eaa52-2ecb-4c62-917b-cf75eb4d1fd7","html_url":"https://github.com/yahoo/CaffeOnSpark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FCaffeOnSpark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FCaffeOnSpark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FCaffeOnSpark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2FCaffeOnSpark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yahoo","download_url":"https://codeload.github.com/yahoo/CaffeOnSpark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226666541,"owners_count":17665043,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-07T23:01:26.081Z","updated_at":"2025-07-19T20:31:00.980Z","avatar_url":"https://github.com/yahoo.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","人工智能","\u003ca name=\"Tools\"\u003e\u003c/a\u003e6. Tools"],"sub_categories":[],"readme":"\u003c!--\nCopyright 2016 Yahoo Inc.\nLicensed under the terms of the Apache 2.0 license.\nPlease see LICENSE file in the project root for terms.\n--\u003e\n### Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version of it and continue to use this code under the terms of the project license.\n\n# CaffeOnSpark\n\n## What's CaffeOnSpark?\n\nCaffeOnSpark brings deep learning to Hadoop and Spark clusters.  By\ncombining salient features from deep learning framework\n[Caffe](https://github.com/BVLC/caffe) and big-data frameworks [Apache\nSpark](http://spark.apache.org/) and [Apache Hadoop](http://hadoop.apache.org/), CaffeOnSpark enables distributed\ndeep learning on a cluster of GPU and CPU servers.\n\nAs a distributed extension of Caffe, CaffeOnSpark supports neural\nnetwork model training, testing, and feature extraction.  Caffe users\ncan now perform distributed learning using their existing LMDB data\nfiles and minorly adjusted network configuration (as\n[illustrated](../master/data/lenet_memory_train_test.prototxt#L10-L12)).\n\nCaffeOnSpark is a Spark package for deep learning. It is complementary\nto non-deep learning libraries MLlib and Spark SQL.\nCaffeOnSpark's Scala API provides Spark applications with an easy\nmechanism to invoke deep learning (see\n[sample](../master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/examples/MyMLPipeline.scala))\nover distributed datasets.\n\nCaffeOnSpark was developed by Yahoo for [large-scale distributed deep\nlearning on our Hadoop\nclusters](http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop)\nin Yahoo's private cloud.  It's been in use by Yahoo for image search,\ncontent classification and several other use cases.\n\n## Why CaffeOnSpark?\n\nCaffeOnSpark provides some important benefits (see [our blog](http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep)) over alternative deep learning solutions.  \n\n* It enables model training, test and feature extraction directly on Hadoop datasets stored in HDFS on Hadoop clusters.\n* It turns your Hadoop or Spark cluster(s) into a powerful platform for deep learning, without the need to set up a new dedicated cluster for deep learning separately.\n* Server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck. \n* Caffe users' existing datasets (e.g. LMDB) and configurations could be applied for distributed learning without any conversion needed.\n* High-level API empowers Spark applications to easily conduct deep learning. \n* Incremental learning is supported to leverage previously trained models or snapshots. \n* Additional data formats and network interfaces could be easily added.\n* It can be easily deployed on public cloud (ex. AWS EC2) or a private cloud.\n\n## Using CaffeOnSpark\n\nPlease check CaffeOnSpark [wiki site](../../wiki) for detailed\ndocumentations such as [building instruction](../../wiki/build), [API\nreference](http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.package)\nand getting started guides for [standalone\ncluster](../../wiki/GetStarted_local) and [AWS EC2\ncluster](../../wiki/GetStarted_EC2).\n\n\n* Batch sizes specified in prototxt files are per device.\n* Memory layers should not be shared among GPUs, and thus \"share_in_parallel: false\" is required for layer configuration.\n\n## Building for Spark 2.X\n\nCaffeOnSpark supports both Spark 1.x and 2.x. For Spark 2.0, our default settings are:\n  - spark-2.0.0\n  - hadoop-2.7.1\n  - scala-2.11.7\nYou may want to adjust them in caffe-grid/pom.xml.\n\n \n## Mailing List\n\nPlease join [CaffeOnSpark user\ngroup](https://groups.google.com/forum/#!forum/caffeonspark-users) for\ndiscussions and questions.\n\n\n## License\n\nThe use and distribution terms for this software are covered by the\nApache 2.0 license. See [LICENSE](LICENSE.txt) file for terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyahoo%2FCaffeOnSpark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyahoo%2FCaffeOnSpark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyahoo%2FCaffeOnSpark/lists"}