Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shifuml/shifu
An end-to-end machine learning and data mining framework on Hadoop
https://github.com/shifuml/shifu
bigdata end-to-end-machine-learning gbdt hadoop machine-learning neural-network pipeline random-forest shifu
Last synced: 4 days ago
JSON representation
An end-to-end machine learning and data mining framework on Hadoop
- Host: GitHub
- URL: https://github.com/shifuml/shifu
- Owner: ShifuML
- License: apache-2.0
- Created: 2014-04-21T22:21:09.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2024-05-13T05:32:43.000Z (8 months ago)
- Last Synced: 2024-12-22T22:07:35.316Z (11 days ago)
- Topics: bigdata, end-to-end-machine-learning, gbdt, hadoop, machine-learning, neural-network, pipeline, random-forest, shifu
- Language: Java
- Homepage: https://github.com/ShifuML/shifu/wiki
- Size: 16.1 MB
- Stars: 252
- Watchers: 42
- Forks: 108
- Open Issues: 238
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.txt
- License: LICENSE.txt
Awesome Lists containing this project
README
[](http://shifu.ml)
[![Build Status](https://travis-ci.org/ShifuML/shifu.svg)](https://travis-ci.org/ShifuML/shifu?branch=develop)[![Maven Central](https://maven-badges.herokuapp.com/maven-central/ml.shifu/shifu/badge.svg)](https://maven-badges.herokuapp.com/maven-central/ml.shifu/shifu)#
## Download
Please [download](https://github.com/ShifuML/shifu/wiki/shifu-0.12.0-hdp-yarn.tar.gz) latest shifu [here](https://github.com/ShifuML/shifu/wiki/shifu-0.12.0-hdp-yarn.tar.gz).
## Getting Started
After shifu downloading, build your first model with Shifu [tutorial](https://github.com/ShifuML/shifu/wiki/Tutorial---Build-Your-First-ML-Model). More details about shifu can be found in our [wiki pages](https://github.com/ShifuML/shifu/wiki).## What is Shifu?
Shifu is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. Shifu is designed for data scientists, simplifying the life-cycle of building machine learning models. While originally built for fraud modeling, Shifu is generalized for many other modeling domains.One of Shifu's pros is an end-to-end modeling pipeline in machine learning. With only configurations settings, a whole machine pipeline can be built and model can be much more easy to develop and push to production. The pipeline defined in Shifu is in below:
![Shifu Pipeline](https://raw.githubusercontent.com/wiki/ShifuML/shifu/images/new-shifu-pipeline.png)
Shifu provides a simple command-line interface for each step of the model building process, including
* Statistic calculation & variable selection to determine the most predictive variables in your data
* [Variable normalization](https://github.com/ShifuML/shifu/wiki/Variable%20Transform%20in%20Shifu)
* [Distributed variable selection based on sensitivity analysis](https://github.com/ShifuML/shifu/wiki/Variable%20Selection%20in%20Shifu)
* [Distributed neural network model training](https://github.com/ShifuML/shifu/wiki/Distributed%20Neural%20Network%20Training%20in%20Shifu)
* [Distributed tree ensemble model training](https://github.com/ShifuML/shifu/wiki/Distributed%20Tree%20Ensemble%20Model%20Training%20in%20Shifu)
* Post training analysis & model evaluation
* [Distributed Tensorflow on Shifu](https://github.com/ShifuML/shifu/wiki/Distributed-Tensorflow-Support-On-Shifu)Shifu’s fast Hadoop-based, distributed neural network / logistic regression / gradient boosted trees training can reduce model training time from days to hours on TB data sets. Shifu integrates with Pig workflows on Hadoop, and Shifu-trained models can be integrated into production code with a simple Java API. Shifu leverages Pig, Akka, Encog and other open source projects.
[Guagua](https://github.com/ShifuML/guagua), an in-memory iterative computing framework on Hadoop YARN is developed as sub-project of Shifu to accelerate training progress.
More details about shifu can be found in our [wiki pages](https://github.com/ShifuML/shifu/wiki)
## Conference
* [QCON Shanghai 2015](http://2015.qconshanghai.com/presentation/2827) [Slides](http://www.slideshare.net/pengshanzhang/large-scale-machine-learning-at-pay-pal-risk)
* [BDTC Beijing 2016](http://bdtc2016.hadooper.cn/dct/page/70107)
* [Strata Beijing 2017](https://strata.oreilly.com.cn/strata-cn/public/schedule/detail/59593?locale=en)
## Contributors
- Zhanghao Hu ([email protected])
- Grahame Jastrebski ([email protected])
- Lavar Li ([email protected])
- Mark Liu ([email protected])
- David Zhang ([email protected])
- Xin Zhong ([email protected])
- Simon Zhang ([email protected])
- Sharma Nitin ([email protected])
- Wayne Zhu ([email protected])
- Devin Wu ([email protected])
- Fred Bai ([email protected])## Google Group
Please join [Shifu group](https://groups.google.com/forum/#!forum/shifuml) if questions, bugs or anything else.
## Copyright and License
Copyright 2012-2019, PayPal Software Foundation under the [Apache License](LICENSE.txt).