An open API service indexing awesome lists of open source software.

https://github.com/ceteri/pattern

"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.
https://github.com/ceteri/pattern

Last synced: over 1 year ago
JSON representation

"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.

Awesome Lists containing this project

README

          

cascading.pattern
=================
_Pattern_ sub-project for http://Cascading.org/ which uses flows as
containers for machine learning models, importing
[PMML](http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language)
model descriptions from _R_, _SAS_, _Weka_, _RapidMiner_, _KNIME_,
_SQL Server_, etc.

Current support for PMML includes:

* [Random Forest](http://en.wikipedia.org/wiki/Random_forest) in [PMML 4.0+](http://www.dmg.org/v4-0-1/MultipleModels.html) exported from [R/Rattle](http://cran.r-project.org/web/packages/rattle/index.html)
* [Linear Regression](http://en.wikipedia.org/wiki/Linear_regression) in [PMML 1.1+](http://www.dmg.org/v1-1/generalregression.html)
* [Hierarchical Clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering) and [K-Means Clustering](http://en.wikipedia.org/wiki/K-means_clustering) in [PMML 2.0+](http://www.dmg.org/v2-0/ClusteringModel.html)
* [Logistic Regression](http://en.wikipedia.org/wiki/Logistic_regression) in [PMML 4.0.1+](http://www.dmg.org/v4-0-1/Regression.html)

Build Instructions
------------------
To build _Pattern_ and then run its unit tests:

gradle --info --stacktrace clean test

The following scripts generate a baseline (model+data) for the _Random
Forest_ algorithm. This baseline includes a reference data set --
1000 independent variables, 500 rows of simulated ecommerce orders --
plus a predictive model in PMML:

./src/py/gen_orders.py 500 1000 > orders.tsv
R --vanilla < ./src/r/rf_pmml.R > model.log

This will generate `huge.rf.xml` as the PMML export for a Random
Forest classifier plus `huge.tsv` as a baseline data set for
regression testing.

To build _Pattern_ and run a regression test:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \
--pmml data/sample.rf.xml --measure out/measure --assert

For each tuple in the data, a _stream assertion_ tests whether the
`predicted` field matches the `score` field generated by the
model. Tuples which fail that assertion get trapped into
`out/trap/part*` for inspection.

Also, the _confusion matrix_ shown in `out/measure/part*` should
match the one logged in `model.log` from baseline generated in _R_.

To run on Amazon AWS, take a look at the `emr.sh` script.

Classifier vs. Predictive Model
-------------------------------
Here's how to run an example _classifier_ using Random Forest:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.rf.tsv out/classify out/trap \
--pmml data/iris.rf.xml --measure out/measure --label species

Here's how to run an example _predictive model_ using Linear Regression:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \
--pmml data/iris.lm_p.xml --rmse out/measure

Use in Cascading Apps
---------------------
Alternatively, if you want to re-use this assembly for your own
Cascading app, remove the parts related to `verifyPipe` and
`measurePipe` from the sample code.

The following snippet in R shows how to train a Random Forest model,
then generate PMML as a file called `sample.rf.xml`:

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
saveXML(pmml(fit), file="sample.rf.xml")

To use the PMML file in your Cascading app, this example it
referenced as a command line argument called `pmmlPath`:

// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getFields(), classFunc, Fields.ALL );

Now when you run that Cascading app, provide a reference to
`sample.rf.xml` for the `pmmlPath` argument.

An architectural diagram for common use case patterns is shown in
`docs/pattern.graffle` which is an OmniGraffle document.

Example Models
--------------
Check the `src/r/rattle_pmml.R` script for examples of predictive
models which are created in R, then exported using _Rattle_.
These examples use the popular
[Iris](http://en.wikipedia.org/wiki/Iris_flower_data_set) data set.

* random forest (rf)
* linear regression (lm)
* hierarchical clustering (hclust)
* k-means clustering (kmeans)
* logistic regression (glm)
* multinomial model (multinom)
* single hidden-layer neural network (nnet)
* support vector machine (ksvm)
* recursive partition classification tree (rpart)
* association rules

To execute the R script:

R --vanilla < src/r/rattle_pmml.R

It is possible to extend PMML support for other kinds of modeling in R
and other analytics platforms. Contact the developers to discuss on
the [cascading-user](https://groups.google.com/forum/?fromgroups#!forum/cascading-user)
email forum.

PMML Resources
--------------
* [Data Mining Group](http://www.dmg.org/) XML standards and supported vendors
* [PMML In Action](http://www.amazon.com/dp/1470003244) book
* [PMML validator](http://www.zementis.com/pmml_tools.htm)