https://github.com/ceteri/pattern

"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.
https://github.com/ceteri/pattern

Last synced: over 1 year ago
JSON representation

"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.

Host: GitHub
URL: https://github.com/ceteri/pattern
Owner: ceteri
License: other
Created: 2013-03-27T14:43:51.000Z (about 13 years ago)
Default Branch: master
Last Pushed: 2013-03-17T02:11:59.000Z (over 13 years ago)
Last Synced: 2025-02-24T12:22:58.944Z (over 1 year ago)
Language: Java
Homepage: http://cascading.org/
Size: 1.57 MB
Stars: 8
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          cascading.pattern

=================

_Pattern_ sub-project for http://Cascading.org/ which uses flows as

containers for machine learning models, importing

[PMML](http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language)

model descriptions from _R_, _SAS_, _Weka_, _RapidMiner_, _KNIME_,

_SQL Server_, etc.

Current support for PMML includes:

 * [Random Forest](http://en.wikipedia.org/wiki/Random_forest) in [PMML 4.0+](http://www.dmg.org/v4-0-1/MultipleModels.html) exported from [R/Rattle](http://cran.r-project.org/web/packages/rattle/index.html)

 * [Linear Regression](http://en.wikipedia.org/wiki/Linear_regression) in [PMML 1.1+](http://www.dmg.org/v1-1/generalregression.html)

 * [Hierarchical Clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering) and [K-Means Clustering](http://en.wikipedia.org/wiki/K-means_clustering) in [PMML 2.0+](http://www.dmg.org/v2-0/ClusteringModel.html)

 * [Logistic Regression](http://en.wikipedia.org/wiki/Logistic_regression) in [PMML 4.0.1+](http://www.dmg.org/v4-0-1/Regression.html)

Build Instructions

------------------

To build _Pattern_ and then run its unit tests:

    gradle --info --stacktrace clean test

The following scripts generate a baseline (model+data) for the _Random

Forest_ algorithm. This baseline includes a reference data set -- 

1000 independent variables, 500 rows of simulated ecommerce orders --

plus a predictive model in PMML:

    ./src/py/gen_orders.py 500 1000 > orders.tsv

    R --vanilla < ./src/r/rf_pmml.R > model.log

This will generate `huge.rf.xml` as the PMML export for a Random

Forest classifier plus `huge.tsv` as a baseline data set for

regression testing.

To build _Pattern_ and run a regression test:

    gradle clean jar

    rm -rf out

    hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \

     --pmml data/sample.rf.xml --measure out/measure --assert

For each tuple in the data, a _stream assertion_ tests whether the

`predicted` field matches the `score` field generated by the

model. Tuples which fail that assertion get trapped into

`out/trap/part*` for inspection.

Also, the _confusion matrix_ shown in `out/measure/part*` should

match the one logged in `model.log` from baseline generated in _R_.

To run on Amazon AWS, take a look at the `emr.sh` script.

Classifier vs. Predictive Model

-------------------------------

Here's how to run an example _classifier_ using Random Forest:

    gradle clean jar

    rm -rf out

    hadoop jar build/libs/pattern.jar data/iris.rf.tsv out/classify out/trap \

     --pmml data/iris.rf.xml --measure out/measure --label species

Here's how to run an example _predictive model_ using Linear Regression:

    gradle clean jar

    rm -rf out

    hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \

     --pmml data/iris.lm_p.xml --rmse out/measure

Use in Cascading Apps

---------------------

Alternatively, if you want to re-use this assembly for your own

Cascading app, remove the parts related to `verifyPipe` and

`measurePipe` from the sample code.

The following snippet in R shows how to train a Random Forest model,

then generate PMML as a file called `sample.rf.xml`:

    f <- as.formula("as.factor(label) ~ .")

    fit <- randomForest(f, data_train, ntree=50)

    saveXML(pmml(fit), file="sample.rf.xml")

To use the PMML file in your Cascading app, this example it

referenced as a command line argument called `pmmlPath`:

    // define a "Classifier" model from PMML to evaluate the orders

    ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );

    Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getFields(), classFunc, Fields.ALL );

Now when you run that Cascading app, provide a reference to

`sample.rf.xml` for the `pmmlPath` argument.

An architectural diagram for common use case patterns is shown in

`docs/pattern.graffle` which is an OmniGraffle document.

Example Models

--------------

Check the `src/r/rattle_pmml.R` script for examples of predictive

models which are created in R, then exported using _Rattle_.

These examples use the popular

[Iris](http://en.wikipedia.org/wiki/Iris_flower_data_set) data set.

 * random forest (rf)

 * linear regression (lm)

 * hierarchical clustering (hclust)

 * k-means clustering (kmeans)

 * logistic regression (glm)

 * multinomial model (multinom)

 * single hidden-layer neural network (nnet)

 * support vector machine (ksvm)

 * recursive partition classification tree (rpart)

 * association rules

To execute the R script:

    R --vanilla < src/r/rattle_pmml.R

It is possible to extend PMML support for other kinds of modeling in R

and other analytics platforms.  Contact the developers to discuss on

the [cascading-user](https://groups.google.com/forum/?fromgroups#!forum/cascading-user)

email forum.

PMML Resources

--------------

 * [Data Mining Group](http://www.dmg.org/) XML standards and supported vendors

 * [PMML In Action](http://www.amazon.com/dp/1470003244) book 

 * [PMML validator](http://www.zementis.com/pmml_tools.htm)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ceteri/pattern

Awesome Lists containing this project

README