https://github.com/ceteri/pattern
"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.
https://github.com/ceteri/pattern
Last synced: over 1 year ago
JSON representation
"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.
- Host: GitHub
- URL: https://github.com/ceteri/pattern
- Owner: ceteri
- License: other
- Created: 2013-03-27T14:43:51.000Z (about 13 years ago)
- Default Branch: master
- Last Pushed: 2013-03-17T02:11:59.000Z (over 13 years ago)
- Last Synced: 2025-02-24T12:22:58.944Z (over 1 year ago)
- Language: Java
- Homepage: http://cascading.org/
- Size: 1.57 MB
- Stars: 8
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
cascading.pattern
=================
_Pattern_ sub-project for http://Cascading.org/ which uses flows as
containers for machine learning models, importing
[PMML](http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language)
model descriptions from _R_, _SAS_, _Weka_, _RapidMiner_, _KNIME_,
_SQL Server_, etc.
Current support for PMML includes:
* [Random Forest](http://en.wikipedia.org/wiki/Random_forest) in [PMML 4.0+](http://www.dmg.org/v4-0-1/MultipleModels.html) exported from [R/Rattle](http://cran.r-project.org/web/packages/rattle/index.html)
* [Linear Regression](http://en.wikipedia.org/wiki/Linear_regression) in [PMML 1.1+](http://www.dmg.org/v1-1/generalregression.html)
* [Hierarchical Clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering) and [K-Means Clustering](http://en.wikipedia.org/wiki/K-means_clustering) in [PMML 2.0+](http://www.dmg.org/v2-0/ClusteringModel.html)
* [Logistic Regression](http://en.wikipedia.org/wiki/Logistic_regression) in [PMML 4.0.1+](http://www.dmg.org/v4-0-1/Regression.html)
Build Instructions
------------------
To build _Pattern_ and then run its unit tests:
gradle --info --stacktrace clean test
The following scripts generate a baseline (model+data) for the _Random
Forest_ algorithm. This baseline includes a reference data set --
1000 independent variables, 500 rows of simulated ecommerce orders --
plus a predictive model in PMML:
./src/py/gen_orders.py 500 1000 > orders.tsv
R --vanilla < ./src/r/rf_pmml.R > model.log
This will generate `huge.rf.xml` as the PMML export for a Random
Forest classifier plus `huge.tsv` as a baseline data set for
regression testing.
To build _Pattern_ and run a regression test:
gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \
--pmml data/sample.rf.xml --measure out/measure --assert
For each tuple in the data, a _stream assertion_ tests whether the
`predicted` field matches the `score` field generated by the
model. Tuples which fail that assertion get trapped into
`out/trap/part*` for inspection.
Also, the _confusion matrix_ shown in `out/measure/part*` should
match the one logged in `model.log` from baseline generated in _R_.
To run on Amazon AWS, take a look at the `emr.sh` script.
Classifier vs. Predictive Model
-------------------------------
Here's how to run an example _classifier_ using Random Forest:
gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.rf.tsv out/classify out/trap \
--pmml data/iris.rf.xml --measure out/measure --label species
Here's how to run an example _predictive model_ using Linear Regression:
gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \
--pmml data/iris.lm_p.xml --rmse out/measure
Use in Cascading Apps
---------------------
Alternatively, if you want to re-use this assembly for your own
Cascading app, remove the parts related to `verifyPipe` and
`measurePipe` from the sample code.
The following snippet in R shows how to train a Random Forest model,
then generate PMML as a file called `sample.rf.xml`:
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
saveXML(pmml(fit), file="sample.rf.xml")
To use the PMML file in your Cascading app, this example it
referenced as a command line argument called `pmmlPath`:
// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getFields(), classFunc, Fields.ALL );
Now when you run that Cascading app, provide a reference to
`sample.rf.xml` for the `pmmlPath` argument.
An architectural diagram for common use case patterns is shown in
`docs/pattern.graffle` which is an OmniGraffle document.
Example Models
--------------
Check the `src/r/rattle_pmml.R` script for examples of predictive
models which are created in R, then exported using _Rattle_.
These examples use the popular
[Iris](http://en.wikipedia.org/wiki/Iris_flower_data_set) data set.
* random forest (rf)
* linear regression (lm)
* hierarchical clustering (hclust)
* k-means clustering (kmeans)
* logistic regression (glm)
* multinomial model (multinom)
* single hidden-layer neural network (nnet)
* support vector machine (ksvm)
* recursive partition classification tree (rpart)
* association rules
To execute the R script:
R --vanilla < src/r/rattle_pmml.R
It is possible to extend PMML support for other kinds of modeling in R
and other analytics platforms. Contact the developers to discuss on
the [cascading-user](https://groups.google.com/forum/?fromgroups#!forum/cascading-user)
email forum.
PMML Resources
--------------
* [Data Mining Group](http://www.dmg.org/) XML standards and supported vendors
* [PMML In Action](http://www.amazon.com/dp/1470003244) book
* [PMML validator](http://www.zementis.com/pmml_tools.htm)