{"id":19925673,"url":"https://github.com/openscoring/papis.io","last_synced_at":"2025-06-26T14:34:49.964Z","repository":{"id":68882682,"uuid":"152601876","full_name":"openscoring/papis.io","owner":"openscoring","description":"Putting five ML models to production in five minutes","archived":false,"fork":false,"pushed_at":"2018-12-22T21:47:27.000Z","size":57,"stargazers_count":7,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-07T13:37:39.528Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openscoring.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-11T14:06:57.000Z","updated_at":"2023-02-16T19:29:19.000Z","dependencies_parsed_at":"2023-02-23T23:45:20.804Z","dependency_job_id":null,"html_url":"https://github.com/openscoring/papis.io","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openscoring%2Fpapis.io","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openscoring%2Fpapis.io/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openscoring%2Fpapis.io/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openscoring%2Fpapis.io/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openscoring","download_url":"https://codeload.github.com/openscoring/papis.io/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252162521,"owners_count":21704266,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T22:23:40.676Z","updated_at":"2025-05-03T08:31:12.315Z","avatar_url":"https://github.com/openscoring.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[https://github.com/openscoring/papis.io](https://github.com/openscoring/papis.io)\n=========================================\n\n[PAPIs 2018](https://www.papis.io/2018) tool demonstration: [Putting five ML models to production in five minutes](https://papis2018.sched.com/event/FnJW/putting-five-ml-models-to-production-in-five-minutes)\n\n# Table of Contents #\n\n- [Introduction](#introduction)\n- [Prerequisites](#prerequisites)\n- [Installation and usage](#installation-and-usage)\n    + [R](#r)\n    + [Scikit-Learn](#scikit-learn)\n    + [Apache Spark](#apache-spark)\n    + [Openscoring](#openscoring)\n- [TL;DR, Demo](#tldr-demo)\n- [Demo](#demo)\n    + [Initialization](#initialization)\n    + [Logistic Regression in R](#logistic-regression-in-r)\n    + [XGBoost classification in Scikit-Learn](#xgboost-classification-in-scikit-learn)\n    + [H2O.ai Distributed Random Forest (DRF) classification in Scikit-Learn](#h2oai-distributed-random-forest-drf-classification-in-scikit-learn)\n    + [Regularized (Elastic net) Logistic Regression in Apache Spark](#regularized-elastic-net-logistic-regression-in-apache-spark)\n    + [Business rules classification in Scikit-Learn](#business-rules-classification-in-scikit-learn)\n    + [Scoring data](#scoring-data)\n- [Further reading](#further-reading)\n- [Contact](#contact)\n\n# Introduction #\n\nThe field of data science is split between two paradigms:\n\n| | **Structured** (ML) | **Unstructured** (AI) |\n| --- | --- | --- |\n| Scale | Small to large | Medium to extremely large |\n| Data | Relational | Images, videos, text |\n| Feature type | Scalar | Array/matrix |\n| Workflows | Manual, intelligent | Automated, brute-force |\n| Hardware | Commodity (CPU) | Specialized (GPU, TPU) |\n| Results | Explainable | \"Black-box\" |\n| Standards | [PMML](https://dmg.org/) | [ONNX](https://onnx.ai/), [TensorFlow](https://www.tensorflow.org/) |\n\nThe domain of structured data science is based on a solid foundation (statistics), and is responsible for delivering the majority of business value today and in the foreseeable future.\n\nEverything about data science is a lucrative and fast-growing market for software vendors. Legacy and continuation projects are typically served by proprietary/closed-source solutions. However, new projects tend to gravitate towards free- and open-source software (FOSS) solutions because of their superior functional and technical capabilities, and support options.\n\nDominant FOSS ML frameworks:\n\n* [R](https://www.r-project.org/)\n* [Scikit-Learn](https://scikit-learn.org/stable/)\n* [Apache Spark](https://spark.apache.org/)\n\nOn top of frameworks, there are a number of independent FOSS ML algorithm:\n\n* [H2O.ai](https://www.h2o.ai/)\n* [XGBoost](https://github.com/dmlc/xgboost)\n* [LightGBM](https://github.com/Microsoft/LightGBM)\n\nThird-party algorithms can deliver significant performance, predictivity and explainability gains over built-in algorithms.\n\nThe biggest issue with FOSS ML frameworks and algorithms is the difficulty of moving trained models \"from the laboratory to the factory\". There are two sides to it. First, the trained model object is functionally very tightly coupled to the original environment. Second, enterprise application programming languages such as Java, C# and SQL do not provide meaningful interoperability with R and Python.\n\nDominant productionalization strategies:\n\n* Containerization.\n* Translation from R/Python representation to Java/C#/SQL application code.\n* Translation from R/Python representation to standardized intermediate representation.\n\nThis tool demonstration is about the third strategy. We shall 1) train models using popular FOSS ML frameworks and algorithms, 2) translate them from their native R/Scikit-Learn/Apache Spark representation to the standardized Predictive Model Markup Language (PMML) representation, and 3) deploy them as such using the Openscoring REST web service.\n\n# Prerequisites\n\n* Java 1.8 or newer. The Java executable (`java.exe`) must be available on system path.\n* R 3.3 or newer\n* Python 2.7, 3.3 or newer\n* Apache Spark 2.0 or newer\n\n# Installation and usage #\n\n### R\n\nThe conversion is handled by the [`r2pmml`](https://github.com/jpmml/r2pmml) package.\n\nThis package is not available on CRAN. It can only be installed from its GitHub repository using the [`devtools`](https://cran.r-project.org/package=devtools) package:\n\n```R\nlibrary(\"devtools\")\n\ninstall_git(\"git://github.com/jpmml/r2pmml.git\")\n```\n\nThe conversion functionality is available via the `r2pmml::r2pmml(obj, pmml_path)` function:\n\n```R\nlibrary(\"r2pmml\")\n\nglm.obj = glm(y ~ ., data = mydata)\n\nr2pmml(glm.obj, \"MyModel.pmml\")\n```\n\n### Scikit-Learn\n\nThe conversion is handled by the [`sklearn2pmml`](https://github.com/jpmml/sklearn2pmml) package.\n\nThis package is available on PyPI. Alternatively, it can be installed from its GitHub repository:\n\n```\n$ pip install git+https://github.com/jpmml/sklearn2pmml.git\n``` \n\nThe `sklearn2pmml` package is \"softly dependent\" on `h2o`, `lightgbm` and `xgboost` packages. This tool demonstration needs two of them, so they must be installed separately:\n\n```\n$ pip install h2o xgboost\n```\n\nThe conversion functionality is available via the `sklearn2pmml.sklearn2pmml(pmml_pipeline, pmml_path)` function:\n\n```Python\nfrom sklearn2pmml import sklearn2pmml\nfrom sklearn2pmml.pipeline import PMMLPipeline\n\npipeline = PMMLPipeline([...])\n\nsklearn2pmml(pipeline, \"MyModel.pmml\")\n```\n\nThe only code change required is using `sklearn2pmml.pipeline.PMMLPipeline` instead of `sklearn.pipeline.Pipeline`. The former is a direct descendant of the latter (hence providing full API compatibility), but adds behind-the-scenes metadata collection and a couple of PMML-related methods (decision engineering, model configuration and verification).\n\n### Apache Spark\n\nThe conversion is handled by the [JPMML-SparkML](https://github.com/jpmml/jpmml-sparkml) library. R and Python users might feel more comfortable working with [`sparklyr2pmml`](https://github.com/jpmml/sparklyr2pmml) and [`pyspark2pmml`](https://github.com/jpmml/pyspark2pmml) packages, respectively.\n\nEnd users are advised to download a JPMML-SparkML release version from its GitHub releases page: https://github.com/jpmml/jpmml-sparkml/releases\n\nThe JPMML-SparkML library is being developed and released in four parallel version lines, one for each supported Apache Spark version line:\n\n| JPMML-SparkML | Apache Spark |\n| --- | --- |\n| [1.1.X](https://github.com/jpmml/jpmml-sparkml/tree/1.1.X) | 2.0.X |\n| [1.2.X](https://github.com/jpmml/jpmml-sparkml/tree/1.2.X) | 2.1.X |\n| [1.3.X](https://github.com/jpmml/jpmml-sparkml/tree/1.3.X) | 2.2.X |\n| [1.4.X](https://github.com/jpmml/jpmml-sparkml/tree/1.4.X) | 2.3.X |\n| [1.5.X](https://github.com/jpmml/jpmml-sparkml/tree/master) | 2.4.X |\n\nFor example, if targeting Apache Spark 2.3.X, then the end user should download the latest JPMML-SparkML 1.4.X version (1.4.6 at the time of PAPIs.io 2018).\n\nThe JPMML-SparkML library should be appended to Apache Spark application classpath. For command-line applications, this can be easily done using the `--jars` option:\n\n```\n$ spark-submit --jars jpmml-sparkml-executable-${version}.jar \u003capp jar | python file | R file\u003e\n```\n\nThe conversion functionality is available via the `org.jpmml.sparkml.PMMLBuilder` builder class:\n\n```Java\nDataFrame df = ...\nPipeline pipeline = ...\n\nPipelineModel pipelineModel = pipeline.fit(df);\n\nPMMLBuilder pmmlBuilder = new PMMLBuilder(df.schema(), pipelineModel);\n\npmmlBuilder.buildFile(new File(\"MyModel.pmml\"));\n```\n\n### Openscoring\n\nThe [Openscoring](https://github.com/openscoring/openscoring) REST web service is a thin JAX-RS wrapper around the [JPMML-Evaluator](https://github.com/jpmml/jpmml-evaluator) library.\n\nOpenscoring provides a microservices-style approach for turning static PMML documents into live functions:\n\n* Commissioning and decommissioning\n* Schema querying\n* Evaluation in single prediction, batch prediction and CSV prediction modes\n* Metrics\n\nEnd users are advised to download an Openscoring release version from its GitHub releases page: https://github.com/openscoring/openscoring/releases \n\nStarting up the standalone edition:\n\n```\n$ java -jar openscoring-server-executable-${version}.jar\n```\n\nBy default, Openscoring binds to `localhost:8080`, using `/openscoring` as the web context root. If the startup was successful, then performing an HTTP GET query against the model collection endpoint [`model/`](http://localhost:8080/openscoring/model) should return an empty JSON array `{}`.\n\nFurther interaction is possible using HTTP toolkits such as [cURL](https://curl.haxx.se/) or [postman](https://www.getpostman.com/).\n\nEmulating the full lifecycle of a model using cURL:\n\n```\n$ curl -X PUT --data-binary @MyModel.pmml -H \"Content-type: text/xml\" http://localhost:8080/openscoring/model/MyModel\n$ curl -X GET http://localhost:8080/openscoring/model/MyModel\n$ curl -X POST --data-binary @input.csv -H \"Content-type: text/plain; charset=UTF-8\" http://localhost:8080/openscoring/model/MyModel/csv \u003e output.csv\n$ curl -X DELETE http://localhost:8080/openscoring/model/MyModel\n```\n\nR and Python users might feel more comfortable working with [`openscoring-r`](https://github.com/openscoring/openscoring-r) and [`openscoring-python`](https://github.com/openscoring/openscoring-python) packages, respectively.\n\nEmulating the full lifecycle of a model using the `openscoring-python` package:\n\n```Python\nfrom openscoring import Openscoring\n\nos = Openscoring(base_url = \"http://localhost:8080/openscoring\")\nos.deployFile(\"MyModel\", \"MyModel.pmml\")\nos.evaluateCsvFile(\"MyModel\", \"input.csv\", \"output.csv\")\nos.undeploy(\"MyModel\")\n```\n\n# TL;DR, Demo #\n\nInitialization:\n\n```\n$ java -jar openscoring-server-executable-${version}.jar\n```\n\nTraining, converting and deploying models:\n\n```\n$ Rscript --vanilla GLMAudit.R --deploy\n$ python XGBoostAudit.py --deploy\n$ python RandomForestAudit.py --deploy\n$ spark-shell --jars jpmml-sparkml-executable-${version}.jar,openscoring-client-executable-${version}.jar -i ElasticNetAudit.scala --conf spark.driver.args=\"--deploy\"\n$ python RuleSetIris.py --deploy\n```\n\nScoring data:\n\n```\n$ curl -X POST --data-binary @csv/Audit.csv -H \"Content-type: text/plain; charset=UTF-8\" http://localhost:8080/openscoring/model/RandomForestAudit/csv \u003e RandomForestAudit.csv\n$ curl -X POST --data-binary @csv/Iris.csv -H \"Content-type: text/plain; charset=UTF-8\" http://localhost:8080/openscoring/model/RuleSetIris/csv \u003e RuleSetIris.csv\n```\n\n# Demo #\n\n### Initialization\n\nStarting up Openscoring:\n\n```\n$ java -jar openscoring-server-executable-${version}.jar\n```\n\n### Logistic Regression in R\n\nThe R scipt file: [GLMAudit.R](GLMAudit.R)\n\nAll feature engineering should be done using the [model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html) approach in order to make it part of the model object state (ie. can be saved and read back into memory using [`base::saveRDS(obj, path)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html) and [`base::readRDS(path)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html) functions).\n\nBinning the \"Age\" feature using the [`base::cut(x, breaks)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html) function:\n\n```R\nageQuantiles = quantiles(audit$Age)\n\naudit.formula = formula(Adjusted ~ . - Age + base::cut(Age, breaks = ageQuantiles))\n```\n\nInteracting \"Gender\" and \"Marital\" features using the `:` operator:\n\n```R\naudit.formula = formula(Adjusted ~ . + Gender:Marital)\n```\n\nDeriving an hourly income based on \"Income\" (annual income) and \"Hours\" (the number of working hours in a week) features using arithmetic operators; as a matter of caution, all inline R expressions should be surrounded with the [`base::I(x)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/AsIs.html) function:\n\n```R\naudit.formula = formula(Adjusted ~ . + I(Income / (Hours * 52)))\n```\n\nAfter training, the model object is enhanced with verification data using the `r2pmml::verify(obj, newdata)` function:\n\n```R\nlibrary(\"r2pmml\")\n\naudit.glm = glm(Adjusted ~ ., data = audit)\n\n# Discard known values of the dependent variable\naudit$Adjusted = NULL\n\naudit.glm = verify(audit.glm, audit[sample(nrow(audit), 100), ])\n```\n\nRunning the R script file:\n\n```\n$ Rscript --vanilla GLMAudit.R --deploy\n```\n\nThe generated PMML document is saved as `pmml/GLMAudit.pmml` and deployed to Openscoring as [`model/GLMAudit`](http://localhost:8080/openscoring/model/GLMAudit).\n\n### XGBoost classification in Scikit-Learn\n\nThe Python script file: [XGBoostAudit.py](XGBoostAudit.py)\n\nAll column-oriented feature engineering should be done using the `sklearn_pandas.DataFrameMapper` meta-transformer class:\n\n```Python\nfrom sklearn.preprocessing import LabelBinarizer\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain\n\nmapper = DataFrameMapper(\n\t[([cat_column], [CategoricalDomain(), LabelBinarizer()]) for cat_column in [...]] +\n\t[([cont_column], [ContinuousDomain()]) for cont_column in [...]]\n)\n```\n\nBinning the \"Age\" feature using the `sklearn2pmml.preprocessing.CutTransformer` transformer class:\n\n```Python\nfrom sklearn2pmml.preprocessing import CutTransformer\n\nmapper = DataFrameMapper([\n\t(\"Age\", [ContinuousDomain(), CutTransformer(bins = [17, 28, 37, 47, 83], labels = [\"q1\", \"q2\", \"q3\", \"q4\"]), LabelBinarizer()])\n])\n```\n\nInteracting \"Gender\" and \"Marital\" features using the `sklearn.preprocessing.PolynomialFeatures` transformer class:\n\n```Python\nfrom sklearn.pipeline import FeatureUnion, Pipeline\nfrom sklearn.preprocessing import PolynomialFeatures\n\nunion = FeatureUnion([\n\t(\"scalar_mapper\", DataFrameMapper([...])),\n\t(\"interaction_pipeline\", Pipeline([\n\t\t(\"interaction_mapper\", DataFrameMapper([\n\t\t\t(\"Gender\", [CategoricalDomain(), LabelBinarizer()]),\n\t\t\t(\"Marital\", [CategoricalDomain(), LabelBinarizer()])\n\t\t])),\n\t\t(\"polynomial_features\", PolynomialFeatures())\n\t]))\n])\n```\n\nDeriving an hourly income based on \"Income\" and \"Hours\" features using the `sklearn2pmml.preprocessing.ExpressionTransformer` transformer class:\n\n```Python\nfrom sklearn2pmml.decoration import Alias\nfrom sklearn2pmml.preprocessing import ExpressionTransformer\n\nmapper = DataFrameMapper([\n\t([\"Hours\", \"Income\"], Alias(ExpressionTransformer(\"X[1] / (X[0] * 52)\"), \"Hourly_Income\"))\n])\n```\n\nAfter training, the model object is re-encoded from binary splits to multi-way splits using the `PMMLPipeline.configure(**pmml_options)` method, and enhanced with verification data using the `PMMLPipeline.verify(X, precision, zeroThreshold)` method:\n\n```Python\nfrom sklearn2pmml.pipeline import PMMLPipeline\n\npipeline = PMMLPipeline([...])\n\npipeline.configure(compact = True)\npipeline.verify(audit_X.sample(100), zeroThreshold = 1e-6, precision = 1e-6)\n```\n\nRunning the Python script file:\n\n```\n$ python XGBoostAudit.py --deploy\n```\n\nThe generated PMML document is saved as `pmml/XGBoostAudit.pmml` and deployed to Openscoring as [`model/XGBoostAudit`](http://localhost:8080/openscoring/model/XGBoostAudit).\n\n\n### H2O.ai Distributed Random Forest (DRF) classification in Scikit-Learn\n\nThe Python script file: [RandomForestAudit.py](RandomForestAudit.py)\n\nH2O.ai algorithms provide full support for string categorical features. This is in stark contrast with other Python-accessible ML algorithms that require them to be binarized in one-hot-encoding fashion (eg. Scikit-Learn, XGBoost) or at least re-encoded (eg. LightGBM):\n\n```Python\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain\n\nmapper = DataFrameMapper(\n\t[([cat_column], [CategoricalDomain()]) for cat_column in [...]] +\n\t[([cont_column], [ContinuousDomain()]) for cont_column in [...]]\n)\n```\n\nAll feature engineering happens in local computer using Scikit-Learn transformer classes. The pre-processed dataset (could be a `pandas.DataFrane` or a Numpy matrix) is then uploaded to the remove computer where the H2O.ai compute engine resides using the `sklearn2pmml.preprocessing.h2o.H2OFrameCreator` meta-transformer class:\n\n```Python\nfrom h2o import H2OFrame\nfrom h2o.estimators.random_forest import H2ORandomForestEstimator\nfrom sklearn2pmml.preprocessing.h2o import H2OFrameCreator\n\npipeline = PMMLPipeline([\n\t(\"local_mapper\", DataFrameMapper([...])),\n\t(\"uploaded\", H2OFrameCreator()),\n\t(\"remote_classifier\", H2ORandomForestEstimator())\n])\npipeline.fit(audit_X, H2OFrame(audit_y.to_frame(), column_types = [\"categorical\"]))\n```\n\nA `Pipeline.predict_proba(X)` method call returns a two-column matrix for binary classification problems, where the first column holds the probability of the negative (\"no-event\") scenario and the second column holds the probability of the positive (\"event\") scenario.\n\nThe Scikit-Learn framework does not support decision engineering (eg. appending transformation steps to the final estimator step) based on predicted labels or probability distributions. \n\nThe `PMMLPipeline` class makes it possible by adding the following attributes and methods:\n\n| Attribute | Method |\n| --- | --- |\n| `predict_transformer` | `predict_transform(X)` |\n| `predict_proba_transformer` | `predict_proba_transform(X)` |\n| `apply_transformer` | N/A |\n\nBinning the probability of the positive scenario using the `CutTransformer` transformer class:\n\n```Python\npredict_proba_transformer = Pipeline([\n\t(\"expression\", ExpressionTransformer(\"X[1]\")),\n\t(\"cut\", Alias(CutTransformer(bins = [0.0, 0.75, 0.90, 1.0], labels = [\"no\", \"maybe\", \"yes\"]), \"Decision\", prefit = True))\n])\n\npipeline = PMMLPipeline([...], predict_proba_transformer = predict_proba_transformer)\npipeline.fit(audit_X, H2OFrame(audit_y.to_frame(), column_types = [\"categorical\"]))\n\npipeline.predict_proba_transform(audit_X)\n```\n\nRunning the Python script file:\n\n```\n$ python RandomForestAudit.py --deploy\n```\n\nThe generated PMML document is saved as `pmml/RandomForestAudit.pmml` and deployed to Openscoring as [`model/RandomForestAudit`](http://localhost:8080/openscoring/model/RandomForestAudit).\n\n### Regularized (Elastic net) Logistic Regression in Apache Spark\n\nThe Scala script file: [ElasticNetAudit.scala](ElasticNetAudit.scala)\n\nApache Spark pipelines are much more flexible than Scikit-Learn pipelines. Specifically, they support model chains, transformations between models and after the last model. The JPMML-SparkML library should be able to convert all that into the standardized PMML representation in a fully automated way.\n\nBinning the \"Age\" feature using the `org.apache.spark.ml.feature.QuantileDiscretizer` transformer class:\n\n```Scala\nval ageDiscretizer = new QuantileDiscretizer()\n\t.setNumBuckets(4)\n\t.setInputCol(\"Age\")\n\t.setOutputCol(\"discretizedAge\");\n```\n\nInteracting \"Gender\" and \"Marital\" features using the `org.apache.spark.ml.feature.Interaction` transformer class:\n\n```Scala\nval genderMaritalInteraction = new Interaction()\n\t.setInputCols(Array(\"encodedGender\", \"encodedMarital\"))\n\t.setOutputCol(\"interactedGenderMarital\");\n```\n\nSearching for the best regularization parameter using the `org.apache.spark.ml.tuning.CrossValidator` meta-estimator class:\n\n```Scala\nval logisticRegression = new LogisticRegression()\n\t.setElasticNetParam(0.5)\n\t.setFeaturesCol(\"vectorizedFeatures\")\n\t.setLabelCol(\"indexedAdjusted\");\n\nstages += logisticRegression\t\n\nval estimator = new Pipeline().setStages(stages.toArray)\nval estimatorParamMaps = new ParamGridBuilder().addGrid(logisticRegression.regParam, Array(0.05, 0.10, 0.15)).build()\nval evaluator = new BinaryClassificationEvaluator().setLabelCol(\"indexedAdjusted\")\n\nval crossValidator = new CrossValidator()\n\t.setEstimator(estimator)\n\t.setEstimatorParamMaps(estimatorParamMaps)\n\t.setEvaluator(evaluator)\n\t.setSeed(42L);\n\nval pipeline = new Pipeline().setStages(Array(crossValidator))\nval pipelineModel = pipeline.fit(df)\n```\n\nRunning the Scala script without Openscoring deployment:\n\n```\n$ spark-shell --jars jpmml-sparkml-executable-${version}.jar -i ElasticNetAudit.scala\n```\n\nThe generated PMML document is saved as `pmml/ElasticNetAudit.pmml`.\n\nRunning the Scala script with Openscoring deployment:\n\n```\n$ spark-shell --jars jpmml-sparkml-executable-${version}.jar,openscoring-client-executable-${version}.jar -i ElasticNetAudit.scala --conf spark.driver.args=\"--deploy\"\n```\n\nThe generated PMML document is saved as `pmml/ElasticNetAudit.pmml` and deployed to Openscoring as [`model/ElasticNetAudit`](http://localhost:8080/openscoring/model/ElasticNetAudit).\n\n### Business rules classification in Scikit-Learn\n\nThe Python script file: [RuleSetIris.py](RuleSetIris.py)\n\nThere are data science problems where the solution is obvious/known in advance, and the whole machine learning workflow is reduced to just writing down the function.\n\nGenerating PMML documents manually is not too difficult. However, it would be a major usability/productivity advance if end users could accomplish everything from within their favourite environment, without having to learn and do anything new.\n\nThe `sklearn2pmml` package provides the `sklearn2pmml.ruleset.RuleSetClassifier` estimator class, which allows a data record to be labeled by matching it against a collection of Python predicates (ie. boolean expressions).\n\nImplementing a decision tree-like solution:\n\n```Python\nfrom sklearn2pmml.ruleset import RuleSetClassifier\n\nclassifier = RuleSetClassifier([\n\t(\"X['Petal_Length'] \u003c 2.45\", \"setosa\"),\n\t(\"X['Petal_Width'] \u003c 1.75\", \"versicolor\"),\n], default_score = \"virginica\")\n```\n\nRunning the Python script file:\n\n```\n$ python RuleSetIris.py --deploy\n```\n\nThe generated PMML document is saved as `pmml/RuleSetIris.pmml` and deployed to Openscoring as [`model/RuleSetIris`](http://localhost:8080/openscoring/model/RuleSetIris).\n\n### Scoring data\n\nIn this point, there should be five models deployed on the Openscoring:\n\n* [`model/GLMAudit`](http://localhost:8080/openscoring/model/GLMAudit)\n* [`model/XGBoostAudit`](http://localhost:8080/openscoring/model/XGBoostAudit)\n* [`model/RandomForestAudit`](http://localhost:8080/openscoring/model/RandomForestAudit)\n* [`model/ElasticNetAudit`](http://localhost:8080/openscoring/model/ElasticNetAudit)\n* [`model/RuleSetIris`](http://localhost:8080/openscoring/model/RuleSetIris)\n\nScoring the [`csv/Audit.CSV`](csv/Audit.csv) input file with the `RandomForestAudit` model using cURL:\n\n```\n$ curl -X POST --data-binary @csv/Audit.csv -H \"Content-type: text/plain; charset=UTF-8\" http://localhost:8080/openscoring/model/RandomForestAudit/csv \u003e RandomForestAudit.csv\n```\n\nThe `RandomForestAudit.csv` results file contains five columns - the \"Adjusted\" target column, and \"probability(0)\", \"probability(1)\", \"eval(X[1])\" and \"Decision\" output columns. The last one holds the the outcome of our decision engineering efforts - all in all there are 154 \"yes\" decisions, 153 \"maybe\" decisions and 1592 \"no\" decisions.\n\nScoring the [`csv/Iris.csv`](csv/Iris.csv) input file with the `RuleSetIris` model using cURL:\n\n```\n$ curl -X POST --data-binary @csv/Iris.csv -H \"Content-type: text/plain; charset=UTF-8\" http://localhost:8080/openscoring/model/RuleSetIris/csv \u003e RuleSetIris.csv\n```\n\nThe `RuleSetIris.csv` results file contains a single \"Species\" target column.\n\n# Further reading #\n\nPresentations:\n\n* [State of the (J)PMML art](https://www.slideshare.net/VilluRuusmann/state-of-the-jpmml-art)\n* [Converting R to PMML](https://www.slideshare.net/VilluRuusmann/converting-r-to-pmml-82182483)\n* [Converting Scikit-Learn to PMML](https://www.slideshare.net/VilluRuusmann/converting-scikitlearn-to-pmml)\n\nSoftware:\n\n* [Java PMML API](https://github.com/jpmml)\n* [Openscoring REST API](https://github.com/openscoring)\n\n# Contact #\n\nVillu Ruusmann  \nCTO and Founder at Openscoring OÜ, Estonia\n\nGitHub: https://github.com/vruusmann  \nLinkedIn: https://ee.linkedin.com/in/villuruusmann/  \nSlideShare: https://slideshare.net/VilluRuusmann  \ne-mail: villu@openscoring.io  \nSkype: villu.ruusmann\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenscoring%2Fpapis.io","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenscoring%2Fpapis.io","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenscoring%2Fpapis.io/lists"}