{"id":19140836,"url":"https://github.com/codait/aardpfark","last_synced_at":"2025-05-06T23:17:11.848Z","repository":{"id":79555433,"uuid":"136139555","full_name":"CODAIT/aardpfark","owner":"CODAIT","description":"A library for exporting Spark ML models and pipelines to PFA","archived":false,"fork":false,"pushed_at":"2018-11-21T02:50:21.000Z","size":181,"stargazers_count":54,"open_issues_count":12,"forks_count":15,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-05-06T23:17:05.151Z","etag":null,"topics":["apache-spark","machine-learning","ml","model","model-export","pfa","pfa-standard","pipelines"],"latest_commit_sha":null,"homepage":"http://codait.org","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CODAIT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-06-05T07:44:13.000Z","updated_at":"2023-06-06T17:29:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"6bcaad13-5697-4afe-a920-02dca8daaccc","html_url":"https://github.com/CODAIT/aardpfark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CODAIT%2Faardpfark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CODAIT%2Faardpfark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CODAIT%2Faardpfark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CODAIT%2Faardpfark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CODAIT","download_url":"https://codeload.github.com/CODAIT/aardpfark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252782835,"owners_count":21803410,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","machine-learning","ml","model","model-export","pfa","pfa-standard","pipelines"],"created_at":"2024-11-09T07:18:58.674Z","updated_at":"2025-05-06T23:17:11.839Z","avatar_url":"https://github.com/CODAIT.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/CODAIT/aardpfark.svg?branch=master)](https://travis-ci.org/CODAIT/aardpfark)\n\n# Aardpfark\n\nAardpfark is a library for exporting Spark ML models and pipelines to the [Portable Format for Analytics (PFA)](http://dmg.org/pfa/).\n\nPFA is a JSON format for representing machine learning models, data transformations and analytic applications.\nThe format encapsulates both serialization as well as the operations (or functions) to be applied to\ninput data to produce output data. It can essentially be thought of as a mini functional language, \ntogether with a data schema specification.\n\nA PFA [\"document\"](http://dmg.org/pfa/docs/document_structure/) is fully self-contained and can be executed by any\ncompliant execution engine, making a model written to PFA truly portable across languages, frameworks, and runtimes.\n\n# Installation\n\n## Prerequisites\n\n* [`sbt`](https://www.scala-sbt.org/)\n* [Apache Maven](https://maven.apache.org/) for [installing](#running-the-tests) test dependency\n* [Apache Spark](https://spark.apache.org/)\n\n## Quick start\n\nAardpfark currently targets and has been tested on Apache Spark 2.2.0. 2.3.0 support will be added soon.\n\n1. Build the `aardpfark` project (ignoring tests ) using `sbt 'set test in assembly := {}' clean assembly`\n2. Add the aardpfark JAR to your Spark application, e.g. using spark-shell:\n\n```\n./bin/spark-shell --driver-class-path /PATH_TO_AARDPFARK_JAR/aardpfark-assembly-0.1.0-SNAPSHOT.jar\n\n```\n\n### Use with SBT or Maven\n\n*Note* publishing to Maven coming soon.\n\nFirst you will need to install `aardpfark` locally using `sbt publish-local`. Then, add it to your SBT build file:\n\n```scala\nlibraryDependencies += \"com.ibm\" %% \"aardpfark\" % \"0.1.0-SNAPSHOT\"\n```\n\n\n## Usage\n\nAardpfark provides functions for exporting supported models to PFA as JSON strings. For example,\nto export a simple logistic regression model and print the resulting PFA document:\n\n\n```scala\nimport com.ibm.aardpfark.spark.ml.SparkSupport.toPFA\n\nimport org.apache.spark.ml.classification._\n\nval data = spark.read.format(\"libsvm\").load(\"data/sample_multiclass_classification_data.txt\")\nval lr = new LogisticRegression()\nval model = lr.fit(data)\n\nval pfa = toPFA(model, true)\nprintln(pfa)\n\n```\n\n### Pipeline support\n\nAardpfark also supports exporting pipeline consisting of supported models and transformers. Because \nit requires access to the schema information of the input dataframe, you must also pass in that schema\nto the export function:\n\n```scala\nimport com.ibm.aardpfark.spark.ml.SparkSupport.toPFA\n\nimport org.apache.spark.ml._\nimport org.apache.spark.ml.feature._\n\nval data = spark.read.format(\"libsvm\").load(\"data/sample_multiclass_classification_data.txt\")\nval scaler = new StandardScaler().setInputCol(\"features\").setOutputCol(\"scaled\")\nval lr = new LogisticRegression().setFeaturesCol(\"scaled\")\nval pipeline = new Pipeline().setStages(Array(scaler, lr))\nval model = pipeline.fit(data)\n\nval pfa = toPFA(model, data.schema, true)\nprintln(pfa)\n\n```\n\nSupport for more natural implicit conversions is also in progress.\n\n## Scoring exported models\n\nTo score exported models, use a reference PFA scoring engine in Java, Python or R from the\n[Hadrian project](https://github.com/opendatagroup/hadrian). *Note* for the JVM engine you will \nneed to install the `daily` branch build (see the [instructions below](#running-the-tests)).\n\nFor example, using the Hadrian JVM engine (in Scala). You can add the Hadrian jar to the driver classpath\n\n```\n$SPARK_HOME/bin/spark-shell --driver-class-path /PATH_TO_AARDPFARK_JAR/aardpfark-assembly-0.1.0-SNAPSHOT.jar:/PATH_TO_HADRIAN_JAR/hadrian-0.8.5.jar\n```\n\nand execute the following:\n\n```scala\nimport com.opendatagroup.hadrian.jvmcompiler.PFAEngine\nimport com.ibm.aardpfark.spark.ml.SparkSupport.toPFA\n\nimport org.apache.spark.ml.classification._\n\nval data = spark.read.format(\"libsvm\").load(\"data/sample_multiclass_classification_data.txt\")\nval lr = new LogisticRegression()\nval model = lr.fit(data)\n\nval pfa = toPFA(model, true)\nval engine = PFAEngine.fromJson(pfa, multiplicity = 1).head\nval input = \"\"\"{\"features\":[-0.222222,0.5,-0.762712,-0.833333]}\"\"\"\nprintln(engine.action(engine.jsonInput(input)))\n```\n\nYou should see the result returned as JSON:\n\n```json\n{\n   \"rawPrediction\":[\n      -80.61228861915214,\n      100.66271325935413,\n      -20.050424640201975\n   ],\n   \"prediction\":1.0,\n   \"probability\":[\n      1.8761474921138084E-79,\n      1.0,\n      3.7579441119545976E-53\n   ]\n}\n```\n\nCheck out the aardpfark test cases to see further examples. We are working on adding more detailed \nexamples and benchmarks.\n\n## Running the tests\n\n*Note* `aardpfark` tests depend on the JVM reference implementation of a PFA scoring engine: [Hadrian]().\nHadrian has not yet published a version supporting Scala 2.11 to Maven, so you will need to install the \n`daily` branch to run the tests.\n\nInstall Hadrian using the following steps:\n\n1. Clone the repo: `git clone https://github.com/opendatagroup/hadrian.git`\n2. Change to the cloned `hadrian` sub-directory: `cd hadrian/hadrian`\n3. Checkout the `daily` branch: `git checkout daily`\n4. Install locally using Maven: `mvn install`\n\nRun tests using `sbt test`. The test cases include checking equivalence between what Spark ML components produce and\nwhat PFA produces.\n\n# Coverage\n\nAardpfark aims to provide complete coverage of all Spark ML components. The current coverage status\nis listed below.\n\n**NOTE** export to PFA is for Models and Transformers only (not Estimators)\n\n| Component | Status |\n| --- | --- |\n| _**Predictors**_ |  |\n| Logistic Regression | Supported |\n| LinearSVC | Supported |\n| Linear Regression| Supported |\n| Generalized Linear Model | Supported |\n| Multilayer Perceptron | Supported |\n| Decision Tree Classifier \u0026 Regressor | Supported |\n| Gradient Boosted Tree Classifier \u0026 Regressor | Supported |\n| Naive Bayes | Supported |\n| OneVsRest | Not yet |\n| AFTSurvivalRegresstion | Not yet |\n| IsotonicRegression | Not yet |\n| _**Clustering**_ |  |\n| KMeans | Supported |\n| Bisecting KMeans | Not yet |\n| LDA | Not yet |\n| Gaussian Mixture| Not yet |\n| _**Recommendations**_ |  |\n| ALS | Not yet |\n| _**Feature Extractors**_ |  |\n| CountVectorizerModel | Supported |\n| IDFModel | Supported |\n| Word2Vec | Not yet |\n| HashingTF | Not yet |\n| FeatureHasher | Not yet |\n| _**Feature Transformers**_ |  |\n| Binarizer | Supported |\n| Bucketizer | Supported |\n| ElementwiseProduct | Supported |\n| MaxAbsScalerModel | Supported |\n| MinMaxScalerModel | Supported |\n| NGram | Supported |\n| Normalizer | Supported |\n| PCAModel | Supported |\n| QuantileDiscretizer | Supported |\n| RegexTokenizer | Supported |\n| StandardScalerModel | Supported |\n| StopWordsRemover| Supported |\n| StringIndexerModel | Supported |\n| VectorAssembler | Supported |\n| OneHotEncoderModel | Not yet |\n| PolynomialExpansion | Not yet |\n| IndexToString | Not yet |\n| VectorIndexer | Not yet |\n| Imputer | Not yet |\n| Interaction | Not yet |\n| VectorSizeHint | Won't support (TBD) |\n| DCT | Won't support |\n| SQLTransformer | Won't support |\n| _**Feature Selectors**_ |  |\n| ChiSqSelectorModel | Supported |\n| VectorSlicer | Supported |\n| RFormula | Not yet |\n| _**LSH**_ |  |\n| LSH transformers | Not yet |\n\n\n# Roadmap\n\nImmediate objectives include:\n* Complete adding support for Spark ML components together with tests\n* Complete Scala DSL and tests\n* Add PySpark support\n* Improve the existing test coverage\n* Improve the pipeline support\n\nLonger term objectives include:\n\n* Add support for other ML libraries, starting with scikit-learn\n* Add support for generic vectors (mixed sparse/dense)\n\n\n# Contributing\n\nWe welcome contributions - whether it be adding or improving documentation and examples, adding support \nfor missing Spark ML components, or any other item on the roadmap above. See [CONTRIBUTING](CONTRIBUTING.md)\nfor details and open an issue or pull request.\n\n# License\n\nAardpfark is released under an [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html) (see [LICENSE](LICENSE)).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodait%2Faardpfark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodait%2Faardpfark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodait%2Faardpfark/lists"}