{"id":18842819,"url":"https://github.com/aamend/ml-registry","last_synced_at":"2025-04-14T07:31:56.786Z","repository":{"id":52535790,"uuid":"207323466","full_name":"aamend/ml-registry","owner":"aamend","description":"Enabling continuous delivery and improvement of Spark pipeline models through devops methodology and ML governance","archived":false,"fork":false,"pushed_at":"2021-04-26T19:29:40.000Z","size":1225,"stargazers_count":4,"open_issues_count":2,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-27T21:22:14.941Z","etag":null,"topics":["datascience","devops","machinelearning","maven","ml","nexus","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aamend.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-09-09T14:03:21.000Z","updated_at":"2021-03-12T09:57:58.000Z","dependencies_parsed_at":"2022-08-26T13:11:39.002Z","dependency_job_id":null,"html_url":"https://github.com/aamend/ml-registry","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Fml-registry","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Fml-registry/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Fml-registry/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Fml-registry/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aamend","download_url":"https://codeload.github.com/aamend/ml-registry/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248839462,"owners_count":21169817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datascience","devops","machinelearning","maven","ml","nexus","spark"],"created_at":"2024-11-08T02:55:48.144Z","updated_at":"2025-04-14T07:31:56.442Z","avatar_url":"https://github.com/aamend.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Spark ML Registry package\n\nEnabling continuous delivery and improvement of Spark pipeline models through devops methodology and ML governance.\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.aamend.spark/ml-registry/badge.svg)](https://maven-badges.herokuapp.com/maven-central/com.aamend.spark/ml-registry)\n\n## Principles\n\nWe enrich Spark ML framework to enable governance of machine learning models,\nleveraging software delivery tools such as [apache maven](https://maven.apache.org/), [Ivy](http://ant.apache.org/ivy/) and [nexus](https://www.sonatype.com/product-nexus-repository). \n\n- Use `maven` to version a trained pipeline model and package binary as `.jar` file\n- Use `nexus` as a central model registry to deploy immutable ML binaries\n- Use `ivy` to load specific models to Spark context via `--packages` functionality\n\nWith a central repository for ML models, machine learning can be constantly retrained and re-injected\ninto your operation environment as reported in below HL workflow. \nOne can operate data science under full governance where each\nmodel is trusted, validated, registered, deployed and continuously improved.\n\n![ml-flow](images/ml-flow.png)\n\nKey concepts of this projects are explained below\n- [Model Versioning](#model-versioning)\n- [Model Registry](#model-registry)\n\nAlternatively, jump to usage section\n- [Pipeline Deployment](#deploy-pipeline)\n- [Pipeline Resolution](#resolve-pipeline)\n- [Pipeline Watermark](#versioned-pipeline)\n\n### Model Versioning\n\nWe propose a naming convention for machine learning models borrowed from standard \nsoftware delivery principles (see [docs](https://docs.oracle.com/middleware/1212/core/MAVEN/maven_version.htm)), in the form of\n\n```\n[groupId]:[artifactId]:[majorVersion].[minorVersion].[buildNumber]\n```\n\n- `groupId`: The name of your organisation, using reversed domain (e.g. `com.organisation`)\n- `artifactId`: The name of your pipeline, agnostic from the modelling technique used (e.g. `customer-propensity`)\n- `majorVersion`: The major version of your model. Specific to the technique and features used\n- `minorVersion`: The minor version of your model. A specific configuration was used but technique remained the same (version increment should be backwards compatible)\n- `buildNumber`: The build number that will be incremented automatically any time a same model is retrained using same configuration and same technique but with up to date data\n\nAn example of valid naming convention would be\n\n```\n    com.organisation:customer-propensity:1.0.0\n    com.organisation:customer-propensity:1.0.1\n    com.organisation:customer-propensity:1.1.0\n    com.organisation:customer-propensity:2.0.0\n```\n\nThe corresponding maven dependency can be added to java / scala based project using GAV (**G**roup **A**rtifact **V**ersion) coordinates\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.organisation\u003c/groupId\u003e\n    \u003cartifactId\u003ecustomer-propensity\u003c/artifactId\u003e\n    \u003cversion\u003e2.0.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Model Registry\n\nWe use nexus as a central model registry.\nSetting up nexus is relatively easy and should already be de facto standard in your organisation. \nProject requires a `maven2` release repository to be created in order to host versioned pipeline models as per any standard Java dependency.\nIt also requires http connection between edge nodes used to run Spark job and nexus interface.\n\n![ml-registry](images/ml-registry.png)\n\nNote that we purposely did not enable `SNAPSHOT` feature of machine learning models as we consider each iteration \nof a model as an immutable release, hence with a different version build number.\n \nTo operate under full governance, it is advised to use multiple repositories where only validated\nmodels (e.g. validated through a QA process) can be promoted from one another via nexus [staging release process](https://help.sonatype.com/repomanager2/staging-releases)\n\n## Usage\n\nAvailable as a [spark package](https://spark-packages.org/package/aamend/ml-registry), include this package in your Spark Application as follows\n\n```shell script\n$ spark-shell --packages com.aamend.spark:ml-registry:latest.release\n```\n\n### Deploy Pipeline\n\nInspired by the scikit-learn project, spark ML relies on Pipeline to execute machine learning workflows at scale.\nSpark stores binaries in a \"crude\" way by serializing model metadata to a given path (hdfs or s3).\n\n```scala\nval pipeline: Pipeline = new Pipeline().setStages(stages)\nval model: PipelineModel = pipeline.fit(df)\nmodel.save(\"/path/to/hdfs\")\n```\n\nWe propose the following changes\n\n```scala\nimport com.aamend.spark.ml._\nmodel.deploy(\"com.aamend.spark:hello-world:1.0\")\n```\n\nThis process will\n \n- Serialize model to disk as per standard ML pipeline `save` function\n- Package pipeline model as a `.jar` file \n- Work out the relevant build number for artifact `com.aamend.spark:hello-world` given a major and minor version\n- Upload model `com.aamend.spark:hello-world:latest` to nexus\n\nNexus authentication is enabled by passing an `application.conf` to your spark context or application classpath\n\n```shell script\n$ spark-shell \\\n  --files application.conf \\\n  --driver-java-options -Dconfig.file=application.conf \\\n  --packages com.aamend.spark:ml-registry:latest.release\n```\n\nConfiguration needs to contain the following information. \nNote that we highly recommend enabling [user tokens settings on nexus](https://help.sonatype.com/repomanager3/security/security-setup-with-user-tokens#SecuritySetupwithUserTokens-EnablingandResettingUserTokens) \nto encrypt username / password. \n\n```shell script\nmodel {\n    repository {\n        id: \"ml-registry\"\n        url: \"http://localhost:8081/repository/ml-registry/\"\n        username: \"5gEa1ez2\"\n        password: \"Rl5PpGxICA-vh8-cghkJoq3i3tWAmKJtqgOoYpZqhh-f\"\n    }\n}\n```\n\nAlternatively, one can pass nexus credentials to `deploy` function explicitly\n\n```scala\nimport com.aamend.spark.ml._\nMLRegistry.deploy(\n  model = model,\n  gav = \"com.aamend.spark:hello-world:1.0\",\n  repoId = \"ml-registry\",\n  repoURL = \"http://localhost:8081/repository/ml-registry/\",\n  repoUsername = \"5gEa1ez2\",\n  repoPassword = \"Rl5PpGxICA-vh8-cghkJoq3i3tWAmKJtqgOoYpZqhh-f\"\n)\n```\n\nResulting artifact will be released to nexus and - as such - considered as immutable across multiple environments.\n\n![ml-version](images/model-versioned.png)\n\nWe also extract all parameters used across ML pipeline and report those as part of artifact metadata on nexus\n\n![ml-metadata](images/model-metadata.png)\n\n### Resolve Pipeline\n\nGiven that we consider each pipeline model as a standard maven dependency available on nexus, \nwe can leverage Spark Ivy functionality (through `--packages`) to inject our model as a dependency to a spark context. \nNote that one needs to pass specific ivy settings to point to their internal nexus repository. \nAn example of `ivysettings.xml` can be found [here](examples/ivysettings.xml)\n\n```shell script\n$ spark-shell \\\n  --conf spark.jars.ivySettings=ivysettings.xml \\\n  --packages com.aamend.spark:hello-world:latest.release\n```\n\nBy specifying `latest.release` instead of specific version, Ivy framework will ensure latest version of a \nmodel is resolved and loaded, paving the way to online machine learning. \nUnder the hood, we read pipeline metadata from classpath, \nstore binaries to disk and load pipeline model through native spark `load` function.\n\n```scala\nimport com.aamend.spark.ml._\nval model: PipelineModel = MLRegistry.resolve(\"hello-world\")\n```\n\nNote that you do not need to add this specific project `com.aamend.spark:ml-registry` at runtime since it has been added to your model `pom.xml` specs, hence will be resolved as a transitive dependency - magic!\n\n### Versioned Pipeline\n\nIn order to guarantee model reproducibility, we have introduced `VersionedPipeline`, a new type of pipeline object \nthat appends model version as published to nexus. \n\n```scala\nimport com.aamend.spark.ml._\nval pipeline: Pipeline = new VersionedPipeline().setWatermarkCol(\"pipeline\").setStages(stages)\nval model: PipelineModel = pipeline.fit(df)\nMLRegistry.deploy(model, \"com.aamend.spark:hello-world:1.0\")\n```\n\nSchema is attached to pipeline object and enriched at deployment time with corresponding maven version\n\n```scala\nimport com.aamend.spark.ml._\nval model: PipelineModel = MLRegistry.resolve(\"hello-world\")\nmodel.transform(df).select(\"id\", \"pipeline\").show()\n```\n\nFor each record, we know the exact version of the model used. \n\n```\n+---+----------------------------------+\n|id |pipeline                          |\n+---+----------------------------------+\n|4  |com.aamend.spark:hello-world:1.0.0|\n|5  |com.aamend.spark:hello-world:1.0.0|\n|6  |com.aamend.spark:hello-world:1.0.0|\n|7  |com.aamend.spark:hello-world:1.0.0|\n+---+----------------------------------+\n```\n\nIdeally, this extra information at a record level must serve model monitoring, covering below user stories\n\n+ **As a** data scientist\n+ **I want to** be informed whenever performance of my model degrade\n+ **so that** I can retrain a new model and deploy via the above methodology\n\nNote that this also pave the way to *champion/challenger* model approach where 2 versions of a model would be running concurrently and monitored simultaneously.\n\n## Backlog\n\n- serialize models using [MLeap](https://github.com/combust/mleap) to be used outside of a spark context\n\n## Install\n\n```shell script\nmvn clean package -Plocal\n```\n\n### Author\n\n[Antoine Amend](mailto:antoine.amend@gmail.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faamend%2Fml-registry","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faamend%2Fml-registry","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faamend%2Fml-registry/lists"}