{"id":18644110,"url":"https://github.com/indix/sparkplug","last_synced_at":"2025-04-11T12:30:59.854Z","repository":{"id":57725205,"uuid":"106816720","full_name":"indix/sparkplug","owner":"indix","description":"Spark package to \"plug\" holes in data using SQL based rules ⚡️ 🔌 ","archived":false,"fork":false,"pushed_at":"2020-05-15T09:11:07.000Z","size":515,"stargazers_count":28,"open_issues_count":0,"forks_count":2,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-25T13:39:27.744Z","etag":null,"topics":["datapipeline","spark","spark-sql"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/indix.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-10-13T11:35:57.000Z","updated_at":"2023-01-06T02:07:41.000Z","dependencies_parsed_at":"2022-09-11T17:22:15.113Z","dependency_job_id":null,"html_url":"https://github.com/indix/sparkplug","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fsparkplug","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fsparkplug/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fsparkplug/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fsparkplug/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/indix","download_url":"https://codeload.github.com/indix/sparkplug/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248401936,"owners_count":21097328,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datapipeline","spark","spark-sql"],"created_at":"2024-11-07T06:10:02.424Z","updated_at":"2025-04-11T12:30:59.541Z","avatar_url":"https://github.com/indix.png","language":"Scala","funding_links":[],"categories":["Big Data"],"sub_categories":[],"readme":"# ![Sparkplug](./logo/Sparkplug-Wide.svg)\nSpark package to \"plug\" holes in data using SQL based rules. \n\n[![Build Status](https://travis-ci.org/indix/sparkplug.svg?branch=master)](https://travis-ci.org/indix/sparkplug) [![Maven](https://maven-badges.herokuapp.com/maven-central/com.indix/sparkplug_2.11/badge.svg)](http://repo1.maven.org/maven2/com/indix/sparkplug_2.11/)\n\n## Motivation\n\nAt Indix, we work with a lot of data. Our data pipelines run a wide variety of ML models against our data. There are cases where we have to \"plug\" or override certain values or predictions in our data. This maybe due to bugs or deficiencies in our current models or just the inherent quality in the source/raw data.\n\n`SparkPlug` is a rule based override system that helps us to do fixes in our data. The rules also act as central place of \"debt\" that we need to pay by doing improvements to our aglorithms and models.\n\n## Design\n\nWe came up with a system that enables engineering, customer success and product management to make the necessary fixes in the data. Using rules based on SQL conditions (WHERE clause predicate), we provided a way to override and fix values / predictions in our data. Each rule has the condition along with the fields that are to be overridden with their respective override values.\n\nSparkPlug is the core of this system. We also have an internal GUI based tool that utilizes our internal data pipeline platform to sample data, apply the rules and provide detailed view into how each rule affects the data.\n\nSparkPlug leverages Spark-SQL to do much of its work. SparkPlug is designed to run within our internal data pipeline platform as well as standalone Spark jobs.\n\n## Getting Started\n\nLet's first talk about how rules that SparkPlug works with look like.\n\n### Rules\n\nAn example rule is given below in json:\n\n```json\n{\n  \"name\": \"rule1\",\n  \"version\": \"version1\",\n  \"condition\": \"title like '%iPhone%'\",\n  \"actions\": [\n    {\n      \"key\": \"title\",\n      \"value\": \"Apple iPhone\"\n    }\n  ]\n}\n```\nEach rule identifies itself with a `name`. The current version of the rule can be identified using `version`. The SQL predicate in `condition` is used to identify the applicable rows in the data. On the selected rows, the `actions` - specified via the column name in `key` and its overridden `value` - are applied to plug the data. The value is currently always specified as a string, and is internally validated and convereted to the appropriate type.\n\nRules can be fed into Sparkplug as a normal jsonlines dataset that Spark can work with.\n\nSparkplug comes with an helper to deserialize json rules into a collection of `PlugRule` objects, which is shown below:\n\n```scala\n// example of creating a Spark session\nimplicit val spark: SparkSession = SparkSession.builder\n    .config(new SparkConf())\n    .enableHiveSupport()\n    .master(\"local[*]\")\n    .getOrCreate()\n    \nval rules = spark.readPlugRulesFrom(path)\n```\nThe `rules` can now be fed into SparkPlug.\n\n### Creating SparkPlug instance\n\nSparkPlug comes with a builder that helps you instanstiate SparkPlug object with the right settings. The simple way to create one is as follows:\n\n```scala\nval sparkPlug = SparkPlug.builder.create()\n```\n\nOnce we have the instance, we can \"plug\" a `DatFrame` with the rules:\n\n```scala\nsparkPlug.plug(df, rules)\n```\n\n### Rules validation\n\n`SparkPlug.validate` method can be used to validate the input rules for a given schema.\n\n```scala\nsparkPlug.validate(df.schema, rules) // Returns a list of validation errors if any.\n```\n\nThe SparkPlug object can also be created with validation enabled so that rules are validated before plugging:\n\n```scala\nval sparkPlug = SparkPlug.builder.enableRulesValidation.create()\n```\n\n### Plug details\n\nTo track what changes are being made (or not) to each record, it is possible to add `PlugDetails` to every record with information on which rules were applied to it. This is disabled by default and can be enabled as follows:\n\n```scala\nval sparkPlug = SparkPlug.builder.enablePlugDetails.create()\n```\nThis adds a `plugDetails` column of type `Seq[PlugDetail]` to the DataFrame. `PlugDetail` is a simple case class as defined below:\n\n```scala\ncase class PlugDetail(name: String, version: String, fieldNames: Seq[String])\n```\n\n#### Custom plug details column\n\nBy default, plug details are added to the column `plugDetails`. This can be overridden to a different column, say `overrideDetails` as follows:\n\n```scala\nval sparkPlug = SparkPlug.builder.enablePlugDetails(\"overrideDetails\").create()\n```\n\n#### Custom plug details schema / class\n\nBy default, plug details is of type `Seq[PlugDetail]`. It is possible to provide a custom type by supplying a UDF to `SparkPlug` which defines how the plug details information are to be populated into the custom type.\n\nThe following example shows how one can go about adding plug details of type `Seq[OverrideDetail]`:\n\n```scala\ncase class OverrideDetail(ruleId: Option[String],\n                          fieldNames: Seq[String],\n                          ruleVersion: Option[String])\n\ncase class TestRowWithOverrideDetails(title: String,\n                                      brand: String,\n                                      price: Int,\n                                      overrideDetails: Seq[OverrideDetail])\n\nclass CustomAddPlugDetailUDF extends AddPlugDetailUDF[OverrideDetail] {\n  override def addPlugDetails(plugDetails: Seq[Row],\n                              ruleName: String,\n                              ruleVersion: String,\n                              fields: Seq[String]) = {\n    plugDetails :+ new GenericRowWithSchema(\n      Array(ruleName, fields, ruleVersion),\n      plugDetailSchema)\n  }\n}\n```\n\nAs seen in the example above, the custom UDF inherits from `AddPlugDetailUDF[T]` and implements the `addPlugDetails` method as needed.\n\n### Working with structs\n\nIt is possible to override values within a `StructType`. \n\n\n```json\n{\n  \"name\": \"rule1\",\n  \"version\": \"version1\",\n  \"condition\": \"title like '%iPhone%'\",\n  \"actions\": [\n    {\n      \"key\": \"price.min\",\n      \"value\": \"100.0\"\n    },\n    {\n      \"key\": \"price.max\",\n      \"value\": \"1000.0\"\n    }\n  ]\n}\n```\n\nCurrently SparkPlug supports only one level within structs.\n\n### SQL in values\n\nValues can be literal values like \"iPhone\", \"100\" or \"999.9\" etc. SparkPlug also allow SQL within values so that overrides can use the power of SQL and most importantly depend of values of other fields. Values enclosed within `` ` `` (backtick) are treated as SQL:\n\n\n```json\n{\n  \"name\": \"rule1\",\n  \"version\": \"version1\",\n  \"condition\": \"true\",\n  \"actions\": [\n    {\n      \"key\": \"title\",\n      \"value\": \"`concat(brand, ' ', title)`\"\n    }\n  ]\n}\n```\n\nThe above rule appends the `brand` to the `title`\n\n### Keeping track of old value\n\nIf the old value of a overridden field needs to be tracked, SparkPlug can be built with the `keepOldField` option set:\n\n```scala\nval sparkPlug = SparkPlug.builder.keepOldField.create()\n```\n\nThis will add, for each action, a new column named `${actionKey}_${ruleName}_old`.\n\n**Note**: This feature is ideal only when `SparkPlug` is used with a single rule. This will add one column per action per rule and is not recommended for the production job. We use this feature internally when adding a rule so that we can see how each rule affects the dataset.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Findix%2Fsparkplug","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Findix%2Fsparkplug","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Findix%2Fsparkplug/lists"}