{"id":18810381,"url":"https://github.com/absaoss/spline-spark-agent","last_synced_at":"2025-04-05T12:06:36.627Z","repository":{"id":36961877,"uuid":"231394159","full_name":"AbsaOSS/spline-spark-agent","owner":"AbsaOSS","description":"Spline agent for Apache Spark","archived":false,"fork":false,"pushed_at":"2024-04-09T10:45:17.000Z","size":2500,"stargazers_count":168,"open_issues_count":58,"forks_count":89,"subscribers_count":17,"default_branch":"develop","last_synced_at":"2024-04-12T07:05:56.428Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://absaoss.github.io/spline/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-01-02T14:06:46.000Z","updated_at":"2024-04-14T18:39:29.224Z","dependencies_parsed_at":"2024-04-14T18:35:31.651Z","dependency_job_id":null,"html_url":"https://github.com/AbsaOSS/spline-spark-agent","commit_stats":null,"previous_names":[],"tags_count":39,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspline-spark-agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspline-spark-agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspline-spark-agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspline-spark-agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/spline-spark-agent/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247332605,"owners_count":20921853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:20:02.355Z","updated_at":"2025-04-05T12:06:36.599Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"Spline Agent for Apache Spark\u0026trade;\n===\n\nThe _Spline Agent for Apache Spark\u0026trade;_ is a complementary module to the [Spline project](https://absaoss.github.io/spline/)\nthat captures runtime lineage information from the Apache Spark jobs.\n\nThe agent is a Scala library that is embedded into the Spark driver, listening to Spark events, and capturing logical execution plans.\nThe collected metadata is then handed over to the lineage dispatcher, from where it can either be sent to the Spline server\n(e.g. via REST API or Kafka), or used in another way, depending on selected dispatcher type (see [Lineage Dispatchers](#dispatchers)).\n\nThe agent can be used with or without a Spline server, depending on your use case. See [References](#references).\n\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.spline.agent.spark/agent-core_2.12/badge.svg)](https://search.maven.org/search?q=g:za.co.absa.spline.agent.spark)\n[![TeamCity build](https://teamcity.jetbrains.com/app/rest/builds/aggregated/strob:%28locator:%28buildType:%28id:OpenSourceProjects_AbsaOSS_SplineAgentSpark_AutoBuildSpark24scala212%29,branch:develop%29%29/statusIcon.svg)](https://teamcity.jetbrains.com/viewType.html?buildTypeId=OpenSourceProjects_AbsaOSS_SplineAgentSpark_AutoBuildSpark24scala212\u0026branch=develop\u0026tab=buildTypeStatusDiv)\n[![Sonarcloud Status](https://sonarcloud.io/api/project_badges/measure?project=AbsaOSS_spline-spark-agent\u0026metric=alert_status)](https://sonarcloud.io/dashboard?id=AbsaOSS_spline-spark-agent)\n[![SonarCloud Maintainability](https://sonarcloud.io/api/project_badges/measure?project=AbsaOSS_spline-spark-agent\u0026metric=sqale_rating)](https://sonarcloud.io/dashboard?id=AbsaOSS_spline-spark-agent)\n[![SonarCloud Reliability](https://sonarcloud.io/api/project_badges/measure?project=AbsaOSS_spline-spark-agent\u0026metric=reliability_rating)](https://sonarcloud.io/dashboard?id=AbsaOSS_spline-spark-agent)\n[![SonarCloud Security](https://sonarcloud.io/api/project_badges/measure?project=AbsaOSS_spline-spark-agent\u0026metric=security_rating)](https://sonarcloud.io/dashboard?id=AbsaOSS_spline-spark-agent)\n[![Docker Pulls](https://badgen.net/docker/pulls/absaoss/spline-spark-agent?icon=docker\u0026label=pulls)](https://hub.docker.com/r/absaoss/spline-spark-agent/)\n\n\n## Table of Contents\n\n\u003c!--ts--\u003e\n\n* [Versioning](#versioning)\n    * [Spark / Scala version compatibility matrix](#compat-matrix)\n* [Usage](#usage)\n    * [Selecting artifact](#selecting-artifact)\n    * [Initialization](#initialization)\n        * [Codeless](#initialization-codeless)\n        * [Programmatic](#initialization-programmatic)\n* [Configuration](#configuration)\n    * [Properties](#properties)\n    * [Lineage Dispatchers](#dispatchers)\n    * [Post Processing Filters](#filters)\n* [Spark features coverage](#spark-coverage)\n* [Developer documentation](#dev-doc)\n    * [Plugin API](#plugins)\n    * [Building for different Scala and Spark versions](#building)\n* [References and Examples](#references)\n\n\u003c!-- Added by: wajda, at: Fri 14 May 18:05:53 CEST 2021 --\u003e\n\n\u003c!--te--\u003e\n\n\u003ca id=\"versioning\"\u003e\u003c/a\u003e\n\n## Versioning\n\nThe Spline Spark Agent follows the [Semantic Versioning](https://semver.org/) principles.\nThe _Public API_ is defined as a set of entry-point classes (`SparkLineageInitializer`, `SplineSparkSessionWrapper`),\nextension APIs (Plugin API, filters, dispatchers), configuration properties and a set of supported Spark versions.\nIn other words, the _Spline Spark Agent Public API_ in terms of _SemVer_ covers all entities and abstractions that are designed\nto be used or extended by client applications.\n\nThe version number **does not** directly reflect the relation of the Agent to the Spline Producer API (the Spline server). Both the Spline Server and\nthe Agent are designed to be as much mutually compatible as possible, assuming long-term operation and a possibly significant gap in the server and\nthe agent release dates. Such requirement is dictated by the nature of the Agent that could be embedded into some Spark jobs and only rarely if ever\nupdated without posing a risk to stop working because of eventual Spline server update. Likewise, it should be possible to update the Agent anytime\n(e.g. to fix a bug or support a newer Spark version or a feature that earlier agent version didn't support) without requiring a Spline server upgrade.\n\nAlthough not required by the above statement, for minimizing user astonishment when the compatibility between too distant _Agent_ and _Server_\nversions is dropped, we'll increment the _Major_ version component.\n\n\u003ca id=\"compat-matrix\"\u003e\u003c/a\u003e\n\n### Spark / Scala version compatibility matrix\n\n|                        |         Scala 2.11         | Scala 2.12 |\n|------------------------|:--------------------------:|:----------:|\n| **Spark 2.2**          | (no SQL; no codeless init) |  \u0026mdash;   |\n| **Spark 2.3**          |     (no Delta support)     |  \u0026mdash;   |\n| **Spark 2.4**          |            Yes             |    Yes     |\n| **Spark 3.0 or newer** |          \u0026mdash;           |    Yes     |\n\n\u003ca id=\"usage\"\u003e\u003c/a\u003e\n\n## Usage\n\n\u003ca id=\"selecting-artifact\"\u003e\u003c/a\u003e\n\n### Selecting artifact\n\nThere are two main agent artifacts:\n\n- `agent-core`\n  is a Java library that you can use with any compatible Spark version. Use this one if you want to include Spline agent into your\n  custom Spark application, and you want to manage all transitive dependencies yourself.\n\n- `spark-spline-agent-bundle`\n  is a fat jar that is designed to be embedded into the Spark driver, either by manually copying it to the Spark's `/jars` directory,\n  or by using `--jars` or `--packages` argument for the `spark-submit`, `spark-shell` or `pyspark` commands.\n  This artifact is self-sufficient and is **aimed to be used by most users**.\n\nBecause the bundle is pre-built with all necessary dependencies, it is important to select a proper version of it that matches the minor Spark\nand Scala versions of your target Spark installation.\n\n```\nspark-A.B-spline-agent-bundle_X.Y.jar\n```\n\nhere `A.B` is the first two Spark version numbers and `X.Y` is the first two Scala version numbers.\nFor example, if you have Spark 2.4.4 pre-built with Scala 2.12.10 then select the following agent bundle:\n\n```\nspark-2.4-spline-agent-bundle_2.12.jar\n```\n\n**AWS Glue Note**: dependency on `org.yaml:snakeyaml:1.33` is **missing** in Glue flavour of Spark. Please add this dependency on the classpath.\n\n\u003ca id=\"initialization\"\u003e\u003c/a\u003e\n\n### Initialization\n\nSpline agent is basically a Spark query listener that needs to be registered in a Spark session before is can be used.\nDepending on if you are using it as a library in your custom Spark application, or as a standalone bundle you can choose\none of the following initialization approaches.\n\n\u003ca id=\"initialization-codeless\"\u003e\u003c/a\u003e\n\n#### Codeless Initialization\n\nThis way is the most convenient one, can be used in majority use-cases.\nSimply include the Spline listener into the `spark.sql.queryExecutionListeners` config property\n(see [Static SQL Configuration](https://spark.apache.org/docs/latest/configuration.html#static-sql-configuration))\n\nExample:\n\n```bash\npyspark \\\n  --packages za.co.absa.spline.agent.spark:spark-2.4-spline-agent-bundle_2.12:\u003cVERSION\u003e \\\n  --conf \"spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener\" \\\n  --conf \"spark.spline.lineageDispatcher.http.producer.url=http://localhost:9090/producer\"\n```\n\nThe same approach works for `spark-submit` and `spark-shell` commands.\n\n**Note**: all Spline properties set via Spark conf should be prefixed with `spark.` prefix in order to be visible to the Spline agent.  \nSee [Configuration](#configuration) section for details.\n\n\u003ca id=\"initialization-programmatic\"\u003e\u003c/a\u003e\n\n#### Programmatic Initialization\n\n**Note**: starting from Spline 0.6 most agent components can be configured or even replaced in a declarative manner\neither using [Configuration](#configuration) or [Plugin API](#plugins). So normally there should be no need to use a programmatic initialization\nmethod.\n**We recommend to use [Codeless Initialization](#initialization-codeless) instead**.\n\nBut if for some reason, Codeless Initialization doesn't fit your needs, or you want to do more customization on Spark agent,\nyou can use programmatic initialization method.\n\n```Scala\n// given a Spark session ...\nval sparkSession: SparkSession = ???\n\n// ... enable data lineage tracking with Spline\nimport za.co.absa.spline.harvester.SparkLineageInitializer._\nsparkSession.enableLineageTracking()\n\n// ... then run some Dataset computations as usual.\n// The lineage will be captured and sent to the configured Spline Producer endpoint.\n```\n\nor in Java syntax:\n\n```java\nimport za.co.absa.spline.harvester.SparkLineageInitializer;\n// ...\nSparkLineageInitializer.enableLineageTracking(session);\n```\n\nThe method `enableLineageTracking()` accepts optional `AgentConfig` object that can be used to customize Spline behavior.\nThis is an alternative way to configure Spline. The other one if via the [property based configuration](#configuration).\n\nThe instance of `AgentConfig` can be created by using a builder or one of the factory methods.\n\n```scala\n// from a sequence of key-value pairs \nval config = AgentConfig.from(???: Iterable[(String, Any)])\n\n// from a Common Configuration\nval config = AgentConfig.from(???: org.apache.commons.configuration.Configuration)\n\n// using a builder\nval config = AgentConfig.builder()\n  // call some builder methods here...\n  .build()\n\nsparkSession.enableLineageTracking(config)\n```\n\n**Note**: `AgentConfig` object doesn't override the standard configuration stack. Instead, it serves as an additional configuration mean\nwith the precedence set between the `spline.yaml` and `spline.default.yaml` files (see below).\n\n\u003ca id=\"configuration\"\u003e\u003c/a\u003e\n\n## Configuration\n\nThe agent looks for configuration in the following sources (listed in order of precedence):\n\n- Hadoop configuration (`core-site.xml`)\n- Spark configuration\n- JVM system properties\n- `spline.properties` file on classpath\n- `spline.yaml` file on classpath\n- `AgentConfig` object\n- `spline.default.yaml` file on classpath\n\nThe file [spline.default.yaml](core/src/main/resources/spline.default.yaml) contains default values\nfor all Spline properties along with additional documentation.\nIt's a good idea to look in the file to see what properties are available.\n\nThe order of precedence might look counter-intuitive, as one would expect that explicitly provided config (`AgentConfig` instance) should\noverride ones defined in the outer scope. However, prioritizing global config to local one makes it easier to manage Spline settings centrally\non clusters, while still allowing room for customization by job developers.\n\nFor example, a company could require lineage metadata from jobs executed on a particular cluster to be sanitized, enhanced with some metrics\nand credentials and stored in a certain metadata store (a database, file, Spline server etc). The Spline configuration needs to be set globally\nand applied to all Spark jobs automatically. However, some jobs might contain hardcoded properties that the developers used locally or on\na testing environment, and forgot to remove them before submitting jobs into a production.\nIn such situation we want cluster settings to have precedence over the job settings.\nAssuming that hardcoded settings would most likely be defined in the `AgentConfig` object, a property file or a JVM properties,\non the cluster we could define them in the Spark config or Hadoop config.\n\nIn case of multiple definitions of property the first occurrence wins, but `spline.lineageDispatcher` and `spline.postProcessingFilter` properties\nare composed instead. E.g. if a _LineageDispatcher_ is set to be _Kafka_ in one config source and 'Http' in another, they would be implicitly\nwrapped by a composite dispatcher, so both would be called in the order corresponding the config source precedence.\nSee `CompositeLineageDispatcher` and `CompositePostProcessingFilter`.\n\nEvery config property is resolved independently. So, for instance, if a `DataSourcePasswordReplacingFilter` is used some of its properties might be\ntaken from one config source and the other ones form another, according to the conflict resolution rules described above.\nThis allows administrators to tweak settings of individual Spline components (filters, dispatchers or plugins) without having to redefine and override\nthe whole piece of configuration for a given component.\n\n\u003ca id=\"properties\"\u003e\u003c/a\u003e\n\n### Properties\n\n#### `spline.mode`\n\n- `ENABLED` [default]\n\n  Spline will try to initialize itself, but if it fails it switches to DISABLED mode\n  allowing the Spark application to proceed normally without Lineage tracking.\n\n- `DISABLED`\n\n  Lineage tracking is completely disabled and Spline is unhooked from Spark.\n\n#### `spline.lineageDispatcher`\n\nThe logical name of the root lineage dispatcher. See [Lineage Dispatchers](#dispatchers) chapter.\n\n#### `spline.postProcessingFilter`\n\nThe logical name of the root post-processing filter. See [Post Processing Filters](#filters) chapter.\n\n\u003ca id=\"dispatchers\"\u003e\u003c/a\u003e\n\n### Lineage Dispatchers\n\nThe `LineageDispatcher` trait is responsible for sending out the captured lineage information.\nBy default, the `HttpLineageDispatcher` is used, that sends the lineage data to the Spline REST endpoint (see Spline Producer API).\n\nAvailable dispatchers:\n\n- `HttpLineageDispatcher` - sends lineage to an HTTP endpoint\n- `KafkaLineageDispatcher` - sends lineage to a Kafka topic\n- `ConsoleLineageDispatcher` - writes lineage to the console\n- `LoggingLineageDispatcher` - logs lineage using the Spark logger\n- `FallbackLineageDispatcher` - sends lineage to a fallback dispatcher if the primary one fails\n- `CompositeLineageDispatcher` - allows to combine multiple dispatchers to send lineage to multiple endpoints\n\nEach dispatcher can have different configuration parameters.\nTo make the configs clearly separated each dispatcher has its own namespace in which all it's parameters are defined.\nI will explain it on a Kafka example.\n\nDefining dispatcher\n\n```properties\nspline.lineageDispatcher=kafka\n```\n\nOnce you defined the dispatcher all other parameters will have a namespace `spline.lineageDispatcher.{{dipatcher-name}}.` as a prefix.\nIn this case it is `spline.lineageDispatcher.kafka.`.\n\nTo find out which parameters you can use look into `spline.default.yaml`. For kafka I would have to define at least these two properties:\n\n```properties\nspline.lineageDispatcher.kafka.topic=foo\nspline.lineageDispatcher.kafka.producer.bootstrap.servers=localhost:9092\n```\n\n#### Using the Http Dispatcher\n\nThis dispatcher is used by default. The only mandatory configuration is url of the producer API rest endpoint\n(`spline.lineageDispatcher.http.producer.url`).\nAdditionally, timeouts, apiVersion and multiple custom headers can be set.\n\n```properties\nspline.lineageDispatcher.http.producer.url=\nspline.lineageDispatcher.http.timeout.connection=2000\nspline.lineageDispatcher.http.timeout.read=120000\nspline.lineageDispatcher.http.apiVersion=LATEST\nspline.lineageDispatcher.http.header.X-CUSTOM-HEADER=custom-header-value\n```\n\nIf the producer requires token based authentication for requests, below mentioned details must be included in configuration.\n\n```properties\nspline.lineageDispatcher.http.authentication.type=OAUTH\nspline.lineageDispatcher.http.authentication.grantType=client_credentials\nspline.lineageDispatcher.http.authentication.clientId=\u003cclient_id\u003e\nspline.lineageDispatcher.http.authentication.clientSecret=\u003csecret\u003e\nspline.lineageDispatcher.http.authentication.scope=\u003cscope\u003e\nspline.lineageDispatcher.http.authentication.tokenUrl=\u003ctoken_url\u003e\n```\n\nExample: Azure HTTP trigger template API key header can be set like this:\n\n```properties\nspline.lineageDispatcher.http.header.X-FUNCTIONS-KEY=USER_API_KEY\n```\n\nExample: AWS Rest API key header can be set like this:\n\n```properties\nspline.lineageDispatcher.http.header.X-API-Key=USER_API_KEY\n```\n\n#### Using the Fallback Dispatcher\n\nThe `FallbackDispatcher` is a proxy dispatcher that sends lineage to the primary dispatcher first, and then _if_ there is an error\nit calls the fallback one.\n\nIn the following example the `HttpLineageDispatcher` will be used as a primary, and the `ConsoleLineageDispatcher` as fallback.\n\n```properties\nspline.lineageDispatcher=fallback\nspline.lineageDispatcher.fallback.primaryDispatcher=http\nspline.lineageDispatcher.fallback.fallbackDispatcher=console\n```\n\n#### Using the Composite Dispatcher\n\nThe `CompositeDispatcher` is a proxy dispatcher that forwards lineage data to multiple dispatchers.\n\nFor example, if you want the lineage data to be sent to an HTTP endpoint and to be logged to the console at the same time you can do the following:\n\n```properties\nspline.lineageDispatcher=composite\nspline.lineageDispatcher.composite.dispatchers=http,console\n```\n\nBy default, if some dispatchers in the list fail, the others are still attempted. If you want the error in any dispatcher to be treated as fatal,\nand be propagated to the main process, you set the `failOnErrors` property to `true`:\n\n```properties\nspline.lineageDispatcher.composite.failOnErrors=true\n```\n\n#### Creating your own dispatcher\n\nThere is also a possibility to create your own dispatcher. It must implement `LineageDispatcher` trait and have a constructor\nwith a single parameter of type `org.apache.commons.configuration.Configuration`.\nTo use it you must define name and class and also all other parameters you need. For example:\n\n```properties\nspline.lineageDispatcher=my-dispatcher\nspline.lineageDispatcher.my-dispatcher.className=org.example.spline.MyDispatcherImpl\nspline.lineageDispatcher.my-dispatcher.prop1=value1\nspline.lineageDispatcher.my-dispatcher.prop2=value2\n```\n\n#### Combining dispatchers (complex example)\n\nIf you need, you can combine multiple dispatchers into a single one using `CompositeLineageDispatcher` and `FallbackLineageDispatcher`\nin any combination as you wish.\n\nIn the following example the lineage will be first sent to the HTTP endpoint \"http://10.20.111.222/lineage-primary\", if that fails it's redirected to\nthe \"http://10.20.111.222/lineage-secondary\" endpoint, and if that one fails as well, lineage is logged to the ERROR logs and the console at the same\ntime.\n\n```properties\nspline.lineageDispatcher.http1.className=za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher\nspline.lineageDispatcher.http1.producer.url=http://10.20.111.222/lineage-primary\n\nspline.lineageDispatcher.http2.className=za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher\nspline.lineageDispatcher.http2.producer.url=http://10.20.111.222/lineage-secondary\n\nspline.lineageDispatcher.errorLogs.className=za.co.absa.spline.harvester.dispatcher.LoggingLineageDispatcher\nspline.lineageDispatcher.errorLogs.level=ERROR\n\nspline.lineageDispatcher.disp1.className=za.co.absa.spline.harvester.dispatcher.FallbackLineageDispatcher\nspline.lineageDispatcher.disp1.primaryDispatcher=http1\nspline.lineageDispatcher.disp1.fallbackDispatcher=disp2\n\nspline.lineageDispatcher.disp2.className=za.co.absa.spline.harvester.dispatcher.FallbackLineageDispatcher\nspline.lineageDispatcher.disp2.primaryDispatcher=http2\nspline.lineageDispatcher.disp2.fallbackDispatcher=disp3\n\nspline.lineageDispatcher.disp3.className=za.co.absa.spline.harvester.dispatcher.CompositeLineageDispatcher\nspline.lineageDispatcher.composite.dispatchers=errorLogs,console\n\nspline.lineageDispatcher=disp1\n```\n\n\u003ca id=\"filters\"\u003e\u003c/a\u003e\n\n### Post Processing Filters\n\nFilters can be used to enrich the lineage with your own custom data or to remove unwanted data like passwords.\nAll filters are applied after the Spark plan is converted to Spline DTOs, but before the dispatcher is called.\n\nThe procedure how filters are registered and configured is similar to the `LineageDispatcher` registration and configuration procedure.\nA custom filter class must implement `za.co.absa.spline.harvester.postprocessing.PostProcessingFilter` trait and declare a constructor\nwith a single parameter of type `org.apache.commons.configuration.Configuration`.\nThen register and configure it like this:\n\n```properties\nspline.postProcessingFilter=my-filter\nspline.postProcessingFilter.my-filter.className=my.awesome.CustomFilter\nspline.postProcessingFilter.my-filter.prop1=value1\nspline.postProcessingFilter.my-filter.prop2=value2\n```\n\nUse pre-registered `CompositePostProcessingFilter` to chain up multiple filters:\n\n```properties\nspline.postProcessingFilter=composite\nspline.postProcessingFilter.composite.filters=myFilter1,myFilter2\n```\n\n(see `spline.default.yaml` for details and examples)\n\n#### Using MetadataCollectingFilter\n\nMetadataCollectingFilter provides a way to add additional data to lineage produced by Spline Agent.\n\nData can be added to the following lineage entities: `executionPlan`, `executionEvent`, `operation`, `read` and `write`.\n\nInside each entity is dedicated map named `extra` that can store any additional user data.\n\n`executionPlan` and `executionEvent` have additional map `labels`. Labels are intended for identification and filtering on the server.\n\nExample usage:\n\n```properties\nspline.postProcessingFilter=userExtraMeta\nspline.postProcessingFilter.userExtraMeta.rules=file:///path/to/json-with-rules.json\n```\n\njson-with-rules.json could look like this:\n\n```json\n{\n    \"executionPlan\": {\n        \"extra\": {\n            \"my-extra-1\": 42,\n            \"my-extra-2\": [ \"aaa\", \"bbb\", \"ccc\" ]\n        },\n        \"labels\": {\n            \"my-label\": \"my-value\"\n        }\n    },\n    \"write\": {\n        \"extra\": {\n            \"foo\": \"extra-value\"\n        }\n    }\n}\n```\n\nThe `spline.postProcessingFilter.userExtraMeta.rules` can be either url pointing to json file or a json string.\nThe rules definition can be quite long and when providing string directly a lot of escaping may be necessary so using a file is recommended.\n\nExample of escaping the rules string in Scala String:\n```\n.config(\"spline.postProcessingFilter.userExtraMeta.rules\", \"{\\\"executionPlan\\\":{\\\"extra\\\":{\\\"qux\\\":42\\\\,\\\"tags\\\":[\\\"aaa\\\"\\\\,\\\"bbb\\\"\\\\,\\\"ccc\\\"]}}}\")\n```\n- `\"` needs to be escaped because it would end the string\n- `,` needs to be escaped because when passing configuration via Java properties the comma is used as a separator under the hood \n  and must be explicitly escaped.\n\nExample of escaping the rules string as VM option:\n```\n-Dspline.postProcessingFilter.userExtraMeta.rules={\\\"executionPlan\\\":{\\\"extra\\\":{\\\"qux\\\":42\\,\\\"tags\\\":[\\\"aaa\\\"\\,\\\"bbb\\\"\\,\\\"ccc\\\"]}}}\n```\n\nA convenient way how to provide rules json without need for escaping may be to specify the property in yaml config file.\nAn example of this can be seen in \n[spline examples yaml config](https://github.com/AbsaOSS/spline-spark-agent/blob/develop/examples/src/main/resources/spline.yaml).\n\n\nThere is also option to get environment variables using `$env`, jvm properties using `$jvm` and execute javascript using `$js`.\nSee the following example:\n\n```json\n{\n    \"executionPlan\": {\n        \"extra\": {\n            \"my-extra-1\": 42,\n            \"my-extra-2\": [ \"aaa\", \"bbb\", \"ccc\" ],\n            \"bar\": { \"$env\": \"BAR_HOME\" },\n            \"baz\": { \"$jvm\": \"some.jvm.prop\" },\n            \"daz\": { \"$js\": \"session.conf().get('k')\" },\n            \"appName\": { \"$js\":\"session.sparkContext().appName()\" }\n       }\n    }\n}\n```\n\nFor the javascript evaluation following variables are available by default:\n\n| variable          | Scala Type                                                 |\n|-------------------|:-----------------------------------------------------------|\n| `session`         | `org.apache.spark.sql.SparkSession`                        |\n| `logicalPlan`     | `org.apache.spark.sql.catalyst.plans.logical.LogicalPlan`  | \n| `executedPlanOpt` | `Option[org.apache.spark.sql.execution.SparkPlan]`         |\n\nUsing those objects it should be possible to extract almost any relevant information from Spark.\n\nThe rules can be conditional, meaning the specified params will be added only when some condition is met.\nSee the following example:\n\n```json\n{\n    \"executionEvent[@.timestamp \u003e 65]\": {\n        \"extra\": { \"tux\": 1 }\n    },\n    \"executionEvent[@.extra['foo'] == 'a' \u0026\u0026 @.extra['bar'] == 'x']\": {\n        \"extra\": { \"bux\": 2 }\n    },\n    \"executionEvent[@.extra['foo'] == 'a' \u0026\u0026 !@.extra['bar']]\": {\n        \"extra\": { \"dux\": 3 }\n    },\n    \"executionEvent[@.extra['baz'][2] \u003e= 3]\": {\n        \"extra\": { \"mux\": 4 }\n    },\n    \"executionEvent[@.extra['baz'][2] \u003c 3]\": {\n        \"extra\": { \"fux\": 5 }\n    },\n    \"executionEvent[session.sparkContext.conf['spark.ui.enabled'] == 'false']\": {\n      \"extra\": { \"tux\": 1 }\n    }\n}\n```\n\nThe condition is enclosed by `[]` after entity name.\nHere the `@` serves as a reference to currently processed entity, in this case executionEvent.\nThe `[]` inside the condition statement can also serve as a way to access maps and sequences.\nLogical and comparison operators are available.\n\n`session` and other variables available for js are available here as well.\n\n\nFor more examples of usage please se `MetadataCollectingFilterSpec` test class.\n\n\u003ca id=\"spark-coverage\"\u003e\u003c/a\u003e\n\n## Spark features coverage\n\n_Dataset_ operations are fully supported\n\n_RDD_ transformations aren't supported due to Spark internal architecture specifics, but they might be supported semi-automatically\nin the future Spline versions (see #33)\n\n_SQL_ dialect is mostly supported.\n\n_DDL_ operations are not supported, excepts for `CREATE TABLE ... AS SELECT ...` which is supported.\n\n**Note**: By default, the lineage is only captured on persistent (write) actions.\nTo capture in-memory actions like `collect()`, `show()` etc the corresponding plugin needs to be activated\nby setting up the following configuration property:\n\n```properties\nspline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true\n```\n\n(See [spline.default.yaml](core/src/main/resources/spline.default.yaml#L230) for more information)\n\nThe following data formats and providers are supported out of the box:\n\n- Avro\n- Cassandra\n- COBOL\n- Delta\n- ElasticSearch\n- Excel\n- HDFS\n- Hive\n- JDBC\n- Kafka\n- MongoDB\n- XML\n\nAlthough Spark being an extensible piece of software can support much more,\nit doesn't provide any universal API that Spline can utilize to capture\nreads and write from/to everything that Spark supports.\nSupport for most of different data sources and formats has to be added to Spline one by one.\nFortunately starting with Spline 0.5.4 the auto discoverable [Plugin API](#plugins)\nhas been introduced to make this process easier.\n\nBelow is the break-down of the read/write command list that we have come through.  \nSome commands are implemented, others have yet to be implemented,\nand finally there are such that bear no lineage information and hence are ignored.\n\nAll commands inherit from `org.apache.spark.sql.catalyst.plans.logical.Command`.\n\nYou can see how to produce unimplemented commands in `za.co.absa.spline.harvester.SparkUnimplementedCommandsSpec`.\n\n\u003ca id=\"spark-coverage-done\"\u003e\u003c/a\u003e\n\n### Implemented\n\n- `CreateDataSourceTableAsSelectCommand`  (org.apache.spark.sql.execution.command)\n- `CreateHiveTableAsSelectCommand`  (org.apache.spark.sql.hive.execution)\n- `CreateTableCommand`  (org.apache.spark.sql.execution.command)\n- `DropTableCommand`  (org.apache.spark.sql.execution.command)\n- `InsertIntoDataSourceDirCommand`  (org.apache.spark.sql.execution.command)\n- `InsertIntoHadoopFsRelationCommand`  (org.apache.spark.sql.execution.datasources)\n- `InsertIntoHiveDirCommand`  (org.apache.spark.sql.hive.execution)\n- `InsertIntoHiveTable`  (org.apache.spark.sql.hive.execution)\n- `SaveIntoDataSourceCommand`  (org.apache.spark.sql.execution.datasources)\n\n\u003ca id=\"spark-coverage-todo\"\u003e\u003c/a\u003e\n\n### To be implemented\n\n- `AlterTableAddColumnsCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableChangeColumnCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableRenameCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableSetLocationCommand`  (org.apache.spark.sql.execution.command)\n- `CreateDataSourceTableCommand`  (org.apache.spark.sql.execution.command)\n- `CreateDatabaseCommand`  (org.apache.spark.sql.execution.command)\n- `CreateTableLikeCommand`  (org.apache.spark.sql.execution.command)\n- `DropDatabaseCommand`  (org.apache.spark.sql.execution.command)\n- `LoadDataCommand`  (org.apache.spark.sql.execution.command)\n- `TruncateTableCommand`  (org.apache.spark.sql.execution.command)\n\nWhen one of these commands occurs spline will let you know by logging a warning.\n\n\u003ca id=\"spark-coverage-ignored\"\u003e\u003c/a\u003e\n\n### Ignored\n\n- `AddFileCommand`  (org.apache.spark.sql.execution.command)\n- `AddJarCommand`  (org.apache.spark.sql.execution.command)\n- `AlterDatabasePropertiesCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableAddPartitionCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableDropPartitionCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableRecoverPartitionsCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableRenamePartitionCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableSerDePropertiesCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableSetPropertiesCommand`  (org.apache.spark.sql.execution.command)\n- `AlterTableUnsetPropertiesCommand`  (org.apache.spark.sql.execution.command)\n- `AlterViewAsCommand`  (org.apache.spark.sql.execution.command)\n- `AnalyzeColumnCommand`  (org.apache.spark.sql.execution.command)\n- `AnalyzePartitionCommand`  (org.apache.spark.sql.execution.command)\n- `AnalyzeTableCommand`  (org.apache.spark.sql.execution.command)\n- `CacheTableCommand`  (org.apache.spark.sql.execution.command)\n- `ClearCacheCommand`  (org.apache.spark.sql.execution.command)\n- `CreateFunctionCommand`  (org.apache.spark.sql.execution.command)\n- `CreateTempViewUsing`  (org.apache.spark.sql.execution.datasources)\n- `CreateViewCommand`  (org.apache.spark.sql.execution.command)\n- `DescribeColumnCommand`  (org.apache.spark.sql.execution.command)\n- `DescribeDatabaseCommand`  (org.apache.spark.sql.execution.command)\n- `DescribeFunctionCommand`  (org.apache.spark.sql.execution.command)\n- `DescribeTableCommand`  (org.apache.spark.sql.execution.command)\n- `DropFunctionCommand`  (org.apache.spark.sql.execution.command)\n- `ExplainCommand`  (org.apache.spark.sql.execution.command)\n- `InsertIntoDataSourceCommand`  (org.apache.spark.sql.execution.datasources) *\n- `ListFilesCommand`  (org.apache.spark.sql.execution.command)\n- `ListJarsCommand`  (org.apache.spark.sql.execution.command)\n- `RefreshResource`  (org.apache.spark.sql.execution.datasources)\n- `RefreshTable`  (org.apache.spark.sql.execution.datasources)\n- `ResetCommand$` (org.apache.spark.sql.execution.command)\n- `SetCommand`  (org.apache.spark.sql.execution.command)\n- `SetDatabaseCommand`  (org.apache.spark.sql.execution.command)\n- `ShowColumnsCommand`  (org.apache.spark.sql.execution.command)\n- `ShowCreateTableCommand`  (org.apache.spark.sql.execution.command)\n- `ShowDatabasesCommand`  (org.apache.spark.sql.execution.command)\n- `ShowFunctionsCommand`  (org.apache.spark.sql.execution.command)\n- `ShowPartitionsCommand`  (org.apache.spark.sql.execution.command)\n- `ShowTablePropertiesCommand`  (org.apache.spark.sql.execution.command)\n- `ShowTablesCommand`  (org.apache.spark.sql.execution.command)\n- `StreamingExplainCommand`  (org.apache.spark.sql.execution.command)\n- `UncacheTableCommand`  (org.apache.spark.sql.execution.command)\n\n\u003ca id=\"dev-doc\"\u003e\u003c/a\u003e\n\n## Developer documentation\n\n\u003ca id=\"plugins\"\u003e\u003c/a\u003e\n\n### Plugin API\n\nUsing a plugin API you can capture lineage from a 3rd party data source provider.\nBy default, Spline discover plugins automatically by scanning a classpath, so no special steps required to register and configure a plugin.\nAll you need is to create a class extending the `za.co.absa.spline.harvester.plugin.Plugin` marker trait\nmixed with one or more `*Processing` traits, depending on your intention.\n\nTo disable automatic plugin discovery and speed up initialization, set `spline.scanClasspath` to `false` in your configuration file.\nThen, you will need to register all necessary plugins one by one, using `spline.plugins.{className}.enabled` property, e.g.:\n```properties\n# Disable automatic plugin discovery to save on bootstrap time\nspline.scanClasspath=false\n# This explicitly registers and enables a plugin with the class name 'com.example.MyPlugin'\nspline.plugins.com.example.MyPlugin.enabled=true\n```\n\nThere are three general processing traits:\n\n- `DataSourceFormatNameResolving` - returns a name of a data provider/format in use.\n- `ReadNodeProcessing` - detects a read-command and gather meta information.\n- `WriteNodeProcessing` - detects a write-command and gather meta information.\n\nThere are also two additional trait that handle common cases of reading and writing:\n\n- `BaseRelationProcessing` - similar to `ReadNodeProcessing`, but instead of capturing all logical plan nodes it only reacts on `LogicalRelation`\n  (see `LogicalRelationPlugin`)\n- `RelationProviderProcessing` - similar to `WriteNodeProcessing`, but it only captures `SaveIntoDataSourceCommand`\n  (see `SaveIntoDataSourceCommandPlugin`)\n\nThe best way to illustrate how plugins work is to look at the real working example,\ne.g. [`za.co.absa.spline.harvester.plugin.embedded.JDBCPlugin`](core/src/main/scala/za/co/absa/spline/harvester/plugin/embedded/JDBCPlugin.scala)\n\nThe most common simplified pattern looks like this:\n\n```scala\npackage my.spline.plugin\n\nimport javax.annotation.Priority\nimport za.co.absa.spline.harvester.builder._\nimport za.co.absa.spline.harvester.plugin.Plugin._\nimport za.co.absa.spline.harvester.plugin._\n\n@Priority(Precedence.User) // not required, but can be used to control your plugin precedence in the plugin chain. Default value is `User`.  \nclass FooBarPlugin\n  extends Plugin\n    with BaseRelationProcessing\n    with RelationProviderProcessing {\n\n  override def baseRelationProcessor: PartialFunction[(BaseRelation, LogicalRelation), ReadNodeInfo] = {\n    case (FooBarRelation(a, b, c, d), lr) if /*more conditions*/ =\u003e\n      val dataFormat: Option[AnyRef] = ??? // data format being read (will be resolved by the `DataSourceFormatResolver` later)\n      val dataSourceURI: String = ??? // a unique URI for the data source\n      val params: Map[String, Any] = ??? // additional parameters characterizing the read-command. E.g. (connection protocol, access mode, driver options etc)\n\n      (SourceIdentifier(dataFormat, dataSourceURI), params)\n  }\n\n  override def relationProviderProcessor: PartialFunction[(AnyRef, SaveIntoDataSourceCommand), WriteNodeInfo] = {\n    case (provider, cmd) if provider == \"foobar\" || provider.isInstanceOf[FooBarProvider] =\u003e\n      val dataFormat: Option[AnyRef] = ??? // data format being written (will be resolved by the `DataSourceFormatResolver` later)\n      val dataSourceURI: String = ??? // a unique URI for the data source\n      val writeMode: SaveMode = ??? // was it Append or Overwrite?\n      val query: LogicalPlan = ??? // the logical plan to get the rest of the lineage from\n      val params: Map[String, Any] = ??? // additional parameters characterizing the write-command\n\n      (SourceIdentifier(dataFormat, dataSourceURI), writeMode, query, params)\n  }\n}\n```\n\n**Note**: to avoid unwanted possible shadowing the other plugins (including the future ones),\nmake sure that the pattern-matching criteria are as much selective as possible for your plugin needs.\n\nA plugin class is expected to only have a single constructor.\nThe constructor can have no arguments, or one or more of the following types (the values will be autowired):\n\n- `SparkSession`\n- `PathQualifier`\n- `PluginRegistry`\n\nCompile you plugin and drop it into the Spline/Spark classpath.\nSpline will pick it up automatically.\n\n\u003ca id=\"building\"\u003e\u003c/a\u003e\n\n### Building for different Scala and Spark versions\n\n**Note:** The project requires Java version 1.8 (strictly) and [Apache Maven](https://maven.apache.org/) for building.\n\nCheck the build environment:\n\n```shell\nmvn --version\n```\n\nVerify that Maven is configured to run on Java 1.8. For example:\n\n```\nApache Maven 3.6.3 (Red Hat 3.6.3-8)\nMaven home: /usr/share/maven\nJava version: 1.8.0_302, vendor: Red Hat, Inc., runtime: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.302.b08-2.fc34.x86_64/jre\n```\n\nThere are several maven profiles that makes it easy to build the project with different versions of Spark and Scala.\n\n- Scala profiles: `scala-2.11`, `scala-2.12` (default)\n- Spark profiles: `spark-2.2`, `spark-2.3`, `spark-2.4` (default), `spark-3.0`, `spark-3.1`, `spark-3.2`, `spark-3.3`\n\nFor example, to build an agent for Spark 2.4 and Scala 2.11:\n\n```shell\n# Change Scala version in pom.xml.\nmvn scala-cross-build:change-version -Pscala-2.11\n\n# now you can build for Scala 2.11\nmvn clean install -Pscala-2.11,spark-2.4\n```\n\n### Build docker image\n\nThe agent docker image is mainly used to run [example jobs](examples/) and pre-fill the database with the sample lineage data.\n\n(Spline docker images are available on the DockerHub repo - https://hub.docker.com/u/absaoss)\n\n```shell\nmvn install -Ddocker -Ddockerfile.repositoryUrl=my\n```\n\nSee [How to build Spline Docker images](https://github.com/AbsaOSS/spline-getting-started/blob/main/building-docker.md) for details.\n\n\u003ca id=\"references\"\u003e\u003c/a\u003e\n\n### How to measure code coverage\n```shell\n./mvn verify -Dcode-coverage\n```\nIf module contains measurable data the code coverage report will be generated on path:\n```\n{local-path}\\spline-spark-agent\\{module}\\target\\site\\jacoco\n```\n\n## References and examples\n\nAlthough the primary goal of Spline agent is to be used in combination with the [Spline server](https://github.com/AbsaOSS/spline),\nit is flexible enough to be used in isolation or integration with other data lineage tracking solutions including custom ones.\n\nBelow is a couple of examples of such integration:\n\n- [Databricks Lineage In Azure Purview](https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/)\n- [Spark Compute Lineage to Datahub](https://firststr.com/2021/04/26/spark-compute-lineage-to-datahub)\n\n---\n\n    Copyright 2019 ABSA Group Limited\n    \n    Licensed under the Apache License, Version 2.0 (the \"License\");\n    you may not use this file except in compliance with the License.\n    You may obtain a copy of the License at\n    \n        http://www.apache.org/licenses/LICENSE-2.0\n    \n    Unless required by applicable law or agreed to in writing, software\n    distributed under the License is distributed on an \"AS IS\" BASIS,\n    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n    See the License for the specific language governing permissions and\n    limitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspline-spark-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fspline-spark-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspline-spark-agent/lists"}