{"id":15208897,"url":"https://github.com/chermenin/spark-states","last_synced_at":"2025-04-05T20:06:19.771Z","repository":{"id":57742670,"uuid":"144570004","full_name":"chermenin/spark-states","owner":"chermenin","description":"Custom state store providers for Apache Spark","archived":false,"fork":false,"pushed_at":"2025-02-14T09:59:53.000Z","size":273,"stargazers_count":92,"open_issues_count":0,"forks_count":26,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-29T19:04:04.836Z","etag":null,"topics":["apache","apache-spark","spark","spark-streaming","spark-structured-streaming","state","state-store","stateful","structured-streaming"],"latest_commit_sha":null,"homepage":"http://code.chermenin.ru/spark-states/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chermenin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"chermenin","liberapay":"chermenin","issuehunt":"chermenin"}},"created_at":"2018-08-13T11:22:42.000Z","updated_at":"2025-02-14T09:59:57.000Z","dependencies_parsed_at":"2025-02-25T08:00:51.971Z","dependency_job_id":"96e589ea-7e86-4027-a991-062b68ed9ae4","html_url":"https://github.com/chermenin/spark-states","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chermenin%2Fspark-states","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chermenin%2Fspark-states/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chermenin%2Fspark-states/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chermenin%2Fspark-states/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chermenin","download_url":"https://codeload.github.com/chermenin/spark-states/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247393569,"owners_count":20931812,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","apache-spark","spark","spark-streaming","spark-structured-streaming","state","state-store","stateful","structured-streaming"],"created_at":"2024-09-28T07:03:37.440Z","updated_at":"2025-04-05T20:06:19.740Z","avatar_url":"https://github.com/chermenin.png","language":"Scala","funding_links":["https://github.com/sponsors/chermenin","https://liberapay.com/chermenin","https://issuehunt.io/r/chermenin"],"categories":[],"sub_categories":[],"readme":"## Custom state store providers for Apache Spark\n\n[![Build Status](https://travis-ci.org/chermenin/spark-states.svg?branch=master)](https://travis-ci.org/chermenin/spark-states)\n[![CodeFactor](https://www.codefactor.io/repository/github/chermenin/spark-states/badge)](https://www.codefactor.io/repository/github/chermenin/spark-states)\n[![codecov](https://codecov.io/gh/chermenin/spark-states/branch/master/graph/badge.svg)](https://codecov.io/gh/chermenin/spark-states)\n[![Maven Central](https://img.shields.io/maven-central/v/ru.chermenin/spark-states_2.12.svg)](https://central.sonatype.com/search?q=g%3Aru.chermenin++spark-states_*)\n[![javadoc](https://javadoc.io/badge2/ru.chermenin/spark-states_2.12/javadoc.svg)](https://javadoc.io/doc/ru.chermenin/spark-states_2.12/latest/ru/chermenin/spark/sql/execution/streaming/state/RocksDbStateStoreProvider.html)\n\nState management extensions for Apache Spark to keep data across micro-batches during stateful stream processing.\n\n### Motivation\n\nOut of the box, Apache Spark has only one implementation of state store providers. It's `HDFSBackedStateStoreProvider` which stores all of the data in memory, what is a very memory consuming approach. To avoid `OutOfMemory` errors, this repository and custom state store providers were created.\n\n### Usage\n\nTo use the custom state store provider for your pipelines use the following additional configuration for the submit script/ SparkConf:\n\n    --conf spark.sql.streaming.stateStore.providerClass=\"ru.chermenin.spark.sql.execution.streaming.state.RocksDbStateStoreProvider\"\n\nHere is some more information about it: https://docs.databricks.com/spark/latest/structured-streaming/production.html\n\nAlternatively, you can use the `useRocksDBStateStore()` helper method in your application while creating the SparkSession,\n\n```\nimport ru.chermenin.spark.sql.execution.streaming.state.implicits._\n\nval spark = SparkSession.builder().master(...).useRocksDBStateStore().getOrCreate()\n```\n\nNote: For the helper methods to be available, you must import the implicits as shown above.\n\n\n### State Timeout\n    \nWith semantics similar to those of `GroupState`/ `FlatMapGroupWithState`, state timeout features have been built directly into the custom state store. \n\nImportant points to note when using State Timeouts,\n \n * Timeouts can be set differently for each streaming query. This relies on `queryName` and its `checkpointLocation`.\n * The poll trigger set on a streaming query may or may not be set to a different value than the state expiration.\n * Timeouts are currently based on processing time\n * The timeout will occur once \n    1) a fixed duration has elapsed after the entry's creation, or\n    2) the most recent replacement (update) of its value, or\n    3) its last access\n * Unlike `GroupState`, the timeout **is not** eventual as it is independent from query progress\n * Since the processing time timeout is based on the clock time, it is affected by the variations in the system clock (i.e. time zone changes, clock skew, etc.)\n * Timeout may or may not be set to strict expiration at the slight cost of memory. More info [here](https://github.com/chermenin/spark-states/issues/1).\n    \nThere are 2 different ways configure state timeout:\n\n1. Via additional configuration on SparkConf:\n \n   To set a processing time timeout for all streaming queries in strict mode.\n   ```\n   --conf spark.sql.streaming.stateStore.stateExpirySecs=5\n   --conf spark.sql.streaming.stateStore.strictExpire=true\n   ```\n\n   To configure state timeout differently for each query the above configs can be modified to,\n   ```\n   --conf spark.sql.streaming.stateStore.stateExpirySecs.queryName1=5\n   --conf spark.sql.streaming.stateStore.stateExpirySecs.queryName2=10\n       ...\n       ...\n   --conf spark.sql.streaming.stateStore.strictExpire=true\n   ```\n\n2. Via `stateTimeout()` helper method _(recommended way)_:\n\n   ```\n   import ru.chermenin.spark.sql.execution.streaming.state.implicits._\n\n   val spark: SparkSession = ...\n   val streamingDF: DataFrame = ...\n\n   streamingDF.writeStream\n         .format(...)\n         .outputMode(...)\n         .trigger(Trigger.ProcessingTime(1000L))\n         .queryName(\"myQuery1\")\n         .option(\"checkpointLocation\", \"chkpntloc\")\n         .stateTimeout(spark.conf, expirySecs = 5)\n         .start()\n   \n   spark.streams.awaitAnyTermination()\n   ```\n   \n   Preferably, the `queryName` and `checkpointLocation` can be set directly via the `stateTimeout()` method, as below:\n   ```\n   streamingDF.writeStream\n         .format(...)\n         .outputMode(...)\n         .trigger(Trigger.ProcessingTime(1000L))\n         .stateTimeout(spark.conf, queryName=\"myQuery1\", expirySecs = 5, checkpointLocation =\"chkpntloc\")\n         .start()\n   ```\n\nNote: If `queryName` is invalid/ unavailable, the streaming query will be tagged as `UNNAMED` and timeout applicable will be as per the value of `spark.sql.streaming.stateStore.stateExpirySecs` (which defaults to -1, but can be overridden via SparkConf) \n\nOther state timeout related points (applicable on global and query level),\n * For no timeout, i.e. infinite state, set `spark.sql.streaming.stateStore.stateExpirySecs=-1`\n * For stateless processing, i.e. no state, set `spark.sql.streaming.stateStore.stateExpirySecs=0`\n\n### Contributing\n\nYou're welcome to submit pull requests with any changes for this repository at any time. I'll be very glad to see any contributions.\n\n### License\n\nThe standard [Apache 2.0](LICENSE) license is used for this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchermenin%2Fspark-states","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchermenin%2Fspark-states","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchermenin%2Fspark-states/lists"}