https://github.com/chermenin/spark-states
Custom state store providers for Apache Spark
https://github.com/chermenin/spark-states
apache apache-spark spark spark-streaming spark-structured-streaming state state-store stateful structured-streaming
Last synced: about 1 year ago
JSON representation
Custom state store providers for Apache Spark
- Host: GitHub
- URL: https://github.com/chermenin/spark-states
- Owner: chermenin
- License: apache-2.0
- Created: 2018-08-13T11:22:42.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2025-02-14T09:59:53.000Z (over 1 year ago)
- Last Synced: 2025-03-29T19:04:04.836Z (about 1 year ago)
- Topics: apache, apache-spark, spark, spark-streaming, spark-structured-streaming, state, state-store, stateful, structured-streaming
- Language: Scala
- Homepage: http://code.chermenin.ru/spark-states/
- Size: 267 KB
- Stars: 92
- Watchers: 7
- Forks: 26
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
## Custom state store providers for Apache Spark
[](https://travis-ci.org/chermenin/spark-states)
[](https://www.codefactor.io/repository/github/chermenin/spark-states)
[](https://codecov.io/gh/chermenin/spark-states)
[](https://central.sonatype.com/search?q=g%3Aru.chermenin++spark-states_*)
[](https://javadoc.io/doc/ru.chermenin/spark-states_2.12/latest/ru/chermenin/spark/sql/execution/streaming/state/RocksDbStateStoreProvider.html)
State management extensions for Apache Spark to keep data across micro-batches during stateful stream processing.
### Motivation
Out of the box, Apache Spark has only one implementation of state store providers. It's `HDFSBackedStateStoreProvider` which stores all of the data in memory, what is a very memory consuming approach. To avoid `OutOfMemory` errors, this repository and custom state store providers were created.
### Usage
To use the custom state store provider for your pipelines use the following additional configuration for the submit script/ SparkConf:
--conf spark.sql.streaming.stateStore.providerClass="ru.chermenin.spark.sql.execution.streaming.state.RocksDbStateStoreProvider"
Here is some more information about it: https://docs.databricks.com/spark/latest/structured-streaming/production.html
Alternatively, you can use the `useRocksDBStateStore()` helper method in your application while creating the SparkSession,
```
import ru.chermenin.spark.sql.execution.streaming.state.implicits._
val spark = SparkSession.builder().master(...).useRocksDBStateStore().getOrCreate()
```
Note: For the helper methods to be available, you must import the implicits as shown above.
### State Timeout
With semantics similar to those of `GroupState`/ `FlatMapGroupWithState`, state timeout features have been built directly into the custom state store.
Important points to note when using State Timeouts,
* Timeouts can be set differently for each streaming query. This relies on `queryName` and its `checkpointLocation`.
* The poll trigger set on a streaming query may or may not be set to a different value than the state expiration.
* Timeouts are currently based on processing time
* The timeout will occur once
1) a fixed duration has elapsed after the entry's creation, or
2) the most recent replacement (update) of its value, or
3) its last access
* Unlike `GroupState`, the timeout **is not** eventual as it is independent from query progress
* Since the processing time timeout is based on the clock time, it is affected by the variations in the system clock (i.e. time zone changes, clock skew, etc.)
* Timeout may or may not be set to strict expiration at the slight cost of memory. More info [here](https://github.com/chermenin/spark-states/issues/1).
There are 2 different ways configure state timeout:
1. Via additional configuration on SparkConf:
To set a processing time timeout for all streaming queries in strict mode.
```
--conf spark.sql.streaming.stateStore.stateExpirySecs=5
--conf spark.sql.streaming.stateStore.strictExpire=true
```
To configure state timeout differently for each query the above configs can be modified to,
```
--conf spark.sql.streaming.stateStore.stateExpirySecs.queryName1=5
--conf spark.sql.streaming.stateStore.stateExpirySecs.queryName2=10
...
...
--conf spark.sql.streaming.stateStore.strictExpire=true
```
2. Via `stateTimeout()` helper method _(recommended way)_:
```
import ru.chermenin.spark.sql.execution.streaming.state.implicits._
val spark: SparkSession = ...
val streamingDF: DataFrame = ...
streamingDF.writeStream
.format(...)
.outputMode(...)
.trigger(Trigger.ProcessingTime(1000L))
.queryName("myQuery1")
.option("checkpointLocation", "chkpntloc")
.stateTimeout(spark.conf, expirySecs = 5)
.start()
spark.streams.awaitAnyTermination()
```
Preferably, the `queryName` and `checkpointLocation` can be set directly via the `stateTimeout()` method, as below:
```
streamingDF.writeStream
.format(...)
.outputMode(...)
.trigger(Trigger.ProcessingTime(1000L))
.stateTimeout(spark.conf, queryName="myQuery1", expirySecs = 5, checkpointLocation ="chkpntloc")
.start()
```
Note: If `queryName` is invalid/ unavailable, the streaming query will be tagged as `UNNAMED` and timeout applicable will be as per the value of `spark.sql.streaming.stateStore.stateExpirySecs` (which defaults to -1, but can be overridden via SparkConf)
Other state timeout related points (applicable on global and query level),
* For no timeout, i.e. infinite state, set `spark.sql.streaming.stateStore.stateExpirySecs=-1`
* For stateless processing, i.e. no state, set `spark.sql.streaming.stateStore.stateExpirySecs=0`
### Contributing
You're welcome to submit pull requests with any changes for this repository at any time. I'll be very glad to see any contributions.
### License
The standard [Apache 2.0](LICENSE) license is used for this project.