{"id":15208912,"url":"https://github.com/heartsavior/spark-state-tools","last_synced_at":"2025-10-29T12:31:44.461Z","repository":{"id":57737192,"uuid":"178559541","full_name":"HeartSaVioR/spark-state-tools","owner":"HeartSaVioR","description":"Spark Structured Streaming State Tools","archived":false,"fork":false,"pushed_at":"2020-07-03T04:23:50.000Z","size":146,"stargazers_count":34,"open_issues_count":5,"forks_count":9,"subscribers_count":6,"default_branch":"develop-3.0","last_synced_at":"2025-02-02T01:31:56.710Z","etag":null,"topics":["apache-spark","structured-streaming"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HeartSaVioR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-30T13:05:01.000Z","updated_at":"2023-11-17T00:56:16.000Z","dependencies_parsed_at":"2022-08-24T05:31:47.155Z","dependency_job_id":null,"html_url":"https://github.com/HeartSaVioR/spark-state-tools","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HeartSaVioR%2Fspark-state-tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HeartSaVioR%2Fspark-state-tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HeartSaVioR%2Fspark-state-tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HeartSaVioR%2Fspark-state-tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HeartSaVioR","download_url":"https://codeload.github.com/HeartSaVioR/spark-state-tools/tar.gz/refs/heads/develop-3.0","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238825750,"owners_count":19537118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","structured-streaming"],"created_at":"2024-09-28T07:04:38.566Z","updated_at":"2025-10-29T12:31:39.093Z","avatar_url":"https://github.com/HeartSaVioR.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark State Tools \n\n[![CircleCI](https://circleci.com/gh/HeartSaVioR/spark-state-tools/tree/master.svg?style=svg)](https://circleci.com/gh/HeartSaVioR/spark-state-tools/tree/master)\n\nSpark State Tools provides features about offline manipulation of Structured Streaming state on existing query.\n\nThe features we provide as of now are:\n\n* Show some state information which you'll need to provide to enjoy below features\n  * state operator information from checkpoint\n  * state schema from streaming query\n* Create savepoint from existing checkpoint of Structured Streaming query\n  * You can pick specific batch (if it exists on metadata) to create savepoint\n* Read state as batch source of Spark SQL\n* Write DataFrame to state as batch sink of Spark SQL\n  * With feature of writing state, you can achieve rescaling state (repartition), simple schema evolution, etc.\n* Migrate state format from old to new\n  * migrating Streaming Aggregation from ver 1 to 2\n  * migrating FlatMapGroupsWithState from ver 1 to 2\n\nAs this project leverages Spark Structured Streaming's interfaces, and doesn't deal with internal\n(e.g. the structure of state file for HDFS state store), the performance may be suboptimal.\n\nFor now, from the most parts, states from Streaming Aggregation query (`groupBy().agg()`) and (Flat)MapGroupsWithState are supported.\n\n## Disclaimer\n\nThis is something more of a proof of concept implementation, might not be something for production ready.\nWhen you deal with writing state, you may want to backup your checkpoint with CheckpointUtil and try doing it with savepoint.\n\nThe project is intended to deal with offline state, not against state which streaming query is running.\nActually it can be possible, but state store provider in running query can purge old batches, which would produce error on here.\n\n## Supported versions\n\nBoth Spark 3.0.x and 2.4.x is supported: it only means you should use these versions when using this project.\n\nThe project provides cross-compile for Scala 2.11 and 2.12 (thanks [@redsk](https://github.com/redsk)!); please pick the right artifact for your Scala version.\n\nSpark version | Scala versions | artifact version\n------------- | -------------- | ----------------\n2.4.x         | 2.11 / 2.12    | 0.5.0-spark-2.4\n3.0.x         | 2.12           | 0.5.0-spark-3.0\n\n## Pulling artifacts\n\nYou may use this library in your applications with the following dependency information:\n\n```\ngroupId: net.heartsavior.spark\nartifactId: spark-state-tools_\u003cscala short version\u003e\n```\n\nYou are encouraged to always use latest version which is compatible to your Apache Spark version.\n\ne.g. For maven:\n\n(Please replace `{{...}}` with content in above matrix.)\n\n```\n\u003cdependency\u003e\n  \u003cgroupId\u003enet.heartsavior.spark\u003c/groupId\u003e\n  \u003cartifactId\u003espark-state-tool_{{scala_version}}\u003c/artifactId\u003e\n  \u003cversion\u003e{{artifact_version}}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nFor other dependency managements, you can refer below page to get the guide:\n\nhttps://search.maven.org/artifact/net.heartsavior.spark/spark-state-tools_2.11/\nhttps://search.maven.org/artifact/net.heartsavior.spark/spark-state-tools_2.12/\n\n(NOTE: Use at least 0.4.0 or higher as previous versions have critical performance issue on reading path.)\n\n\n## How to use\n\nFirst of all, you may want to get state and last batch information to provide them as parameters.\nYou can get it from `StateInformationInCheckpoint`, whether calling from your codebase or running with `spark-submit`.\nHere we assume you have artifact jar of spark-state-tool and you want to run it from cli (leveraging `spark-submit`).\n\n```text\n\u003cspark_path\u003e/bin/spark-submit --master \"local[*]\" \\\n--class net.heartsavior.spark.sql.state.StateInformationInCheckpoint \\\nspark-state-tool-0.0.1-SNAPSHOT.jar \u003ccheckpoint_root_path\u003e\n```\n\nThe command line will provide checkpoint information like below:\n\n```text\nLast committed batch ID: 2\nOperator ID: 0, partitions: 5, storeNames: List(default)\n```\n\nThis output means the query has batch ID 2 as last committed (NOTE: corresponding state version is 3, not 2), and\nthere's only one stateful operator which has ID as 0, and 5 partitions, and there's also only one kind of store named \"default\".\n\nYou can achieve this as calling `StateInformationInCheckpoint.gatherInformation` against checkpoint directory too.\n\n```scala\n// Here we assume 'spark' as SparkSession.\n// Here the class of Path is `org.apache.hadoop.fs.Path`\nval stateInfo = new StateInformationInCheckpoint(spark).gatherInformation(new Path(cpDir.getAbsolutePath))\n// Here stateInfo is `StateInformation`, which you can extract same information as running CLI app\n```\n\nTo read state from your existing query, you may want to provide state schema manually, or read from your existing query:\n\n* Read schema from existing query\n\n(supported: `streaming aggregation`, `flatMapGroupsWithState`)\n\n```scala\n// Here we assume 'spark' as SparkSession.\n// the query shouldn't have sink - you may need to get rid of writeStream part and pass DataFrame\nval schemaInfos = new StateSchemaExtractor(spark).extract(streamingQueryDf)\n// Here schemaInfos is `Seq[StateSchemaInfo]`, which you can extract keySchema,\n// and valueSchema and finally define state schema. Please refer \"Manual schema\"\n// to define state schema with key schema and value schema\n```\n\n* Manual schema\n\n```scala\nval stateKeySchema = new StructType()\n  .add(\"groupKey\", IntegerType)\n\nval stateValueSchema = new StructType()\n  .add(\"cnt\", LongType)\n  .add(\"sum\", LongType)\n  .add(\"max\", IntegerType)\n  .add(\"min\", IntegerType)\n\nval stateFormat = new StructType()\n  .add(\"key\", stateKeySchema)\n  .add(\"value\", stateValueSchema)\n```\n\nYou can also combine both state operator information in state information and state schema via `StateStoreReaderOperatorParamExtractor`\nto get necessary parameters for state batch read:\n\n```scala\n// Here we assume 'spark' as SparkSession.\nval stateInfo = new StateInformationInCheckpoint(spark).gatherInformation(new Path(cpDir.getAbsolutePath))\nval schemaInfos = new StateSchemaExtractor(spark).extract(streamingQueryDf)\nval stateReadParams = StateStoreReaderOperatorParamExtractor.extract(stateInfo, schemaInfos)\n// from `stateReadParams` you can get last committed state version, operatorId, storeName, state schema per each (operatorId, storeName) group\n```\n\nThen you can start your batch query like:\n\n```scala\nval operatorId = 0\nval batchId = 1 // the version of state for the output of batch is batchId + 1\n\n// Here we assume 'spark' as SparkSession\nval stateReadDf = spark.read\n  .format(\"state\")\n  .schema(stateSchema)\n  .option(StateStoreDataSourceProvider.PARAM_CHECKPOINT_LOCATION,\n    new Path(checkpointRoot, \"state\").getAbsolutePath)\n  .option(StateStoreDataSourceProvider.PARAM_VERSION, batchId + 1)\n  .option(StateStoreDataSourceProvider.PARAM_OPERATOR_ID, operatorId)\n  .load()\n\n\n// The schema of stateReadDf follows:\n// For streaming aggregation state format v1\n// (query ran with lower than Spark 2.4.0 for the first time)\n/*\nroot\n |-- key: struct (nullable = false)\n |    |-- groupKey: integer (nullable = true)\n |-- value: struct (nullable = false)\n |    |-- groupKey: integer (nullable = true)\n |    |-- cnt: long (nullable = true)\n |    |-- sum: long (nullable = true)\n |    |-- max: integer (nullable = true)\n |    |-- min: integer (nullable = true)\n*/\n\n// For streaming aggregation state format v2\n// (query ran with Spark 2.4.0 or higher for the first time)\n/*\nroot\n |-- key: struct (nullable = false)\n |    |-- groupKey: integer (nullable = true)\n |-- value: struct (nullable = false)\n |    |-- cnt: long (nullable = true)\n |    |-- sum: long (nullable = true)\n |    |-- max: integer (nullable = true)\n |    |-- min: integer (nullable = true)\n*/\n```\n\nTo write Dataset as state of Structured Streaming, you can transform your Dataset as having schema as follows:\n\n```text\nroot\n |-- key: struct (nullable = false)\n |    |-- ...key fields...\n |-- value: struct (nullable = false)\n |    |-- ...value fields...\n```\n\nand add state batch output as follow:\n\n```scala\nval operatorId = 0\nval batchId = 1 // the version of state for the output of batch is batchId + 1\nval newShufflePartitions = 10\n\ndf.write\n  .format(\"state\")\n  .option(StateStoreDataSourceProvider.PARAM_CHECKPOINT_LOCATION,\n    new Path(newCheckpointRoot, \"state\").getAbsolutePath)\n  .option(StateStoreDataSourceProvider.PARAM_VERSION, batchId + 1)\n  .option(StateStoreDataSourceProvider.PARAM_OPERATOR_ID, operatorId)\n  .option(StateStoreDataSourceProvider.PARAM_NEW_PARTITIONS, newShufflePartitions)\n  .save() // saveAsTable() also supported\n```\n\nBefore that, you may want to create a savepoint from existing checkpoint to another path, so that you can simply \nrun new Structured Streaming query with modified state.\n\n```scala\n// Here we assume 'spark' as SparkSession.\n// If you just want to create a savepoint without modifying state, provide `additionalMetadataConf` as `Map.empty`,\n// and `excludeState` as `false`.\n// That said, if you want to prepare state modification, it would be good to create a savepoint with providing\n// addConf to new shuffle partition (like below), and `excludeState` as `true` (to avoid unnecessary copy for state)\nval addConf = Map(SQLConf.SHUFFLE_PARTITIONS.key -\u003e newShufflePartitions.toString)\nCheckpointUtil.createSavePoint(spark, oldCpPath, newCpPath, newLastBatchId, addConf, excludeState = true)\n```\n\nIf you ran streaming aggregation query before Spark 2.4.0 and want to upgrade (or already upgraded) to Spark 2.4.0 or higher,\nyou may also want to migrate your state from state format 1 to 2 (Spark 2.4.0 introduces it) to reduce overall state size,\nand get some speedup from most of cases.\n\nPlease refer [SPARK-24763](https://issues.apache.org/jira/browse/SPARK-24763) for more details.\n\n```scala\n// Here we assume 'spark' as SparkSession.\n\n// Please refer above to see how to construct `stateSchema`\n// (manually, or reading from existing query)\n// Here we already construct `stateSchema` as state schema.\n\nval migrator = new StreamingAggregationMigrator(spark)\nmigrator.convertVersion1To2(oldCpPath, newCpPath, stateKeySchema, stateValueSchema)\n```\n\nSimilarly, if you ran flatMapGroupsWithState query before Spark 2.4.0 and want to upgrade (or already upgraded) to Spark 2.4.0 or higher,\nyou may also want to migrate your state from state format 1 to 2 (Spark 2.4.0 introduces it) to enable setting timeout even when state is null.\n(This also changes timeout timestamp type from int to long.)\n\nPlease refer [SPARK-22187](https://issues.apache.org/jira/browse/SPARK-22187) for more details.\n\n```scala\n// Here we assume 'spark' as SparkSession.\n\n// Please refer above to see how to construct `stateSchema`\n// (manually, or reading from existing query)\n// Here we already construct `stateSchema` as state schema.\n\nval migrator = new FlatMapGroupsWithStateMigrator(spark)\nmigrator.convertVersion1To2(oldCpPath, newCpPath, stateKeySchema, stateValueSchema)\n```\n\nPlease refer the [test codes](https://github.com/HeartSaVioR/spark-state-tools/tree/master/src/test/scala/net/heartsavior/spark/sql/state) to see more examples on how to use.\n\n## License\n\nCopyright 2019-2020 Jungtaek Lim \"\u003ckabhwan.opensource@gmail.com\u003e\"\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheartsavior%2Fspark-state-tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fheartsavior%2Fspark-state-tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheartsavior%2Fspark-state-tools/lists"}