{"id":28867161,"url":"https://github.com/minio/spark-streaming-checkpoint","last_synced_at":"2025-08-25T10:19:31.289Z","repository":{"id":141776043,"uuid":"605312039","full_name":"minio/spark-streaming-checkpoint","owner":"minio","description":"Spark Streaming Checkpoint File Manager for MinIO","archived":false,"fork":false,"pushed_at":"2023-04-25T03:46:17.000Z","size":42,"stargazers_count":11,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-20T11:53:27.152Z","etag":null,"topics":["checkpoints","java","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/minio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-02-22T22:26:56.000Z","updated_at":"2025-05-13T20:23:19.000Z","dependencies_parsed_at":null,"dependency_job_id":"c6370661-b1cf-4273-926a-59400e6bb26f","html_url":"https://github.com/minio/spark-streaming-checkpoint","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/minio/spark-streaming-checkpoint","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/minio%2Fspark-streaming-checkpoint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/minio%2Fspark-streaming-checkpoint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/minio%2Fspark-streaming-checkpoint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/minio%2Fspark-streaming-checkpoint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/minio","download_url":"https://codeload.github.com/minio/spark-streaming-checkpoint/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/minio%2Fspark-streaming-checkpoint/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272045420,"owners_count":24864021,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-25T02:00:12.092Z","response_time":1107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["checkpoints","java","scala","spark"],"created_at":"2025-06-20T11:41:30.617Z","updated_at":"2025-08-25T10:19:31.283Z","avatar_url":"https://github.com/minio.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Streaming Checkpoint File Manager for MinIO\n\nThis project implements a MinIO native CheckpointFileManager for Apache Spark Structured Streaming. \nMinIO is a strictly consistent S3-API compatible object store; all object operations are atomic and transactional. \nThis native CheckpointFileManager takes full advantage of the native object APIs and eliminates the Hadoop HCFS \nemulation layer, which is inefficient and unnecessary on object stores.\n \nSince filesystems did not support ACID transactions, applications wrote the files to a temporary location and \nused atomic renames to mimic the commit operation. Object stores do not have a rename API because the objects \ndo not appear in the namespace until the put or put-multipart transaction is complete. The default CheckpointFileManager \nshipped with Apache Spark is designed for HDFS and POSIX-based filesystems and it emulates rename API on the object \nstore using PUT-COPY-LIST-DELTE APIs.\n\n## Sample Code used in testing\n\n```scala\nimport org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}\n\nobject SparkStreamingFromDirectory {\n\n  def main(args: Array[String]): Unit = {\n\n    val spark:SparkSession = SparkSession.builder()\n      .appName(\"SparkByExample\")\n      .config(\"spark.sql.streaming.checkpointFileManagerClass\", \"io.minio.spark.checkpoint.S3BasedCheckpointFileManager\")\n      .master(\"local[1]\").getOrCreate()\n\n    spark.sparkContext.setLogLevel(\"ERROR\")\n    spark.sparkContext.hadoopConfiguration.set(\"fs.s3a.endpoint\", \"http://127.0.0.1:9000\")\n    spark.sparkContext.hadoopConfiguration.set(\"fs.s3a.path.style.access\", \"true\")\n    spark.sparkContext.hadoopConfiguration.set(\"fs.s3a.access.key\", \"minioadmin\")\n    spark.sparkContext.hadoopConfiguration.set(\"fs.s3a.secret.key\", \"minioadmin\")\n\n    val schema = StructType(\n      List(\n        StructField(\"RecordNumber\", IntegerType, true),\n        StructField(\"Zipcode\", StringType, true),\n        StructField(\"ZipCodeType\", StringType, true),\n        StructField(\"City\", StringType, true),\n        StructField(\"State\", StringType, true),\n        StructField(\"LocationType\", StringType, true),\n        StructField(\"Lat\", StringType, true),\n        StructField(\"Long\", StringType, true),\n        StructField(\"Xaxis\", StringType, true),\n        StructField(\"Yaxis\", StringType, true),\n        StructField(\"Zaxis\", StringType, true),\n        StructField(\"WorldRegion\", StringType, true),\n        StructField(\"Country\", StringType, true),\n        StructField(\"LocationText\", StringType, true),\n        StructField(\"Location\", StringType, true),\n        StructField(\"Decommisioned\", StringType, true)\n      )\n    )\n\n    val df = spark.readStream\n      .schema(schema)\n      .json(\"./resources/\")\n\n    df.printSchema()\n\n    val groupDF = df.select(\"Zipcode\")\n        .groupBy(\"Zipcode\").count()\n    groupDF.printSchema()\n\n    groupDF.writeStream\n      .format(\"console\")\n      .outputMode(\"complete\")\n      .option(\"truncate\", false)\n      .option(\"newRows\", 30)\n      .option(\"checkpointLocation\", \"s3a://process-runner/checkpoints/\")\n      .start()\n      .awaitTermination()\n  }\n}\n```\n\nThe resources used for streaming inputs.\n```\ntree ../resources/\n../resources/\n├── zipcode10.json\n├── zipcode11.json\n├── zipcode12.json\n├── zipcode1.json\n├── zipcode2.json\n├── zipcode3.json\n├── zipcode4.json\n├── zipcode5.json\n├── zipcode6.json\n├── zipcode7.json\n├── zipcode8.json\n└── zipcode9.json\n\n0 directories, 12 files\n```\n\n## Results (concise)\n\n### Optimization can be seen in terms of total time taken for Batch '0'\n| Without Optimization | With Optimization |\n|----------------------|-------------------|\n| 72secs               | 17secs            |\n\n### Total number of namespace pollution\n| Total DEL markers without optimization | Total DEL markers with optimization |\n|----------------------------------------|-------------------------------------|\n| 409                                    | 0                                   |\n\n### Total number of excess objects on namespace\n| Total excess objects without optimization | Total excess objects with optimization |\n|-------------------------------------------|----------------------------------------|\n| 818 (out of which 409 are DEL markers)    | 0                                      |\n\n### Total number of API calls\n| Total number of API calls without optimization | Total number of API calls with optimization |\n|------------------------------------------------|---------------------------------------------|\n| 6938                                           | 224                                         |\n\n### The number of excess calls to object ratio \n| API Calls / Objects without optimization | API Calls / objects with optimization |\n|------------------------------------------|---------------------------------------|\n| 33.8x                                    | 1.09x                                 |\n\n*These results show the overall benefits of using this CheckpointFileManager, and why the upstream s3a based checkpointing is poorly designed to be used with object storage.*\n\n## Results (detailed) with each steps\n\n### Spark-shell with S3A based checkpointing\n\n```\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 3.3.2\n      /_/\n         \nscala\u003e :load SparkStreamingFromDirectory-S3A.scala\nLoading SparkStreamingFromDirectory-S3A.scala...\nimport org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}\ndefined object SparkStreamingFromDirectory\n\nscala\u003e SparkStreamingFromDirectory.main(Array(\"\"))\n23/02/25 02:14:14 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.\nroot\n |-- RecordNumber: integer (nullable = true)\n |-- Zipcode: string (nullable = true)\n |-- ZipCodeType: string (nullable = true)\n |-- City: string (nullable = true)\n |-- State: string (nullable = true)\n |-- LocationType: string (nullable = true)\n |-- Lat: string (nullable = true)\n |-- Long: string (nullable = true)\n |-- Xaxis: string (nullable = true)\n |-- Yaxis: string (nullable = true)\n |-- Zaxis: string (nullable = true)\n |-- WorldRegion: string (nullable = true)\n |-- Country: string (nullable = true)\n |-- LocationText: string (nullable = true)\n |-- Location: string (nullable = true)\n |-- Decommisioned: string (nullable = true)\n\nroot\n |-- Zipcode: string (nullable = true)\n |-- count: long (nullable = false)\n\n-------------------------------------------                                     \nBatch: 0\n-------------------------------------------\n+-------+-----+\n|Zipcode|count|\n+-------+-----+\n|76166  |2    |\n|32564  |2    |\n|85210  |2    |\n|36275  |3    |\n|709    |3    |\n|35146  |3    |\n|708    |2    |\n|35585  |3    |\n|32046  |2    |\n|27203  |4    |\n|34445  |2    |\n|27007  |4    |\n|704    |10   |\n|27204  |4    |\n|34487  |2    |\n|85209  |2    |\n|76177  |4    |\n+-------+-----+\n```\n\nAmount of calls\n```\nmc support top api myminio/\n\nAPI                             RX      TX      CALLS   ERRORS \ns3.CopyObject                   48 KiB  47 KiB  208     0     \ns3.DeleteMultipleObjects        146 KiB 47 KiB  417     0     \ns3.DeleteObject                 32 KiB  0 B     211     0     \ns3.GetObject                    168 B   1.3 KiB 1       0     \ns3.HeadObject                   441 KiB 0 B     2950    0     \ns3.ListObjectsV2                408 KiB 1.4 MiB 2732    0     \ns3.PutObject                    128 KiB 0 B     419     0     \n\nSummary:\n\nTotal: 6938 CALLS, 1.2 MiB RX, 1.5 MiB TX - in 72.36s\n```\n\nThe amount of files left over in the wake of this behavior on a versioned buckets.\n\n```\n~ mc ls -r --versions myminio/process-runner/ | wc -l\n1023\n```\n\nOur of which `614` actual objects\n\n```\n~  mc ls -r --versions myminio/process-runner/ | grep PUT | wc -l\n614\n```\n\nand almost `409` delete markers (soft deletes)\n\n```\n~ mc ls -r --versions myminio/process-runner/ | grep DEL | wc -l\n409\n```\n\nActual objects on namespace without versioning lookup\n```\n~ mc ls -r myminio/process-runner/  | wc -l\n205\n```\n\n### After Direct Checkpointing Write Optimization\n\n```\n...\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 3.3.2\n      /_/\n         \nUsing Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.17)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala\u003e :load SparkStreamingFromDirectory.scala\nLoading SparkStreamingFromDirectory.scala...\nimport org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}\ndefined object SparkStreamingFromDirectory\n\nscala\u003e SparkStreamingFromDirectory.main(Array(\"\"))\n23/02/25 02:20:25 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.\nroot\n |-- RecordNumber: integer (nullable = true)\n |-- Zipcode: string (nullable = true)\n |-- ZipCodeType: string (nullable = true)\n |-- City: string (nullable = true)\n |-- State: string (nullable = true)\n |-- LocationType: string (nullable = true)\n |-- Lat: string (nullable = true)\n |-- Long: string (nullable = true)\n |-- Xaxis: string (nullable = true)\n |-- Yaxis: string (nullable = true)\n |-- Zaxis: string (nullable = true)\n |-- WorldRegion: string (nullable = true)\n |-- Country: string (nullable = true)\n |-- LocationText: string (nullable = true)\n |-- Location: string (nullable = true)\n |-- Decommisioned: string (nullable = true)\n\nroot\n |-- Zipcode: string (nullable = true)\n |-- count: long (nullable = false)\n\n-------------------------------------------                                     \nBatch: 0\n-------------------------------------------\n+-------+-----+\n|Zipcode|count|\n+-------+-----+\n|76166  |2    |\n|32564  |2    |\n|85210  |2    |\n|36275  |3    |\n|709    |3    |\n|35146  |3    |\n|708    |2    |\n|35585  |3    |\n|32046  |2    |\n|27203  |4    |\n|34445  |2    |\n|27007  |4    |\n|704    |10   |\n|27204  |4    |\n|34487  |2    |\n|85209  |2    |\n|76177  |4    |\n+-------+-----+\n```\n\n```\n~ mc support top api myminio/\n\nAPI                     RX      TX      CALLS   ERRORS \ns3.GetObject            159 B   1.3 KiB 1       0     \ns3.HeadObject           1.5 KiB 0 B     10      0     \ns3.ListObjectVersions   765 B   2.0 KiB 5       0     \ns3.PutObject            88 KiB  0 B     208     0     \n\nSummary:\n\nTotal: 224 CALLS, 90 KiB RX, 3.3 KiB TX - in 17.00s\n```\n\nActual number of valid objects \n```\n~ mc ls -r --versions myminio/process-runner/ | wc -l\n205\n```\n\nActual objects on namespace without versioning lookup\n```\n~ mc ls -r myminio/process-runner/  | wc -l\n205\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fminio%2Fspark-streaming-checkpoint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fminio%2Fspark-streaming-checkpoint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fminio%2Fspark-streaming-checkpoint/lists"}