https://github.com/minio/spark-streaming-checkpoint
Spark Streaming Checkpoint File Manager for MinIO
https://github.com/minio/spark-streaming-checkpoint
checkpoints java scala spark
Last synced: 4 months ago
JSON representation
Spark Streaming Checkpoint File Manager for MinIO
- Host: GitHub
- URL: https://github.com/minio/spark-streaming-checkpoint
- Owner: minio
- License: apache-2.0
- Created: 2023-02-22T22:26:56.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-04-25T03:46:17.000Z (over 2 years ago)
- Last Synced: 2025-06-20T11:53:27.152Z (6 months ago)
- Topics: checkpoints, java, scala, spark
- Language: Scala
- Homepage:
- Size: 41 KB
- Stars: 11
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Spark Streaming Checkpoint File Manager for MinIO
This project implements a MinIO native CheckpointFileManager for Apache Spark Structured Streaming.
MinIO is a strictly consistent S3-API compatible object store; all object operations are atomic and transactional.
This native CheckpointFileManager takes full advantage of the native object APIs and eliminates the Hadoop HCFS
emulation layer, which is inefficient and unnecessary on object stores.
Since filesystems did not support ACID transactions, applications wrote the files to a temporary location and
used atomic renames to mimic the commit operation. Object stores do not have a rename API because the objects
do not appear in the namespace until the put or put-multipart transaction is complete. The default CheckpointFileManager
shipped with Apache Spark is designed for HDFS and POSIX-based filesystems and it emulates rename API on the object
store using PUT-COPY-LIST-DELTE APIs.
## Sample Code used in testing
```scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object SparkStreamingFromDirectory {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.appName("SparkByExample")
.config("spark.sql.streaming.checkpointFileManagerClass", "io.minio.spark.checkpoint.S3BasedCheckpointFileManager")
.master("local[1]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "http://127.0.0.1:9000")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "minioadmin")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "minioadmin")
val schema = StructType(
List(
StructField("RecordNumber", IntegerType, true),
StructField("Zipcode", StringType, true),
StructField("ZipCodeType", StringType, true),
StructField("City", StringType, true),
StructField("State", StringType, true),
StructField("LocationType", StringType, true),
StructField("Lat", StringType, true),
StructField("Long", StringType, true),
StructField("Xaxis", StringType, true),
StructField("Yaxis", StringType, true),
StructField("Zaxis", StringType, true),
StructField("WorldRegion", StringType, true),
StructField("Country", StringType, true),
StructField("LocationText", StringType, true),
StructField("Location", StringType, true),
StructField("Decommisioned", StringType, true)
)
)
val df = spark.readStream
.schema(schema)
.json("./resources/")
df.printSchema()
val groupDF = df.select("Zipcode")
.groupBy("Zipcode").count()
groupDF.printSchema()
groupDF.writeStream
.format("console")
.outputMode("complete")
.option("truncate", false)
.option("newRows", 30)
.option("checkpointLocation", "s3a://process-runner/checkpoints/")
.start()
.awaitTermination()
}
}
```
The resources used for streaming inputs.
```
tree ../resources/
../resources/
├── zipcode10.json
├── zipcode11.json
├── zipcode12.json
├── zipcode1.json
├── zipcode2.json
├── zipcode3.json
├── zipcode4.json
├── zipcode5.json
├── zipcode6.json
├── zipcode7.json
├── zipcode8.json
└── zipcode9.json
0 directories, 12 files
```
## Results (concise)
### Optimization can be seen in terms of total time taken for Batch '0'
| Without Optimization | With Optimization |
|----------------------|-------------------|
| 72secs | 17secs |
### Total number of namespace pollution
| Total DEL markers without optimization | Total DEL markers with optimization |
|----------------------------------------|-------------------------------------|
| 409 | 0 |
### Total number of excess objects on namespace
| Total excess objects without optimization | Total excess objects with optimization |
|-------------------------------------------|----------------------------------------|
| 818 (out of which 409 are DEL markers) | 0 |
### Total number of API calls
| Total number of API calls without optimization | Total number of API calls with optimization |
|------------------------------------------------|---------------------------------------------|
| 6938 | 224 |
### The number of excess calls to object ratio
| API Calls / Objects without optimization | API Calls / objects with optimization |
|------------------------------------------|---------------------------------------|
| 33.8x | 1.09x |
*These results show the overall benefits of using this CheckpointFileManager, and why the upstream s3a based checkpointing is poorly designed to be used with object storage.*
## Results (detailed) with each steps
### Spark-shell with S3A based checkpointing
```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.2
/_/
scala> :load SparkStreamingFromDirectory-S3A.scala
Loading SparkStreamingFromDirectory-S3A.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
defined object SparkStreamingFromDirectory
scala> SparkStreamingFromDirectory.main(Array(""))
23/02/25 02:14:14 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
root
|-- RecordNumber: integer (nullable = true)
|-- Zipcode: string (nullable = true)
|-- ZipCodeType: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- LocationType: string (nullable = true)
|-- Lat: string (nullable = true)
|-- Long: string (nullable = true)
|-- Xaxis: string (nullable = true)
|-- Yaxis: string (nullable = true)
|-- Zaxis: string (nullable = true)
|-- WorldRegion: string (nullable = true)
|-- Country: string (nullable = true)
|-- LocationText: string (nullable = true)
|-- Location: string (nullable = true)
|-- Decommisioned: string (nullable = true)
root
|-- Zipcode: string (nullable = true)
|-- count: long (nullable = false)
-------------------------------------------
Batch: 0
-------------------------------------------
+-------+-----+
|Zipcode|count|
+-------+-----+
|76166 |2 |
|32564 |2 |
|85210 |2 |
|36275 |3 |
|709 |3 |
|35146 |3 |
|708 |2 |
|35585 |3 |
|32046 |2 |
|27203 |4 |
|34445 |2 |
|27007 |4 |
|704 |10 |
|27204 |4 |
|34487 |2 |
|85209 |2 |
|76177 |4 |
+-------+-----+
```
Amount of calls
```
mc support top api myminio/
API RX TX CALLS ERRORS
s3.CopyObject 48 KiB 47 KiB 208 0
s3.DeleteMultipleObjects 146 KiB 47 KiB 417 0
s3.DeleteObject 32 KiB 0 B 211 0
s3.GetObject 168 B 1.3 KiB 1 0
s3.HeadObject 441 KiB 0 B 2950 0
s3.ListObjectsV2 408 KiB 1.4 MiB 2732 0
s3.PutObject 128 KiB 0 B 419 0
Summary:
Total: 6938 CALLS, 1.2 MiB RX, 1.5 MiB TX - in 72.36s
```
The amount of files left over in the wake of this behavior on a versioned buckets.
```
~ mc ls -r --versions myminio/process-runner/ | wc -l
1023
```
Our of which `614` actual objects
```
~ mc ls -r --versions myminio/process-runner/ | grep PUT | wc -l
614
```
and almost `409` delete markers (soft deletes)
```
~ mc ls -r --versions myminio/process-runner/ | grep DEL | wc -l
409
```
Actual objects on namespace without versioning lookup
```
~ mc ls -r myminio/process-runner/ | wc -l
205
```
### After Direct Checkpointing Write Optimization
```
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.2
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.17)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :load SparkStreamingFromDirectory.scala
Loading SparkStreamingFromDirectory.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
defined object SparkStreamingFromDirectory
scala> SparkStreamingFromDirectory.main(Array(""))
23/02/25 02:20:25 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
root
|-- RecordNumber: integer (nullable = true)
|-- Zipcode: string (nullable = true)
|-- ZipCodeType: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- LocationType: string (nullable = true)
|-- Lat: string (nullable = true)
|-- Long: string (nullable = true)
|-- Xaxis: string (nullable = true)
|-- Yaxis: string (nullable = true)
|-- Zaxis: string (nullable = true)
|-- WorldRegion: string (nullable = true)
|-- Country: string (nullable = true)
|-- LocationText: string (nullable = true)
|-- Location: string (nullable = true)
|-- Decommisioned: string (nullable = true)
root
|-- Zipcode: string (nullable = true)
|-- count: long (nullable = false)
-------------------------------------------
Batch: 0
-------------------------------------------
+-------+-----+
|Zipcode|count|
+-------+-----+
|76166 |2 |
|32564 |2 |
|85210 |2 |
|36275 |3 |
|709 |3 |
|35146 |3 |
|708 |2 |
|35585 |3 |
|32046 |2 |
|27203 |4 |
|34445 |2 |
|27007 |4 |
|704 |10 |
|27204 |4 |
|34487 |2 |
|85209 |2 |
|76177 |4 |
+-------+-----+
```
```
~ mc support top api myminio/
API RX TX CALLS ERRORS
s3.GetObject 159 B 1.3 KiB 1 0
s3.HeadObject 1.5 KiB 0 B 10 0
s3.ListObjectVersions 765 B 2.0 KiB 5 0
s3.PutObject 88 KiB 0 B 208 0
Summary:
Total: 224 CALLS, 90 KiB RX, 3.3 KiB TX - in 17.00s
```
Actual number of valid objects
```
~ mc ls -r --versions myminio/process-runner/ | wc -l
205
```
Actual objects on namespace without versioning lookup
```
~ mc ls -r myminio/process-runner/ | wc -l
205
```