Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/qubole/kinesis-sql
Kinesis Connector for Structured Streaming
https://github.com/qubole/kinesis-sql
kinesis real-time-processing spark spark-streaming spark-structured-streaming structured-streaming
Last synced: about 17 hours ago
JSON representation
Kinesis Connector for Structured Streaming
- Host: GitHub
- URL: https://github.com/qubole/kinesis-sql
- Owner: qubole
- License: apache-2.0
- Created: 2018-03-01T10:14:43.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2024-07-02T12:12:23.000Z (7 months ago)
- Last Synced: 2025-01-13T13:21:36.984Z (9 days ago)
- Topics: kinesis, real-time-processing, spark, spark-streaming, spark-structured-streaming, structured-streaming
- Language: Scala
- Homepage: http://www.qubole.com
- Size: 251 KB
- Stars: 136
- Watchers: 13
- Forks: 80
- Open Issues: 34
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG
- License: LICENSE
Awesome Lists containing this project
README
[![Build Status](https://travis-ci.org/qubole/kinesis-sql.svg?branch=master)](https://travis-ci.org/qubole/kinesis-sql)
## NOTE: This project is NO LONGER MAINTAINED.
[Ron Cremer](https://github.com/roncemer) has volunteered to maintain this project. Beginning with Spark 3.2, the new project is located here: https://github.com/roncemer/spark-sql-kinesis
# Kinesis Connector for Structured Streaming
Implementation of Kinesis Source Provider in Spark Structured Streaming. [SPARK-18165](https://issues.apache.org/jira/browse/SPARK-18165) describes the need for such implementation. More details on the implementation can be read in this [blog](https://www.qubole.com/blog/kinesis-connector-for-structured-streaming/)
## Downloading and Using the Connector
The connector is available from the Maven Central repository. It can be used using the --packages option or the spark.jars.packages configuration property. Use the following connector artifact
Spark 3.0: com.qubole.spark/spark-sql-kinesis_2.12/1.2.0-spark_3.0
Spark 2.4: com.qubole.spark/spark-sql-kinesis_2.11/1.2.0-spark_2.4## Developer Setup
Checkout kinesis-sql branch depending upon your Spark version. Use Master branch for the latest Spark version###### Spark version 3.0.x
git clone [email protected]:qubole/kinesis-sql.git
git checkout master
cd kinesis-sql
mvn install -DskipTestsThis will create *target/spark-sql-kinesis_2.12-\*.jar* file which contains the connector code and its dependency jars.
## How to use it
#### Setup Kinesis
Refer [Amazon Docs](https://docs.aws.amazon.com/cli/latest/reference/kinesis/create-stream.html) for more options###### Create Kinesis Stream
$ aws kinesis create-stream --stream-name test --shard-count 2
###### Add Records in the stream
$ aws kinesis put-record --stream-name test --partition-key 1 --data 'Kinesis'
$ aws kinesis put-record --stream-name test --partition-key 1 --data 'Connector'
$ aws kinesis put-record --stream-name test --partition-key 1 --data 'for'
$ aws kinesis put-record --stream-name test --partition-key 1 --data 'Apache'
$ aws kinesis put-record --stream-name test --partition-key 1 --data 'Spark'#### Example Streaming Job
Refering $SPARK_HOME to the Spark installation directory.
###### Open Spark-Shell
$SPARK_HOME/bin/spark-shell --jars target/spark-sql-kinesis_2.11-2.2.0.jar
###### Subscribe to Kinesis Source
// Subscribe the "test" stream
scala> :paste
val kinesis = spark
.readStream
.format("kinesis")
.option("streamName", "spark-streaming-example")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("awsAccessKeyId", [ACCESS_KEY])
.option("awsSecretKey", [SECRET_KEY])
.option("startingposition", "TRIM_HORIZON")
.load###### Check Schema
scala> kinesis.printSchema
root
|-- data: binary (nullable = true)
|-- streamName: string (nullable = true)
|-- partitionKey: string (nullable = true)
|-- sequenceNumber: string (nullable = true)
|-- approximateArrivalTimestamp: timestamp (nullable = true)###### Word Count
// Cast data into string and group by data column
scala> :paste
kinesis
.selectExpr("CAST(data AS STRING)").as[(String)]
.groupBy("data").count()
.writeStream
.format("console")
.outputMode("complete")
.start()
.awaitTermination()
###### Output in Console+------------+-----+
| data|count|
+------------+-----+
| for| 1|
| Apache| 1|
| Spark| 1|
| Kinesis| 1|
| Connector| 1|
+------------+-----+###### Using the Kinesis Sink
// Cast data into string and group by data column
scala> :paste
kinesis
.selectExpr("CAST(rand() AS STRING) as partitionKey","CAST(data AS STRING)").as[(String,String)]
.groupBy("data").count()
.writeStream
.format("kinesis")
.outputMode("update")
.option("streamName", "spark-sink-example")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("awsAccessKeyId", [ACCESS_KEY])
.option("awsSecretKey", [SECRET_KEY])
.start()
.awaitTermination()## Kinesis Source Configuration
Option-Name | Default-Value | Description |
| ------------- |:-------------:| -----:|
| streamName | - | Name of the stream in Kinesis to read from |
| endpointUrl | https://kinesis.us-east-1.amazonaws.com | end-point URL for Kinesis Stream|
| awsAccessKeyId | - | AWS Credentials for Kinesis describe, read record operations |
| awsSecretKey | - | AWS Credentials for Kinesis describe, read record operations |
| awsSTSRoleARN | - | AWS STS Role ARN for Kinesis describe, read record operations |
| awsSTSSessionName | - | AWS STS Session name for Kinesis describe, read record operations |
| awsUseInstanceProfile | true | Use Instance Profile Credentials if none of credentials provided |
| startingPosition | LATEST | Starting Position in Kinesis to fetch data from. Possible values are "latest", "trim_horizon", "earliest" (alias for trim_horizon), or JSON serialized map shardId->KinesisPosition |
| failondataloss| true | fail the streaming job if any active shard is missing or expired
| kinesis.executor.maxFetchTimeInMs | 1000 | Maximum time spent in executor to fetch record from Kinesis per Shard |
| kinesis.executor.maxFetchRecordsPerShard | 100000 | Maximum Number of records to fetch per shard |
| kinesis.executor.maxRecordPerRead | 10000 | Maximum Number of records to fetch per getRecords API call |
| kinesis.executor.addIdleTimeBetweenReads | false | Add delay between two consecutive getRecords API call |
| kinesis.executor.idleTimeBetweenReadsInMs | 1000 | Minimum delay between two consecutive getRecords |
| kinesis.client.describeShardInterval | 1s (1 second) | Minimum Interval between two ListShards API calls to consider resharding |
| kinesis.client.numRetries | 3 | Maximum Number of retries for Kinesis API requests |
| kinesis.client.retryIntervalMs | 1000 | Cool-off period before retrying Kinesis API |
| kinesis.client.maxRetryIntervalMs | 10000 | Max Cool-off period between 2 retries |
| kinesis.client.avoidEmptyBatches| false | Avoid creating an empty microbatch job by checking upfront if there are any unread data in the stream before the batch is started## Kinesis Sink Configuration
Option-Name | Default-Value | Description |
| ------------- |:-------------:| -----:|
| streamName | - | Name of the stream in Kinesis to write to|
| endpointUrl | https://kinesis.us-east-1.amazonaws.com | The aws endpoint of the kinesis Stream |
| awsAccessKeyId | - | AWS Credentials for Kinesis describe, read record operations
| awsSecretKey | - | AWS Credentials for Kinesis describe, read record |
| awsSTSRoleARN | - | AWS STS Role ARN for Kinesis describe, read record operations |
| awsSTSSessionName | - | AWS STS Session name for Kinesis describe, read record operations |
| awsUseInstanceProfile | true | Use Instance Profile Credentials if none of credentials provided |
| kinesis.executor.recordMaxBufferedTime | 1000 (millis) | Specify the maximum buffered time of a record |
| kinesis.executor.maxConnections | 1 | Specify the maximum connections to Kinesis |
| kinesis.executor.aggregationEnabled | true | Specify if records should be aggregated before sending them to Kinesis |
| kniesis.executor.flushwaittimemillis | 100 | Wait time while flushing records to Kinesis on Task End |## Roadmap
* We need to migrate to DataSource V2 APIs for MicroBatchExecution.
* Maintain Per Micro-Batch Shard Commit state in Dynamo DB## Acknowledgement
This connector would not have been possible without reference implemetation of [Kafka connector](https://github.com/apache/spark/tree/branch-2.2/external/kafka-0-10-sql) for Structured streaming, [Kinesis Connector](https://github.com/apache/spark/tree/branch-2.2/external/kinesis-asl) for Legacy Streaming and [Kinesis Client Library](https://github.com/awslabs/amazon-kinesis-client). Structure of some part of the code is influenced by the excellent work done by various Apache Spark Contributors.