{"id":21268959,"url":"https://github.com/qubole/kinesis-sql","last_synced_at":"2025-04-08T03:11:49.002Z","repository":{"id":30446307,"uuid":"123418284","full_name":"qubole/kinesis-sql","owner":"qubole","description":"Kinesis Connector for Structured Streaming","archived":false,"fork":false,"pushed_at":"2024-07-02T12:12:23.000Z","size":257,"stargazers_count":136,"open_issues_count":34,"forks_count":80,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-03-31T17:18:13.991Z","etag":null,"topics":["kinesis","real-time-processing","spark","spark-streaming","spark-structured-streaming","structured-streaming"],"latest_commit_sha":null,"homepage":"http://www.qubole.com","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qubole.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-01T10:14:43.000Z","updated_at":"2024-12-29T15:18:46.000Z","dependencies_parsed_at":"2024-12-21T03:08:21.822Z","dependency_job_id":null,"html_url":"https://github.com/qubole/kinesis-sql","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fkinesis-sql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fkinesis-sql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fkinesis-sql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fkinesis-sql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qubole","download_url":"https://codeload.github.com/qubole/kinesis-sql/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247767236,"owners_count":20992548,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kinesis","real-time-processing","spark","spark-streaming","spark-structured-streaming","structured-streaming"],"created_at":"2024-11-21T08:06:54.763Z","updated_at":"2025-04-08T03:11:48.986Z","avatar_url":"https://github.com/qubole.png","language":"Scala","readme":"[![Build Status](https://travis-ci.org/qubole/kinesis-sql.svg?branch=master)](https://travis-ci.org/qubole/kinesis-sql)\n\n## NOTE: This project is NO LONGER MAINTAINED. \n\n[Ron Cremer](https://github.com/roncemer) has volunteered to maintain this project. Beginning with Spark 3.2, the new project is located here: https://github.com/roncemer/spark-sql-kinesis\n\n\n# Kinesis Connector for Structured Streaming \n\nImplementation of Kinesis Source Provider in Spark Structured Streaming. [SPARK-18165](https://issues.apache.org/jira/browse/SPARK-18165) describes the need for such implementation. More details on the implementation can be read in this [blog](https://www.qubole.com/blog/kinesis-connector-for-structured-streaming/)\n\n## Downloading and Using the Connector\n\nThe connector is available from the Maven Central repository. It can be used using the --packages option or the spark.jars.packages configuration property. Use the following connector artifact\n\n\tSpark 3.0: com.qubole.spark/spark-sql-kinesis_2.12/1.2.0-spark_3.0\n\tSpark 2.4: com.qubole.spark/spark-sql-kinesis_2.11/1.2.0-spark_2.4\n\n## Developer Setup\nCheckout kinesis-sql branch depending upon your Spark version. Use Master branch for the latest Spark version \n\n###### Spark version 3.0.x\n\tgit clone git@github.com:qubole/kinesis-sql.git\n\tgit checkout master\n\tcd kinesis-sql\n\tmvn install -DskipTests\n\nThis will create *target/spark-sql-kinesis_2.12-\\*.jar* file which contains the connector code and its dependency jars.\n\n\n## How to use it\n\n#### Setup Kinesis\nRefer [Amazon Docs](https://docs.aws.amazon.com/cli/latest/reference/kinesis/create-stream.html) for more options\n\n###### Create Kinesis Stream \n\n\t$ aws kinesis create-stream --stream-name test --shard-count 2\n    \n###### Add Records in the stream\n\t\n    $ aws kinesis put-record --stream-name test --partition-key 1 --data 'Kinesis'\n    $ aws kinesis put-record --stream-name test --partition-key 1 --data 'Connector'\n    $ aws kinesis put-record --stream-name test --partition-key 1 --data 'for'\n    $ aws kinesis put-record --stream-name test --partition-key 1 --data 'Apache'\n    $ aws kinesis put-record --stream-name test --partition-key 1 --data 'Spark'\n\n#### Example Streaming Job\n\nRefering $SPARK_HOME to the Spark installation directory.\n\n###### Open Spark-Shell\n\n\t$SPARK_HOME/bin/spark-shell --jars target/spark-sql-kinesis_2.11-2.2.0.jar\n\n###### Subscribe to Kinesis Source\n\t// Subscribe the \"test\" stream\n\tscala\u003e :paste\n\tval kinesis = spark\n  \t\t.readStream\n  \t\t.format(\"kinesis\")\n    \t.option(\"streamName\", \"spark-streaming-example\")\n       \t.option(\"endpointUrl\", \"https://kinesis.us-east-1.amazonaws.com\")\n        .option(\"awsAccessKeyId\", [ACCESS_KEY])\n        .option(\"awsSecretKey\", [SECRET_KEY])\n        .option(\"startingposition\", \"TRIM_HORIZON\")\n  \t\t.load\n\n###### Check Schema \n\tscala\u003e kinesis.printSchema\n\troot\n \t|-- data: binary (nullable = true)\n \t|-- streamName: string (nullable = true)\n \t|-- partitionKey: string (nullable = true)\n \t|-- sequenceNumber: string (nullable = true)\n \t|-- approximateArrivalTimestamp: timestamp (nullable = true)\n\n###### Word Count \n\t// Cast data into string and group by data column\n\tscala\u003e :paste\n    \n    \t kinesis\n        .selectExpr(\"CAST(data AS STRING)\").as[(String)]\n        .groupBy(\"data\").count()\n  \t\t.writeStream\n  \t\t.format(\"console\")\n        .outputMode(\"complete\") \n  \t\t.start()\n  \t\t.awaitTermination()\n        \n###### Output in Console\n\n\n\t+------------+-----+\n\t|        data|count|\n\t+------------+-----+\n\t|         for|    1|\n\t|      Apache|    1|\n    |       Spark|    1|\n\t|     Kinesis|    1|\n\t|   Connector|    1|\n\t+------------+-----+ \n\n###### Using the Kinesis Sink\n    // Cast data into string and group by data column\n        scala\u003e :paste\n        kinesis\n        .selectExpr(\"CAST(rand() AS STRING) as partitionKey\",\"CAST(data AS STRING)\").as[(String,String)]\n        .groupBy(\"data\").count()\n  \t    .writeStream\n  \t    .format(\"kinesis\")\n        .outputMode(\"update\") \n        .option(\"streamName\", \"spark-sink-example\")\n        .option(\"endpointUrl\", \"https://kinesis.us-east-1.amazonaws.com\")\n        .option(\"awsAccessKeyId\", [ACCESS_KEY])\n        .option(\"awsSecretKey\", [SECRET_KEY])\n  \t    .start()\n  \t    .awaitTermination()\n\n## Kinesis Source Configuration \n\n Option-Name        | Default-Value           | Description  |\n| ------------- |:-------------:| -----:|\n| streamName     | - | Name of the stream in Kinesis to read from |\n| endpointUrl     |   https://kinesis.us-east-1.amazonaws.com    |   end-point URL for Kinesis Stream|\n| awsAccessKeyId |    -     |    AWS Credentials for Kinesis describe, read record operations |\n| awsSecretKey |      -  |    AWS Credentials for Kinesis describe, read record operations |\n| awsSTSRoleARN |      -  |    AWS STS Role ARN for Kinesis describe, read record operations |\n| awsSTSSessionName |      -  |    AWS STS Session name for Kinesis describe, read record operations |\n| awsUseInstanceProfile | true |    Use Instance Profile Credentials if none of credentials provided |\n| startingPosition |      LATEST |    Starting Position in Kinesis to fetch data from. Possible values are \"latest\", \"trim_horizon\", \"earliest\" (alias for trim_horizon), or JSON serialized map shardId-\u003eKinesisPosition   |\n| failondataloss| true | fail the streaming job if any active shard is missing or expired\n| kinesis.executor.maxFetchTimeInMs |     1000 |  Maximum time spent in executor to fetch record from Kinesis per Shard |\n| kinesis.executor.maxFetchRecordsPerShard |     100000 |  Maximum Number of records to fetch per shard  |\n| kinesis.executor.maxRecordPerRead |     10000 |  Maximum Number of records to fetch per getRecords API call  |\n| kinesis.executor.addIdleTimeBetweenReads\t| false\t| Add delay between two consecutive getRecords API call\t|\n| kinesis.executor.idleTimeBetweenReadsInMs\t| 1000\t| Minimum delay between two consecutive getRecords\t| \n| kinesis.client.describeShardInterval |      1s (1 second) |  Minimum Interval between two ListShards API calls to consider resharding  |\n| kinesis.client.numRetries |     3 |  Maximum Number of retries for Kinesis API requests  |\n| kinesis.client.retryIntervalMs |     1000 |  Cool-off period before retrying Kinesis API  |\n| kinesis.client.maxRetryIntervalMs\t| 10000\t| Max Cool-off period between 2 retries\t|\n| kinesis.client.avoidEmptyBatches| false | Avoid creating an empty microbatch job by checking upfront if there are any unread data in the stream before the batch is started\n\n## Kinesis Sink Configuration\n Option-Name        | Default-Value           | Description  |\n| ------------- |:-------------:| -----:|\n| streamName   | - | Name of the stream in Kinesis to write to|\n| endpointUrl  | https://kinesis.us-east-1.amazonaws.com |  The aws endpoint of the kinesis Stream |\n| awsAccessKeyId |    -     |    AWS Credentials for  Kinesis describe, read record operations    \n| awsSecretKey |      -  |    AWS Credentials for  Kinesis describe, read record |\n| awsSTSRoleARN |      -  |    AWS STS Role ARN for Kinesis describe, read record operations |\n| awsSTSSessionName |      -  |    AWS STS Session name for Kinesis describe, read record operations |\n| awsUseInstanceProfile | true |    Use Instance Profile Credentials if none of credentials provided |\n| kinesis.executor.recordMaxBufferedTime | 1000 (millis) | Specify the maximum buffered time of a record |\n| kinesis.executor.maxConnections | 1 | Specify the maximum connections to Kinesis | \n| kinesis.executor.aggregationEnabled | true | Specify if records should be aggregated before sending them to Kinesis |\n| kniesis.executor.flushwaittimemillis | 100 | Wait time while flushing records to Kinesis on Task End |\n\n## Roadmap\n*  We need to migrate to DataSource V2 APIs for MicroBatchExecution.\n*  Maintain Per Micro-Batch Shard Commit state in Dynamo DB\n\n## Acknowledgement\n\nThis connector would not have been possible without reference implemetation of [Kafka connector](https://github.com/apache/spark/tree/branch-2.2/external/kafka-0-10-sql) for Structured streaming, [Kinesis Connector](https://github.com/apache/spark/tree/branch-2.2/external/kinesis-asl) for Legacy Streaming and [Kinesis Client Library](https://github.com/awslabs/amazon-kinesis-client). Structure of some part of the code is influenced by the excellent work done by various Apache Spark Contributors.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqubole%2Fkinesis-sql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqubole%2Fkinesis-sql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqubole%2Fkinesis-sql/lists"}