{"id":15056710,"url":"https://github.com/polomarcus/spark-structured-streaming-examples","last_synced_at":"2025-04-10T03:54:24.844Z","repository":{"id":73159128,"uuid":"94458320","full_name":"polomarcus/Spark-Structured-Streaming-Examples","owner":"polomarcus","description":"Spark Structured Streaming / Kafka / Cassandra / Elastic ","archived":false,"fork":false,"pushed_at":"2023-02-07T15:06:48.000Z","size":17267,"stargazers_count":183,"open_issues_count":5,"forks_count":78,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-10T03:54:17.692Z","etag":null,"topics":["cassandra","kafka","spark","spark-sql","structured-streaming"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/polomarcus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-15T16:27:21.000Z","updated_at":"2025-01-03T21:46:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"84c6ff4f-9c8a-4ebb-95f8-59525bf96031","html_url":"https://github.com/polomarcus/Spark-Structured-Streaming-Examples","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polomarcus%2FSpark-Structured-Streaming-Examples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polomarcus%2FSpark-Structured-Streaming-Examples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polomarcus%2FSpark-Structured-Streaming-Examples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polomarcus%2FSpark-Structured-Streaming-Examples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/polomarcus","download_url":"https://codeload.github.com/polomarcus/Spark-Structured-Streaming-Examples/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248154999,"owners_count":21056542,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cassandra","kafka","spark","spark-sql","structured-streaming"],"created_at":"2024-09-24T21:55:22.783Z","updated_at":"2025-04-10T03:54:24.822Z","avatar_url":"https://github.com/polomarcus.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kafka / Cassandra / Elastic with Spark Structured Streaming\n\n[![Codacy Badge](https://api.codacy.com/project/badge/Grade/214d5a4420ef471cba15ca3c59c15de0)](https://app.codacy.com/app/paleclercq/Spark-Structured-Streaming-Examples?utm_source=github.com\u0026utm_medium=referral\u0026utm_content=polomarcus/Spark-Structured-Streaming-Examples\u0026utm_campaign=Badge_Grade_Dashboard)\n\nStream the number of time **Drake is broadcasted** on each radio.\nAnd also, see how easy is Spark Structured Streaming to use using Spark SQL's Dataframe API\n\n## Run the Project\n### Step 1 - Start containers\nStart the ZooKeeper, Kafka, Cassandra containers in detached mode (-d)\n```\n./start-docker-compose.sh\n```\nIt will run these 2 commands together so you don't have to\n```\ndocker-compose up -d\n```\n\n```\n# create Cassandra schema\ndocker-compose exec cassandra cqlsh -f /schema.cql;\n\n# confirm schema\ndocker-compose exec cassandra cqlsh -e \"DESCRIBE SCHEMA;\"\n```\n\n### Step 2 - start spark structured streaming\n```\nsbt run\n```\n\n### Run the project after another time\nAs checkpointing enables us to process our data exactly once, we need to delete the checkpointing folders to re run our examples.\n```\nrm -rf checkpoint/\nsbt run\n```\n\n## Monitor\n* Spark : http://localhost:4040/SQL/\n* Kibana (index \"test\") : http://localhost:5601/app/kibana#/discover\n* Kafka : Read all messages sent\n```\ndocker-compose exec kafka  \\\n kafka-console-consumer --bootstrap-server localhost:9092 --topic test --from-beginning\n```\n\nExamples:\n```\n{\"radio\":\"nova\",\"artist\":\"Drake\",\"title\":\"From Time\",\"count\":18}\n{\"radio\":\"nova\",\"artist\":\"Drake\",\"title\":\"4pm In Calabasas\",\"count\":1}\n```\n## Requirements\n* SBT\n* [docker compose](https://github.com/docker/compose/releases/tag/1.17.1)\n\n### Linux\n```\ncurl -L https://github.com/docker/compose/releases/download/1.17.1/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose\nchmod +x /usr/local/bin/docker-compose\n```\n### MacOS\n```\nbrew install docker-compose\n```\n\n## Input data\nComing from radio stations stored inside a parquet file, the stream is emulated with ` .option(\"maxFilesPerTrigger\", 1)` option.\n\nThe stream is after read to be sink into Kafka.\nThen, Kafka to Cassandra\n\n## Output data \nStored inside Kafka and Cassandra for example only.\nCassandra's Sinks uses the [ForeachWriter](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter) and also the [StreamSinkProvider](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.sources.StreamSinkProvider) to compare both sinks.\n\nOne is using the **Datastax's Cassandra saveToCassandra** method. The other another method, messier (untyped), that uses CQL on a custom foreach loop.\n\nFrom Spark's doc about batch duration:\n\u003e Trigger interval: Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will attempt to trigger at the next trigger point, not immediately after the processing has completed.\n\n### Kafka topic\nOne topic `test` with only one partition\n\n#### List all topics\n```\ndocker-compose exec kafka  \\\n  kafka-topics --list --zookeeper zookeeper:32181\n```\n\n\n#### Send a message to be processed\n```\ndocker-compose exec kafka  \\\n kafka-console-producer --broker-list localhost:9092 --topic test\n\n\u003e {\"radio\":\"skyrock\",\"artist\":\"Drake\",\"title\":\"Hold On We’Re Going Home\",\"count\":38}\n```\n\n### Cassandra Table\nThere are 3 tables. 2 used as sinks, and another to save kafka metadata.\nHave a look to [schema.cql](https://github.com/polomarcus/Spark-Structured-Streaming-Examples/blob/e9afaf6691c860ffb4da64e311c6cec4cdee8968/src/conf/cassandra/schema.cql) for all the details.\n\n```\n docker-compose exec cassandra cqlsh -e \"SELECT * FROM structuredstreaming.radioOtherSink;\"\n\n radio   | title                    | artist | count\n---------+--------------------------+--------+-------\n skyrock |                Controlla |  Drake |     1\n skyrock |                Fake Love |  Drake |     9\n skyrock | Hold On We’Re Going Home |  Drake |    35\n skyrock |            Hotline Bling |  Drake |  1052\n skyrock |  Started From The Bottom |  Drake |    39\n    nova |         4pm In Calabasas |  Drake |     1\n    nova |             Feel No Ways |  Drake |     2\n    nova |                From Time |  Drake |    34\n    nova |                     Hype |  Drake |     2\n\n```\n\n### Kafka Metadata\n@TODO Verify this below information. Cf this [SO comment](https://stackoverflow.com/questions/46153105/how-to-get-kafka-offsets-for-structured-query-for-manual-and-reliable-offset-man/46174353?noredirect=1#comment79536515_46174353)\n\nWhen doing an application upgrade, we cannot use [checkpointing](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing), so we need to store our offset into a external datasource, here Cassandra is chosen.\nThen, when starting our kafka source we need to use the option \"StartingOffsets\" with a json string like \n```\n\"\"\" {\"topicA\":{\"0\":23,\"1\":-1},\"topicB\":{\"0\":-2}} \"\"\"\n```\nLearn more [in the official Spark's doc for Kafka](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries).\n\nIn the case, there is not Kafka's metadata stored inside Cassandra, **earliest** is used.\n\n```\ndocker-compose exec cassandra cqlsh -e \"SELECT * FROM structuredstreaming.kafkametadata;\"\n partition | offset\n-----------+--------\n         0 |    171\n```\n\n## Useful links\n* [Kafka tutorial #8 - Spark Structured Streaming](http://aseigneurin.github.io/2018/08/14/kafka-tutorial-8-spark-structured-streaming.html)\n* [Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2](https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html)\n* https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html\n* https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach\n* https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes\n* [Elastic Structured Streamin doc](https://www.elastic.co/blog/structured-streaming-elasticsearch-for-hadoop-6-0)\n* [Structured Streaming - “Failed to find data source: es” ](https://discuss.elastic.co/t/structured-streaming-failed-to-find-data-source-es)\n* [Arbitrary Stateful Processing in Apache Spark’s Structured Streaming][1]\n* [Deep dive stateful stream processing][2] \n* [Official documentation][3]\n\n\n  [1]: https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html\n  [2]: https://databricks.com/session/deep-dive-stateful-stream-processing\n  [3]: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations\n\n### Docker-compose\n* [The last pickle's docker example](https://github.com/thelastpickle/docker-cassandra-bootstrap/blob/master/docker-compose.yml)\n* [Confluence's Kafka docker compose](https://docs.confluent.io/current/installation/docker/docs/quickstart.html#getting-started-with-docker-compose)\n\n## Inspired by\n* https://github.com/ansrivas/spark-structured-streaming\n* [Holden Karau's High Performance Spark](https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala#L66)\n* [Jay Kreps blog articles](https://medium.com/@jaykreps/exactly-once-support-in-apache-kafka-55e1fdd0a35f)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolomarcus%2Fspark-structured-streaming-examples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpolomarcus%2Fspark-structured-streaming-examples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolomarcus%2Fspark-structured-streaming-examples/lists"}