{"id":27968592,"url":"https://github.com/ging/spark-streaming-delta-lake","last_synced_at":"2025-05-07T21:05:10.763Z","repository":{"id":235478816,"uuid":"790776371","full_name":"ging/spark-streaming-delta-lake","owner":"ging","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-25T10:04:12.000Z","size":202,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-07T21:05:07.042Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ging.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-04-23T14:03:48.000Z","updated_at":"2025-01-15T06:50:29.000Z","dependencies_parsed_at":"2024-04-23T15:29:00.681Z","dependency_job_id":"48127098-6955-44e1-8c53-d8d2e958cf69","html_url":"https://github.com/ging/spark-streaming-delta-lake","commit_stats":null,"previous_names":["ging/spark-streaming-delta-lake"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ging%2Fspark-streaming-delta-lake","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ging%2Fspark-streaming-delta-lake/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ging%2Fspark-streaming-delta-lake/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ging%2Fspark-streaming-delta-lake/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ging","download_url":"https://codeload.github.com/ging/spark-streaming-delta-lake/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252954429,"owners_count":21830903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-07T21:05:10.241Z","updated_at":"2025-05-07T21:05:10.750Z","avatar_url":"https://github.com/ging.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Streaming and Delta lake\n\n## Context\n\nWe’re big data engineers working in the ESA (European Space Agency). After a technical meeting we’ve been tasked to create a pipeline with real-time data from ISS (International Space Station) involving a data lake creation for historical data and also a service which takes the higher altitude the satellite reaches for aerospacial researching purposes.\n\nOur colleagues from ground control developed a service exposing an REST-API endpoint so that when we send a GET HTTP message, we get current information from the satellite.\n\nAfter a thorough revision of the briefing, you and your team mates came up with the following service architecture:\n\n![architecture](./assets/architecture.png)\n\nThe architecture is not very complicated, but it’s effective and scalable. \n\n- We set up first a script in python which sets a loop that queries the ground control API service each .5 seconds. If the request is successful it produces a kafka message in the “iss” topic.\n- Kafka cluster is simple now, is just a broker with a zookeeper together. We have only a only kafka instance, but it could scale up horizontally with more kafka servers.\n- In the data lake, we have drawn two sections inside following the medallion architecture with just two stages. The first one is the bronze stage, in which we just parse the JSON data coming from the kafka broker and converting it to a spark table with Spark structured streaming API and storing it with deltalake connector.\n- The second one is the gold stage in which we aggregate data using again Spark structured API, but this time consuming real time data stream from deltalake.\n\n## Setting up project\n\n### Docker environment\n\nThe kafka broker, zookeeper and the data access service are dockerized. It’s important to review carefully how this setup has been done. Feel free to checkout the docker-compose.yaml, and the data-source/src/main.py file to gain understanding of the environment behind the scenes. \n\nLaunch the docker environment opening a shell session on the root folder and type:\n\n```bash\ndocker compose up -d \n```\n\nIf you want to restart or tear down the environment just type:\n\n```bash\ndocker compose down\n```\n\n### Checkout kafka incoming data\n\nTo check the incoming messages in the “iss” topic, lets get into the kafka container and use the kafka-console-consumer utility to see the data flow. \n\n```bash\ndocker exec -it kafka sh\n```\n\nAnd in the container session, type:\n\n```bash\n/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic iss\n```\n\nAnd you see the data flowing like that: \n\n```json\n{\"name\": \"iss\", \"id\": 25544, \"latitude\": -48.947826322504, \"longitude\": 125.29514904266, \"altitude\": 433.83561822897, \"velocity\": 27538.217323796, \"visibility\": \"daylight\", \"footprint\": 4577.1935346024, \"timestamp\": 1714032708, \"daynum\": 2460425.8415278, \"solar_lat\": 13.394772525764, \"solar_lon\": 56.529414593886, \"units\": \"kilometers\"}\n{\"name\": \"iss\", \"id\": 25544, \"latitude\": -48.969416507533, \"longitude\": 125.38312401158, \"altitude\": 433.84568019422, \"velocity\": 27538.194159784, \"visibility\": \"daylight\", \"footprint\": 4577.2437734697, \"timestamp\": 1714032709, \"daynum\": 2460425.8415394, \"solar_lat\": 13.394776271319, \"solar_lon\": 56.525247467118, \"units\": \"kilometers\"}\n{\"name\": \"iss\", \"id\": 25544, \"latitude\": -49.012376588641, \"longitude\": 125.55931500715, \"altitude\": 433.86570018068, \"velocity\": 27538.148067913, \"visibility\": \"daylight\", \"footprint\": 4577.3437302068, \"timestamp\": 1714032711, \"daynum\": 2460425.8415625, \"solar_lat\": 13.394783762579, \"solar_lon\": 56.516913045926, \"units\": \"kilometers\"}\n```\n\nTo exit this utility, just `Ctrl+C`, and to exit the container session just type `exit`\n\nNow we have or project up and running. \n\n### Sbt and Java setup\n\nFirst, make sure that you have sbt and java 1.8 installed in your machine. \n\n```bash\n which sbt\n```\n\nAnd for java, we’ll use sdkman to check our java version. \n\n```bash\nsdk version\n```\n\nLet’s make sure, our java version is 1.8.\n\n```bash\nsdk current java\n```\n\nIf it’s not the case, just: \n\n```bash\nsdk install java 8.0.412-amzn\ndk default java 8.0.392-amzn \n```\n\n### Launch Bronze ETL\n\nNow we’re going to launch or first spark job. Instead of working with spark-submit, we just work with sbt. \n\n```bash\ncd etl\nsbt \"runMain etsit.ging.etl.etl_bronze.EtlBronze\"\n```\n\nIf all works out properly, you should see a lot of logs of bootstraping spark services and after, a recurrent log trace of the mini batches produced by the spark streaming writestream in console. \n\nYou also will notice that a new folder called `data` has been created in your root folder. Feel free to checkout what is inside, you’ll see there are a lot of parquet files and also the transaction logs. \n\nIf in the logs you see weird ERROR logs, WARN logs but the pipeline is still working, do not worry, those are related to checkpoint locations, or offset kafka caches, ect. \n\n### Launch Gold ETL\n\n```bash\nsbt \"runMain etsit.ging.etl.etl_gold.EtlGold\"\n```\n\nIf all works out properly, you should see a lot of logs of bootstraping spark services and after, a recurrent log trace of the mini batches produced by the spark streaming writestream in console. \n\nIf in the logs you see weird ERROR logs, WARN logs but the pipeline is still working, do not worry, those are related to checkpoint locations, or offset kafka caches, ect.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fging%2Fspark-streaming-delta-lake","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fging%2Fspark-streaming-delta-lake","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fging%2Fspark-streaming-delta-lake/lists"}