{"id":22213971,"url":"https://github.com/xmlking/streaming-poc","last_synced_at":"2025-03-25T06:23:56.458Z","repository":{"id":142313683,"uuid":"164498309","full_name":"xmlking/streaming-poc","owner":"xmlking","description":null,"archived":false,"fork":false,"pushed_at":"2020-04-25T20:06:39.000Z","size":2031,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-30T05:43:18.599Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xmlking.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-01-07T21:24:13.000Z","updated_at":"2021-12-21T01:30:22.000Z","dependencies_parsed_at":"2024-02-06T03:01:23.004Z","dependency_job_id":"c0c07ce4-f608-40fe-8af3-15fc5906ef2f","html_url":"https://github.com/xmlking/streaming-poc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fstreaming-poc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fstreaming-poc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fstreaming-poc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fstreaming-poc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xmlking","download_url":"https://codeload.github.com/xmlking/streaming-poc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245409543,"owners_count":20610549,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-02T21:12:50.653Z","updated_at":"2025-03-25T06:23:56.434Z","avatar_url":"https://github.com/xmlking.png","language":"Kotlin","readme":"Spark and Kafka PoC\n===================\n\n\n\n###  Examples\n1. [Spark Batch](./spark-batch)\n2. [Spark Streaming](./spark-streaming)\n3. [Kafka Streams](./kstreams)\n\n\n### Prerequisites\n1. Gradle \u003e 4.7 [ Install via [sdkman](http://sdkman.io/)]\n2. Docker for Mac [[Setup Instructions](./docs/Docker.md)]\n3. Apache Spark [[Download Link](https://spark.apache.org/downloads.html)]\n\n#### Install spark via SDKMAN (Preferred for Windows and Mac users)\n```bash\n# install spark v2.1.1 or you prefered version.\n# this will give you access to spark-shell, spark-submit CLI\nsdk ls spark\nsdk i spark 2.1.1\n```\n\n#### Install spark via brew (Mac)\n```bash\n# As an alternative, you can install spark via brew on Mac\nbrew update\nbrew install apache-spark\n# verifying installation\nspark-shell\n```\n\n#### Install spark via downloading\n    \n    Download `spark-x.x.x-bin-hadoop2.7.tgz` from  https://spark.apache.org/downloads.html\n    Install Spark by unpacking i.e., /Developer/Applications/spark-2.2.0-bin-hadoop2.7\n\n### Build\n\n```bash\ngradle shadowJar\n# skip tests\ngradle shadowJar -x test\n```\n\n### Start Standalone Spark Cluster\n```bash\n# run foreground\ndocker-compose up spark-master\ndocker-compose up spark-worker\ndocker-compose up zeppelin\n# scall up workers if needed\ndocker-compose scale spark-worker=2\n# to restart any container \ndocker-compose restart spark-master\n# to shutdown\ndocker-compose down\n# to see ports of workers bind to host\ndocker-compose ps\n# to see logs\ndocker-compose logs -f zeppelin\n# ssh to a service(master)\ndocker-compose exec spark-master bash\nhadoop fs -ls /data/in\n```\n\nThe SparkUI will be running at `http://${YOUR_DOCKER_HOST or localhost}:8080` with one worker listed.\n\n### Run\n\n\u003e Set your environment path to spark commands, if you installed spark via downloading\n```bash\nSPARK_HOME=/Developer/Applications/spark-2.2.0-bin-hadoop2.7\nPATH=$PATH:$SPARK_HOME/bin\n# set this, if you get Error: WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.\nexport SPARK_LOCAL_IP=\"127.0.0.1\" \n```\n\n#### Spark Shell\n\n\u003e To open Spark Shell\n```bash\nspark-shell --master spark://localhost:7077\n# or with docker-compose\n    docker-compose exec spark-master bash\n    # start spark shell with in this bash\n    spark-shell --master spark://spark-master:7077\n    # or run example `SparkPi` job\n    run-example SparkPi 10\n```\n\n#### Running Locally\n    \n```bash\n# Submit Local\nspark-submit \\\n    --class com.sumo.experiments.BatchJobKt \\\n    --master local[2] \\\n    spark-batch/build/libs/spark-batch-0.1.0-SNAPSHOT-all.jar\n    \nspark-submit \\\n    --class com.sumo.experiments.LoadJobKt \\\n    --master local[2] \\\n    --properties-file application.properties \\\n    spark-batch/build/libs/spark-batch-0.1.0-SNAPSHOT-all.jar\n```\n\nIn IDEs like IntelliJ, you can right-click the file and run directly.\n\n#### Launching on a Cluster\n\n```bash\n# Submit to Cluster\nspark-submit \\\n    --class com.sumo.experiments.BatchJobKt \\\n    --master spark://localhost:7077 \\\n    spark-batch/build/libs/spark-batch-0.1.0-SNAPSHOT-all.jar\n\nspark-submit \\\n    --class com.sumo.experiments.LoadJobKt \\\n    --master spark://localhost:7077 \\\n    --properties-file application-prod.properties \\\n    spark-batch/build/libs/spark-batch-0.1.0-SNAPSHOT-all.jar\n\nnohup spark-submit \\\n    --class com.sumo.experiments.LoadJobKt \\\n    --master yarn \\\n    --queue abcd \\\n    --num-executors 2 \\\n    --executor-memory 2G \\\n    --properties-file application-prod.properties \\\n    spark-batch/build/libs/spark-batch-0.1.0-SNAPSHOT-all.jar arg1 arg2 \u003e app.log 2\u003e\u00261 \u0026\n```\n\n### Gradle Commands\n```bash\n# upgrade project gradle version\ngradle wrapper --gradle-version 5.0 --distribution-type all\n# gradle daemon status \ngradle --status\ngradle --stop\n# refresh dependencies\ngradle build --refresh-dependencies\n```\n\n### Reference \n* [try kotlin avro serializer: avro4k-kafka-serializer](https://github.com/thake/avro4k-kafka-serializer)\n* https://bigdatagurus.wordpress.com/2017/03/01/how-to-start-spark-cluster-in-minutes/\n* https://zeppelin.apache.org/docs/0.7.2/install/cdh.html\n* https://spark.apache.org/examples.html\n* https://github.com/cliftbar/etl-stack/blob/master/docker-compose.yml\n* https://github.com/big-data-europe/docker-hive\n* https://github.com/SANSA-Stack/SANSA-Notebooks/tree/develop\n \n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmlking%2Fstreaming-poc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxmlking%2Fstreaming-poc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmlking%2Fstreaming-poc/lists"}