{"id":22213968,"url":"https://github.com/xmlking/spark-demo","last_synced_at":"2025-03-25T06:23:56.005Z","repository":{"id":142313670,"uuid":"483836013","full_name":"xmlking/spark-demo","owner":"xmlking","description":"spark kotlin demo","archived":false,"fork":false,"pushed_at":"2022-04-22T00:32:36.000Z","size":73,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-30T05:43:18.573Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xmlking.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-04-20T22:47:46.000Z","updated_at":"2022-04-21T03:13:22.000Z","dependencies_parsed_at":"2024-02-15T05:33:27.484Z","dependency_job_id":null,"html_url":"https://github.com/xmlking/spark-demo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fspark-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fspark-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fspark-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xmlking%2Fspark-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xmlking","download_url":"https://codeload.github.com/xmlking/spark-demo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245409539,"owners_count":20610547,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-02T21:12:48.243Z","updated_at":"2025-03-25T06:23:55.984Z","avatar_url":"https://github.com/xmlking.png","language":"Kotlin","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Spark\n\nSpark Batch examples.\n\n* DemoJob - count lines in this `README.md` file.\n* AvroJob - load avro file, transform and save to avro file.\n\n## Prerequisites\n```shell\n# spark/hadoop currently don't support Java 17\nsdk install java 11.0.14-zulu\nsdk use java 11.0.14-zulu \n# install `spark-shell`, `spark-submit` cli\nsdk install spark\n```\n\n## Build\n\n```shell\ngradle build\n# skip tests\ngradle build -x test\n```\n\n## Run\n\n### Running Locally\n\n\u003e In IDEs like IntelliJ, you can run `main` method directly.\n\n```shell\nsdk use java 11.0.14-zulu \n\ngradle run \n# passing arguments for main method\ngradle run --args=\"lorem ipsum dolor\"\n```\n\nOr via spark-submit\n\n```shell\n# Submit Local\nspark-submit \\\n    --class org.mycompany.spark.AvroJobKt \\\n    --master local \\\n    --properties-file application.properties \\\n    --packages org.apache.spark:spark-avro_2.12:3.2.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.6 \\\n    build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar\n\nspark-submit \\\n    --class org.mycompany.spark.AvroJobKt \\\n    --master local \\\n    --properties-file application.properties \\\n    --packages org.apache.spark:spark-avro_2.12:3.2.0 \\\n    build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar\n```\n\n\n### Launching on a Cluster\n\n```shell\n# Submit to Cluster\nspark-submit \\\n    --class org.mycompany.spark.AvroJobKt \\\n    --master spark://localhost:7077 \\\n    build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar\n\nspark-submit \\\n    --class org.mycompany.spark.AvroJobKt \\\n    --master spark://localhost:7077 \\\n    --properties-file application-prod.properties \\\n    --packages org.apache.spark:spark-avro_2.12:3.2.0 \\\n    build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar\n    \nnohup spark-submit \\\n    --class corg.mycompany.spark.AvroJobKt \\\n    --master yarn \\\n    --queue abcd \\\n    --num-executors 2 \\\n    --executor-memory 2G \\\n    --properties-file application-prod.properties \\\n    --packages org.apache.spark:spark-avro_2.12:3.2.0 \\\n    build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar arg1 arg2 \u003e app.log 2\u003e\u00261 \u0026\n```\n \n### Google Cloud\n\n```shell\nexport GCP_PROJECT=\u003cgcp-project-id\u003e\nexport REGION=\u003cregion\u003e\nexport SUBNET=\u003csubnet\u003e\nexport GCS_STAGING_LOCATION=\u003cgcs-staging-bucket-folder\u003e\nexport HISTORY_SERVER_CLUSTER=\u003chistory-server\u003e\nexport BUCKET_NAME=my-demo-bucket-sumo\n```\n```shell\ngsutil mb gs://$BUCKET_NAME\ngsutil cp data/in/account.avro gs://$BUCKET_NAME/data/in\n\ngsutil cp build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar gs://$BUCKET_NAME/java/build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar\n```\n\n```shell\nGCP_PROJECT=\u003cgcp-project-id\u003e\nREGION=\u003cregion\u003e\nSUBNET=\u003csubnet\u003e\nGCS_STAGING_LOCATION=\u003cgcs-staging-bucket-folder\u003e\nHISTORY_SERVER_CLUSTER=\u003chistory-server\u003e\n\ngcloud dataproc jobs submit spark \\\n    --cluster=${CLUSTER} \\\n    --region=${REGION} \\\n    --class=org.mycompany.spark.AvroJobKt \\\n    --jars=gs://${BUCKET_NAME}/java/build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar \\\n    --archives=org.apache.spark:spark-avro_2.12:3.2.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.6 \\\n    --properties-file application.properties \\\n    -- gs://${BUCKET_NAME}/data/in/ gs://${BUCKET_NAME}/data/out/\n```\n\n## Reference \n- [Introducing Kotlin for Apache Spark Preview](https://blog.jetbrains.com/kotlin/2020/08/introducing-kotlin-for-apache-spark-preview/)\n- [Code Examples](https://github.com/JetBrains/kotlin-spark-api/tree/main/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples)\n- [Processing AVRO data using Google Cloud DataProc](https://sourabhsjain.medium.com/processing-avro-data-using-google-cloud-dataproc-86352e70e50d)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmlking%2Fspark-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxmlking%2Fspark-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxmlking%2Fspark-demo/lists"}