https://github.com/xmlking/spark-demo
spark kotlin demo
https://github.com/xmlking/spark-demo
Last synced: 10 months ago
JSON representation
spark kotlin demo
- Host: GitHub
- URL: https://github.com/xmlking/spark-demo
- Owner: xmlking
- Created: 2022-04-20T22:47:46.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-04-22T00:32:36.000Z (over 3 years ago)
- Last Synced: 2025-01-30T05:43:18.573Z (12 months ago)
- Language: Kotlin
- Size: 71.3 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Apache Spark
Spark Batch examples.
* DemoJob - count lines in this `README.md` file.
* AvroJob - load avro file, transform and save to avro file.
## Prerequisites
```shell
# spark/hadoop currently don't support Java 17
sdk install java 11.0.14-zulu
sdk use java 11.0.14-zulu
# install `spark-shell`, `spark-submit` cli
sdk install spark
```
## Build
```shell
gradle build
# skip tests
gradle build -x test
```
## Run
### Running Locally
> In IDEs like IntelliJ, you can run `main` method directly.
```shell
sdk use java 11.0.14-zulu
gradle run
# passing arguments for main method
gradle run --args="lorem ipsum dolor"
```
Or via spark-submit
```shell
# Submit Local
spark-submit \
--class org.mycompany.spark.AvroJobKt \
--master local \
--properties-file application.properties \
--packages org.apache.spark:spark-avro_2.12:3.2.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.6 \
build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar
spark-submit \
--class org.mycompany.spark.AvroJobKt \
--master local \
--properties-file application.properties \
--packages org.apache.spark:spark-avro_2.12:3.2.0 \
build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar
```
### Launching on a Cluster
```shell
# Submit to Cluster
spark-submit \
--class org.mycompany.spark.AvroJobKt \
--master spark://localhost:7077 \
build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar
spark-submit \
--class org.mycompany.spark.AvroJobKt \
--master spark://localhost:7077 \
--properties-file application-prod.properties \
--packages org.apache.spark:spark-avro_2.12:3.2.0 \
build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar
nohup spark-submit \
--class corg.mycompany.spark.AvroJobKt \
--master yarn \
--queue abcd \
--num-executors 2 \
--executor-memory 2G \
--properties-file application-prod.properties \
--packages org.apache.spark:spark-avro_2.12:3.2.0 \
build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar arg1 arg2 > app.log 2>&1 &
```
### Google Cloud
```shell
export GCP_PROJECT=
export REGION=
export SUBNET=
export GCS_STAGING_LOCATION=
export HISTORY_SERVER_CLUSTER=
export BUCKET_NAME=my-demo-bucket-sumo
```
```shell
gsutil mb gs://$BUCKET_NAME
gsutil cp data/in/account.avro gs://$BUCKET_NAME/data/in
gsutil cp build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar gs://$BUCKET_NAME/java/build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar
```
```shell
GCP_PROJECT=
REGION=
SUBNET=
GCS_STAGING_LOCATION=
HISTORY_SERVER_CLUSTER=
gcloud dataproc jobs submit spark \
--cluster=${CLUSTER} \
--region=${REGION} \
--class=org.mycompany.spark.AvroJobKt \
--jars=gs://${BUCKET_NAME}/java/build/libs/spark-demo-0.1.0-SNAPSHOT-all.jar \
--archives=org.apache.spark:spark-avro_2.12:3.2.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.6 \
--properties-file application.properties \
-- gs://${BUCKET_NAME}/data/in/ gs://${BUCKET_NAME}/data/out/
```
## Reference
- [Introducing Kotlin for Apache Spark Preview](https://blog.jetbrains.com/kotlin/2020/08/introducing-kotlin-for-apache-spark-preview/)
- [Code Examples](https://github.com/JetBrains/kotlin-spark-api/tree/main/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples)
- [Processing AVRO data using Google Cloud DataProc](https://sourabhsjain.medium.com/processing-avro-data-using-google-cloud-dataproc-86352e70e50d)