https://github.com/adampaternostro/azure-spark-livy
Run a job in Spark 2.x with HDInsight and submit the job through Livy
https://github.com/adampaternostro/azure-spark-livy
azure azure-data-lake hdinsight livy spark spark-sql
Last synced: 7 months ago
JSON representation
Run a job in Spark 2.x with HDInsight and submit the job through Livy
- Host: GitHub
- URL: https://github.com/adampaternostro/azure-spark-livy
- Owner: AdamPaternostro
- Created: 2017-05-31T14:16:57.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-06-15T15:49:10.000Z (over 8 years ago)
- Last Synced: 2025-01-31T04:32:19.136Z (8 months ago)
- Topics: azure, azure-data-lake, hdinsight, livy, spark, spark-sql
- Language: Scala
- Size: 168 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Azure-Spark-Livy
Run a job in Spark 2.x with HDInsight and submit the job through Livy. The full code is in the zip and the scala files are above for easy reference.## Steps to run this script:
1 - Create a Azure Data Lake Storage account
- Create a root folder called "livy"
- Create a folder under livy called "code" and upload the SparkApp.jar inside of the folder
- Create a folder under livy called "input" and upload the HVAC.csv inside of the folder
- Create a folder under livy called "output"2 - Create a Spark cluster (Spark 2.x) that uses Data Lake as its main storage and when you create the Service Principle grant acess to the /clusters directory and the /livy directory
3 - Run a job via Livy (open a Windows Bash or Linux prompt)
## ADLS Job
Read data to/from Azure Data Lake Storage1 - Type "nano SparkApp1.txt" (or use VI or whatever) and place the below in the file. Change the << >> items.
{ "args":
[
"adl://<>.azuredatalakestore.net/livy/input/HVAC.csv",
"adl://<>.azuredatalakestore.net/livy/output/ADLSIOTest"
],
"file":"adl://<>.azuredatalakestore.net/livy/code/SparkApp.jar",
"className":"com.adampaternostro.spark.example.ADLSIOTest" }2 - Run the job via Livy. You need to delete your output folder if it exists (e.g. /livy/output/ADLSIOTest)
curl -k --user "admin:<>" -v -H "Content-Type: application/json" -X POST --data @SparkApp1.txt "<>.azurehdinsight.net/livy/batches"3 - Get the status. The prior command will return a "id": ? (replace the 0 below with the ?) You can run this over and over to see the jobs status.
curl -k --user "admin:<>" -v -X GET "<>.azurehdinsight.net/livy/batches/0"4 - Delete the batch
curl -k --user "admin:<>" -v -X DELETE "<>.azurehdinsight.net/livy/batches/0"## SQL Job
Run a Spark SQL Statement using the Hive metastore1 - Type "nano SparkApp2.txt" (or use VI or whatever) and place the below in the file. Change the << >> items.
{ "args":
[
"file:/usr/hdp/2.6.0.2-76/spark2/bin/spark-warehouse",
"SELECT * FROM hivesampletable LIMIT 100",
"adl://<>.azuredatalakestore.net/livy/output/SqlTestOut"
],
"file":"adl://<>.azuredatalakestore.net/livy/code/SparkApp.jar",
"className":"com.adampaternostro.spark.example.SqlTest" }2 - Run the job via Livy. You need to delete your output folder if it exists (e.g. /livy/output/SqlTestOut)
curl -k --user "admin:<>" -v -H "Content-Type: application/json" -X POST --data @SparkApp2.txt "<>.azurehdinsight.net/livy/batches"3 - Get the status. The prior command will return a "id": ? (replace the 0 below with the ?) You can run this over and over to see the jobs status.
curl -k --user "admin:<>" -v -X GET "<>.azurehdinsight.net/livy/batches/0"4 - Delete the batch
curl -k --user "admin:<>" -v -X DELETE "<>.azurehdinsight.net/livy/batches/0"## Notes
- Depending on your Spark version the value "file:/usr/hdp/2.6.0.2-76/spark2/bin/spark-warehouse" might change. To get the latest value you can SSH into your HDInsight cluster.
- ssh sshuser@MY-CLUSTER-ssh.azurehdinsight.net
- cd $SPARK_HOME/bin
- spark-shell
- sc.getConf.getAll.foreach(println)
- look for: (hive.metastore.warehouse.dir,file:/usr/hdp/2.6.0.2-76/spark2/bin/spark-warehouse). This might change to "spark.sql.warehouse.dir" in the future.