https://github.com/adampaternostro/azure-spark-livy

Run a job in Spark 2.x with HDInsight and submit the job through Livy
https://github.com/adampaternostro/azure-spark-livy

azure azure-data-lake hdinsight livy spark spark-sql

Last synced: 7 months ago
JSON representation

Run a job in Spark 2.x with HDInsight and submit the job through Livy

Host: GitHub
URL: https://github.com/adampaternostro/azure-spark-livy
Owner: AdamPaternostro
Created: 2017-05-31T14:16:57.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-06-15T15:49:10.000Z (over 8 years ago)
Last Synced: 2025-01-31T04:32:19.136Z (8 months ago)
Topics: azure, azure-data-lake, hdinsight, livy, spark, spark-sql
Language: Scala
Size: 168 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Azure-Spark-Livy
Run a job in Spark 2.x with HDInsight and submit the job through Livy. The full code is in the zip and the scala files are above for easy reference.

## Steps to run this script:
1 - Create a Azure Data Lake Storage account
- Create a root folder called "livy"
- Create a folder under livy called "code" and upload the SparkApp.jar inside of the folder
- Create a folder under livy called "input" and upload the HVAC.csv inside of the folder
- Create a folder under livy called "output"

2 - Create a Spark cluster (Spark 2.x) that uses Data Lake as its main storage and when you create the Service Principle grant acess to the /clusters directory and the /livy directory

3 - Run a job via Livy (open a Windows Bash or Linux prompt)

## ADLS Job
Read data to/from Azure Data Lake Storage

1 - Type "nano SparkApp1.txt" (or use VI or whatever) and place the below in the file. Change the << >> items.
{ "args":
[
"adl://<>.azuredatalakestore.net/livy/input/HVAC.csv",
"adl://<>.azuredatalakestore.net/livy/output/ADLSIOTest"
],
"file":"adl://<>.azuredatalakestore.net/livy/code/SparkApp.jar",
"className":"com.adampaternostro.spark.example.ADLSIOTest" }

2 - Run the job via Livy. You need to delete your output folder if it exists (e.g. /livy/output/ADLSIOTest)
curl -k --user "admin:<>" -v -H "Content-Type: application/json" -X POST --data @SparkApp1.txt "<>.azurehdinsight.net/livy/batches"

3 - Get the status. The prior command will return a "id": ? (replace the 0 below with the ?) You can run this over and over to see the jobs status.
curl -k --user "admin:<>" -v -X GET "<>.azurehdinsight.net/livy/batches/0"

4 - Delete the batch
curl -k --user "admin:<>" -v -X DELETE "<>.azurehdinsight.net/livy/batches/0"

## SQL Job
Run a Spark SQL Statement using the Hive metastore

1 - Type "nano SparkApp2.txt" (or use VI or whatever) and place the below in the file. Change the << >> items.
{ "args":
[
"file:/usr/hdp/2.6.0.2-76/spark2/bin/spark-warehouse",
"SELECT * FROM hivesampletable LIMIT 100",
"adl://<>.azuredatalakestore.net/livy/output/SqlTestOut"
],
"file":"adl://<>.azuredatalakestore.net/livy/code/SparkApp.jar",
"className":"com.adampaternostro.spark.example.SqlTest" }

2 - Run the job via Livy. You need to delete your output folder if it exists (e.g. /livy/output/SqlTestOut)
curl -k --user "admin:<>" -v -H "Content-Type: application/json" -X POST --data @SparkApp2.txt "<>.azurehdinsight.net/livy/batches"

4 - Delete the batch
curl -k --user "admin:<>" -v -X DELETE "<>.azurehdinsight.net/livy/batches/0"

## Notes
- Depending on your Spark version the value "file:/usr/hdp/2.6.0.2-76/spark2/bin/spark-warehouse" might change. To get the latest value you can SSH into your HDInsight cluster.
- ssh sshuser@MY-CLUSTER-ssh.azurehdinsight.net
- cd $SPARK_HOME/bin
- spark-shell
- sc.getConf.getAll.foreach(println)
- look for: (hive.metastore.warehouse.dir,file:/usr/hdp/2.6.0.2-76/spark2/bin/spark-warehouse). This might change to "spark.sql.warehouse.dir" in the future.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adampaternostro/azure-spark-livy

Awesome Lists containing this project

README