An open API service indexing awesome lists of open source software.

https://github.com/adampaternostro/azure-databricks-hdinsight-hive-metastore

How to share an HDInsight Hive Metastore with Azure Databricks
https://github.com/adampaternostro/azure-databricks-hdinsight-hive-metastore

azure azure-database azure-hdinsight databricks hdinsight hive-metastore

Last synced: 7 months ago
JSON representation

How to share an HDInsight Hive Metastore with Azure Databricks

Awesome Lists containing this project

README

          

# Azure-Databricks-HDInsight-Hive-Metastore
How to share an HDInsight Hive Metastore with Azure Databricks

## Steps
1. Create a SQL Database (PaaS) server in Azure
2. Create an empty database in under the server
3. Create an HDInsight cluster in Azure and point the external Hive metastore to your database
4. Delete the HDInsight cluster after it has been created
5. Create a Databricks workspace in Azure
6. Launch the workspace and open a notebook
7. Run Step 1 in a cell
8. Run Step 2 in a cell (replace your server name, database name, username and password)
9. Terminate your Databricks cluster
10. Restart your cluster
11. In a SQL notebook you can run SHOW TABLES (you should see the sample table from HDInsight)
* NOTE: You will not be able to select the data from this table. The table is probably pointing to blob or ADLS storage. You should create all your tables in HDI/Databricks as EXTERNAL and makes sure you LOCATION property is set to blob or ADLS. You Databricks cluster will also need to be authorized to access ADLS or blob.

### Step 1
```
dbutils.fs.mkdirs("dbfs:/databricks/init/")
```

### Step 2
```
dbutils.fs.put(
"/databricks/init/external-metastore.sh",
"""#!/bin/sh
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options for metastores in the local mode.
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://<>.database.windows.net:1433;database=<>;encrypt=true;trustServerCertificate=true;create=false;loginTimeout=300"
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<>"
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<>"
| "hive.metastore.schema.verification.record.version" = "true"
| "spark.sql.hive.metastore.jars" = "maven"
| "hive.metastore.schema.verification" = "true"
| "spark.sql.hive.metastore.version" = "2.1.1"
|EOF
|# Add the JDBC driver separately since must use variable expansion to choose the correct
|# driver version.
|cat << EOF >> /databricks/driver/conf/00-custom-spark.conf
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|}
|EOF
|""".stripMargin,
overwrite = true
)
```