https://github.com/blairj09/databricks-exp
Experimenting with RStudio and Databricks Connect
https://github.com/blairj09/databricks-exp
Last synced: 8 months ago
JSON representation
Experimenting with RStudio and Databricks Connect
- Host: GitHub
- URL: https://github.com/blairj09/databricks-exp
- Owner: blairj09
- Created: 2020-03-03T23:49:06.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-10-25T17:13:35.000Z (about 3 years ago)
- Last Synced: 2025-01-24T16:11:24.621Z (10 months ago)
- Language: HTML
- Size: 8.43 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Databricks Experiments
This repo highlights initial efforts using [Databricks
Connect](https://docs.databricks.com/dev-tools/databricks-connect.html) to
remotely connect to a Databricks cluster and interact with it via `sparklyr`.
## Requirements
- Development version of `sparklyr`: `remotes::install_github("sparklyr/sparklyr")`
- Follow the [Client setup
instructions](https://docs.databricks.com/dev-tools/databricks-connect.html#client-setup)
for Databricks Connect.
```
pip install -U databricks-connect==6.3.1
```
- Note: There is some messaging when running `databricks-connect` about
including `spark.databricks.service.server.enabled true` in the Spark conf.
This isn't necessary in Databricks Runtime > 5.3.
## Connecting
In order to connect, there must be a local installation of Spark that matches
the version of the Databricks Cluster. This can be installed using
`sparklyr::spark_install()`.
The `SPARK_HOME` environment variable must be set to the output of
`databricks-connect get-spark-home`.
Connect using the following:
```r
library(sparklyr)
Sys.setenv(SPARK_HOME="[path returned by databricks-connect get-spark-home]")
sc <- spark_connect(master = "local", method = "databricks")
```
## Setup
This repository uses [`renv`](https://rstudio.github.io/renv/articles/renv.html)
and packages can be restored using `renv::restore()`. Note, there have been no
efforts made to create light dependencies. Therefore, there are several packages
used by this project.
## Test Environment
General description of the environment used for testing and proving Databricks +
RStudio
* Use `sol-eng-terraform` and RST standalone
* Install `databricks-connect` using `/opt/python/` system installation
* Run `databricks-connect configure` (needs to be run per user)
* Install RStudio Pro Drivers
* Install Databricks Spark Driver