{"id":25155079,"url":"https://github.com/the-strategy-unit/sconn","last_synced_at":"2025-04-03T11:16:58.227Z","repository":{"id":276311712,"uuid":"928812788","full_name":"The-Strategy-Unit/sconn","owner":"The-Strategy-Unit","description":"Handles Spark connection to Databricks in R","archived":false,"fork":false,"pushed_at":"2025-03-18T16:27:02.000Z","size":36,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-28T19:17:25.123Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/The-Strategy-Unit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-07T09:34:10.000Z","updated_at":"2025-03-18T16:27:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"796c6bc2-4f9f-4921-b24e-2f62454316b9","html_url":"https://github.com/The-Strategy-Unit/sconn","commit_stats":null,"previous_names":["the-strategy-unit/sconn"],"tags_count":0,"template":false,"template_full_name":"The-Strategy-Unit/template-repository","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/The-Strategy-Unit%2Fsconn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/The-Strategy-Unit%2Fsconn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/The-Strategy-Unit%2Fsconn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/The-Strategy-Unit%2Fsconn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/The-Strategy-Unit","download_url":"https://codeload.github.com/The-Strategy-Unit/sconn/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246989754,"owners_count":20865331,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-09T00:39:55.456Z","updated_at":"2025-04-03T11:16:58.209Z","avatar_url":"https://github.com/The-Strategy-Unit.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sconn\n\nA very simple package that provides a function to connect to a\n  Databricks instance, and a function to disconnect.\nThe user should set up Databricks authentication details as environment\n  variables, ideally in their `.Renviron` file.\n\n\n## Caveats\n\nIf you have the `radian` console installed, this package will not work in\n  VSCode, due to a conflict with {reticulate} / Python virtual environments.\nIt should work in RStudio and Positron.\n\nThe first attempt to connect may take a long time, or fail, while the cluster\n  spins up.\nSubsequent connection attempts should then succeed, however.\n\n\n## Installation\n\n```r\nremotes::install_github(\"The-Strategy-Unit/sconn\")\n```\n\nOnce installed, there are some initial setup steps to complete before using the\n  connection function for the first time. See below.\n\n\n## Quick usage\n\nIt should be as simple as this:\n\n```r\nlibrary(sconn)\n\n# Initiate the connection - but this may take a while on first connect\nsc()\n\n# You can then keep on using the `sc()` function, it will just use the existing\n# connection (not try to create a new one).\nsparklyr::spark_connection_is_open(sc())\n\n# Then disconnect once you are done\nsc_disconnect()\n```\n\n\n## Setup: Environment variables\n\nThe connection function requires four environment variables to be available.\n\nThe best method for doing this is to add them to your main `.Renviron` file,\n  which is read in automatically by R each time it starts.\n  You can alternatively store them in a per-project `.Renviron` file.\n\nTo edit your main `.Renviron` file, you can use the helper function:\n\n```r\nusethis::edit_r_environ()\n```\n\nThis will save you trying to find the file each time you want to edit it 😊.\n\nAdd the following lines to your `.Renviron`:\n\n```\nDATABRICKS_HOST=\nDATABRICKS_TOKEN=\nDATABRICKS_CLUSTER_ID=\nDATABRICKS_VENV=\n```\n\nand add the following information after each `=` sign:\n\n* for DATABRICKS_HOST, add the base URL of your Databricks instance, beginning with `https://` and perhaps ending with `azuredatabricks.net`\n* for DATABRICKS_TOKEN, go to your Databricks web instance, find your user settings, and in the 'Developer' section under 'Access tokens' click the 'Manage' button, then 'Generate new token' ([databricks documentation](https://docs.databricks.com/en/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users))\n* for DATABRICKS_CLUSTER_ID, go to your Databricks instance, click on 'Compute' in the left-hand side menu, then click on the name of the cluster you are to use. Click on the three-dot (`⁝`) menu and then 'View JSON'\n* for DATABRICKS_VENV, a simple `databricks` is the suggested value, but you can set this to whatever name you like. This variable will be the name of your local Python virtual\n  environment that will store the necessary Python libraries.\n\nOnce you have added these variables to your `.Renviron`, save it and restart\n  your R session.\n\n\n## Setup: {reticulate} and virtual environments\n\nFirst, find out which version of Python your Databricks instance uses.\nThis can be done in a notebook with:\n\n```python\n%python\nimport sys\nprint(sys.version)\n```\n\nHere we will assume it is version 3.12.\n\nUse the {reticulate} package to make the right Python version available:\n\n```r\nlibrary(reticulate)\nreticulate::install_python(\"3.12\") # to match Databricks version\n```\n\nUse {reticulate} to create a custom Python virtual environment and install\n  PySpark.\n  (You can check what version of PySpark is installed by watching the output).\n\nNB The `force=TRUE` parameter means that any existing virtual environment called\n  \"databricks\" (or whatever your DATABRICKS_VENV envvar is) will be replaced.\n\n```r\nreticulate::virtualenv_create(\n  envname = Sys.getenv(\"DATABRICKS_VENV\"),\n  python = \"3.12\", # match this to the version of Python installed above\n  packages = c(\"pandas\", \"pyarrow\", \"pyspark\"),\n  force = TRUE\n)\n```\n\nUse {pysparklyr} to install the databricks libraries.\nCurrently we use the same virtual environment as the one we just created, above.\nThis may not be strictly necessary, but it does avoid reinstalling various\n  dependencies that were already installed along with PySpark.\n\n```r\npysparklyr::install_databricks(\n  version = \"15.4\", # match the version of Databricks used in your instance\n  envname = Sys.getenv(\"DATABRICKS_VENV\"),\n  new_env = FALSE\n)\n```\n\n## Usage options\n\nThere are two main ways you can use the package to handle a connection,\n  for example within an R script you are writing.\n\n1. You can just use the `sc()` function each time.\n  The advantage of this is that in theory it will kick the connection back up\n  if it has gone to sleep (is that a thing?) or disconnected.\n  But if it's still connected, it will just use the existing connection; it\n  won't try to restart the connection from scratch.\n2. Or you can assign the connection to an object, like: `sc \u003c- sc()` and then\n  just refer to the `sc` object in your code.\n  But if it becomes disconnected, you will need to run `sc \u003c- sc()` again.\n\n\n## Problems\n\nPlease use GitHub to post an issue if you experience problems setting up or\n  using the package.\n\n\n## Further notes and links\n\n* [Posit/RStudio documentation](https://posit.co/blog/databricks-clusters-in-rstudio-with-sparklyr/)\n* [Posit Spark/Databricks Connect documentation](https://spark.posit.co/deployment/databricks-connect.html)\n* [Databricks personal access tokens for workspace users](https://docs.databricks.com/en/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthe-strategy-unit%2Fsconn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthe-strategy-unit%2Fsconn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthe-strategy-unit%2Fsconn/lists"}