https://github.com/miraisolutions/sparkbq

Sparklyr extension package to connect to Google BigQuery
https://github.com/miraisolutions/sparkbq

bigquery r spark sparklyr

Last synced: about 1 month ago
JSON representation

Sparklyr extension package to connect to Google BigQuery

Host: GitHub
URL: https://github.com/miraisolutions/sparkbq
Owner: miraisolutions
License: gpl-3.0
Created: 2017-10-25T08:48:56.000Z (almost 8 years ago)
Default Branch: develop
Last Pushed: 2024-10-29T08:44:25.000Z (11 months ago)
Last Synced: 2025-08-12T16:58:33.823Z (about 2 months ago)
Topics: bigquery, r, spark, sparklyr
Language: R
Size: 29.5 MB
Stars: 19
Watchers: 8
Forks: 3
Open Issues: 5
Metadata Files:
- Readme: README.md
- Changelog: NEWS
- License: LICENSE

Awesome Lists containing this project

awesome-sparklyr - sparkbq: Sparklyr extension package to connect to Google BigQuery
jimsghstars - miraisolutions/sparkbq - Sparklyr extension package to connect to Google BigQuery (R)

README

          

# sparkbq: Google BigQuery Support for sparklyr

[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/sparkbq)](https://cran.r-project.org/package=sparkbq) [![Rdoc](http://www.rdocumentation.org/badges/version/sparkbq)](http://www.rdocumentation.org/packages/sparkbq)

**sparkbq** is a [sparklyr](https://spark.rstudio.com/) [extension](https://spark.rstudio.com/articles/guides-extensions.html) package providing an integration with [Google BigQuery](https://cloud.google.com/bigquery/). It builds on top of [spark-bigquery](https://github.com/miraisolutions/spark-bigquery), which provides a Google BigQuery data source to [Apache Spark](https://spark.apache.org/).

## Version Information

You can install the released version of **sparkbq** from CRAN via

``` r

install.packages("sparkbq")

```

or the latest development version through

``` r

devtools::install_github("miraisolutions/sparkbq", ref = "develop")

```

The following table provides an overview over supported versions of Apache Spark, Scala, and [Google Dataproc](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions):

| sparkbq | spark-bigquery | Apache Spark    | Scala | Google Dataproc |

| :-----: | -------------- | --------------- | ----- | --------------- |

| 0.1.x   | 0.1.0          | 2.2.x and 2.3.x | 2.11  | 1.2.x and 1.3.x |

**sparkbq** is based on the Spark package [spark-bigquery](https://spark-packages.org/package/miraisolutions/spark-bigquery) which is available in a separate [GitHub repository](https://github.com/miraisolutions/spark-bigquery).

## Example Usage

``` r

library(sparklyr)

library(sparkbq)

library(dplyr)

config <- spark_config()

sc <- spark_connect(master = "local[*]", config = config)

# Set Google BigQuery default settings

bigquery_defaults(

  billingProjectId = "",

  gcsBucket = "",

  datasetLocation = "US",

  serviceAccountKeyFile = "",

  type = "direct"

)

# Reading the public shakespeare data table

# https://cloud.google.com/bigquery/public-data/

# https://cloud.google.com/bigquery/sample-tables

hamlet <- 

  spark_read_bigquery(

    sc,

    name = "hamlet",

    projectId = "bigquery-public-data",

    datasetId = "samples",

    tableId = "shakespeare") %>%

  filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!

  

# Retrieve results into a local tibble

hamlet %>% collect()

# Write result into "mysamples" dataset in our BigQuery (billing) project

spark_write_bigquery(

  hamlet,

  datasetId = "mysamples",

  tableId = "hamlet",

  mode = "overwrite")

```

## Authentication

When running outside of Google Cloud it is necessary to specify a service account JSON key file. The service account key file can be passed as parameter `serviceAccountKeyFile` to `bigquery_defaults` or directly to `spark_read_bigquery` and `spark_write_bigquery`.

Alternatively, an environment variable `export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json` can be set (see https://cloud.google.com/docs/authentication/getting-started for more information). Make sure the variable is set before starting the R session.

When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.

## Further Information

* [spark-bigquery on GitHub](https://github.com/miraisolutions/spark-bigquery)

* [spark-bigquery on Spark Packages](https://spark-packages.org/package/miraisolutions/spark-bigquery)

* [BigQuery pricing](https://cloud.google.com/bigquery/pricing)

* [BigQuery dataset locations](https://cloud.google.com/bigquery/docs/dataset-locations)

* [General authentication](https://cloud.google.com/docs/authentication/)

* [BigQuery authentication](https://cloud.google.com/bigquery/docs/authentication/)

* [BigQuery: authenticating with a service account key file](https://cloud.google.com/bigquery/docs/authentication/service-account-file)

* [Cloud Storage authentication](https://cloud.google.com/storage/docs/authentication/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/miraisolutions/sparkbq

Awesome Lists containing this project

README