Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/samelamin/spark-bigquery
Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
https://github.com/samelamin/spark-bigquery
bigquery data-frame schema spark
Last synced: 3 months ago
JSON representation
Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
- Host: GitHub
- URL: https://github.com/samelamin/spark-bigquery
- Owner: samelamin
- License: apache-2.0
- Created: 2017-01-28T17:41:15.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-05-08T18:39:01.000Z (over 1 year ago)
- Last Synced: 2024-09-29T07:04:05.957Z (4 months ago)
- Topics: bigquery, data-frame, schema, spark
- Language: Scala
- Homepage:
- Size: 146 KB
- Stars: 70
- Watchers: 4
- Forks: 28
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
spark-bigquery
===============This Spark module allows saving DataFrame as BigQuery table.
The project was inspired by [spotify/spark-bigquery](https://github.com/spotify/spark-bigquery), but there are several differences and enhancements:
* Use of the Structured Streaming API
* Use within Pyspark
* Saving via Decorators
* Allow saving to partitioned tables
* Easy integration with [Databricks](https://github.com/samelamin/spark-bigquery/blob/master/Databricks.md)
* Use of Standard SQL
* Use Of Time-Ingested Partition Columns
* Run Data Manipulation Language Queries [DML](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language)
* Update schemas on writes using the [setSchemaUpdateOptions](https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationQuery.html#setSchemaUpdateOptions(java.util.List))
* JSON is used as an intermediate format instead of Avro. This allows having fields on different levels named the same:
```json
{
"obj": {
"data": {
"data": {}
}
}
}
```
* DataFrame's schema is automatically adapted to a legal one:1. Illegal characters are replaced with `_`
2. Field names are converted to lower case to avoid ambiguity
3. Duplicate field names are given a numeric suffix (`_1`, `_2`, etc.)### Docker!
I created a container that launches zepplin with spark and the connector for ease of use and quick startup. You can find it [here](https://github.com/samelamin/docker-zeppelin)## Usage
### Including spark-bigquery into your project
#### Maven
```xml
oss-sonatype
oss-sonatype
https://oss.sonatype.org/content/repositories/releases/
true
com.github.samelamin
spark-bigquery_${scala.binary.version}
0.2.6
```
#### SBT
To use it in a local SBT console first add the package as a dependency then set up your project details
```sbt
resolvers += Opts.resolver.sonatypeReleaseslibraryDependencies += "com.github.samelamin" %% "spark-bigquery" % "0.2.6"
``````scala
import com.samelamin.spark.bigquery._// Set up GCP credentials
sqlContext.setGcpJsonKeyFile("")// Set up BigQuery project and bucket
sqlContext.setBigQueryProjectId("")
sqlContext.setBigQueryGcsBucket("")// Set up BigQuery dataset location, default is US
sqlContext.setBigQueryDatasetLocation("")
```### Structured Streaming from S3/HDFS to BigQuery
S3,Blob Storage or HDFS are the defacto technology for storage in the cloud, this package allows you to stream any data added to a Big Query Table of your choice
```scala
import com.samelamin.spark.bigquery._val df = spark.readStream.json("s3a://bucket")
df.writeStream
.option("checkpointLocation", "s3a://checkpoint/dir")
.option("tableReferenceSink","my-project:my_dataset.my_table")
.format("com.samelamin.spark.bigquery")
.start()
```### Structured Streaming from BigQuery Table
You can use this connector to stream from a BigQuery Table. The connector uses a Timestamped column to get offsets.
```scala
import com.samelamin.spark.bigquery._val df = spark
.readStream
.option("tableReferenceSource","my-project:my_dataset.my_table")
.format("com.samelamin.spark.bigquery")
.load()
```
You can also specify a custom timestamp column:
```scala
import com.samelamin.spark.bigquery._sqlContext.setBQTableTimestampColumn("column_name")
```You can also specify a custom Time Ingested Partition column:
```scala
import com.samelamin.spark.bigquery._sqlContext.setBQTimePartitioningField("column_name")
```### Saving DataFrame using BigQuery Hadoop writer API
By Default any table created by this connector has a timestamp column of `bq_load_timestamp` which has the value of the current timestamp.
```scala
import com.samelamin.spark.bigquery._val df = ...
df.saveAsBigQueryTable("project-id:dataset-id.table-name")
```You can also save to a table decorator by saving to `dataset-id.table-name$YYYYMMDD`
### Saving DataFrame using Pyspark
```python
from pyspark.sql import SparkSessionBQ_PROJECT_ID = "projectId"
DATASET_ID = "datasetId"
TABLE_NAME = "tableName"KEY_FILE = "/path/to/service_account.json" # When not on GCP
STAGING_BUCKET = "gcs-bucket" # Intermediate JSON files
DATASET_LOCATION = "US" # Location for dataset creation# Start session and reference the JVM package via py4j for convienence
session = SparkSession.builder.getOrCreate()
bigquery = session._sc._jvm.com.samelamin.spark.bigquery# Prepare the bigquery context
bq = bigquery.BigQuerySQLContext(session._wrapped._jsqlContext)
bq.setGcpJsonKeyFile(KEY_FILE)
bq.setBigQueryProjectId(BQ_PROJECT_ID)
bq.setGSProjectId(BQ_PROJECT_ID)
bq.setBigQueryGcsBucket(STAGING_BUCKET)
bq.setBigQueryDatasetLocation(DATASET_LOCATION)# Extract and Transform a dataframe
# df = session.read.csv(...)# Load into a table or table partition
bqDF = bigquery.BigQueryDataFrame(df._jdf)
bqDF.saveAsBigQueryTable(
"{0}:{1}.{2}".format(BQ_PROJECT_ID, DATASET_ID, TABLE_NAME),
False, # Day paritioned when created
0, # Partition expired when created
bigquery.__getattr__("package$WriteDisposition$").__getattr__("MODULE$").WRITE_EMPTY(),
bigquery.__getattr__("package$CreateDisposition$").__getattr__("MODULE$").CREATE_IF_NEEDED(),
)
```Submit with:
```bash
pyspark yourjob.py --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
```Or
```bash
gcloud dataproc jobs submit pyspark yourjob.py --properties spark.jars.packages=com.github.samelamin:spark-bigquery_2.11:0.2.6
```### Reading DataFrame From BigQuery
```scala
import com.samelamin.spark.bigquery._
val sqlContext = spark.sqlContextsqlContext.setBigQueryGcsBucket("bucketname")
sqlContext.setBigQueryProjectId("projectid")
sqlContext.setGcpJsonKeyFile("keyfilepath")
sqlContext.hadoopConf.set("fs.gs.project.id","projectid")val df = spark.sqlContext.read.format("com.samelamin.spark.bigquery").option("tableReferenceSource","bigquery-public-data:samples.shakespeare").load()
``### Reading DataFrame From BigQuery in Pyspark
```python
bq = spark._sc._jvm.com.samelamin.spark.bigquery.BigQuerySQLContext(spark._wrapped._jsqlContext)
df= DataFrame(bq.bigQuerySelect("SELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]"), session._wrapped)
```### Running DML Queries
```scala
import com.samelamin.spark.bigquery._// Load results from a SQL query
sqlContext.runDMLQuery("UPDATE dataset-id.table-name SET test_col = new_value WHERE test_col = old_value")
```
Please note that DML queries need to be done using Standard SQL### Update Schemas
You can also allow the saving of a dataframe to update a schema:
```scala
import com.samelamin.spark.bigquery._sqlContext.setAllowSchemaUpdates()
```Notes on using this API:
* Structured Streaming needs a partitioned table which is created by default when writing a stream
* Structured Streaming needs a timestamp column where offsets are retrieved from, by default all tables are created with a `bq_load_timestamp` column with a default value of the current timstamp.
* For use with Databricks please follow this [guide](https://github.com/samelamin/spark-bigquery/blob/master/Databricks.md)#TODO
Need to upgrade spark version
# License
Copyright 2016 samelamin.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0