{"id":14982417,"url":"https://github.com/samelamin/spark-bigquery","last_synced_at":"2026-03-17T22:32:23.741Z","repository":{"id":57722608,"uuid":"80299327","full_name":"samelamin/spark-bigquery","owner":"samelamin","description":"Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.","archived":false,"fork":false,"pushed_at":"2023-05-08T18:39:01.000Z","size":150,"stargazers_count":70,"open_issues_count":8,"forks_count":29,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-08-08T16:39:28.015Z","etag":null,"topics":["bigquery","data-frame","schema","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/samelamin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-01-28T17:41:15.000Z","updated_at":"2024-04-13T18:20:53.000Z","dependencies_parsed_at":"2026-02-23T21:01:46.260Z","dependency_job_id":null,"html_url":"https://github.com/samelamin/spark-bigquery","commit_stats":{"total_commits":108,"total_committers":10,"mean_commits":10.8,"dds":0.4444444444444444,"last_synced_commit":"c8f5929268d97b8905a035d5795568a9ae424d3c"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/samelamin/spark-bigquery","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/samelamin%2Fspark-bigquery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/samelamin%2Fspark-bigquery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/samelamin%2Fspark-bigquery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/samelamin%2Fspark-bigquery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/samelamin","download_url":"https://codeload.github.com/samelamin/spark-bigquery/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/samelamin%2Fspark-bigquery/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30633333,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-17T17:32:55.572Z","status":"ssl_error","status_checked_at":"2026-03-17T17:32:38.732Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","data-frame","schema","spark"],"created_at":"2024-09-24T14:05:22.767Z","updated_at":"2026-03-17T22:32:23.705Z","avatar_url":"https://github.com/samelamin.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"spark-bigquery\n===============\n\nThis Spark module allows saving DataFrame as BigQuery table.\n\nThe project was inspired by [spotify/spark-bigquery](https://github.com/spotify/spark-bigquery), but there are several differences and enhancements:\n\n* Use of the Structured Streaming API\n\n* Use within Pyspark\n\n* Saving via Decorators\n\n* Allow saving to partitioned tables\n\n* Easy integration with [Databricks](https://github.com/samelamin/spark-bigquery/blob/master/Databricks.md)\n\n* Use of Standard SQL\n\n* Use Of Time-Ingested Partition Columns\n\n* Run Data Manipulation Language Queries [DML](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language)\n\n* Update schemas on writes using the [setSchemaUpdateOptions](https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationQuery.html#setSchemaUpdateOptions(java.util.List))\n\n* JSON is used as an intermediate format instead of Avro. This allows having fields on different levels named the same:\n\n```json\n{\n  \"obj\": {\n    \"data\": {\n      \"data\": {}\n    }\n  }\n}\n```\n* DataFrame's schema is automatically adapted to a legal one:\n\n  1. Illegal characters are replaced with `_`\n  2. Field names are converted to lower case to avoid ambiguity\n  3. Duplicate field names are given a numeric suffix (`_1`, `_2`, etc.)\n\n\n### Docker! \nI created a container that launches zepplin with spark and the connector for ease of use and quick startup. You can find it [here](https://github.com/samelamin/docker-zeppelin)\n\n## Usage\n\n### Including spark-bigquery into your project\n\n#### Maven\n\n```xml\n\u003crepositories\u003e\n  \u003crepository\u003e\n    \u003cid\u003eoss-sonatype\u003c/id\u003e\n    \u003cname\u003eoss-sonatype\u003c/name\u003e\n    \u003curl\u003ehttps://oss.sonatype.org/content/repositories/releases/\u003c/url\u003e\n    \u003csnapshots\u003e\n      \u003cenabled\u003etrue\u003c/enabled\u003e\n    \u003c/snapshots\u003e\n  \u003c/repository\u003e\n\u003c/repositories\u003e\n\n\u003cdependencies\u003e\n  \u003cdependency\u003e\n    \u003cgroupId\u003ecom.github.samelamin\u003c/groupId\u003e\n    \u003cartifactId\u003espark-bigquery_${scala.binary.version}\u003c/artifactId\u003e\n    \u003cversion\u003e0.2.6\u003c/version\u003e\n  \u003c/dependency\u003e\n\u003c/dependencies\u003e\n```\n\n#### SBT\n\nTo use it in a local SBT console first add the package as a dependency then set up your project details\n```sbt\nresolvers += Opts.resolver.sonatypeReleases\n\nlibraryDependencies += \"com.github.samelamin\" %% \"spark-bigquery\" % \"0.2.6\"\n```\n\n```scala\nimport com.samelamin.spark.bigquery._\n\n// Set up GCP credentials\nsqlContext.setGcpJsonKeyFile(\"\u003cJSON_KEY_FILE\u003e\")\n\n// Set up BigQuery project and bucket\nsqlContext.setBigQueryProjectId(\"\u003cBILLING_PROJECT\u003e\")\nsqlContext.setBigQueryGcsBucket(\"\u003cGCS_BUCKET\u003e\")\n\n// Set up BigQuery dataset location, default is US\nsqlContext.setBigQueryDatasetLocation(\"\u003cDATASET_LOCATION\u003e\")\n```\n\n### Structured Streaming from S3/HDFS to BigQuery\n\nS3,Blob Storage or HDFS are the defacto technology for storage in the cloud, this package allows you to stream any data added to a Big Query Table of your choice\n```scala\nimport com.samelamin.spark.bigquery._\n\nval df = spark.readStream.json(\"s3a://bucket\")\n\ndf.writeStream\n      .option(\"checkpointLocation\", \"s3a://checkpoint/dir\")\n      .option(\"tableReferenceSink\",\"my-project:my_dataset.my_table\")\n      .format(\"com.samelamin.spark.bigquery\")\n      .start()\n```\n\n### Structured Streaming from BigQuery Table\n\nYou can use this connector to stream from a BigQuery Table. The connector uses a Timestamped column to get offsets. \n\n```scala\nimport com.samelamin.spark.bigquery._\n\nval df = spark\n          .readStream\n          .option(\"tableReferenceSource\",\"my-project:my_dataset.my_table\")\n          .format(\"com.samelamin.spark.bigquery\")\n          .load()\n```\nYou can also specify a custom timestamp column: \n```scala\nimport com.samelamin.spark.bigquery._\n\nsqlContext.setBQTableTimestampColumn(\"column_name\")\n```\n\n\nYou can also specify a custom Time Ingested Partition column: \n```scala\nimport com.samelamin.spark.bigquery._\n\nsqlContext.setBQTimePartitioningField(\"column_name\")\n```\n\n### Saving DataFrame using BigQuery Hadoop writer API\nBy Default any table created by this connector has a timestamp column of `bq_load_timestamp` which has the value of the current timestamp.\n```scala\nimport com.samelamin.spark.bigquery._\n\nval df = ...\ndf.saveAsBigQueryTable(\"project-id:dataset-id.table-name\")\n```\n\nYou can also save to a table decorator by saving to `dataset-id.table-name$YYYYMMDD`\n\n\n### Saving DataFrame using Pyspark\n\n```python\nfrom pyspark.sql import SparkSession\n\nBQ_PROJECT_ID = \"projectId\"\nDATASET_ID = \"datasetId\"\nTABLE_NAME = \"tableName\"\n\nKEY_FILE = \"/path/to/service_account.json\" # When not on GCP\nSTAGING_BUCKET = \"gcs-bucket\"              # Intermediate JSON files\nDATASET_LOCATION = \"US\"                    # Location for dataset creation\n\n# Start session and reference the JVM package via py4j for convienence\nsession = SparkSession.builder.getOrCreate()\nbigquery = session._sc._jvm.com.samelamin.spark.bigquery\n\n# Prepare the bigquery context\nbq = bigquery.BigQuerySQLContext(session._wrapped._jsqlContext)\nbq.setGcpJsonKeyFile(KEY_FILE)\nbq.setBigQueryProjectId(BQ_PROJECT_ID)\nbq.setGSProjectId(BQ_PROJECT_ID)\nbq.setBigQueryGcsBucket(STAGING_BUCKET)\nbq.setBigQueryDatasetLocation(DATASET_LOCATION)\n\n# Extract and Transform a dataframe\n# df = session.read.csv(...)\n\n# Load into a table or table partition\nbqDF = bigquery.BigQueryDataFrame(df._jdf)\nbqDF.saveAsBigQueryTable(\n    \"{0}:{1}.{2}\".format(BQ_PROJECT_ID, DATASET_ID, TABLE_NAME),\n    False, # Day paritioned when created\n    0,     # Partition expired when created\n    bigquery.__getattr__(\"package$WriteDisposition$\").__getattr__(\"MODULE$\").WRITE_EMPTY(),\n    bigquery.__getattr__(\"package$CreateDisposition$\").__getattr__(\"MODULE$\").CREATE_IF_NEEDED(),\n)\n```\n\nSubmit with:\n\n```bash\npyspark yourjob.py --packages com.github.samelamin:spark-bigquery_2.11:0.2.6\n```\n\nOr\n\n```bash\ngcloud dataproc jobs submit pyspark yourjob.py --properties spark.jars.packages=com.github.samelamin:spark-bigquery_2.11:0.2.6\n```\n\n### Reading DataFrame From BigQuery\n\n```scala\nimport com.samelamin.spark.bigquery._\nval sqlContext = spark.sqlContext\n\nsqlContext.setBigQueryGcsBucket(\"bucketname\")\nsqlContext.setBigQueryProjectId(\"projectid\")\nsqlContext.setGcpJsonKeyFile(\"keyfilepath\")\nsqlContext.hadoopConf.set(\"fs.gs.project.id\",\"projectid\")\n\nval df = spark.sqlContext.read.format(\"com.samelamin.spark.bigquery\").option(\"tableReferenceSource\",\"bigquery-public-data:samples.shakespeare\").load()\n``\n\n### Reading DataFrame From BigQuery in Pyspark\n\n```python\nbq = spark._sc._jvm.com.samelamin.spark.bigquery.BigQuerySQLContext(spark._wrapped._jsqlContext)\ndf= DataFrame(bq.bigQuerySelect(\"SELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]\"), session._wrapped)\n```\n\n### Running DML Queries\n\n```scala\nimport com.samelamin.spark.bigquery._\n\n// Load results from a SQL query\nsqlContext.runDMLQuery(\"UPDATE dataset-id.table-name SET test_col = new_value WHERE test_col = old_value\")\n```\nPlease note that DML queries need to be done using Standard SQL\n\n### Update Schemas\n\nYou can also allow the saving of a dataframe to update a schema:\n\n```scala\nimport com.samelamin.spark.bigquery._\n\nsqlContext.setAllowSchemaUpdates()\n```\n\nNotes on using this API:\n\n * Structured Streaming needs a partitioned table which is created by default when writing a stream\n * Structured Streaming needs a timestamp column where offsets are retrieved from, by default all tables are created with a `bq_load_timestamp` column with a default value of the current timstamp.\n * For use with Databricks please follow this [guide](https://github.com/samelamin/spark-bigquery/blob/master/Databricks.md)\n\n\n#TODO\n\nNeed to upgrade spark version\n\n# License\n\nCopyright 2016 samelamin.\n\nLicensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamelamin%2Fspark-bigquery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsamelamin%2Fspark-bigquery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamelamin%2Fspark-bigquery/lists"}