{"id":15284115,"url":"https://github.com/GoogleCloudDataproc/spark-bigquery-connector","last_synced_at":"2025-10-07T00:31:56.227Z","repository":{"id":37046446,"uuid":"172799788","full_name":"GoogleCloudDataproc/spark-bigquery-connector","owner":"GoogleCloudDataproc","description":"BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.","archived":false,"fork":false,"pushed_at":"2025-10-03T23:22:06.000Z","size":6337,"stargazers_count":409,"open_issues_count":47,"forks_count":217,"subscribers_count":40,"default_branch":"master","last_synced_at":"2025-10-04T01:15:02.582Z","etag":null,"topics":["bigquery","bigquery-storage-api","google-bigquery","google-cloud","google-cloud-dataproc","spark"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudDataproc.png","metadata":{"files":{"readme":"README-template.md","changelog":"CHANGES.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-02-26T22:20:05.000Z","updated_at":"2025-10-03T23:22:10.000Z","dependencies_parsed_at":"2023-10-15T23:25:04.516Z","dependency_job_id":"cdf04391-ad1b-4d92-b207-49cc0a23e87f","html_url":"https://github.com/GoogleCloudDataproc/spark-bigquery-connector","commit_stats":{"total_commits":821,"total_committers":80,"mean_commits":10.2625,"dds":0.656516443361754,"last_synced_commit":"14fa9b879b62a535c37c906ffb76386b8d144fef"},"previous_names":["googlecloudplatform/spark-bigquery-connector"],"tags_count":73,"template":false,"template_full_name":null,"purl":"pkg:github/GoogleCloudDataproc/spark-bigquery-connector","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-bigquery-connector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-bigquery-connector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-bigquery-connector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-bigquery-connector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudDataproc","download_url":"https://codeload.github.com/GoogleCloudDataproc/spark-bigquery-connector/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-bigquery-connector/sbom","scorecard":{"id":57646,"data":{"date":"2025-08-11","repo":{"name":"github.com/GoogleCloudDataproc/spark-bigquery-connector","commit":"4d4f1357d98c5ae6043f79531745ed4916ada86f"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":5.6,"checks":[{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":7,"reason":"Found 22/30 approved changesets -- score normalized to 7","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":10,"reason":"22 commit(s) and 15 issue activity found in the last 90 days -- score normalized to 10","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Info: jobLevel 'actions' permission set to 'read': .github/workflows/codeql-analysis.yml:28","Info: jobLevel 'contents' permission set to 'read': .github/workflows/codeql-analysis.yml:29","Warn: no topLevel permission defined: .github/workflows/codeql-analysis.yml:1","Warn: no topLevel permission defined: .github/workflows/cpd.yaml:1","Warn: no topLevel permission defined: .github/workflows/spotless.yaml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:41: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/codeql-analysis.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:45: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/codeql-analysis.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:56: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/codeql-analysis.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codeql-analysis.yml:70: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/codeql-analysis.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/cpd.yaml:16: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/cpd.yaml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/cpd.yaml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/cpd.yaml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/spotless.yaml:16: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/spotless.yaml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/spotless.yaml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/GoogleCloudDataproc/spark-bigquery-connector/spotless.yaml/master?enable=pin","Warn: containerImage not pinned by hash: cloudbuild/Dockerfile:2: pin your Docker image by updating ubuntu:22.04 to ubuntu:22.04@sha256:1ec65b2719518e27d4d25f104d93f9fac60dc437f81452302406825c46fcc9cb","Warn: downloadThenRun not pinned by hash: cloudbuild/nightly.sh:58","Warn: downloadThenRun not pinned by hash: cloudbuild/presubmit.sh:54","Warn: downloadThenRun not pinned by hash: cloudbuild/presubmit.sh:104","Info:   0 out of   8 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 containerImage dependencies pinned","Info:   0 out of   3 downloadThenRun dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Signed-Releases","score":0,"reason":"Project has not signed or included provenance with any releases.","details":["Warn: release artifact 0.42.4 not signed: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/226778037","Warn: release artifact 0.42.3 not signed: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/220500971","Warn: release artifact 0.42.2 not signed: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/219232806","Warn: release artifact 0.34.1 not signed: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/207788182","Warn: release artifact 0.42.1 not signed: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/206407872","Warn: release artifact 0.42.4 does not have provenance: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/226778037","Warn: release artifact 0.42.3 does not have provenance: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/220500971","Warn: release artifact 0.42.2 does not have provenance: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/219232806","Warn: release artifact 0.34.1 does not have provenance: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/207788182","Warn: release artifact 0.42.1 does not have provenance: https://api.github.com/repos/GoogleCloudDataproc/spark-bigquery-connector/releases/206407872"],"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"SAST","score":9,"reason":"SAST tool detected but not run on all commits","details":["Info: SAST configuration detected: CodeQL","Warn: 26 commits out of 27 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-15T01:05:53.676Z","repository_id":37046446,"created_at":"2025-08-15T01:05:53.676Z","updated_at":"2025-08-15T01:05:53.676Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278254473,"owners_count":25956598,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-03T02:00:06.070Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","bigquery-storage-api","google-bigquery","google-cloud","google-cloud-dataproc","spark"],"created_at":"2024-09-30T14:49:47.454Z","updated_at":"2025-10-07T00:31:56.219Z","avatar_url":"https://github.com/GoogleCloudDataproc.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"# Apache Spark SQL connector for Google BigQuery\n\n\u003c!--- TODO(#2): split out into more documents. --\u003e\n\nThe connector supports reading [Google BigQuery](https://cloud.google.com/bigquery/) tables into Spark's DataFrames, and writing DataFrames back into BigQuery.\nThis is done by using the [Spark SQL Data Source API](https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources) to communicate with BigQuery.\n\n## Unreleased Changes\n\nThis Readme may include documentation for changes that haven't been released yet.  The latest release's documentation and source code are found here.\n\nhttps://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/README.md\n\n## BigQuery Storage API\nThe [Storage API](https://cloud.google.com/bigquery/docs/reference/storage) streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary.\n\nIt has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:\n\n### Direct Streaming\n\nIt does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using the Arrow or Avro wire formats.\n\n### Filtering\n\nThe new API allows column and predicate filtering to only read the data you are interested in.\n\n#### Column Filtering\n\nSince BigQuery is [backed by a columnar datastore](https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format), it can efficiently stream data without reading all columns.\n\n#### Predicate Filtering\n\nThe Storage API supports arbitrary pushdown of predicate filters. Connector version 0.8.0-beta and above support pushdown of arbitrary filters to Bigquery.\n\nThere is a known issue in Spark that does not allow pushdown of filters on nested fields. For example - filters like `address.city = \"Sunnyvale\"` will not get pushdown to Bigquery.\n\n### Dynamic Sharding\n\nThe API rebalances records between readers until they all complete. This means that all Map phases will finish nearly concurrently. See this blog article on [how dynamic sharding is similarly used in Google Cloud Dataflow](https://cloud.google.com/blog/products/gcp/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow).\n\nSee [Configuring Partitioning](#configuring-partitioning) for more details.\n\n## Requirements\n\n### Enable the BigQuery Storage API\n\nFollow [these instructions](https://cloud.google.com/bigquery/docs/reference/storage/#enabling_the_api).\n\n### Create a Google Cloud Dataproc cluster (Optional)\n\nIf you do not have an Apache Spark environment you can create a Cloud Dataproc cluster with pre-configured auth. The following examples assume you are using Cloud Dataproc, but you can use `spark-submit` on any cluster.\n\nAny Dataproc cluster using the API needs the 'bigquery'  or 'cloud-platform' scopes. Dataproc clusters have the 'bigquery' scope by default, so most clusters in enabled projects should work by default e.g.\n\n```\nMY_CLUSTER=...\ngcloud dataproc clusters create \"$MY_CLUSTER\"\n```\n\n## Downloading and Using the Connector\n\nThe latest version of the connector is publicly available in the following links:\n\n| version    | Link                                                                                                                                                                                                                   |\n|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Spark 3.5  | `gs://spark-lib/bigquery/spark-3.5-bigquery-${next-release-tag}.jar`([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-3.5-bigquery-${next-release-tag}.jar))                                        |\n| Spark 3.4  | `gs://spark-lib/bigquery/spark-3.4-bigquery-${next-release-tag}.jar`([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-3.4-bigquery-${next-release-tag}.jar))                                        |\n| Spark 3.3  | `gs://spark-lib/bigquery/spark-3.3-bigquery-${next-release-tag}.jar`([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-3.3-bigquery-${next-release-tag}.jar))                                        |\n| Spark 3.2  | `gs://spark-lib/bigquery/spark-3.2-bigquery-${next-release-tag}.jar`([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-3.2-bigquery-${next-release-tag}.jar))                                        |\n| Spark 3.1  | `gs://spark-lib/bigquery/spark-3.1-bigquery-${next-release-tag}.jar`([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-3.1-bigquery-${next-release-tag}.jar))                                        |\n| Spark 2.4  | `gs://spark-lib/bigquery/spark-2.4-bigquery-0.37.0.jar`([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-2.4-bigquery-0.37.0.jar))                                                                  |\n| Scala 2.13 | `gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-${next-release-tag}.jar` ([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-${next-release-tag}.jar)) |\n| Scala 2.12 | `gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-${next-release-tag}.jar` ([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-${next-release-tag}.jar)) |\n| Scala 2.11 | `gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar` ([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar))                           |\n\nThe first six versions are Java based connectors targeting Spark 2.4/3.1/3.2/3.3/3.4/3.5 of all Scala versions built on the new\nData Source APIs (Data Source API v2) of Spark.\n\nThe final two connectors are Scala based connectors, please use the jar relevant to your Spark installation as outlined\nbelow.\n\n### Connector to Spark Compatibility Matrix\n| Connector \\ Spark                     | 2.3     | 2.4     | 3.0     | 3.1     | 3.2     | 3.3     |3.4      | 3.5     |\n|---------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|\n| spark-3.5-bigquery                    |         |         |         |         |         |         |         | \u0026check; |\n| spark-3.4-bigquery                    |         |         |         |         |         |         | \u0026check; | \u0026check; |\n| spark-3.3-bigquery                    |         |         |         |         |         | \u0026check; | \u0026check; | \u0026check; |\n| spark-3.2-bigquery                    |         |         |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check; |\n| spark-3.1-bigquery                    |         |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check; | \u0026check; |\n| spark-2.4-bigquery                    |         | \u0026check; |         |         |         |         |         |         |\n| spark-bigquery-with-dependencies_2.13 |         |         |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check; |\n| spark-bigquery-with-dependencies_2.12 |         | \u0026check; | \u0026check; | \u0026check; | \u0026check; | \u0026check; | \u0026check; | \u0026check; |\n| spark-bigquery-with-dependencies_2.11 | \u0026check; | \u0026check; |         |         |         |         |         |         |\n\n### Connector to Dataproc Image Compatibility Matrix\n| Connector \\ Dataproc Image            | 1.3     | 1.4     | 1.5     | 2.0     | 2.1     | 2.2     | Serverless\u003cbr\u003eImage 1.0 | Serverless\u003cbr\u003eImage 2.0 | Serverless\u003cbr\u003eImage 2.1 | Serverless\u003cbr\u003eImage 2.2 |\n|---------------------------------------|---------|---------|---------|---------|---------|---------|-------------------------|-------------------------|-------------------------|-------------------------|\n| spark-3.5-bigquery                    |         |         |         |         |         | \u0026check; |                         |                         |                         | \u0026check;                 |\n| spark-3.4-bigquery                    |         |         |         |         |         | \u0026check; |                         |                         | \u0026check;                 | \u0026check;                 |\n| spark-3.3-bigquery                    |         |         |         |         | \u0026check; | \u0026check; | \u0026check;                 | \u0026check;                 | \u0026check;                 | \u0026check;                 |\n| spark-3.2-bigquery                    |         |         |         |         | \u0026check; | \u0026check; | \u0026check;                 | \u0026check;                 | \u0026check;                 | \u0026check;                 |\n| spark-3.1-bigquery                    |         |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check;                 | \u0026check;                 | \u0026check;                 | \u0026check;                 |\n| spark-2.4-bigquery                    |         | \u0026check; | \u0026check; |         |         |         |                         |                         |                         |                         |\n| spark-bigquery-with-dependencies_2.13 |         |         |         |         |         |         |                         | \u0026check;                 | \u0026check;                 | \u0026check;                 |\n| spark-bigquery-with-dependencies_2.12 |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check; | \u0026check;                 |                         |                         |                         |\n| spark-bigquery-with-dependencies_2.11 | \u0026check; | \u0026check; |         |         |         |         |                         |                         |                         |                         |\n\n### Maven / Ivy Package Usage\nThe connector is also available from the\n[Maven Central](https://repo1.maven.org/maven2/com/google/cloud/spark/)\nrepository. It can be used using the `--packages` option or the\n`spark.jars.packages` configuration property. Use the following value\n\n| version    | Connector Artifact                                                                 |\n|------------|------------------------------------------------------------------------------------|\n| Spark 3.5  | `com.google.cloud.spark:spark-3.5-bigquery:${next-release-tag}`                    |\n| Spark 3.4  | `com.google.cloud.spark:spark-3.4-bigquery:${next-release-tag}`                    |\n| Spark 3.3  | `com.google.cloud.spark:spark-3.3-bigquery:${next-release-tag}`                    |\n| Spark 3.2  | `com.google.cloud.spark:spark-3.2-bigquery:${next-release-tag}`                    |\n| Spark 3.1  | `com.google.cloud.spark:spark-3.1-bigquery:${next-release-tag}`                    |\n| Spark 2.4  | `com.google.cloud.spark:spark-2.4-bigquery:0.37.0`                                 |\n| Scala 2.13 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.13:${next-release-tag}` |\n| Scala 2.12 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:${next-release-tag}` |\n| Scala 2.11 | `com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.29.0`              |\n\n### Specifying the  Spark BigQuery connector version in a Dataproc cluster\n\nDataproc clusters created using image 2.1 and above, or batches using the Dataproc serverless service come with built-in Spark BigQuery connector.\nUsing the standard `--jars` or `--packages` (or alternatively, the `spark.jars`/`spark.jars.packages` configuration) won't help in this case as the built-in connector takes precedence.\n\nTo use another version than the built-in one, please do one of the following:\n\n* For Dataproc clusters, using image 2.1 and above, add the following flag on cluster creation to upgrade the version `--metadata SPARK_BQ_CONNECTOR_VERSION=${next-release-tag}`, or `--metadata SPARK_BQ_CONNECTOR_URL=gs://spark-lib/bigquery/spark-3.3-bigquery-${next-release-tag}.jar` to create the cluster with a different jar. The URL can point to any valid connector JAR for the cluster's Spark version.\n* For Dataproc serverless batches, add the following property on batch creation to upgrade the version: `--properties dataproc.sparkBqConnector.version=${next-release-tag}`, or `--properties dataproc.sparkBqConnector.uri=gs://spark-lib/bigquery/spark-3.3-bigquery-${next-release-tag}.jar` to create the batch with a different jar. The URL can point to any valid connector JAR for the runtime's Spark version.\n\n## Hello World Example\n\nYou can run a simple PySpark wordcount against the API without compilation by running\n\n**Dataproc image 1.5 and above**\n\n```\ngcloud dataproc jobs submit pyspark --cluster \"$MY_CLUSTER\" \\\n  --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-${next-release-tag}.jar \\\n  examples/python/shakespeare.py\n```\n\n**Dataproc image 1.4 and below**\n\n```\ngcloud dataproc jobs submit pyspark --cluster \"$MY_CLUSTER\" \\\n  --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar \\\n  examples/python/shakespeare.py\n```\n\n## Example Codelab ##\nhttps://codelabs.developers.google.com/codelabs/pyspark-bigquery\n\n## Usage\n\nThe connector uses the cross language [Spark SQL Data Source API](https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources):\n\n### Reading data from a BigQuery table\n\n```\ndf = spark.read \\\n  .format(\"bigquery\") \\\n  .load(\"bigquery-public-data.samples.shakespeare\")\n```\n\nor the Scala only implicit API:\n\n```\nimport com.google.cloud.spark.bigquery._\nval df = spark.read.bigquery(\"bigquery-public-data.samples.shakespeare\")\n```\n\nThe connector supports reading from tables that contain spaces in their names.\n\n**Note on ambiguous table names**: If a table name contains both spaces and a SQL keyword (e.g., \"from\", \"where\", \"join\"), it may be misinterpreted as a SQL query. To resolve this ambiguity, quote the table identifier with backticks \\`. For example:\n\n```\ndf = spark.read \\\n  .format(\"bigquery\") \\\n  .load(\"`my_project.my_dataset.orders from 2023`\")\n```\n\nFor more information, see additional code samples in\n[Python](examples/python/shakespeare.py),\n[Scala](spark-bigquery-dsv1/src/main/scala/com/google/cloud/spark/bigquery/examples/Shakespeare.scala)\nand\n[Java](spark-bigquery-connector-common/src/main/java/com/google/cloud/spark/bigquery/examples/JavaShakespeare.java).\n\n### Reading data from a BigQuery query\n\nThe connector allows you to run any\n[Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax)\nSELECT query on BigQuery and fetch its results directly to a Spark Dataframe.\nThis is easily done as described in the following code sample:\n```\nspark.conf.set(\"viewsEnabled\",\"true\")\n\nsql = \"\"\"\n  SELECT tag, COUNT(*) c\n  FROM (\n    SELECT SPLIT(tags, '|') tags\n    FROM `bigquery-public-data.stackoverflow.posts_questions` a\n    WHERE EXTRACT(YEAR FROM creation_date)\u003e=2014\n  ), UNNEST(tags) tag\n  GROUP BY 1\n  ORDER BY 2 DESC\n  LIMIT 10\n  \"\"\"\ndf = spark.read.format(\"bigquery\").load(sql)\ndf.show()\n```\nWhich yields the result\n```\n+----------+-------+\n|       tag|      c|\n+----------+-------+\n|javascript|1643617|\n|    python|1352904|\n|      java|1218220|\n|   android| 913638|\n|       php| 911806|\n|        c#| 905331|\n|      html| 769499|\n|    jquery| 608071|\n|       css| 510343|\n|       c++| 458938|\n+----------+-------+\n```\nA second option is to use the `query` option like this:\n```\ndf = spark.read.format(\"bigquery\").option(\"query\", sql).load()\n```\n\nNotice that the execution should be faster as only the result is transmitted\nover the wire. In a similar fashion the queries can include JOINs more\nefficiently then running joins on Spark or use other BigQuery features such as\n[subqueries](https://cloud.google.com/bigquery/docs/reference/standard-sql/subqueries),\n[BigQuery user defined functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions),\n[wildcard tables](https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference),\n[BigQuery ML](https://cloud.google.com/bigquery-ml/docs)\nand more.\n\nIn order to use this feature the `viewsEnabled` configurations MUST be set to\n`true`. This can also be done globally as shown in the example above.\n\n**Important:** This feature is implemented by running the query on BigQuery and\nsaving the result into a temporary table, of which Spark will read the results\nfrom. This may add additional costs on your BigQuery account.\n\n### Reading From Parameterized Queries\n\nThe connector supports executing [BigQuery parameterized queries](https://cloud.google.com/bigquery/docs/parameterized-queries) using the\nstandard `spark.read.format('bigquery')` API.\n\nTo use parameterized queries:\n\n1. Provide the SQL query containing parameters using the\n   `.option(\"query\", \"SQL_STRING\")` with named (`@param`) or positional (`?`) parameters.\n2. Specify the parameter values using dedicated options:\n  * **Named Parameters:** Use options prefixed with `NamedParameters.`. The\n    parameter name follows the prefix (case-insensitive).\n    * Format: `.option(\"NamedParameters.\u003cparameter_name\u003e\", \"TYPE:value\")`\n    * Example: `.option(\"NamedParameters.corpus\", \"STRING:romeoandjuliet\")`\n  * **Positional Parameters:** Use options prefixed with\n    `PositionalParameters.`. The 1-based index follows the prefix.\n    * Format:\n      `.option(\"PositionalParameters.\u003cparameter_index\u003e\", \"TYPE:value\")`\n    * Example: `.option(\"PositionalParameters.1\", \"STRING:romeoandjuliet\")`\n\nThe `TYPE` in the `TYPE:value` string specifies the BigQuery Standard SQL data\ntype. Supported types currently include: `BOOL`, `INT64`, `FLOAT64`, `NUMERIC`,\n`STRING`, `DATE`, `DATETIME`, `JSON`, `TIME`, `GEOGRAPHY`, `TIMESTAMP`.\n\n`ARRAY` and `STRUCT` types are not supported as parameters at this time.\n\n### Reading From Views\n\nThe connector has a preliminary support for reading from\n[BigQuery views](https://cloud.google.com/bigquery/docs/views-intro). Please\nnote there are a few caveats:\n\n* BigQuery views are not materialized by default, which means that the connector\n  needs to materialize them before it can read them. This process affects the\n  read performance, even before running any `collect()` or `count()` action.\n* The materialization process can also incur additional costs to your BigQuery\n  bill.\n* Reading from views is **disabled** by default. In order to enable it,\n  either set the viewsEnabled option when reading the specific view\n  (`.option(\"viewsEnabled\", \"true\")`) or set it globally by calling\n  `spark.conf.set(\"viewsEnabled\", \"true\")`.\n\n**Notice:** Before version 0.42.1 of the connector, the following configurations\nare required:\n* By default, the materialized views are created in the same project and\n  dataset. Those can be configured by the optional `materializationProject`\n  and `materializationDataset` options, respectively. These options can also\n  be globally set by calling `spark.conf.set(...)` before reading the views.\n* As mentioned in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/writing-results#temporary_and_permanent_tables),\n  the `materializationDataset` should be in same location as the view.\n\nStarting version 0.42.1 those configurations are **redundant** and are ignored.\nIt is highly recommended to upgrade to this version or a later one to enjoy\nsimpler configuration when using views or loading from queries.\n\n### Writing data to BigQuery\n\nWriting DataFrames to BigQuery can be done using two methods: Direct and Indirect.\n\n#### Direct write using the BigQuery Storage Write API\n\nIn this method the data is written directly to BigQuery using the\n[BigQuery Storage Write API](https://cloud.google.com/bigquery/docs/write-api). In order to enable this option, please\nset the `writeMethod` option to `direct`, as shown below:\n\n```\ndf.write \\\n  .format(\"bigquery\") \\\n  .option(\"writeMethod\", \"direct\") \\\n  .option(\"writeAtLeastOnce\", \"true\")\n  .save(\"dataset.table\")\n```\n\nWriting to existing partitioned tables (date partitioned, ingestion time partitioned and range\npartitioned) in APPEND save mode and OVERWRITE mode (only date and range partitioned) is fully supported by the connector and the BigQuery Storage Write\nAPI. The use of `datePartition`, `partitionField`, `partitionType`, `partitionRangeStart`, `partitionRangeEnd`, `partitionRangeInterval`\ndescribed below is not supported at this moment by the direct write method.\n\n**Important:** Please refer to the [data ingestion pricing](https://cloud.google.com/bigquery/pricing#data_ingestion_pricing)\npage regarding the BigQuery Storage Write API pricing.\n\n**Important:** Please use version 0.24.2 and above for direct writes, as previous\nversions have a bug that may cause a table deletion in certain cases.\n\n#### Indirect write\nIn this method the data is written first  to GCS, and then it is loaded it to BigQuery. A GCS bucket must be configured\nto indicate the temporary data location.\n\n```\ndf.write \\\n  .format(\"bigquery\") \\\n  .option(\"temporaryGcsBucket\",\"some-bucket\") \\\n  .save(\"dataset.table\")\n```\n\nThe data is temporarily stored using the [Apache Parquet](https://parquet.apache.org/),\n[Apache ORC](https://orc.apache.org/) or [Apache Avro](https://avro.apache.org/) formats.\n\nThe GCS bucket and the format can also be set globally using Spark's RuntimeConfig like this:\n```\nspark.conf.set(\"temporaryGcsBucket\",\"some-bucket\")\ndf.write \\\n  .format(\"bigquery\") \\\n  .save(\"dataset.table\")\n```\n\nWhen streaming a DataFrame to BigQuery, each batch is written in the same manner as a non-streaming DataFrame.\nNote that a HDFS compatible\n[checkpoint location](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)\n(eg: `path/to/HDFS/dir` or `gs://checkpoint-bucket/checkpointDir`) must be specified.\n\n```\ndf.writeStream \\\n  .format(\"bigquery\") \\\n  .option(\"temporaryGcsBucket\",\"some-bucket\") \\\n  .option(\"checkpointLocation\", \"some-location\") \\\n  .option(\"table\", \"dataset.table\")\n```\n\n**Important:** The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained [here](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs).\n\n### Running SQL on BigQuery\n\nThe connector supports Spark's [SparkSession#executeCommand](https://archive.apache.org/dist/spark/docs/3.0.0/api/java/org/apache/spark/sql/SparkSession.html#executeCommand-java.lang.String-java.lang.String-scala.collection.immutable.Map-)\nwith the Spark-X.Y-bigquery connectors. It can be used to run any arbitrary DDL/DML StandardSQL statement on BigQuery as\na query job. `SELECT` statements are not supported, as those are supported by reading from query as shown above. It can\nbe used as follows:\n```\nspark.executeCommand(\"bigquery\", sql, options)\n```\nNotice the following:\n* Notice that apart from the authentication options no other options are supported by this functionality.\n* This API is available only in the Scala/Java API. PySpark does not provide it.\n\n### Properties\n\nThe API Supports a number of options to configure the read\n\n\u003c!--- TODO(#2): Convert to markdown --\u003e\n\u003ctable id=\"propertytable\"\u003e\n\u003cstyle\u003e\ntable#propertytable td, table th\n{\nword-break:break-word\n}\n\u003c/style\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003cth style=\"min-width:240px\"\u003eProperty\u003c/th\u003e\n   \u003cth\u003eMeaning\u003c/th\u003e\n   \u003cth style=\"min-width:80px\"\u003eUsage\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003etable\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe BigQuery table in the format \u003ccode\u003e[[project:]dataset.]table\u003c/code\u003e.\n       It is recommended to use the \u003ccode\u003epath\u003c/code\u003e parameter of\n       \u003ccode\u003eload()\u003c/code\u003e/\u003ccode\u003esave()\u003c/code\u003e instead. This option has been\n       deprecated and will be removed in a future version.\n       \u003cbr/\u003e\u003cstrong\u003e(Deprecated)\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eRead/Write\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003edataset\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe dataset containing the table. This option should be used with\n   standard table and views, but not when loading query results.\n       \u003cbr/\u003e(Optional unless omitted in \u003ccode\u003etable\u003c/code\u003e)\n   \u003c/td\u003e\n   \u003ctd\u003eRead/Write\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003eproject\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe Google Cloud Project ID of the table. This option should be used with\n   standard table and views, but not when loading query results.\n       \u003cbr/\u003e(Optional. Defaults to the project of the Service Account being used)\n   \u003c/td\u003e\n   \u003ctd\u003eRead/Write\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003eparentProject\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe Google Cloud Project ID of the table to bill for the export.\n       \u003cbr/\u003e(Optional. Defaults to the project of the Service Account being used)\n   \u003c/td\u003e\n   \u003ctd\u003eRead/Write\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003emaxParallelism\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe maximal number of partitions to split the data into. Actual number\n       may be less if BigQuery deems the data small enough. If there are not\n       enough executors to schedule a reader per partition, some partitions may\n       be empty.\n       \u003cbr/\u003e\u003cb\u003eImportant:\u003c/b\u003e The old parameter (\u003ccode\u003eparallelism\u003c/code\u003e) is\n            still supported but in deprecated mode. It will ve removed in\n            version 1.0 of the connector.\n       \u003cbr/\u003e(Optional. Defaults to the larger of the preferredMinParallelism and 20,000)\u003c/a\u003e.)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003epreferredMinParallelism\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe preferred minimal number of partitions to split the data into. Actual number\n       may be less if BigQuery deems the data small enough. If there are not\n       enough executors to schedule a reader per partition, some partitions may\n       be empty.\n       \u003cbr/\u003e(Optional. Defaults to the smallest of 3 times the application's default parallelism\n       and maxParallelism\u003c/a\u003e.)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003eviewsEnabled\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eEnables the connector to read from views and not only tables. Please read\n       the \u003ca href=\"#reading-from-views\"\u003erelevant section\u003c/a\u003e before activating\n       this option.\n       \u003cbr/\u003e(Optional. Defaults to \u003ccode\u003efalse\u003c/code\u003e)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003ereadDataFormat\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eData Format for reading from BigQuery. Options : \u003ccode\u003eARROW\u003c/code\u003e, \u003ccode\u003eAVRO\u003c/code\u003e\n       \u003cbr/\u003e(Optional. Defaults to \u003ccode\u003eARROW\u003c/code\u003e)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003eoptimizedEmptyProjection\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe connector uses an optimized empty projection (select without any\n       columns) logic, used for \u003ccode\u003ecount()\u003c/code\u003e execution. This logic takes\n       the data directly from the table metadata or performs a much efficient\n       `SELECT COUNT(*) WHERE...` in case there is a filter. You can cancel the\n       use of this logic by setting this option to \u003ccode\u003efalse\u003c/code\u003e.\n       \u003cbr/\u003e(Optional, defaults to \u003ccode\u003etrue\u003c/code\u003e)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003epushAllFilters\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eIf set to \u003ccode\u003etrue\u003c/code\u003e, the connector pushes all the filters Spark can delegate\n       to BigQuery Storage API. This reduces amount of data that needs to be sent from\n       BigQuery Storage API servers to Spark clients. This option has been\n       deprecated and will be removed in a future version.\n       \u003cbr/\u003e(Optional, defaults to \u003ccode\u003etrue\u003c/code\u003e)\n       \u003cbr/\u003e\u003cstrong\u003e(Deprecated)\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003ebigQueryJobLabel\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eCan be used to add labels to the connector initiated query and load\n       BigQuery jobs. Multiple labels can be set.\n       \u003cbr/\u003e(Optional)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003ebigQueryTableLabel\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eCan be used to add labels to the table while writing to a table. Multiple\n       labels can be set.\n       \u003cbr/\u003e(Optional)\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003etraceApplicationName\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eApplication name used to trace BigQuery Storage read and write sessions.\n       Setting the application name is required to set the trace ID on the\n       sessions.\n       \u003cbr/\u003e(Optional)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003etraceJobId\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eJob ID used to trace BigQuery Storage read and write sessions.\n       \u003cbr/\u003e(Optional, defaults to the Dataproc job ID is exists, otherwise uses\n       the Spark application ID)\n   \u003c/td\u003e\n   \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003ecreateDisposition\u003c/code\u003e\n      \u003c/td\u003e\n      \u003ctd\u003eSpecifies whether the job is allowed to create new tables. The permitted\n          values are:\n          \u003cul\u003e\n            \u003cli\u003e\u003ccode\u003eCREATE_IF_NEEDED\u003c/code\u003e - Configures the job to create the\n                table if it does not exist.\u003c/li\u003e\n            \u003cli\u003e\u003ccode\u003eCREATE_NEVER\u003c/code\u003e - Configures the job to fail if the\n                table does not exist.\u003c/li\u003e\n          \u003c/ul\u003e\n          This option takes place only in case Spark has decided to write data\n          to the table based on the SaveMode.\n         \u003cbr/\u003e(Optional. Default to CREATE_IF_NEEDED).\n      \u003c/td\u003e\n      \u003ctd\u003eWrite\u003c/td\u003e\n   \u003c/tr\u003e\n\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003ewriteMethod\u003c/code\u003e\n     \u003c/td\u003e\n       \u003ctd\u003eControls the method\n       in which the data is written to BigQuery. Available values are \u003ccode\u003edirect\u003c/code\u003e\n       to use the BigQuery Storage Write API and \u003ccode\u003eindirect\u003c/code\u003e which writes the\n       data first to GCS and then triggers a BigQuery load operation. See more\n       \u003ca href=\"#writing-data-to-bigquery\"\u003ehere\u003c/a\u003e\n       \u003cbr/\u003e(Optional, defaults to \u003ccode\u003eindirect\u003c/code\u003e)\n     \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003ewriteAtLeastOnce\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eGuarantees that data is written to BigQuery at least once. This is a lesser\n    guarantee than exactly once. This is suitable for streaming scenarios\n    in which data is continuously being written in small batches.\n       \u003cbr/\u003e(Optional. Defaults to \u003ccode\u003efalse\u003c/code\u003e)\n       \u003cbr/\u003e\u003ci\u003eSupported only by the `DIRECT` write method and mode is \u003cb\u003eNOT\u003c/b\u003e `Overwrite`.\u003c/i\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003etemporaryGcsBucket\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe GCS bucket that temporarily holds the data before it is loaded to\n       BigQuery. Required unless set in the Spark configuration\n       (\u003ccode\u003espark.conf.set(...)\u003c/code\u003e).\n       \u003cbr/\u003eDefaults to the `fs.gs.system.bucket` if exists, for example on Google Cloud Dataproc clusters, starting version 0.42.0.\n       \u003cbr/\u003e\u003ci\u003eSupported only by the `INDIRECT` write method.\u003c/i\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003epersistentGcsBucket\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe GCS bucket that holds the data before it is loaded to\n       BigQuery. If informed, the data won't be deleted after write data\n       into BigQuery.\n       \u003cbr/\u003e\u003ci\u003eSupported only by the `INDIRECT` write method.\u003c/i\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003epersistentGcsPath\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe GCS path that holds the data before it is loaded to\n       BigQuery. Used only with \u003ccode\u003epersistentGcsBucket\u003c/code\u003e.\n       \u003cbr/\u003e\u003ci\u003eNot supported by the `DIRECT` write method.\u003c/i\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003eintermediateFormat\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe format of the data before it is loaded to BigQuery, values can be\n       either \"parquet\",\"orc\" or \"avro\". In order to use the Avro format, the\n       spark-avro package must be added in runtime.\n       \u003cbr/\u003e(Optional. Defaults to \u003ccode\u003eparquet\u003c/code\u003e). On write only. Supported only for the `INDIRECT` write method.\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003euseAvroLogicalTypes\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eWhen loading from Avro (`.option(\"intermediateFormat\", \"avro\")`), BigQuery uses the underlying Avro types instead of the logical types [by default](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types). Supplying this option converts Avro logical types to their corresponding BigQuery data types.\n       \u003cbr/\u003e(Optional. Defaults to \u003ccode\u003efalse\u003c/code\u003e). On write only.\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003edatePartition\u003c/code\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eThe date partition the data is going to be written to. Should be a date string\n       given in the format \u003ccode\u003eYYYYMMDD\u003c/code\u003e. Can be used to overwrite the data of\n\t   a single partition, like this: \u003ccode\u003e\u003cbr/\u003edf.write.format(\"bigquery\")\n       \u003cbr/\u003e\u0026nbsp;\u0026nbsp;.option(\"datePartition\", \"20220331\")\n       \u003cbr/\u003e\u0026nbsp;\u0026nbsp;.mode(\"overwrite\")\n       \u003cbr/\u003e\u0026nbsp;\u0026nbsp;.save(\"table\")\u003c/code\u003e\n       \u003cbr/\u003e(Optional). On write only.\n        \u003cbr/\u003e Can also be used with different partition types like:\n        \u003cbr/\u003e HOUR: \u003ccode\u003eYYYYMMDDHH\u003c/code\u003e\n        \u003cbr/\u003e MONTH: \u003ccode\u003eYYYYMM\u003c/code\u003e\n        \u003cbr/\u003e YEAR: \u003ccode\u003eYYYY\u003c/code\u003e\n        \u003cbr/\u003e\u003ci\u003eNot supported by the `DIRECT` write method.\u003c/i\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003epartitionField\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eIf this field is specified, the table is partitioned by this field.\n         \u003cbr/\u003eFor Time partitioning, specify together with the option `partitionType`.\n         \u003cbr/\u003eFor Integer-range partitioning, specify together with the 3 options: `partitionRangeStart`, `partitionRangeEnd, `partitionRangeInterval`.\n         \u003cbr/\u003eThe field must be a top-level TIMESTAMP or DATE field for Time partitioning, or INT64 for Integer-range partitioning. Its mode must be \u003cstrong\u003eNULLABLE\u003c/strong\u003e\n         or \u003cstrong\u003eREQUIRED\u003c/strong\u003e.\n         If the option is not set for a Time partitioned table, then the table will be partitioned by pseudo\n         column, referenced via either\u003ccode\u003e'_PARTITIONTIME' as TIMESTAMP\u003c/code\u003e type, or\n         \u003ccode\u003e'_PARTITIONDATE' as DATE\u003c/code\u003e type.\n         \u003cbr/\u003e(Optional).\n         \u003cbr/\u003e\u003ci\u003eNot supported by the `DIRECT` write method.\u003c/i\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eWrite\u003c/td\u003e\n    \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n    \u003ctd\u003e\u003ccode\u003epartitionExpirationMs\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eNumber of milliseconds for which to keep the storage for partitions in the table.\n         The storage in a partition will have an expiration time of its partition time plus this value.\n        \u003cbr/\u003e(Optional).\n        \u003cbr/\u003e\u003ci\u003eNot supported by the `DIRECT` write method.\u003c/i\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eWrite\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n       \u003ctd\u003e\u003ccode\u003epartitionType\u003c/code\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eUsed to specify Time partitioning.\n            \u003cbr/\u003eSupported types are: \u003ccode\u003eHOUR, DAY, MONTH, YEAR\u003c/code\u003e\n            \u003cbr/\u003e This option is \u003cb\u003emandatory\u003c/b\u003e for a target table to be Time partitioned.\n            \u003cbr/\u003e(Optional. Defaults to DAY if PartitionField is specified).\n            \u003cbr/\u003e\u003ci\u003eNot supported by the `DIRECT` write method.\u003c/i\u003e\n       \u003c/td\u003e\n        \u003ctd\u003eWrite\u003c/td\u003e\n     \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n       \u003ctd\u003e\u003ccode\u003epartitionRangeStart\u003c/code\u003e,\n           \u003ccode\u003epartitionRangeEnd\u003c/code\u003e,\n           \u003ccode\u003epartitionRangeInterval\u003c/code\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eUsed to specify Integer-range partitioning.\n            \u003cbr/\u003eThese options are \u003cb\u003emandatory\u003c/b\u003e for a target table to be Integer-range partitioned.\n            \u003cbr/\u003eAll 3 options must be specified.\n            \u003cbr/\u003e\u003ci\u003eNot supported by the `DIRECT` write method.\u003c/i\u003e\n       \u003c/td\u003e\n        \u003ctd\u003eWrite\u003c/td\u003e\n   \u003c/tr\u003e\n    \u003ctr valign=\"top\"\u003e\n           \u003ctd\u003e\u003ccode\u003eclusteredFields\u003c/code\u003e\n            \u003c/td\u003e\n            \u003ctd\u003eA string of non-repeated, top level columns seperated by comma.\n               \u003cbr/\u003e(Optional).\n            \u003c/td\u003e\n            \u003ctd\u003eWrite\u003c/td\u003e\n         \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n       \u003ctd\u003e\u003ccode\u003eallowFieldAddition\u003c/code\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eAdds the \u003ca href=\"https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigquery/JobInfo.SchemaUpdateOption.html#ALLOW_FIELD_ADDITION\" target=\"_blank\"\u003eALLOW_FIELD_ADDITION\u003c/a\u003e\n            SchemaUpdateOption to the BigQuery LoadJob. Allowed values are \u003ccode\u003etrue\u003c/code\u003e and \u003ccode\u003efalse\u003c/code\u003e.\n           \u003cbr/\u003e(Optional. Default to \u003ccode\u003efalse\u003c/code\u003e).\n           \u003cbr/\u003e\u003ci\u003eSupported only by the `INDIRECT` write method.\u003c/i\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eWrite\u003c/td\u003e\n     \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n       \u003ctd\u003e\u003ccode\u003eallowFieldRelaxation\u003c/code\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eAdds the \u003ca href=\"https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigquery/JobInfo.SchemaUpdateOption.html#ALLOW_FIELD_RELAXATION\" target=\"_blank\"\u003eALLOW_FIELD_RELAXATION\u003c/a\u003e\n            SchemaUpdateOption to the BigQuery LoadJob. Allowed values are \u003ccode\u003etrue\u003c/code\u003e and \u003ccode\u003efalse\u003c/code\u003e.\n           \u003cbr/\u003e(Optional. Default to \u003ccode\u003efalse\u003c/code\u003e).\n           \u003cbr/\u003e\u003ci\u003eSupported only by the `INDIRECT` write method.\u003c/i\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eWrite\u003c/td\u003e\n     \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003eproxyAddress\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e Address of the proxy server. The proxy must be a HTTP proxy and address should be in the `host:port` format.\n          Can be alternatively set in the Spark configuration (\u003ccode\u003espark.conf.set(...)\u003c/code\u003e) or in Hadoop\n          Configuration (\u003ccode\u003efs.gs.proxy.address\u003c/code\u003e).\n          \u003cbr/\u003e (Optional. Required only if connecting to GCP via proxy.)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003eproxyUsername\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e The userName used to connect to the proxy. Can be alternatively set in the Spark configuration\n          (\u003ccode\u003espark.conf.set(...)\u003c/code\u003e) or in Hadoop Configuration (\u003ccode\u003efs.gs.proxy.username\u003c/code\u003e).\n          \u003cbr/\u003e (Optional. Required only if connecting to GCP via proxy with authentication.)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003eproxyPassword\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e The password used to connect to the proxy. Can be alternatively set in the Spark configuration\n          (\u003ccode\u003espark.conf.set(...)\u003c/code\u003e) or in Hadoop Configuration (\u003ccode\u003efs.gs.proxy.password\u003c/code\u003e).\n          \u003cbr/\u003e (Optional. Required only if connecting to GCP via proxy with authentication.)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003ehttpMaxRetry\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e The maximum number of retries for the low-level HTTP requests to BigQuery. Can be alternatively set in the\n          Spark configuration (\u003ccode\u003espark.conf.set(\"httpMaxRetry\", ...)\u003c/code\u003e) or in Hadoop Configuration\n          (\u003ccode\u003efs.gs.http.max.retry\u003c/code\u003e).\n          \u003cbr/\u003e (Optional. Default is 10)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003ehttpConnectTimeout\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e The timeout in milliseconds to establish a connection with BigQuery. Can be alternatively set in the\n          Spark configuration (\u003ccode\u003espark.conf.set(\"httpConnectTimeout\", ...)\u003c/code\u003e) or in Hadoop Configuration\n          (\u003ccode\u003efs.gs.http.connect-timeout\u003c/code\u003e).\n          \u003cbr/\u003e (Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003ehttpReadTimeout\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e The timeout in milliseconds to read data from an established connection. Can be alternatively set in the\n          Spark configuration (\u003ccode\u003espark.conf.set(\"httpReadTimeout\", ...)\u003c/code\u003e) or in Hadoop Configuration\n          (\u003ccode\u003efs.gs.http.read-timeout\u003c/code\u003e).\n          \u003cbr/\u003e (Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003earrowCompressionCodec\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e  Compression codec while reading from a BigQuery table when using Arrow format. Options :\n           \u003ccode\u003eZSTD (Zstandard compression)\u003c/code\u003e,\n           \u003ccode\u003eLZ4_FRAME (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md)\u003c/code\u003e,\n           \u003ccode\u003eCOMPRESSION_UNSPECIFIED\u003c/code\u003e. The recommended compression codec is \u003ccode\u003eZSTD\u003c/code\u003e\n           while using Java.\n          \u003cbr/\u003e (Optional. Defaults to \u003ccode\u003eCOMPRESSION_UNSPECIFIED\u003c/code\u003e which means no compression will be used)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003eresponseCompressionCodec\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e  Compression codec used to compress the ReadRowsResponse data. Options:\n           \u003ccode\u003eRESPONSE_COMPRESSION_CODEC_UNSPECIFIED\u003c/code\u003e,\n           \u003ccode\u003eRESPONSE_COMPRESSION_CODEC_LZ4\u003c/code\u003e\n          \u003cbr/\u003e (Optional. Defaults to \u003ccode\u003eRESPONSE_COMPRESSION_CODEC_UNSPECIFIED\u003c/code\u003e which means no compression will be used)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003ecacheExpirationTimeInMinutes\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e  The expiration time of the in-memory cache storing query information.\n          \u003cbr/\u003e To disable caching, set the value to 0.\n          \u003cbr/\u003e (Optional. Defaults to 15 minutes)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n   \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003ccode\u003eenableModeCheckForSchemaFields\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e  Checks the mode of every field in destination schema to be equal to the mode in corresponding source field schema, during DIRECT write.\n          \u003cbr/\u003e Default value is true i.e., the check is done by default. If set to false the mode check is ignored.\n     \u003c/td\u003e\n     \u003ctd\u003eWrite\u003c/td\u003e\n  \u003c/tr\u003e\n     \u003ctd\u003e\u003ccode\u003eenableListInference\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e  Indicates whether to use schema inference specifically when the mode is Parquet (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#parquetoptions).\n        \u003cbr/\u003e Defaults to false.\n        \u003cbr/\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eWrite\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ccode\u003ebqChannelPoolSize\u003c/code\u003e\u003c/td\u003e\n    \u003ctd\u003e  The (fixed) size of the gRPC channel pool created by the BigQueryReadClient.\n        \u003cbr/\u003eFor optimal performance, this should be set to at least the number of cores on the cluster executors.\n    \u003c/td\u003e\n    \u003ctd\u003eRead\u003c/td\u003e\n  \u003c/tr\u003e\n   \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003ecreateReadSessionTimeoutInSeconds\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e The timeout in seconds to create a ReadSession when reading a table.\n          \u003cbr/\u003e For Extremely large table this value should be increased.\n          \u003cbr/\u003e (Optional. Defaults to 600 seconds)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003equeryJobPriority\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e Priority levels set for the job while reading data from BigQuery query. The permitted values are:\n          \u003cul\u003e\n            \u003cli\u003e\u003ccode\u003eBATCH\u003c/code\u003e - Query is queued and started as soon as idle resources are available, usually within a few minutes. If the query hasn't started within 3 hours, its priority is changed to \u003ccode\u003eINTERACTIVE\u003c/code\u003e.\u003c/li\u003e\n            \u003cli\u003e\u003ccode\u003eINTERACTIVE\u003c/code\u003e - Query is executed as soon as possible and count towards the concurrent rate limit and the daily rate limit.\u003c/li\u003e\n          \u003c/ul\u003e\n          For WRITE, this option will be effective when DIRECT write is used with OVERWRITE mode, where the connector overwrites the destination table using MERGE statement.\n          \u003cbr/\u003e (Optional. Defaults to \u003ccode\u003eINTERACTIVE\u003c/code\u003e)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003edestinationTableKmsKeyName\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eDescribes the Cloud KMS encryption key that will be used to protect destination BigQuery\n         table. The BigQuery Service Account associated with your project requires access to this\n         encryption key. for further Information about using CMEK with BigQuery see\n         [here](https://cloud.google.com/bigquery/docs/customer-managed-encryption#key_resource_id).\n         \u003cbr/\u003e\u003cb\u003eNotice:\u003c/b\u003e The table will be encrypted by the key only if it created by the\n         connector. A pre-existing unencrypted table won't be encrypted just by setting this option.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eWrite\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003eallowMapTypeConversion\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eBoolean config to disable conversion from BigQuery records to Spark MapType\n          when the record has two subfields with field names as \u003ccode\u003ekey\u003c/code\u003e and \u003ccode\u003evalue\u003c/code\u003e.\n          Default value is \u003ccode\u003etrue\u003c/code\u003e which allows the conversion.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003espark.sql.sources.partitionOverwriteMode\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eConfig to specify the overwrite mode on write when the table is range/time partitioned.\n         Currently supportd two modes : \u003ccode\u003eSTATIC\u003c/code\u003e and \u003ccode\u003eDYNAMIC\u003c/code\u003e. In \u003ccode\u003eSTATIC\u003c/code\u003e mode,\n         the entire table is overwritten. In \u003ccode\u003eDYNAMIC\u003c/code\u003e mode, the data is overwritten by partitions of the existing table.\n         The default value is \u003ccode\u003eSTATIC\u003c/code\u003e.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eWrite\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003eenableReadSessionCaching\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eBoolean config to disable read session caching. Caches BigQuery read sessions to allow for faster Spark query planning.\n          Default value is \u003ccode\u003etrue\u003c/code\u003e.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003ereadSessionCacheDurationMins\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eConfig to set the read session caching duration in minutes. Only works if \u003ccode\u003eenableReadSessionCaching\u003c/code\u003e is \u003ccode\u003etrue\u003c/code\u003e (default).\n          Allows specifying the duration to cache read sessions for. Maximum allowed value is \u003ccode\u003e300\u003c/code\u003e.\n          Default value is \u003ccode\u003e5\u003c/code\u003e.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003ebigQueryJobTimeoutInMinutes\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eConfig to set the BigQuery job timeout in minutes.\n          Default value is \u003ccode\u003e360\u003c/code\u003e minutes.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003esnapshotTimeMillis\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eA timestamp specified in milliseconds to use to read a table snapshot.\n          By default this is not set and the latest version of a table is read.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003ebigNumericDefaultPrecision\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eAn alternative default precision for BigNumeric fields, as the BigQuery default is too wide for Spark. Values can be between 1 and 38.\n          This default is used only when the field has an unparameterized BigNumeric type.\n          Please note that there might be data loss if the actual data's precision is more than what is specified.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003ebigNumericDefaultScale\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eAn alternative default scale for BigNumeric fields. Values can be between 0 and 38, and less than bigNumericFieldsPrecision.\n          This default is used only when the field has an unparameterized BigNumeric type.\n          Please note that there might be data loss if the actual data's scale is more than what is specified.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003ccode\u003ecredentialsScopes\u003c/code\u003e\n     \u003c/td\u003e\n     \u003ctd\u003eReplaces the scopes of the Google Credentials if the credentials type supports that.\n         If scope replacement is not supported then it does nothing.\n         \u003cbr/\u003eThe value should be a comma separated list of valid scopes.\n         \u003cbr/\u003e (Optional)\n     \u003c/td\u003e\n     \u003ctd\u003eRead/Write\u003c/td\u003e\n   \u003c/tr\u003e\n\u003c/table\u003e\n\nOptions can also be set outside of the code, using the `--conf` parameter of `spark-submit` or `--properties` parameter\nof the `gcloud dataproc submit spark`. In order to use this, prepend the prefix `spark.datasource.bigquery.` to any of\nthe options, for example `spark.conf.set(\"temporaryGcsBucket\", \"some-bucket\")` can also be set as\n`--conf spark.datasource.bigquery.temporaryGcsBucket=some-bucket`.\n\n### Data types\n\nWith the exception of `DATETIME` and `TIME` all BigQuery data types directed map into the corresponding Spark SQL data type. Here are all of the mappings:\n\n\u003c!--- TODO(#2): Convert to markdown --\u003e\n\u003ctable\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003eBigQuery Standard SQL Data Type \u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003eSpark SQL\u003c/strong\u003e\n\u003cp\u003e\n\u003cstrong\u003eData Type\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003eNotes\u003c/strong\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eBOOL\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eBooleanType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eINT64\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eLongType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eFLOAT64\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eDoubleType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eNUMERIC\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eDecimalType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n     Please refer to \u003ca href=\"#numeric-and-bignumeric-support\"\u003eNumeric and BigNumeric support\u003c/a\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n     \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eBIGNUMERIC\u003c/code\u003e\u003c/strong\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eDecimalType\u003c/code\u003e\u003c/strong\u003e\n     \u003c/td\u003e\n     \u003ctd\u003e\n       Please refer to \u003ca href=\"#numeric-and-bignumeric-support\"\u003eNumeric and BigNumeric support\u003c/a\u003e\n     \u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eSTRING\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eStringType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eBYTES\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eBinaryType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eSTRUCT\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eStructType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eARRAY\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eArrayType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eTIMESTAMP\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eTimestampType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eDATE\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eDateType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eDATETIME\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eStringType\u003c/code\u003e, \u003c/strong\u003e\u003cstrong\u003e\u003ccode\u003eTimestampNTZType\u003c/code\u003e*\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eSpark has no DATETIME type.\n    \u003cp\u003e\n    Spark string can be written to an existing BQ DATETIME column provided it is in the \u003ca href=\"https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#canonical_format_for_datetime_literals\"\u003eformat for BQ DATETIME literals\u003c/a\u003e.\n    \u003cp\u003e\n    * For Spark 3.4+, BQ DATETIME is read as Spark's TimestampNTZ type i.e. java LocalDateTime\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eTIME\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eLongType\u003c/code\u003e, \u003cstrong\u003e\u003ccode\u003eStringType\u003c/code\u003e*\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eSpark has no TIME type. The generated longs, which indicate \u003ca href=\"https://avro.apache.org/docs/1.8.0/spec.html#Time+%2528microsecond+precision%2529\"\u003emicroseconds since midnight\u003c/a\u003e can be safely cast to TimestampType, but this causes the date to be inferred as the current day. Thus times are left as longs and user can cast if they like.\n\u003cp\u003e\nWhen casting to Timestamp TIME have the same TimeZone issues as DATETIME\n\u003cp\u003e\n* Spark string can be written to an existing BQ TIME column provided it is in the \u003ca href=\"https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#canonical_format_for_time_literals\"\u003eformat for BQ TIME literals\u003c/a\u003e.\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eJSON\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eStringType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eSpark has no JSON type. The values are read as String. In order to write JSON back to BigQuery, the following conditions are \u003cb\u003eREQUIRED\u003c/b\u003e:\n       \u003cul\u003e\n          \u003cli\u003eUse the \u003ccode\u003eINDIRECT\u003c/code\u003e write method\u003c/li\u003e\n          \u003cli\u003eUse the \u003ccode\u003eAVRO\u003c/code\u003e intermediate format\u003c/li\u003e\n          \u003cli\u003eThe DataFrame field \u003cb\u003eMUST\u003c/b\u003e be of type \u003ccode\u003eString\u003c/code\u003e and has an entry of sqlType=JSON in its metadata\u003c/li\u003e\n       \u003c/ul\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\" id=\"datatype:map\"\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eARRAY\u0026lt;STRUCT\u0026lt;key,value\u0026gt;\u0026gt;\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e\u003cstrong\u003e\u003ccode\u003eMapType\u003c/code\u003e\u003c/strong\u003e\n   \u003c/td\u003e\n   \u003ctd\u003eBigQuery has no MAP type, therefore similar to other conversions like Apache Avro and BigQuery Load jobs, the connector converts a Spark Map to a REPEATED STRUCT\u0026lt;key,value\u0026gt;.\n       This means that while writing and reading of maps is available, running a SQL on BigQuery that uses map semantics is not supported.\n       To refer to the map's values using BigQuery SQL, please check the \u003ca href=\"https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays\"\u003eBigQuery documentation\u003c/a\u003e.\n       Due to these incompatibilities, a few restrictions apply:\n       \u003cul\u003e\n          \u003cli\u003eKeys can be Strings only\u003c/li\u003e\n          \u003cli\u003eValues can be simple types (not structs)\u003c/li\u003e\n          \u003cli\u003eFor INDIRECT write, use the \u003ccode\u003eAVRO\u003c/code\u003e intermediate format. DIRECT write is supported as well\u003c/li\u003e\n       \u003c/ul\u003e\n   \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n#### Spark ML Data Types Support\n\nThe Spark ML [Vector](https://spark.apache.org/docs/2.4.5/api/python/pyspark.ml.html#pyspark.ml.linalg.Vector) and\n[Matrix](https://spark.apache.org/docs/2.4.5/api/python/pyspark.ml.html#pyspark.ml.linalg.Matrix) are supported,\nincluding their dense and sparse versions. The data is saved as a BigQuery RECORD. Notice that a suffix is added to\nthe field's description which includes the spark type of the field.\n\nIn order to write those types to BigQuery, use the ORC or Avro intermediate format, and have them as column of the\nRow (i.e. not a field in a struct).\n\n#### Numeric and BigNumeric support\nBigQuery's BigNumeric has a precision of 76.76 (the 77th digit is partial) and scale of 38. Since\nthis precision and scale is beyond spark's DecimalType (38 scale and 38 precision) support, it means\nthat BigNumeric fields with precision larger than 38 cannot be used. Once this Spark limitation will\nbe updated the connector will be updated accordingly.\n\nThe Spark Decimal/BigQuery Numeric conversion tries to preserve the parameterization of the type, i.e\n`NUMERIC(10,2)` will be converted to `Decimal(10,2)` and vice versa. Notice however that there are\ncases where [the parameters are lost](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#parameterized_data_types).\nThis means that the parameters will be reverted to the defaults - NUMERIC (38,9) and BIGNUMERIC(76,38).\nThis means that at the moment, BigNumeric read is supported only from a standard table, but not from\nBigQuery view or when [reading data from a BigQuery query](#reading-data-from-a-bigquery-query).\n\n### Filtering\n\nThe connector automatically computes column and pushdown filters the DataFrame's `SELECT` statement e.g.\n\n```\nspark.read.bigquery(\"bigquery-public-data:samples.shakespeare\")\n  .select(\"word\")\n  .where(\"word = 'Hamlet' or word = 'Claudius'\")\n  .collect()\n```\n\n\nfilters to the column `word`  and pushed down the predicate filter `word = 'hamlet' or word = 'Claudius'`.\n\nIf you do not wish to make multiple read requests to BigQuery, you can cache the DataFrame before filtering e.g.:\n\n```\nval cachedDF = spark.read.bigquery(\"bigquery-public-data:samples.shakespeare\").cache()\nval rows = cachedDF.select(\"word\")\n  .where(\"word = 'Hamlet'\")\n  .collect()\n// All of the table was cached and this doesn't require an API call\nval otherRows = cachedDF.select(\"word_count\")\n  .where(\"word = 'Romeo'\")\n  .collect()\n```\n\nYou can also manually specify the `filter` option, which will override automatic pushdown and Spark will do the rest of the filtering in the client.\n\n### Partitioned Tables\n\nThe pseudo columns \\_PARTITIONDATE and \\_PARTITIONTIME are not part of the table schema. Therefore in order to query by the partitions of [partitioned tables](https://cloud.google.com/bigquery/docs/partitioned-tables) do not use the where() method shown above. Instead, add a filter option in the following manner:\n\n```\nval df = spark.read.format(\"bigquery\")\n  .option(\"filter\", \"_PARTITIONDATE \u003e '2019-01-01'\")\n  ...\n  .load(TABLE)\n```\n\n### Configuring Partitioning\n\nBy default the connector creates one partition per 400MB in the table being read (before filtering). This should roughly correspond to the maximum number of readers supported by the BigQuery Storage API.\nThis can be configured explicitly with the \u003ccode\u003e[maxParallelism](#properties)\u003c/code\u003e property. BigQuery may limit the number of partitions based on server constraints.\n\n## Tagging BigQuery Resources\n\nIn order to support tracking the usage of BigQuery resources the connectors\noffers the following options to tag BigQuery resources:\n\n### Adding BigQuery Jobs Labels\n\nThe connector can launch BigQuery load and query jobs. Adding labels to the jobs\nis done in the following manner:\n```\nspark.conf.set(\"bigQueryJobLabel.cost_center\", \"analytics\")\nspark.conf.set(\"bigQueryJobLabel.usage\", \"nightly_etl\")\n```\nThis will create labels `cost_center`=`analytics` and `usage`=`nightly_etl`.\n\n### Adding BigQuery Storage Trace ID\n\nUsed to annotate the read and write sessions. The trace ID is of the format\n`Spark:ApplicationName:JobID`. This is an opt-in option, and to use it the user\nneed to set the `traceApplicationName` property. JobID is auto generated by the\nDataproc job ID, with a fallback to the Spark application ID (such as\n`application_1648082975639_0001`). The Job ID can be overridden by setting the\n`traceJobId` option. Notice that the total length of the trace ID cannot be over\n256 characters.\n\n## Using in Jupyter Notebooks\n\nThe connector can be used in [Jupyter notebooks](https://jupyter.org/) even if\nit is not installed on the Spark cluster. It can be added as an external jar in\nusing the following code:\n\n**Python:**\n```python\nfrom pyspark.sql import SparkSession\nspark = SparkSession.builder \\\n  .config(\"spark.jars.packages\", \"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:${next-release-tag}\") \\\n  .getOrCreate()\ndf = spark.read.format(\"bigquery\") \\\n  .load(\"dataset.table\")\n```\n\n**Scala:**\n```scala\nval spark = SparkSession.builder\n.config(\"spark.jars.packages\", \"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:${next-release-tag}\")\n.getOrCreate()\nval df = spark.read.format(\"bigquery\")\n.load(\"dataset.table\")\n```\n\nIn case Spark cluster is using Scala 2.12 (it's optional for Spark 2.4.x,\nmandatory in 3.0.x), then the relevant package is\ncom.google.cloud.spark:spark-bigquery-with-dependencies_**2.12**:${next-release-tag}. In\norder to know which Scala version is used, please run the following code:\n\n**Python:**\n```python\nspark.sparkContext._jvm.scala.util.Properties.versionString()\n```\n\n**Scala:**\n```python\nscala.util.Properties.versionString\n```\n## Compiling against the connector\n\nUnless you wish to use the implicit Scala API `spark.read.bigquery(\"TABLE_ID\")`, there is no need to compile against the connector.\n\nTo include the connector in your project:\n\n### Maven\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.google.cloud.spark\u003c/groupId\u003e\n  \u003cartifactId\u003espark-bigquery-with-dependencies_${scala.version}\u003c/artifactId\u003e\n  \u003cversion\u003e${next-release-tag}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### SBT\n\n```sbt\nlibraryDependencies += \"com.google.cloud.spark\" %% \"spark-bigquery-with-dependencies\" % \"${next-release-tag}\"\n```\n\n### Connector metrics and how to view them\n\nSpark populates a lot of metrics which can be found by the end user in the spark history page. But all these metrics are spark related which are implicitly collected without any change from the connector.\nBut there are few metrics which are populated from the BigQuery and currently are visible in the application logs which can be read in the driver/executor logs.\n\nFrom Spark 3.2 onwards, spark has provided the API to expose custom metrics in the spark UI page https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html\n\nCurrently, using this API, connector exposes the following bigquery metrics during read\n\u003ctable id=\"metricstable\"\u003e\n\u003cstyle\u003e\ntable#metricstable td, table th\n{\nword-break:break-word\n}\n\u003c/style\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003cth style=\"min-width:240px\"\u003eMetric Name\u003c/th\u003e\n   \u003cth style=\"min-width:240px\"\u003eDescription\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003ebytes read\u003c/code\u003e\u003c/td\u003e\n   \u003ctd\u003enumber of BigQuery bytes read\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003erows read\u003c/code\u003e\u003c/td\u003e\n   \u003ctd\u003enumber of BigQuery rows read\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003escan time\u003c/code\u003e\u003c/td\u003e\n   \u003ctd\u003ethe amount of time spent between read rows response requested to obtained across all the executors, in milliseconds.\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003eparse time\u003c/code\u003e\u003c/td\u003e\n   \u003ctd\u003ethe amount of time spent for parsing the rows read across all the executors, in milliseconds.\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr valign=\"top\"\u003e\n   \u003ctd\u003e\u003ccode\u003espark time\u003c/code\u003e\u003c/td\u003e\n   \u003ctd\u003ethe amount of time spent in spark to process the queries (i.e., apart from scanning and parsing), across all the executors, in milliseconds.\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n**Note:** To use the metrics in the Spark UI page, you need to make sure the `spark-bigquery-metrics-${next-release-tag}.jar` is the class path before starting the history-server and the connector version is `spark-3.2` or above.\n\n## FAQ\n\n### What is the Pricing for the Storage API?\n\nSee the [BigQuery pricing documentation](https://cloud.google.com/bigquery/pricing#storage-api).\n\n### I have very few partitions\n\nYou can manually set the number of partitions with the `maxParallelism` property. BigQuery may provide fewer partitions than you ask for. See [Configuring Partitioning](#configuring-partitioning).\n\nYou can also always repartition after reading in Spark.\n\n### I get quota exceeded errors while writing\n\nIf there are too many partitions the CreateWriteStream or Throughput [quotas](https://cloud.google.com/bigquery/quotas#write-api-limits)\nmay be exceeded. This occurs because while the data within each partition is processed serially, independent\npartitions may be processed in parallel on different nodes within the spark cluster. Generally, to ensure maximum\nsustained throughput you should file a quota increase request. However, you can also manually reduce the number of\npartitions being written by calling \u003ccode\u003ecoalesce\u003c/code\u003e on the DataFrame to mitigate this problem.\n\n```\ndesiredPartitionCount = 5\ndfNew = df.coalesce(desiredPartitionCount)\ndfNew.write\n```\n\nA rule of thumb is to have a single partition handle at least 1GB of data.\n\nAlso note that a job running with the `writeAtLeastOnce` property turned on will not encounter CreateWriteStream\nquota errors.\n\n### How do I authenticate outside GCE / Dataproc?\n\nThe connector needs an instance of a GoogleCredentials in order to connect to the BigQuery APIs. There are multiple\noptions to provide it:\n\n* The default is to load the JSON key from the  `GOOGLE_APPLICATION_CREDENTIALS` environment variable, as described\n  [here](https://cloud.google.com/docs/authentication/getting-started).\n* In case the environment variable cannot be changed, the credentials file can be configured as\n  as a spark option. The file should reside on the same path on all the nodes of the cluster.\n```\n// Globally\nspark.conf.set(\"credentialsFile\", \"\u003c/path/to/key/file\u003e\")\n// Per read/Write\nspark.read.format(\"bigquery\").option(\"credentialsFile\", \"\u003c/path/to/key/file\u003e\")\n```\n* Credentials can also be provided explicitly, either as a parameter or from Spark runtime configuration.\n  They should be passed in as a base64-encoded string directly.\n```\n// Globally\nspark.conf.set(\"credentials\", \"\u003cSERVICE_ACCOUNT_JSON_IN_BASE64\u003e\")\n// Per read/Write\nspark.read.format(\"bigquery\").option(\"credentials\", \"\u003cSERVICE_ACCOUNT_JSON_IN_BASE64\u003e\")\n```\n* In cases where the user has an internal service providing the Google AccessToken, a custom implementation\n  can be done, creating only the AccessToken and providing its TTL. Token refresh will re-generate a new token. In order\n  to use this, implement the\n  [com.google.cloud.bigquery.connector.common.AccessTokenProvider](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/AccessTokenProvider.java)\n  interface. The fully qualified class name of the implementation should be provided in the `gcpAccessTokenProvider`\n  option. `AccessTokenProvider` must be implemented in Java or other JVM language such as Scala or Kotlin. It must\n  either have a no-arg constructor or a constructor accepting a single `java.util.String` argument. This configuration\n  parameter can be supplied using the `gcpAccessTokenProviderConfig` option. If this is not provided then the no-arg\n  constructor wil be called. The jar containing the implementation should be on the cluster's classpath.\n```\n// Globally\nspark.conf.set(\"gcpAccessTokenProvider\", \"com.example.ExampleAccessTokenProvider\")\n// Per read/Write\nspark.read.format(\"bigquery\").option(\"gcpAccessTokenProvider\", \"com.example.ExampleAccessTokenProvider\")\n```\n* Service account impersonation can be configured for a specific username and a group name, or\n  for all users by default using below properties:\n\n  - `gcpImpersonationServiceAccountForUser_\u003cUSER_NAME\u003e` (not set by default)\n\n    The service account impersonation for a specific user.\n\n  - `gcpImpersonationServiceAccountForGroup_\u003cGROUP_NAME\u003e` (not set by default)\n\n    The service account impersonation for a specific group.\n\n  - `gcpImpersonationServiceAccount` (not set by default)\n\n    Default service account impersonation for all users.\n\n  If any of the above properties are set then the service account specified will be impersonated by\n  generating a short-lived credentials when accessing BigQuery.\n\n  If more than one property is set then the service account associated with the username will take\n  precedence over the service account associated with the group name for a matching user and group,\n  which in turn will take precedence over default service account impersonation.\n\n* For a simpler application, where access token refresh is not required, another alternative is to pass the access token\n  as the `gcpAccessToken` configuration option. You can get the access token by running\n  `gcloud auth application-default print-access-token`.\n```\n// Globally\nspark.conf.set(\"gcpAccessToken\", \"\u003caccess-token\u003e\")\n// Per read/Write\nspark.read.format(\"bigquery\").option(\"gcpAccessToken\", \"\u003cacccess-token\u003e\")\n```\n\n**Important:** The `CredentialsProvider` and  `AccessTokenProvider` need to be implemented in Java or\nother JVM language such as Scala or Kotlin. The jar containing the implementation should be on the cluster's classpath.\n\n**Notice:** Only one of the above options should be provided.\n\n### How do I connect to GCP/BigQuery via Proxy?\n\nTo connect to a forward proxy and to authenticate the user credentials, configure the following options.\n\n`proxyAddress`: Address of the proxy server. The proxy must be an HTTP proxy and address should be in the `host:port`\nformat.\n\n`proxyUsername`: The userName used to connect to the proxy.\n\n`proxyPassword`: The password used to connect to the proxy.\n\n```\nval df = spark.read.format(\"bigquery\")\n  .option(\"proxyAddress\", \"http://my-proxy:1234\")\n  .option(\"proxyUsername\", \"my-username\")\n  .option(\"proxyPassword\", \"my-password\")\n  .load(\"some-table\")\n```\n\nThe same proxy parameters can also be set globally using Spark's RuntimeConfig like this:\n\n```\nspark.conf.set(\"proxyAddress\", \"http://my-proxy:1234\")\nspark.conf.set(\"proxyUsername\", \"my-username\")\nspark.conf.set(\"proxyPassword\", \"my-password\")\n\nval df = spark.read.format(\"bigquery\")\n  .load(\"some-table\")\n```\n\nYou can set the following in the hadoop configuration as well.\n\n`fs.gs.proxy.address`(similar to \"proxyAddress\"), `fs.gs.proxy.username`(similar to \"proxyUsername\") and\n`fs.gs.proxy.password`(similar to \"proxyPassword\").\n\nIf the same parameter is set at multiple places the order of priority is as follows:\n\noption(\"key\", \"value\") \u003e spark.conf \u003e hadoop configuration\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGoogleCloudDataproc%2Fspark-bigquery-connector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGoogleCloudDataproc%2Fspark-bigquery-connector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGoogleCloudDataproc%2Fspark-bigquery-connector/lists"}