{"id":18304050,"url":"https://github.com/googleclouddataproc/spark-spanner-connector","last_synced_at":"2026-03-14T05:09:26.712Z","repository":{"id":80903559,"uuid":"138911951","full_name":"GoogleCloudDataproc/spark-spanner-connector","owner":"GoogleCloudDataproc","description":"Cloud Spanner Connector for Apache Spark","archived":false,"fork":false,"pushed_at":"2025-01-08T18:00:53.000Z","size":1304,"stargazers_count":17,"open_issues_count":4,"forks_count":17,"subscribers_count":31,"default_branch":"main","last_synced_at":"2025-09-09T16:36:46.969Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudDataproc.png","metadata":{"files":{"readme":"README-template.md","changelog":"CHANGES.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-27T17:21:07.000Z","updated_at":"2025-01-08T18:00:58.000Z","dependencies_parsed_at":"2023-10-14T21:56:59.910Z","dependency_job_id":"981125ba-c6eb-46d8-b67e-bde662ecff55","html_url":"https://github.com/GoogleCloudDataproc/spark-spanner-connector","commit_stats":null,"previous_names":["googleclouddataproc/cloud-spanner-spark-connector"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/GoogleCloudDataproc/spark-spanner-connector","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-spanner-connector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-spanner-connector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-spanner-connector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-spanner-connector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudDataproc","download_url":"https://codeload.github.com/GoogleCloudDataproc/spark-spanner-connector/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fspark-spanner-connector/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278657141,"owners_count":26023393,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T15:27:39.541Z","updated_at":"2026-03-14T05:09:24.009Z","avatar_url":"https://github.com/GoogleCloudDataproc.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Spark SQL Connector for Google Cloud Spanner\n\nThe connector supports reading\n[Google Cloud Spanner](https://cloud.google.com/spanner) tables and\n[graphs](https://cloud.google.com/spanner/docs/graph/overview) into Spark\n[DataFrames](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)\nand\n[GraphFrames](https://graphframes.github.io/graphframes/docs/_site/user-guide.html).\n\n## Unreleased Changes\n\nThis Readme may include documentation for changes that haven't been released yet.  The latest release's documentation and source code are found here.\n\nhttps://github.com/GoogleCloudDataproc/spark-spanner-connector/blob/master/README.md\n\n## Requirements\n\n### Enable the Cloud Spanner API\nFollow the [instructions](https://cloud.google.com/spanner/docs/create-query-database-console) to create a project or Spanner table if you don't have an existing one.\n\n### Create a Google Cloud Dataproc cluster (Optional)\n\nIf you do not have an Apache Spark environment you can create a Cloud Dataproc cluster with pre-configured auth. The following examples assume you are using Cloud Dataproc, but you can use `spark-submit` on any cluster.\n\nAny Dataproc cluster using the API needs the 'Spanner' or 'cloud-platform' [scopes](https://developers.google.com/identity/protocols/oauth2/scopes#spanner). Dataproc clusters don't have the 'spanner' scope by default, but you can create a cluster with the scope. For example:\n\n```\nMY_CLUSTER=...\ngcloud dataproc clusters create \"$MY_CLUSTER\" --scopes https://www.googleapis.com/auth/cloud-platform\n```\n\n### Permission\n\nIf you run a Spark job on the Dataproc cluster, you'll have to assign corresponding [Spanner permission](https://cloud.google.com/spanner/docs/iam#permissions) to the [Dataproc VM service account](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#dataproc_service_accounts_2). If you choose to use Dataproc Serverless, you'll have to make sure the [Serverless service account](https://cloud.google.com/dataproc-serverless/docs/concepts/service-account#console) has the permission.\n\n## Downloading and Using the Connector\n\nYou can find the released jar file from the Releases tag on right of the github page. The name pattern is spark-3.1-spanner-x.x.x.jar. The 3.1 indicates the driver depends on the Spark 3.1 and x.x.x is the Spark Spanner connector version. The alternative way is to use `gs://spark-lib/spanner/spark-3.1-spanner-${next-release-tag}.jar` directly.\n\n### Connector to Spark Compatibility Matrix\n| Connector \\ Spark | 2.3     | 2.4\u003cbr\u003e(Scala 2.11) | 2.4\u003cbr\u003e(Scala 2.12) | 3.0     | 3.1     | 3.2     | 3.3     | 3.4     | 3.5     |\n|-------------------|---------|---------------------|---------------------|---------|---------|---------|---------|---------|---------|\n| spark-3.1-spanner |         |                     |                     |         | \u0026check; | \u0026check; | \u0026check; | \u0026check; | \u0026check; |\n| spark-3.2-spanner |         |                     |                     |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check; |\n| spark-3.3-spanner |         |                     |                     |         |         |         | \u0026check; | \u0026check; | \u0026check; |\n| spark-3.5-spanner |         |                     |                     |         |         |         |         |         | \u0026check; |\n\n### Connector to Dataproc Image Compatibility Matrix\n| Connector \\ Dataproc Image | 1.3     | 1.4     | 1.5     | 2.0     | 2.1     | 2.2     | Serverless\u003cbr\u003eImage 1.1 | Serverless\u003cbr\u003eImage 1.2 | Serverless\u003cbr\u003eImage 2.0 | Serverless\u003cbr\u003eImage 2.1 | Serverless\u003cbr\u003eImage 2.2 |\n|----------------------------|---------|---------|---------|---------|---------|---------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|\n| spark-3.1-spanner          |         |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check;                 | Note 1                  | \u0026check;                 | \u0026check;                 | Note 1                  |\n| spark-3.2-spanner          |         |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check;                 | Note 1                  | \u0026check;                 | \u0026check;                 | Note 1                  |\n| spark-3.3-spanner          |         |         |         | \u0026check; | \u0026check; | \u0026check; | \u0026check;                 | Note 1                  | \u0026check;                 | \u0026check;                 | Note 1                  |\n| spark-3.5-spanner          |         |         |         |         |         | \u0026check; |                         | \u0026check;                 |                         |                         | \u0026check;                 |\n\nNote 1: Dataproc compatibility to be tested.\n\n### Maven / Ivy Package\n\nThe connector is published to\n[Maven Central](https://repo1.maven.org/maven2/com/google/cloud/spark/spanner/).\nYou can browse all available versions at\n[mvnrepository.com](https://mvnrepository.com/artifact/com.google.cloud.spark.spanner).\n\nUsing published packages is the recommended way to consume the connector —\nno need to build from source. Supply the artifact coordinates via the\n`--packages` option or the `spark.jars.packages` configuration property:\n\n| version   | Connector Artifact                                                     |\n|-----------|------------------------------------------------------------------------|\n| Spark 3.5 | `com.google.cloud.spark.spanner:spark-3.5-spanner:${next-release-tag}` |\n| Spark 3.3 | `com.google.cloud.spark.spanner:spark-3.3-spanner:${next-release-tag}` |\n| Spark 3.2 | `com.google.cloud.spark.spanner:spark-3.2-spanner:${next-release-tag}` |\n| Spark 3.1 | `com.google.cloud.spark.spanner:spark-3.1-spanner:${next-release-tag}` |\n\nFor example, to start a PySpark shell with the connector:\n\n```shell\npyspark --packages com.google.cloud.spark.spanner:spark-3.5-spanner:${next-release-tag}\n```\n\nOr in a `spark-submit` job:\n\n```shell\nspark-submit --packages com.google.cloud.spark.spanner:spark-3.5-spanner:${next-release-tag} \\\n    my_job.py\n```\n\nYou can also set it programmatically when creating a `SparkSession`:\n\n```python\nspark = (SparkSession.builder\n         .config(\"spark.jars.packages\",\n                 \"com.google.cloud.spark.spanner:spark-3.5-spanner:${next-release-tag}\")\n         .getOrCreate())\n```\n\n### Specifying the Spark Spanner connector version in a Dataproc cluster\n\nYou can use the standard `--packages` or `--jars` (or alternatively, the `spark.jars.packages`/`spark.jars` configuration) to specify the Spark Spanner connector.\n\nUsing Maven coordinates (recommended):\n\n```shell\ngcloud dataproc jobs submit pyspark --cluster \"$MY_CLUSTER\" \\\n    --packages=com.google.cloud.spark.spanner:spark-3.5-spanner:${next-release-tag} \\\n    --region us-central1 examples/SpannerSpark.py\n```\n\nUsing a JAR from Google Cloud Storage:\n\n```shell\ngcloud dataproc jobs submit pyspark --cluster \"$MY_CLUSTER\" \\\n    --jars=gs://spark-lib/spanner/spark-3.5-spanner-${next-release-tag}.jar \\\n    --region us-central1 examples/SpannerSpark.py\n```\n## Usage\n\nThe connector supports exporting both tables and graphs from Spanner, and importing to Spanner.\nIt uses the cross language\n[Spark SQL Data Source API](https://spark.apache.org/docs/latest/sql-data-sources.html)\nto communicate with the\n[Spanner Java library](https://github.com/googleapis/java-spanner).\n\n### Exporting Spanner Tables\nThis is an example of using Python code to connect to a Spanner table. You can find more examples or documentations on the [usage](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html).\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName('Spanner Connect App').getOrCreate()\ndf = spark.read.format('cloud-spanner') \\\n   .option(\"projectId\", \"$YourProjectId\") \\\n   .option(\"instanceId\", \"$YourInstanceId\") \\\n   .option(\"databaseId\", \"$YourDatabaseId\") \\\n   .option(\"table\", \"$YourTable\") \\\n   .load()\ndf.show()\n```\n\nFor support of other languages, you can refer to\n[Scala](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html),\n[Java](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html),\nand\n[R](https://spark.apache.org/docs/latest/api/R/reference/SparkDataFrame.html).\nYou can also refer to\n[Scala, Java](https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark),\nand\n[R](https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark-r)\nabout how to submit a job for other languages.\n\n#### Table Connector Options\n\nHere are the options supported in the Spark Spanner connector for reading\ntables.\n\nVariable|Validation|Comments\n---|---|---\nprojectId|String|The projectID containing the Cloud Spanner database\ninstanceId|String|The instanceID of the Cloud Spanner database\ndatabaseId|String|The databaseID of the Cloud Spanner database\ntable|String|The Table of the Cloud Spanner database that you are reading from\nenableDataboost|Boolean|Enable the [Data Boost](https://cloud.google.com/spanner/docs/databoost/databoost-overview), which provides independent compute resources to query Spanner with near-zero impact to existing workloads. Note the option may trigger [extra charge](https://cloud.google.com/spanner/pricing#spanner-data-boost-pricing).\nemulatorHost|String|The host and port of the Spanner emulator (e.g. `localhost:9010`). When set, the connector connects to the emulator instead of Cloud Spanner. Useful for local development and testing.\n\n### Writing to Spanner Tables\n\nHere is an example of using Python to write to a Spanner table.\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName('Spanner Write App').getOrCreate()\n\ncolumns = ['id', 'name', 'email']\ndata = [(1, 'John Doe', 'john.doe@example.com'), (2, 'Jane Doe', 'jane.doe@example.com')]\ndf = spark.createDataFrame(data, columns)\n\ndf.write.format('cloud-spanner') \\\n   .option(\"projectId\", \"$YourProjectId\") \\\n   .option(\"instanceId\", \"$YourInstanceId\") \\\n   .option(\"databaseId\", \"$YourDatabaseId\") \\\n   .option(\"table\", \"$YourTable\") \\\n   .mode(\"append\") \\\n   .save()\n```\n\n#### Save Modes\n\nThe connector supports the following Spark save modes:\n\nSave Mode|Behavior\n---|---\n`Append`|Inserts rows into the existing Spanner table (default).\n`Overwrite`|Clears the existing data before writing. The behavior can be modified by the `overwriteMode` option (see below).\n`ErrorIfExists`|Creates a new table and writes data. Fails if the table already exists. Requires [Spark Catalog support](#spark-catalog-support).\n`Ignore`|Creates the table and writes data only if the table does not already exist. A no-op if the table exists. Requires [Spark Catalog support](#spark-catalog-support).\n\n\u003e **Note:** Writing a DataFrame to Spanner from Spark is not a single atomic operation. The connector splits large DataFrames into multiple transactions based on the `bytesPerTransaction` and `mutationsPerTransaction` limits.\n\u003e\n\u003e Similarly, actions in `Overwrite` are not atomic either - truncate or recreate actions cannot be undone if subsequent write fails.\n\n#### Write Connector Options\n\nThese are the options supported in the Spark Spanner connector for writing\ntables.\n\nVariable| Validation |Comments\n---|------------|---\nprojectId| String     |The projectID containing the Cloud Spanner database\ninstanceId| String     |The instanceID of the Cloud Spanner database\ndatabaseId| String     |The databaseID of the Cloud Spanner database\ntable| String     |The name of the destination Cloud Spanner table\nmutationsPerTransaction| Integer    |The number of mutations to send in a single transaction. Default: 1000\nbytesPerTransaction | Long       |Maximum size of each transaction. Default: 1048576 (1MB)\nnumWriteThreads| Integer    |The number of threads to use for writing per Spark worker.  Default: 8\nassumeIdempotentRows| Boolean    |When `true`, the connector uses a higher-throughput 'at-least-once' write mode. See [Spanner documentation](https://docs.cloud.google.com/spanner/docs/batch-write) for use cases and limitations. Default: `false`\nmaxPendingTransactions| Integer    |The maximum number of concurrent batches that can be in-flight. This is used to control backpressure. Default: 20\nmutationType| String     |The row write mode used. Valid values are: insert, insert_or_update, replace, update. Default: insert_or_update\noverwriteMode| String     |Controls behavior when using `mode(\"overwrite\")`. `truncate` (default) deletes all rows but keeps the table schema. `recreate` drops and recreates the table from the DataFrame schema.\nenablePartialRowUpdates| Boolean    |When `true`, the connector uses the DataFrame schema instead of the Spanner table schema, allowing writes with a subset of columns. Requires `mutationType` set to `update` or `insert_or_update`. Default: `false`\nemulatorHost| String     |The host and port of the Spanner emulator (e.g. `localhost:9010`). When set, the connector connects to the emulator instead of Cloud Spanner. Useful for local development and testing.\n\n`mutationsPerTransaction` and `bytesPerTransaction` are both used when building a transaction to send to spanner.\n\n\n#### Data Types\nThe connector supports writing the following Spark data types to Spanner.\n\n##### GoogleSQL\nSpark Data Type|Spanner GoogleSql Type\n---|---\n`LongType`|`INT64`\n`StringType`|`STRING`\n`BooleanType`|`BOOL`\n`DoubleType`|`FLOAT64`\n`BinaryType`|`BYTES`\n`TimestampType`|`TIMESTAMP`\n`DateType`|`DATE`\n`DecimalType`|`NUMERIC`\n`ArrayType(ElementType)`|`ARRAY\u003cElementType\u003e`\n\nSpark Array Element Type|Spanner GoogleSql Array Type\n---|---\n`LongType`|`ARRAY\u003cINT64\u003e`\n`StringType`|`ARRAY\u003cSTRING\u003e`\n`BooleanType`|`ARRAY\u003cBOOL\u003e`\n`DoubleType`|`ARRAY\u003cFLOAT64\u003e`\n`BinaryType`|`ARRAY\u003cBYTES\u003e`\n`TimestampType`|`ARRAY\u003cTIMESTAMP\u003e`\n`DateType`|`ARRAY\u003cDATE\u003e`\n`DecimalType`|`ARRAY\u003cNUMERIC\u003e`\n\n##### PostgreSQL\nSpark Data Type|Spanner PostgreSql Type\n---|---\n`LongType`|`bigint`/`int8`\n`StringType`|`varchar`/`text`/`character varying`\n`BooleanType`|`bool`/`boolean`\n`DoubleType`|`double precision`/`float8`\n`BinaryType`|`bytea`\n`TimestampType`|`timestamptz`/`timestamp with time zone`\n`DateType`|`date`\n`DecimalType`|`numeric`/`decimal`\n`ArrayType(ElementType)`|`ElementType[]`\n\nSpark Array Element Type|Spanner PostgreSql Array Type\n---|---\n`LongType`|`bigint[]`/`int8[]`\n`StringType`|`varchar[]`/`text[]`/`character varying[]`\n`BooleanType`|`bool[]`/`boolean[]`\n`DoubleType`|`double precision[]`/`float8[]`\n`BinaryType`|`bytea[]`\n`TimestampType`|`timestamptz[]`/`timestamp with time zone[]`\n`DateType`|`date[]`\n`DecimalType`|`numeric[]`/`decimal[]`\n`StructType`|`JSON`\n\u003e Pre-existing Google Spanner limitations apply. Specifically:\n\u003e - Column value size is limited to 10MB,\n\u003e - In GoogleSQL, `NUMERIC` type is limited to 9 digits of scale, Spark supports up to 38.\n\n### Spark Catalog Support \u003ca id=\"spark-catalog-support\"\u003e\u003c/a\u003e\n\nThe connector implements the Spark\n[TableCatalog](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html)\ninterface, allowing you to manage Spanner tables using Spark SQL DDL statements\nsuch as `CREATE TABLE`, `DROP TABLE`, `INSERT INTO`, and `SELECT`.\n\n#### Configuring the Catalog\n\nRegister the Spanner catalog in your Spark session configuration:\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = (SparkSession.builder\n         .appName(\"Spanner Catalog App\")\n         .config(\"spark.sql.catalog.spanner\",\n                 \"com.google.cloud.spark.spanner.SpannerCatalog\")\n         .config(\"spark.sql.catalog.spanner.projectId\", \"\u003cPROJECT_ID\u003e\")\n         .config(\"spark.sql.catalog.spanner.instanceId\", \"\u003cSPANNER_INSTANCE_ID\u003e\")\n         .config(\"spark.sql.catalog.spanner.databaseId\", \"\u003cSPANNER_DATABASE_ID\u003e\")\n         .getOrCreate())\n```\n\nOn Dataproc, you can pass these as cluster properties:\n\n```shell\ngcloud dataproc clusters create \"$MY_CLUSTER\" \\\n    --properties \"spark:spark.sql.catalog.spanner=com.google.cloud.spark.spanner.SpannerCatalog,spark:spark.sql.catalog.spanner.projectId=\u003cPROJECT_ID\u003e,spark:spark.sql.catalog.spanner.instanceId=\u003cSPANNER_INSTANCE_ID\u003e,spark:spark.sql.catalog.spanner.databaseId=\u003cSPANNER_DATABASE_ID\u003e\"\n```\n\n#### Creating Tables\n\nUse `CREATE TABLE` with the `USING` clause and specify primary keys via\n`TBLPROPERTIES`:\n\n```sql\nCREATE TABLE spanner.my_table (\n    id BIGINT NOT NULL,\n    name STRING,\n    score DOUBLE\n) USING `cloud-spanner`\nTBLPROPERTIES('primaryKeys' = 'id')\n```\n\nFor composite primary keys, provide a comma-separated list:\n\n```sql\nTBLPROPERTIES('primaryKeys' = 'id, name')\n```\n\nUse `CREATE TABLE IF NOT EXISTS` to skip creation when the table already exists\n(Ignore save mode).\n\n#### Inserting Data\n\n```sql\nINSERT INTO spanner.my_table VALUES (1, 'Alice', 95.5)\n```\n\n#### Querying Data\n\n```sql\nSELECT * FROM spanner.my_table WHERE score \u003e 90\n```\n\n#### Dropping Tables\n\n```sql\nDROP TABLE spanner.my_table\n```\n\n#### Using the DataFrame API with the Catalog\n\nYou can also use the DataFrame `writeTo` API for `ErrorIfExists` semantics:\n\n```python\ndf.writeTo(\"spanner.my_table\").tableProperty(\"primaryKeys\", \"id\").create()\n```\n\n### Exporting Spanner Graphs\n\nTo export [Spanner Graphs](https://cloud.google.com/spanner/docs/graph/overview),\nplease use the Python class `SpannerGraphConnector` included in the jar.\n\nThe connector supports exporting the graph into separate node and edge\nDataFrames, and exporting the graph into\n[GraphFrames](https://graphframes.github.io/graphframes/docs/_site/user-guide.html)\ndirectly.\n\nThis is an example of exporting a graph from Spanner as a GraphFrame:\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = (SparkSession.builder.appName(\"spanner-graphframe-graphx-example\")\n         .config(\"spark.jars.packages\", \"graphframes:graphframes:0.8.4-spark3.5-s_2.12\")\n         .config(\"spark.jars\", path_to_connector_jar)\n         .getOrCreate())\n\nspark.sparkContext.addPyFile(path_to_connector_jar)\nfrom spannergraph import SpannerGraphConnector\n\nconnector = (SpannerGraphConnector()\n             .spark(spark)\n             .project(\"$YourProjectId\")\n             .instance(\"$YourInstanceId\")\n             .database(\"$YourDatabaseId\")\n             .graph(\"$YourGraphId\"))\n\ng = connector.load_graph()\ng.vertices.show()\ng.edges.show()\n```\n\nTo export node and edge DataFrames instead of GraphFrames, please use\n`load_dfs` instead:\n\n```python\ndf_vertices, df_edges, df_id_map = connector.load_dfs()\n```\n\n#### Node ID Mapping\n\nWhile Spanner Graph allows nodes to be identified with more than one element\nkey, many libraries for processing graphs, including GraphFrames, expect only\none ID field, ideally integers.\n\nWhen node IDs are not integers, the connector assigns a unique integer ID to\neach row in node tables and maps node keys in edge tables to integer IDs with\nDataFrame joins by default. Please use `load_graph_and_mapping` or `load_dfs`\nto retrieve the mapping when loading a graph:\n\n```python\ng, df_id_map = connector.load_graph_and_mapping()\n```\n\nor\n\n```python\ndf_vertices, df_edges, df_id_map = connector.load_dfs()\n```\n\nIf you do not want to let the connector perform this mapping, please specify\n`.export_string_ids(True)` to let the connector output string concatenations of\ntable IDs (generated by the connector based on the graph schema) and element\nkeys directly. The format of the concatenated strings is\n`{table_id}@{key_1}|{key_2}|{key_3}|...`, where element keys joined with `|` as\nthe separator, and `\\ ` being used as the escape character. For example, the\nstring ID of a node with table ID `1` and keys `(a, b|b, c\\c)` will be\n`1@a|b\\|b|c\\\\c`.\n\n#### Graph Connector Options\n\nHere is a summary of the options supported by the graph connector.\nPlease refer to the API documentation of\n[`SpannerGraphConnector`](python/spannergraph/_connector.py) for details.\n\n##### Required\n\n| Option                  | Summary of Purpose                                                                                                    |\n|-------------------------|-----------------------------------------------------------------------------------------------------------------------|\n| spark                   | The spark session to read graph to                                                                                    |\n| project                 | ID of the Google Cloud project containing the graph                                                                   |\n| instance                | ID of the Spanner instance containing the graph                                                                       |\n| database                | ID of the Spanner database containing the graph                                                                       |\n| graph                   | Name of the graph as defined in the database schema                                                                   |\n\n##### Optional\n\n| Option                  | Summary of Purpose                                                                                                    | Default                                            |\n|-------------------------|-----------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|\n| data_boost              | Enable [Data Boost](https://cloud.google.com/spanner/docs/databoost/databoost-overview)                               | Disabled                                           |\n| partition_size_bytes    | The [partitionSizeBytes](https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions) hint for Spanner   | No hint provided                                   |\n| repartition             | Enable repartitioning of node and edge DataFrames and set the target number of partitions                             | No repartitioning                                  |\n| read_timestamp          | The timestamp of the snapshot to read from                                                                            | Read the snapshot at the time when load is called  |\n| symmetrize_graph        | Symmetrizes the output graph by adding reverse edges                                                                  | No symmetrization                                  |\n| export_string_ids       | Output string concatenations of the element keys instead of assigning integer IDs and performing joins                | Output integer IDs                                 |\n| node_label / edge_label | Specify label element filters, additional properties to fetch, and element-wise property filters (details below)      | Export all nodes and edges and no element property |\n| node_query / edge_query | Overwrite the queries used to fetch nodes and edges (details below)                                                   | Use queries generated by the connector             |\n\n#### Filters and Element Properties\n\nYou can choose to include only graph elements with specific labels by providing\n`node_label` and/or `edge_label` options. `node_label` and `edge_label` can also\nbe used to specify element properties to include in the output and additional\nelement-wise filters (i.e., WHERE clauses). The columns for the returned\nproperties will be prefixed with \"property_\" to avoid naming conflicts (e.g.,\nwhen fetching a property named \"id\").\n\nTo fetch additional properties or specify an element-wise filter without\nperforming any filtering by label, please use `\"*\"` to match any label. Other\nlabel filters of the same type (node/edge) cannot be used if a `\"*\"` label\nfilter is specified for that type.\n\nThis example fetches all nodes with their \"name\" property, all \"KNOWS\" edges\nwith their \"SingerId\" and \"FriendId\" properties, and all \"CREATES_MUSIC\" edges\nwith a release date after 1900-01-01:\n\n```python\nconnector = (connector\n             .node_label(\"*\", properties=[\"name\"])\n             .edge_label(\"KNOWS\", properties=[\"SingerId\", \"FriendId\"])\n             .edge_label(\"CREATES_MUSIC\", where=\"release_date \u003e '1900-01-01'\"))\n```\n\n#### Direct Queries\n\nIn addition to letting the connector generate queries to read nodes and edges\nfrom Spanner, you can provide your own GQL queries with `node_query` and\n`edge_query` to fetch the node and edge tables, with some restrictions:\n\n- The queries must be\n  [root-partitionable](https://cloud.google.com/spanner/docs/reads#read_data_in_parallel).\n- The output columns must meet the following conditions:\n    - A column in the node DataFrame is named \"id\".\n      This column will be used to identify nodes.\n    - A column in the edge DataFrame is named \"src\".\n      This column will be used to identify source nodes.\n    - A column in the edge DataFrame is named \"dst\".\n      This column will be used to identify destination nodes.\n\nThis example provides custom GQL queries to fetch the node and edge tables of\nthe graph:\n\n```python\nnode_query = \"SELECT * FROM GRAPH_TABLE \" \\\n             \"(MusicGraph MATCH (n:SINGER) RETURN n.id AS id)\"\nedge_query = \"SELECT * FROM GRAPH_TABLE \" \\\n             \"(MusicGraph MATCH -[e:KNOWS]-\u003e \" \\\n             \"RETURN e.SingerId AS src, e.FriendId AS dst)\"\nconnector = (connector\n             .node_query(node_query)\n             .edge_query(edge_query))\n```\n\n#### Source and Destination Key Limitation\n\nCurrently, the graph connector expects source_key and destination_key of an Edge\nto match the node_element_key of the referenced source and destination Node\nrespectively\n([Element Definition](https://cloud.google.com/spanner/docs/reference/standard-sql/graph-schema-statements#element_definition)).\nFor example, if an edge table *E* references a node table *N* as source nodes,\nand *N* has a 2-part compound [node_c1, node_c2] as its node_element_key, the\nsource_key of *E* must also be a 2-part compound [edge_c1, edge_c2]. A partial\nmatch, e.g. source_key = [edge_c1], can logically form a hypergraph and is not\nsupported.\n\n### Data Types\n\nHere are the mappings for supported Spanner data types.\n\nSpanner GoogleSql Type|Spark Data Type|Notes\n---|---|---\nARRAY    |ArrayType    | Nested ARRAY is not supported, e.g. ARRAY\u003cARRAY\u003cBOOL\u003e\u003e.\nBOOL     |BooleanType  |\nBYTES    |BinaryType   |\nDATE     |DateType     | The date range is [1700-01-01, 9999-12-31].\nFLOAT64  |DoubleType   |\nINT64    |LongType     | The supported integer range is [-9,223,372,036,854,775,808, 9,223,372,036,854,775,807]\nJSON     |StringType   | Spark has no JSON type. The values are read as String.\nNUMERIC  |DecimalType  | The NUMERIC will be converted to DecimalType with 38 precision and 9 scale, which is the same as the Spanner definition.\nSTRING   |StringType   |\nTIMESTAMP|TimestampType| Only microseconds will be converted to Spark timestamp type. The range of timestamp is  [0001-01-01 00:00:00, 9999-12-31 23:59:59.999999]\n\n### Filter Pushdown\n\nThe connector automatically computes column and pushdown filters the DataFrame's `SELECT` statement e.g.\n\n```\ndf.select(\"word\")\n  .where(\"word = 'Hamlet' or word = 'Claudius'\")\n  .collect()\n```\n\nfilters to the column `word`  and pushed down the predicate filter `word = 'hamlet' or word = 'Claudius'`. Note filters containing ArrayType column is not pushed down.\n\nFilter pushdown is currently not supported when exporting graphs.\n\n### Monitoring\n\nWhen Data Boost is enabled, the usage can be monitored by using Cloud Monitoring. The [page]([url](https://cloud.google.com/spanner/docs/databoost/databoost-monitor#use_to_track_usage)) explains how to do that step by step. The usage cannot be grouped by the Spark job id though.\n\n### Debugging\n\nDataproc [web interface]([url](https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces)) can be used to debug especially to tune the performance. On the `YARN Application Timeline` page, it displays the execution timeline details for the executors and other functions. You can assign more workers if there are many tasks assigned to a same executor.\n\n### Root-partitionable Query\n\nWhen DataBoost is enabled, all queries that are fed into Cloud Spanner must be root-partionable. Please see [`Read data in parallel`](https://cloud.google.com/spanner/docs/reads#read_data_in_parallel) for more details. If you encounter an issue related to partitioning when using this connector, it is probably that the table being read from is not supported.\n\n### PostgreSQL\n\nThe connector supports the Spanner [PostgreSQL interface-enabled databases](https://cloud.google.com/spanner/docs/postgresql-interface#postgresql-components).\n\n#### Data types\n\nSpanner PostgreSql Type|Spark Data Type|Notes\n---|---|---\narray                                |ArrayType    | Nested array is not supported.\nbool / boolean                       |BooleanType  |\nbytea                                |BinaryType   |\ndate                                 |DateType     | The date range is [1700-01-01, 9999-12-31].\ndouble precision / float8            |DoubleType   |\nint8 / bigint                        |LongType     | The supported integer range is [-9,223,372,036,854,775,808, 9,223,372,036,854,775,807]\njsonb                                |StringType   | Spark has no JSON type. The values are read as String.\nnumeric / decimal                    |DecimalType  | The NUMERIC will be converted to DecimalType with 38 precision and 9 scale, which is the same as the Spanner definition.\nvarchar / text / character varying   |StringType   |\ntimestamptz/timestamp with time zone |TimestampType| Only microseconds will be converted to Spark timestamp type. The range of timestamp is  [0001-01-01 00:00:00, 9999-12-31 23:59:59.999999]\n\n#### Filter Pushdown\n\nSince jsonb is converted to StringType in Spark, a filter containing jsonb column can only be pushed down as a string filter. For the jsonb column, `IN` filter is not pushdown to Cloud Spanner.\n\nFilters containing array column will not be pushed down.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fspark-spanner-connector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogleclouddataproc%2Fspark-spanner-connector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fspark-spanner-connector/lists"}