{"id":23443548,"url":"https://github.com/neo4j-field/bigquery-connector","last_synced_at":"2025-04-13T12:13:55.554Z","repository":{"id":106299885,"uuid":"605667565","full_name":"neo4j-field/bigquery-connector","owner":"neo4j-field","description":"Bi-directional connectivity between Google BigQuery and Neo4j AuraDS","archived":false,"fork":false,"pushed_at":"2025-01-16T17:10:36.000Z","size":159,"stargazers_count":3,"open_issues_count":0,"forks_count":4,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-04-13T12:13:49.524Z","etag":null,"topics":["arrow-flight","bigquery","neo4j","protobuf","python","spark"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neo4j-field.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-23T16:39:42.000Z","updated_at":"2025-01-16T17:10:38.000Z","dependencies_parsed_at":"2023-12-18T18:08:27.322Z","dependency_job_id":"b59d94a4-2350-4cdf-8f32-f746de63270a","html_url":"https://github.com/neo4j-field/bigquery-connector","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neo4j-field%2Fbigquery-connector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neo4j-field%2Fbigquery-connector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neo4j-field%2Fbigquery-connector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neo4j-field%2Fbigquery-connector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neo4j-field","download_url":"https://codeload.github.com/neo4j-field/bigquery-connector/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248710448,"owners_count":21149191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow-flight","bigquery","neo4j","protobuf","python","spark"],"created_at":"2024-12-23T18:19:42.422Z","updated_at":"2025-04-13T12:13:55.531Z","avatar_url":"https://github.com/neo4j-field.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Neo4j BigQuery Connector\n\nThis project is a prototype of a Dataproc template to power a BigQuery\nstored procedure for Apache Spark (aka Dataproc Serverless).\n\nIt allows for bidirectional data loading between Neo4j and BigQuery.\n\n## Building\n\nThe code is packaged into a Docker image that gets deployed by\nDataproc onto the Spark environment.\n\nTo build:\n\n```\n$ docker build -t \"europe-west2-docker.pkg.dev/your-gcp-project/connectors/neo4j-bigquery-connector:0.6.1\"\n```\n\nThen push to Google Artifact Registry:\n\n```\n$ docker push \"europe-west2-docker.pkg.dev/your-gcp-project/connectors/neo4j-bigquery-connector:0.6.1\"\n```\n\n\u003e Note: you will need to enable your local gcloud tooling to help\n\u003e authenticate. Try running: `gcloud auth configure-docker`\n\n## Running\n\nThe template has been tested with AuraDS as well as self-managed GDS with Neo4j\nv5 Enterprise.\n\n### Network Prerequisites\n\nIn either case, you most likely need to configure a GCP network to use\n[Private Google Access](https://cloud.google.com/vpc/docs/private-google-access)\nand possibly Cloud NAT. (Cloud NAT is definitely needed for AuraDS.)\n\n### Running Locally\n\nThis project uses [poetry]() as the build tool.\nInstall `poetry`, define your environment with `poetry env use` and invoke `poetry install` to install dependencies.\n\nTo build;\n\n```\npoetry build\n```\n\n\u003e Note: You may also need to install a Java 11/17 JRE and make sure\n\u003e `JAVA_HOME` is set.\n\nThen invoke one of the `main*.py` entrypoint scripts using the command\nline arguments supported by the template. For example:\n\nFor BigQuery to GDS/AuraDS data movement;\n\n```\n$ poetry run python src/main.py --graph_name=mag240 --neo4j_db=neo4j \\\n    --neo4j_action=\"create_graph\" \\\n    --neo4j_secret=\"projects/1055617507124/secrets/neo4j-bigquery-demo-2/versions/2\" \\\n    --graph_uri=\"gcs://my-storage/graph-model.json\" \\\n    --bq_project=neo4j-se-team-201905 --bq_dataset=bqr_neo4j_demo \\\n    --node_tables=AUTHOR,PAPER --edge_tables=PUBLISHED\n```\n\nFor GDS/AuraDB to BigQuery data movement;\n\n```\n$ poetry run python src/main_to_bq.py --graph_name=mag240 --neo4j_db=neo4j \\\n    --neo4j_secret=\"projects/1055617507124/secrets/neo4j-bigquery-demo-2/versions/2\" \\\n    --bq_project=neo4j-se-team-201905 --bq_dataset=bqr_neo4j_demo --bq_node_table=results_nodes \\\n    --bq_edge_table=results_edges \\\n    --neo4j_patterns=\"(:Paper{flag,years}),[:PUBLISHED{year}],(:Author{id})\"\n```\n\n### Submitting a Dataproc Serverless Job\n\nIf you're looking to just use the Dataproc capabilities or looking to\ndo some quick testing, you can submit a batch job directly to\nDataproc.\n\nUsing the `gcloud` tooling, use a shell script like:\n\n```sh\n#!/bin/sh\n# Use fewer, larger executors\nSPARK_EXE_CORES=8\nSPARK_EXE_MEMORY=16g\nSPARK_EXE_COUNT=2\nPROPERTIES=\"spark.executor.cores=${SPARK_EXE_CORES}\"\nPROPERTIES=\"${PROPERTIES},spark.executor.memory=${SPARK_EXE_MEMORY}\"\nPROPERTIES=\"${PROPERTIES},spark.dynamicAllocation.initialExecutors=${SPARK_EXE_COUNT}\"\nPROPERTIES=\"${PROPERTIES},spark.dynamicAllocation.minExecutors=${SPARK_EXE_COUNT}\"\n\ngcloud dataproc batches submit pyspark \\\n    --region=\"europe-west1\" \\\n    --version=\"2.1\" \\\n    --deps-bucket=\"gs://your-bucket\" \\\n    --container-image=\"europe-west2-docker.pkg.dev/your-gcp-project/connectors/neo4j-bigquery-connector:0.6.1\" \\\n    --properties=\"${PROPERTIES}\" \\\n    main.py -- \\\n    --graph_name=mag240 \\\n    --graph_uri=\"gs://your-bucket/folder/model.json\" \\\n    --neo4j_database=neo4j \\\n    --neo4j_secret=\"projects/123456/secrets/neo4j-bigquery/versions/1\" \\\n    --neo4j_action=\"create_graph\" \\\n    --bq_project=\"your-gcp-project\" \\\n    --bq_dataset=\"your_bq_dataset\" \\\n    --node_tables=\"papers,authors,institution\" \\\n    --edge_tables=\"citations,authorship,affiliation\"\n```\n\nThe key parts to note:\n\n1. The arguments _before_ `main.py` are specific to the PySpark job.\n2. The arguments _after_ the `main.py --` are specific to the Dataproc\n   template.\n\nCustomize (1) for your GCP environment and (2) for your AuraDS and\nBigQuery environments as needed.\n\n\u003e Note: you can put configuration values in a JSON document stored in\n\u003e a Google Secret Manager secret (that's a mouthful). Use the\n\u003e `--neo4j_secret` parameter to pass in the full resource id (which\n\u003e should include the secret version number).\n\n## Configuring a Google BigQuery Stored Procedure\n\nIn short, you'll want to familiarize yourself with the [Stored\nprocedures for Apache\nSpark](https://cloud.google.com/bigquery/docs/spark-procedures)\ndocumentation. Assuming you've got your environment properly\nconfigured and enrolled in the preview program to use Spark for stored\nprocedures, you need to create your stored procedure.\n\n### Creating the BigQuery --\u003e Neo4j Procedure:\n\n```\nCREATE OR REPLACE PROCEDURE\n  `my-gcp-project.your_bigquery_dataset.neo4j_gds_graph_project`(graph_name STRING,\n    graph_uri STRING,\n    neo4j_secret STRING,\n    bq_project STRING,\n    bq_dataset STRING,\n    node_tables ARRAY\u003cSTRING\u003e,\n    edge_tables ARRAY\u003cSTRING\u003e)\nWITH CONNECTION `your-gcp-project.eu.spark-connection` OPTIONS (engine='SPARK',\n    runtime_version='2.1',\n    container_image='europe-west2-docker.pkg.dev/your-gcp-project/connectors/neo4j-bigquery-connector:0.6.1',\n    properties=[],\n    description=\"Project a graph from BigQuery into Neo4j AuraDS or GDS.\")\n  LANGUAGE python AS R\"\"\"\nfrom pyspark.sql import SparkSession\nfrom templates import BigQueryToNeo4jGDSTemplate\n\nspark = (\n      SparkSession\n      .builder\n      .appName(\"Neo4j BigQuery Connector\")\n      .getOrCreate()\n)\n\ntemplate = BigQueryToNeo4jGDSTemplate()\nargs = template.parse_args([\"--neo4j_action=create_graph\"])\ntemplate.run(spark, args)\n\"\"\";\n```\n\nSome details on the inputs:\n\n- `graph_name` -- the resulting name of the graph projection in AuraDS\n- `graph_uri` -- the GCS uri pointing to a JSON file describing the\n  [graph model](https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds#the-graph-model)\n  for your data\n- `neo4j_secret` -- a Google Secret Manager secret resource id\n  containing a JSON blob with additional arguments:\n    * `neo4j_user` -- name of the Neo4j user to connect as\n    * `neo4j_password` -- password for the given user\n    * `neo4j_uri` -- Connection URI of the AuraDS instance\n- `bq_project` -- GCP project id owning the BigQuery source data\n- `bq_dataset` -- BigQuery dataset name for the source data\n- `node_tables` -- an `ARRAY\u003cSTRING\u003e` of BigQuery table names representing nodes\n- `edge_tables` -- an `ARRAY\u003cSTRING\u003e` of BigQuery table names representing edges\n\n\u003e Note: you can leverage the fact the secret payload is JSON to tuck\n\u003e in any additional, supported arguments not exposed by your stored\n\u003e procedure. (For instance, you could override the default\n\u003e `neo4j_concurrency` setting.)\n\nAn example BigQuery SQL statement that calls the procedure:\n\n```\nDECLARE graph_name STRING DEFAULT \"test-graph\";\nDECLARE graph_uri STRING DEFAULT \"gs://your-bucket/folder/model.json\";\nDECLARE neo4j_secret STRING DEFAULT \"projects/123456/secrets/neo4j-bigquery/versions/1\";\nDECLARE bq_project STRING DEFAULT \"your-gcp-project\";\nDECLARE bq_dataset STRING DEFAULT \"your_bq_dataset\";\nDECLARE node_tables ARRAY\u003cSTRING\u003e DEFAULT [\"papers\", \"authors\", \"institution\"];\nDECLARE edge_tables ARRAY\u003cSTRING\u003e DEFAULT [\"citations\", \"authorship\", \"affiliation\"];\n\nCALL `your-gcp-project.your_bq_dataset.neo4j_gds_graph_project`(\n    graph_name, graph_uri, neo4j_secret, bq_project, bq_dataset,\n    node_tables, edge_tables);\n```\n\n### Creating the Neo4j --\u003e BigQuery Procedures\n\nOne Dataproc template (`Neo4jGDSToBigQueryTemplate`) supports writing\nNodes or Relationships back to BigQuery from AuraDS/GDS. The mode is\nsimply toggled via a `--bq_sink_mode` parameter that can either be\nhardcoded (like below) to make distinct stored procedures or exposed\nas a parameter for the user to populate.\n\nFor Nodes:\n\n```sql\nCREATE\nOR REPLACE PROCEDURE\n  `your-gcp-project.your_bigquery_dataset.neo4j_gds_stream_graph`(graph_name STRING,\n    neo4j_secret STRING,\n    bq_project STRING,\n    bq_dataset STRING,\n    bq_node_table STRING,\n    bq_edge_table STRING,\n    neo4j_patterns ARRAY\u003cSTRING\u003e)\nWITH CONNECTION `team-connectors-dev.eu.spark-connection` OPTIONS (engine='SPARK',\n    runtime_version='2.1',\n    container_image='europe-west2-docker.pkg.dev/your-gcp-project/connectors/neo4j-bigquery-connector:0.6.1',\n    properties=[(\"spark.driver.cores\", \"8\"),\n        (\"spark.driver.maxResultSize\", \"4g\"),\n        (\"spark.driver.memory\", \"16g\")],\n    description=\"Stream graph entities from Neo4j AuraDS/GDS to BigQuery\")\n  LANGUAGE python AS R\"\"\"\nfrom pyspark.sql import SparkSession\nfrom templates import Neo4jGDSToBigQueryTemplate\n\nspark = (\n\tSparkSession\n\t.builder\n\t.appName(\"Neo4j -\u003e BigQuery Connector\")\n\t.getOrCreate()\n)\n\ntemplate = Neo4jGDSToBigQueryTemplate()\nargs = template.parse_args()\ntemplate.run(spark, args)\n\"\"\";\n```\n\nSome details on the inputs:\n\n- `graph_name` -- the name of the graph in AuraDS you're reading\n- `neo4j_secret` -- a Google Secret Manager secret resource id\n  containing a JSON blob with additional arguments:\n    * `neo4j_user` -- name of the Neo4j user to connect as\n    * `neo4j_password` -- password for the given user\n    * `neo4j_uri` -- Connection URI of the AuraDS instance\n- `bq_project` -- GCP project id owning the BigQuery source data\n- `bq_dataset` -- BigQuery dataset name for the source data\n- `bq_node_table` -- BigQuery table name to write nodes into\n- `bq_edge_table` -- BigQuery table name to write edges into\n- `neo4j_patterns` -- an `ARRAY\u003cSTRING\u003e` of neo4j patterns to query from GDS/AuraDS, in the form of\n  Cypher style node or relationship patterns. e.g. `(:Author{id,birth_year})` for nodes, `[:KNOWS{since_year}]` for\n  relationships.\n\n\u003e Note: Writing back to BigQuery is currently subject to\n\u003e pre-planning your Spark environment to accommodate data sizes.\n\u003e\nSee [Dataproc Serverless docs](https://cloud.google.com/dataproc-serverless/docs/concepts/properties#resource_allocation_properties)\n\u003e for options for increasing CPU or memory for Spark\n\u003e workers/executors.\n\n## Current Caveats\n\n- All known caveats for populating GDS via Arrow Flight apply\n  (e.g. node id formats, etc.).\n- Concurrency doesn't auto-tune. Current recommendation is to set\n  `neo4j_concurrency` to the number of AuraDS CPU / 2 at minimum, but\n  it's not clear how much it helps.\n\n## Copyright and Licensing\n\nAll artifacts and code in this project, unless noted in their\nrespective files, are copyright Neo4j and made available under the\nApache License, Version 2.0.\n\nNo support is currently provided.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneo4j-field%2Fbigquery-connector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneo4j-field%2Fbigquery-connector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneo4j-field%2Fbigquery-connector/lists"}