{"id":19195904,"url":"https://github.com/mercari/dataflowtemplates","last_synced_at":"2025-10-25T10:36:20.008Z","repository":{"id":41817579,"uuid":"194847557","full_name":"mercari/DataflowTemplates","owner":"mercari","description":"Convenient Dataflow pipelines for transforming data between cloud data sources","archived":false,"fork":false,"pushed_at":"2024-02-14T13:11:19.000Z","size":100,"stargazers_count":24,"open_issues_count":3,"forks_count":12,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-22T10:47:01.042Z","etag":null,"topics":["apache-beam","bigquery","dataflow","dataflow-templates","spanner"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mercari.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-02T11:09:38.000Z","updated_at":"2024-06-13T07:52:17.000Z","dependencies_parsed_at":"2024-11-09T12:11:58.515Z","dependency_job_id":"5555c474-da98-459f-86a4-17c8e8361224","html_url":"https://github.com/mercari/DataflowTemplates","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mercari/DataflowTemplates","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mercari%2FDataflowTemplates","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mercari%2FDataflowTemplates/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mercari%2FDataflowTemplates/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mercari%2FDataflowTemplates/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mercari","download_url":"https://codeload.github.com/mercari/DataflowTemplates/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mercari%2FDataflowTemplates/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280943384,"owners_count":26417746,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-25T02:00:06.499Z","response_time":81,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-beam","bigquery","dataflow","dataflow-templates","spanner"],"created_at":"2024-11-09T12:11:55.777Z","updated_at":"2025-10-25T10:36:19.965Z","avatar_url":"https://github.com/mercari.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Mercari Dataflow Templates\n\n([New version released!](https://github.com/mercari/DataflowTemplate))\n\nUseful Cloud Dataflow custom templates.\n\nYou should use [Google provided templates](https://github.com/GoogleCloudPlatform/DataflowTemplates) if your use case fits.\nThis templates target use cases that official templates do not cover.\n\n## Template Pipelines\n\n* [Spanner to GCS Text](src/main/java/com/mercari/solution/templates/SpannerToText.java)\n* [Spanner to GCS Avro](src/main/java/com/mercari/solution/templates/SpannerToAvro.java)\n* [Spanner to BigQuery](src/main/java/com/mercari/solution/templates/SpannerToBigQuery.java)\n* [Spanner to Spanner](src/main/java/com/mercari/solution/templates/SpannerToSpanner.java)\n* [Spanner Bulk Delete](src/main/java/com/mercari/solution/templates/SpannerToSpannerDelete.java)\n* [BigQuery to Spanner](src/main/java/com/mercari/solution/templates/BigQueryToSpanner.java)\n* [BigQuery to Datastore](src/main/java/com/mercari/solution/templates/BigQueryToDatastore.java)\n* [BigQuery to TFRecord](src/main/java/com/mercari/solution/templates/BigQueryToTFRecord.java)\n* [GCS Avro to Spanner](src/main/java/com/mercari/solution/templates/AvroToSpanner.java)\n* [GCS Avro to Datastore](src/main/java/com/mercari/solution/templates/AvroToDatastore.java)\n* [Dummy to Spanner](src/main/java/com/mercari/solution/templates/DummyToSpanner.java)\n\n## Getting Started\n\n### Requirements\n\n* Java 8\n* Maven 3\n\n### Creating a Template File\n\nDataflow templates can be [created](https://cloud.google.com/dataflow/docs/templates/creating-templates#creating-and-staging-templates)\nusing a maven command which builds the project and stages the template\nfile on Google Cloud Storage. Any parameters passed at template build\ntime will not be able to be overwritten at execution time.\n\n```sh\nmvn compile exec:java \\\n-Pdataflow-runner \\\n-Dexec.mainClass=com.mercari.solution.templates.\u003ctemplate-class\u003e \\\n-Dexec.cleanupDaemonThreads=false \\\n-Dexec.args=\" \\\n--project=\u003cproject-id\u003e \\\n--stagingLocation=gs://\u003cbucket-name\u003e/staging \\\n--tempLocation=gs://\u003cbucket-name\u003e/temp \\\n--templateLocation=gs://\u003cbucket-name\u003e/\u003ctemplate-path\u003e \\\n--runner=DataflowRunner\"\n```\n\n### Executing a Template File\n\nOnce the template is staged on Google Cloud Storage, it can then be\nexecuted using the\n[gcloud CLI](https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/run)\ntool. The runtime parameters required by the template can be passed in the\nparameters field via comma-separated list of `paramName=Value`.\n\n```sh\ngcloud dataflow jobs run \u003cjob-name\u003e \\\n--gcs-location=\u003ctemplate-location\u003e \\\n--zone=\u003czone\u003e \\\n--parameters \u003cparameters\u003e\n```\n\n## Template Parameters\n\n### SpannerToText\n\nSpannerToText's feature is that user can specify sql to extract record as template parameter.\nBecause of template parameter, user can extract record flexibly using template such as daily differential backup from Spanner.\nSpannerToText support json and csv format.\n\n| Parameter       | Type   | Description                                      |\n|-----------------|--------|--------------------------------------------------|\n| type            | String | Output format. set `json`(default) or `csv`.     |\n| projectId       | String | projectID for Spanner you will read.             |\n| instanceId      | String | Spanner instanceID you will read.                |\n| databaseId      | String | Spanner databaseID you will read.                |\n| query           | String | SQL query to read record from Spanner            |\n| output          | String | GCS path to output. prefix must start with gs:// |\n| timestampBound  | String | (Optional) timestamp bound (format: yyyy-MM-ddTHH:mm:SSZ). default is strong.   |\n| outputNotify    | String | (Optional) GCS path to put notification file that contains output file paths |\n| outputEmpty     | String | (Optional) If set true, output empty file even empty result |\n| splitField      | String | (Optional) Field name to split output destination by the value |\n| withoutSharding | Boolean| (Optional) Turn off sharding to fix filename as you specified at output |\n| header          | String | (Optional) Header String |\n\n\n* If query is root partitionable(query plan have a DistributedUnion at the root), Pipeline will read record from Spanner in parallel.\nFor example, query that includes 'order by', 'limit' operation can not have DistributedUnion at the root.\nPlease run EXPLAIN for query plan details before running template.\n* [timestampBound](https://cloud.google.com/spanner/docs/timestamp-bounds) must be within one hour.\n* timestampBound format example in japan: '2018-10-01T18:00:00+09:00'.\n* Query will be split and executed in parallel if the delimiter string `--SPLITTER--` present.\n\n\n### SpannerToAvro\n\nSpannerToAvro's feature is that user can specify sql to extract record as template parameter.\nYou can restore Spanner table, BigQuery table from avro files you outputted.\nTemplate parameters are same as SpannerToText.\n\n| Parameter       | Type   | Description                                      |\n|-----------------|--------|--------------------------------------------------|\n| projectId       | String | projectID for Spanner you will read.             |\n| instanceId      | String | Spanner instanceID you will read.                |\n| databaseId      | String | Spanner databaseID you will read.                |\n| query           | String | SQL query to read record from Spanner            |\n| output          | String | GCS path to output. prefix must start with gs:// |\n| splitField      | String | (Optional) Query result field to store avro file separately |\n| timestampBound  | String | (Optional) timestamp bound (format: yyyy-MM-ddTHH:mm:SSZ). default is strong.   |\n| outputNotify    | String | (Optional) GCS path to put notification file that contains output file paths |\n| outputEmpty     | String | (Optional) If set true, output empty file even empty result |\n\n* Some spanner data type will be converted. Date will be converted to epoch days (Int32), Timestamp will be converted to epoch milliseconds (Int64).\n* Query will be split and executed in parallel if the delimiter string `--SPLITTER--` present.\n\n\n### SpannerToBigQuery\n\nSpannerToBigQuery's feature is that user can specify sql to extract record as template parameter.\nTemplate parameters are same as SpannerToText.\nBigQuery destination table must be created.\n\n| Parameter       | Type   | Description                                      |\n|-----------------|--------|--------------------------------------------------|\n| projectId       | String | projectID for Spanner you will read.             |\n| instanceId      | String | Spanner instanceID you will read.                |\n| databaseId      | String | Spanner databaseID you will read.                |\n| query           | String | SQL query to read record from Spanner            |\n| output          | String | Destination BigQuery table. format {dataset}.{table} |\n| timestampBound  | String | (Optional) timestamp bound (format: yyyy-MM-ddTHH:mm:SSZ). default is strong.   |\n\n* Query will be split and executed in parallel if the delimiter string `--SPLITTER--` present.\n\n\n### SpannerToSpanner\n\nSpannerToSpanner enables you to query from some Spanner table and write results to specified Spanner table using template.\nTemplate parameters are same as SpannerToText.\nSpanner destination table must be created.\n\n| Parameter       | Type   | Description                                        |\n|-----------------|--------|----------------------------------------------------|\n| inputProjectId  | String | projectID for Spanner you will read.               |\n| inputInstanceId | String | Spanner instanceID you will read.                  |\n| inputDatabaseId | String | Spanner databaseID you will read.                  |\n| query           | String | SQL query to read record from Spanner              |\n| outputProjectId | String | projectID for Spanner you will write query result. |\n| outputInstanceId| String | Spanner instanceID you will write query result.    |\n| outputDatabaseId| String | Spanner databaseID you will write query result.    |\n| table           | String | Spanner table name you will write query result.    |\n| mutationOp      | String | Spanner [insert policy](https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/spanner/Mutation.Op.html). `INSERT` or `UPDATE` or `REPLACE` or `INSERT_OR_UPDATE` |\n| outputError     | String | GCS path to output error record as avro files.     |\n| timestampBound  | String | (Optional) timestamp bound (format: yyyy-MM-ddTHH:mm:SSZ). default is strong.   |\n\n* Query will be split and executed in parallel if the delimiter string `--SPLITTER--` present.\n\n\n### SpannerToSpannerDelete\n\nSpannerToSpannerDelete delete records from table that matched query results user specified.\n\n| Parameter       | Type   | Description                                      |\n|-----------------|--------|--------------------------------------------------|\n| projectId       | String | projectID for Spanner you will read.             |\n| instanceId      | String | Spanner instanceID you will read.                |\n| databaseId      | String | Spanner databaseID you will read.                |\n| query           | String | SQL query to read record from Spanner            |\n| table           | String | Spanner table name to delete records             |\n| keyFields       | String | Key fields in query results. If composite key case, set comma-separated fields in key sequence. |\n| timestampBound  | String | (Optional) timestamp bound (format: yyyy-MM-ddTHH:mm:SSZ). default is strong.   |\n\n* Query will be split and executed in parallel if the delimiter string `--SPLITTER--` present.\n\n\n### AvroToSpanner\n\nAvroToSpanner recovers Spanner table from avro files you made using SpannerToAvro template.\nTemplate parameters are same as SpannerToText.\n\n| Parameter       | Type   | Description                                      |\n|-----------------|--------|--------------------------------------------------|\n| input           | String | GCS path for avro files. prefix must start with gs:// |\n| projectId       | String | projectID for Spanner you will recover.          |\n| instanceId      | String | Spanner instanceID you will recover.             |\n| databaseId      | String | Spanner databaseID you will recover.             |\n| table           | String | Spanner table name to insert records.            |\n| mutationOp      | String | Spanner [insert policy](https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/spanner/Mutation.Op.html). `INSERT` or `UPDATE` or `REPLACE` or `INSERT_OR_UPDATE` |\n\n\n### AvroToDatastore\n\nAvroToDatastore inserts avro files on GCS to Cloud Datastore.\n\n| Parameter              | Type   | Description                                      |\n|------------------------|--------|--------------------------------------------------|\n| input                  | String | GCS path for avro files. prefix must start with gs:// |\n| projectId              | String | GCP Project ID for Datastore to insert.          |\n| kind                   | String | Datastore kind name to insert.                   |\n| keyField               | String | Unique key field name in avro record.            |\n| excludeFromIndexFields | String | (Optional) Field names to exclude from index.    |\n\n\n### BigQueryToSpanner\n\nBigQueryToSpanner enables you to query from BigQuery and write results to specified Spanner table using template.\nSpanner destination table will be created if not exists.\n\n| Parameter       | Type   | Description                                        |\n|-----------------|--------|----------------------------------------------------|\n| query           | String | SQL query to read record from BigQuery             |\n| projectId       | String | projectID for Spanner you will write query result. |\n| instanceId      | String | Spanner instanceID you will write query result.    |\n| databaseId      | String | Spanner databaseID you will write query result.    |\n| table           | String | Spanner table name you will write query result.    |\n| mutationOp      | String | Spanner [insert policy](https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/spanner/Mutation.Op.html). `INSERT` or `UPDATE` or `REPLACE` or `INSERT_OR_UPDATE` |\n| outputError     | String | GCS path to output error record as avro files.     |\n| outputNotify    | String | (Optional) GCS path to put notification file that contains output file paths |\n| primaryKeyFields| String | (Optional) Key field on destination Spanner table. (Required if use table auto generation) |\n\n* You must enable [BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/).\n* At this time, Worker requires high memory to use Storage API. If the process does not work, please increase the worker size.\n\n\n### BigQueryToDatastore\n\nBigQueryToDatastore enables you to query from BigQuery and write results to specified Cloud Datastore kind using template.\n\n| Parameter              | Type   | Description                                          |\n|------------------------|--------|------------------------------------------------------|\n| query                  | String | SQL query to read record from BigQuery               |\n| projectId              | String | projectID for Datastore you will write query result. |\n| kind                   | String | Cloud Datastore target kind name to store.           |\n| keyField               | String | Unique field name in query results from BigQuery.    |\n| excludeFromIndexFields | String | (Optional) Field names to exclude from index.        |\n\n* You must enable [BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/).\n* At this time, Worker requires high memory to use Storage API. If the process does not work, please increase the worker size.\n\n\n### BigQueryToTFRecord\n\nBigQueryToTFRecord enables you to query from BigQuery and write results to specified GCS path as TFRecord format using template.\n\n| Parameter     | Type   | Description                                          |\n|---------------|--------|------------------------------------------------------|\n| query         | String | SQL query to read record from BigQuery               |\n| output        | String | GCS path to store TFRecord file.                     |\n| splitField    | String | (Optional) Field names to separate records.          |\n| outputNotify  | String | (Optional) GCS path to put notification file that contains output file paths |\n| outputEmpty   | String | (Optional) If set true, output empty file even empty result |\n\n* You must enable [BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/).\n* At this time, Worker requires high memory to use Storage API. If the process does not work, please increase the worker size.\n\n\n### DummyToSpanner\n\nDummyToSpanner insert dummy records to specified Spanner table.\n\n| Parameter   | Type   | Description                                      |\n|-------------|--------|--------------------------------------------------|\n| projectId   | String | Spanner projectID.  |\n| instanceId  | String | Spanner instanceID. |\n| databaseId  | String | Spanner databaseID. |\n| tables      | String | Spanner table names and count you want to insert dummy records. format: tableName1:1000,tableName2:20000   |\n| config      | String | (Optional) Config yaml file. GCS path or yaml text itself.  |\n| parallelNum | Integer| (Optional) Worker parallel num to generate dummy records.  |\n\n* You can specify range of random values each fields.\n* The format of the config yaml file format is as follows.\n\n```yaml\ntables:\n  - name: dummyTable1\n    randomRate: 0\n    fields:\n      - name: intField\n        range: [1,10]\n      - name: floatField\n        range: [-100.0,100.0]\n      - name: dateField\n        range: [2014-01-01,2018-01-01]\n      - name: timestampField\n        range: [2018-01-01T00:00:00,2019-01-01T00:00:00]\n```\n\n* `randomRate` describes null value rate for `NULLABLE` field.\n* `range` describes minimum and maximum value for the field.\n\n\n## Committers\n\n * Yoichi Nagai ([@orfeon](https://github.com/orfeon))\n\n## Contribution\n\nPlease read the CLA carefully before submitting your contribution to Mercari.\nUnder any circumstances, by submitting your contribution, you are deemed to accept and agree to be bound by the terms and conditions of the CLA.\n\nhttps://www.mercari.com/cla/\n\n## License\n\nCopyright 2019 Mercari, Inc.\n\nLicensed under the MIT License.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmercari%2Fdataflowtemplates","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmercari%2Fdataflowtemplates","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmercari%2Fdataflowtemplates/lists"}