https://github.com/trocco-io/embulk-output-bigquery_java

Java flavor faster Embulk output plugin to load/insert data into Google BigQuery
https://github.com/trocco-io/embulk-output-bigquery_java
bigquery embulk etl java
Last synced: 2 months ago
JSON representation
Java flavor faster Embulk output plugin to load/insert data into Google BigQuery
Host: GitHub
URL: https://github.com/trocco-io/embulk-output-bigquery_java
Owner: trocco-io
License: mit
Created: 2020-02-17T00:49:19.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-10-28T13:50:45.000Z (8 months ago)
Last Synced: 2025-04-06T23:12:40.908Z (3 months ago)
Topics: bigquery, embulk, etl, java
Language: Java
Homepage:
Size: 378 KB
Stars: 5
Watchers: 3
Forks: 3
Open Issues: 27
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project

README

        # embulk-output-bigquery_java

[Embulk](https://github.com/embulk/embulk/) output plugin to load/insert data into [Google BigQuery](https://cloud.google.com/bigquery/) using [direct insert](https://cloud.google.com/bigquery/loading-data-into-bigquery#loaddatapostrequest)

## Overview

This plugin is targeting Java version of [embulk-output-bigquery](https://github.com/embulk/embulk-output-bigquery) and some additional functions. Most of features are not implemented right row. You should use jruby version for stable transfer.

load data into Google BigQuery as batch jobs for big amount of data

https://developers.google.com/bigquery/loading-data-into-bigquery

* **Plugin type**: output

* **Resume supported**: no

* **Cleanup supported**: no

* **Dynamic table creating**: yes

### NOT IMPLEMENTED

* insert data over streaming inserts

  * for continuous real-time insertions

  * Please use other product, like [fluent-plugin-bigquery](https://github.com/kaizenplatform/fluent-plugin-bigquery)

  * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases

Current version of this plugin supports Google API with Service Account Authentication, but does not support

OAuth flow for installed applications.

## Difference to [embulk-output-bigquery](https://github.com/embulk/embulk-output-bigquery)

- before_load: SQL query might be executed before loading to the table

- column_options.description: add description in Bigquery field

## Configuration

Under construction

#### Original options

| name     (x) is unsupported 
|:--------------------- 
|  mode (replace, 
|  auth_method 
|  json_keyfile 
|  project  (x) 
|  dataset 
|  location 
|  table 
|  auto_create_dataset 
|  auto_create_table 
|  schema_file   (x) 
|  template_table 
|  job_status_max_polling_time 
|  job_status_polling_interval 
|  is_skip_job_result_check 
|  with_rehearsal  (x) 
|  rehearsal_counts  (x) 
|  abort_on_error 
|  column_options 
|  default_timezone 
|  default_timestamp_format 
|  payload_column  (x) 
|  payload_column_index  (x) 
|  gcs_bucket   (x) 
|  auto_create_gcs_bucket (x) 
|  progress_log_interval  (x) 
|  before_load

| type        | required?  | default                  | description            | ----------------|:------------|:-----------|:-------------------------|:-----------------------| append is supported)                                | string      | optional   | "append"                 | See [Mode](#mode)      | (service_account is supported)                        | string      | optional   | "application\_default"   | See [Authentication](#authentication) | | string      | optional   |                          | keyfile path or `content` | | string      | required unless service\_account's `json_keyfile` is given. | | project\_id | | string      | required   |                          | dataset | | string      | optional   | nil                      | geographic location of dataset. See [Location](#location) | | string      | required   |                          | table name, or table name with a partition decorator such as `table_name$20160929`| | boolean     | optional   | false                    | automatically create dataset | | boolean     | optional   | true                     | `false` is available only for `append_direct` mode. Other modes require `true`. See [Dynamic Table Creating](#dynamic-table-creating) and [Time Partitioning](#time-partitioning) | | string      | optional   |                          | /path/to/schema.json | | string      | optional   |                          | template table name. See [Dynamic Table Creating](#dynamic-table-creating) | | int         | optional   | 3600 sec                 | Max job status polling time | | int         | optional   | 10 sec                   | Job status polling interval | | boolean     | optional   | false                    | Skip waiting Load job finishes. Available for append, or delete_in_advance mode | | boolean     | optional   | false                    | Load `rehearsal_counts` records as a rehearsal. Rehearsal loads into REHEARSAL temporary table, and delete finally. You may use this option to investigate data errors as early stage as possible | | integer     | optional   | 1000                     | Specify number of records to load in a rehearsal | | boolean     | optional   | true if max_bad_records is 0, otherwise false | Raise an error if number of input rows and number of output rows does not match | (not fully supported)                    | hash        | optional   |                          | See [Column Options](#column-options) | | string      | optional   | UTC                      | | | string      | optional   | %Y-%m-%d %H:%M:%S.%6N    | | | string      | optional   | nil                      | See [Formatter Performance Issue](#formatter-performance-issue) | | integer     | optional   | nil                      | See [Formatter Performance Issue](#formatter-performance-issue) | | string      | optional   | nil                      | See [GCS Bucket](#gcs-bucket) | | boolean     | optional   | false                    | See [GCS Bucket](#gcs-bucket) | | float       | optional   | nil (Disabled)           | Progress log interval. The progress log is disabled by nil (default). NOTE: This option may be removed in a future because a filter plugin can achieve the same goal | | string       | optional   | nil            |  if set, this SQL will be executed before loading all records in append mode. In replace mode, SQL is not executed. |

Client or request options

| name                                 | type        | required?  | default                  | description            |

|:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------|

|  open_timeout_sec  (x)                  | integer     | optional   | 300                      | Seconds to wait for the connection to open |

|  timeout_sec       (x)                  | integer     | optional   | 300                      | Seconds to wait for one block to be read (google-api-ruby-client < v0.11.0) |

|  send_timeout_sec   (x)                 | integer     | optional   | 300                      | Seconds to wait to send a request (google-api-ruby-client >= v0.11.0) |

|  read_timeout_sec   (x)                 | integer     | optional   | 300                      | Seconds to wait to read a response (google-api-ruby-client >= v0.11.0) |

|  retries                            | integer     | optional   | 5                        | Number of retries |

|  application_name   (x)                  | string      | optional   | "Embulk BigQuery plugin" | User-Agent |

|  sdk_log_level      (x)                 | string      | optional   | nil (WARN)               | Log level of google api client library |

Options for intermediate local files

| name                                 | type        | required?  | default                  | description            |

|:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------|

|  path_prefix                         | string      | optional   |                          | Path prefix of local files such as "/tmp/prefix_". Default randomly generates with [tempfile](http://ruby-doc.org/stdlib-2.2.3/libdoc/tempfile/rdoc/Tempfile.html) |

|  sequence_format   (x)                  | string      | optional   | .%d.%d                   | Sequence format for pid, thread id |

|  file_ext                            | string      | optional   |                          | The file extension of local files such as ".csv.gz" ".json.gz". Default automatically generates from `source_format` and `compression`|

|  skip_file_generation (x)                | boolean     | optional   |                          | Load already generated local files into BigQuery if available. Specify correct path_prefix and file_ext. |

|  delete_from_local_when_job_end      | boolean     | optional   | true                     | If set to true, delete generate local files when job is end |

|  compression                         | string      | optional   | "NONE"                   | Compression of local files (`GZIP` or `NONE`) |

`source_format` is also used to determine formatter (csv or jsonl).

#### Same options of bq command-line tools or BigQuery job's property

Following options are same as [bq command-line tools](https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile) or BigQuery [job's property](https://cloud.google.com/bigquery/docs/reference/v2/jobs#resource).

| name                              | type     | required? | default | description            |

|:----------------------------------|:---------|:----------|:--------|:-----------------------|

|  source_format   (jsonl is available)                 | string   | required  | "CSV"   |   File type (`NEWLINE_DELIMITED_JSON` or `CSV`) |

|  max_bad_records                  | int      | optional  | 0       | |

|  field_delimiter  (x)                | char     | optional  | ","     | |

|  encoding         (x)                | string   | optional  | "UTF-8" | `UTF-8` or `ISO-8859-1` |

|  ignore_unknown_values (x)            | boolean  | optional  | false   | |

|  allow_quoted_newlines  (x)          | boolean  | optional  | false   | Set true, if data contains newline characters. It may cause slow procsssing |

|  time_partitioning                | hash     | optional  | `{"type":"DAY"}` if `table` parameter has a partition decorator, otherwise nil | See [Time Partitioning](#time-partitioning) |

|  time_partitioning.type           | string   | required  | nil     | The only type supported is DAY, which will generate one partition per day based on data loading time. |

|  time_partitioning.expiration_ms   | int      | optional  | nil     | Number of milliseconds for which to keep the storage for a partition. |

|  time_partitioning.field        | string   | optional  | nil     | `DATE` or `TIMESTAMP` column used for partitioning |

|  clustering     (x)                   | hash     | optional  | nil     | Currently, clustering is supported for partitioned tables, so must be used with `time_partitioning` option. See [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables) |

|  clustering.fields  (x)               | array    | required  | nil     | One or more fields on which data should be clustered. The order of the specified columns determines the sort order of the data. |

|  schema_update_options  (x)           | array    | optional  | nil     | (Experimental) List of `ALLOW_FIELD_ADDITION` or `ALLOW_FIELD_RELAXATION` or both. See [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.schemaUpdateOptions). NOTE for the current status: `schema_update_options` does not work for `copy` job, that is, is not effective for most of modes such as `append`, `replace` and `replace_backup`. `delete_in_advance` deletes origin table so does not need to update schema. Only `append_direct` can utilize schema update. |

### Column Options (NOT fully supported)

Column options are used to aid guessing BigQuery schema, or to define conversion of values:

- **column_options**: advanced: an array of options for columns

  - **name**: column name

  - **type**: BigQuery type such as `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP`, `DATETIME`, `DATE`, `RECORD`, and `NUMERIC`. See belows for supported conversion type.

    - boolean (x):   `BOOLEAN`, `STRING` (default: `BOOLEAN`)

    - long (x):      `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP` (default: `INTEGER`)

    - double (x):    `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP` (default: `FLOAT`)

    - string:    `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP`, `DATETIME`, `DATE`, `RECORD` (default: `STRING`)

    - timestamp (x): `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP`, `DATETIME`, `DATE` (default: `TIMESTAMP`)

    - json (x):      `STRING`,  `RECORD` (default: `STRING`)

    - numeric (x): `STRING`

  - **mode**: BigQuery mode such as `NULLABLE`, `REQUIRED`, and `REPEATED` (string, default: `NULLABLE`)

  - **fields (x) **: Describes the nested schema fields if the type property is set to RECORD. Please note that this is **required** for `RECORD` column.

  - **timestamp_format**: timestamp format to convert into/from `timestamp` (string, default is `default_timestamp_format`)

  - **timezone**: timezone to convert into/from `timestamp`, `date` (string, default is `default_timezone`).

  - **description**: Description for BigQuery field

  - **scale**: optional, [scale](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types?hl=ja#decimal_types) for numeric column (long, default is 9).

- **default_timestamp_format**: default timestamp format for column_options (string, default is "%Y-%m-%d %H:%M:%S.%6N")

- **default_timezone**: default timezone for column_options (string, default is "UTC")

Example)

```yaml

out:

  type: bigquery

  auto_create_table: true

  column_options:

    - {name: date, type: STRING, timestamp_format: %Y-%m-%d, timezone: "Asia/Tokyo"}

    - name: json_column

      type: RECORD

      fields:

        - {name: key1, type: STRING}

        - {name: key2, type: STRING}

```

## Build

```

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

```

### Develop

```

$ embulk run config.yml -L PATH/embulk-output-bigquery_java/build/gemContents/

```

## TEST

```

$ ./gradlew test

```

Real bigquery connection tests are normally disabled. To enable them, set the EMBULK_OUTPUT_BIGQUERY_TEST_CONFIG environment variable to config.yml path.

```

$ EMBULK_OUTPUT_BIGQUERY_TEST_CONFIG="example/test.yml" ./gradlew test # Create example/test.yml based on example/test.yml.example

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/trocco-io/embulk-output-bigquery_java

Awesome Lists containing this project

README