https://github.com/embulk/embulk-input-gcs

Embulk plugin that loads records from Google Cloud Storage
https://github.com/embulk/embulk-input-gcs

embulk embulk-input-plugin embulk-plugin gcp google-cloud-storage

Last synced: about 1 month ago
JSON representation

Embulk plugin that loads records from Google Cloud Storage

Host: GitHub
URL: https://github.com/embulk/embulk-input-gcs
Owner: embulk
License: apache-2.0
Created: 2015-03-01T05:06:39.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2025-03-15T08:01:26.000Z (2 months ago)
Last Synced: 2025-04-02T05:08:23.054Z (about 2 months ago)
Topics: embulk, embulk-input-plugin, embulk-plugin, gcp, google-cloud-storage
Language: Java
Homepage: https://github.com/embulk/embulk-input-gcs
Size: 513 KB
Stars: 14
Watchers: 14
Forks: 8
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

Google Cloud Storage file input plugin for Embulk
===================================================

Overview
---------

embulk-input-gcs v0.5.0+ requires Embulk v0.11.0+.

* Plugin type: **file input**
* Resume supported: **yes**
* Cleanup supported: **yes**

Usage
------

### Install plugin

```
java -jar embulk-X.Y.Z.jar install "org.embulk:embulk-input-gcs:0.5.0"
```

### Google Service Account Settings

If you chose "private_key" or "json_key" as [auth_method](#Authentication), you can get service_account_email and private_key or json_key like below.

1. Make project at [Google Developers Console](https://console.developers.google.com/project).

1. Make "Service Account" with [this step](https://cloud.google.com/storage/docs/authentication#service_accounts).

Service Account has two specific scopes: read-only, read-write.

embulk-input-gcs can run "read-only" scopes.

1. Generate private key in P12(PKCS12) format or json_key, and upload to machine.

### run

```
java -jar embulk-X.Y.Z.jar run /path/to/config.yml
```

## Configuration

- **bucket** Google Cloud Storage bucket name (string, required)
- **path_prefix** prefix of target keys (string, either of "path_prefix" or "paths" is required)
- **paths** list of target keys (array of string, either of "path_prefix" or "paths" is required)
* **path_match_pattern**: regexp to match file paths. If a file path doesn't match with this pattern, the file will be skipped (regexp string, optional)
- **incremental**: enables incremental loading(boolean, optional. default: true. If incremental loading is enabled, config diff for the next execution will include `last_path` parameter so that next execution skips files before the path. Otherwise, `last_path` will not be included.
- **auth_method** (string, optional, "private_key", "json_key" or "compute_engine". default value is "private_key")
- **service_account_email** Google Cloud Storage service_account_email (string, required when auth_method is private_key)
- **p12_keyfile** fullpath of p12 key (string, required when auth_method is private_key)
- **json_keyfile** fullpath of json_key (string, required when auth_method is json_key)
- **application_name** application name anything you like (string, optional)

Example
--------

```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: logs/csv-
auth_method: private_key #default
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
application_name: Anything you like
```

Example for "sample_01.csv.gz" , generated by [embulk example](https://github.com/embulk/embulk#trying-examples)

```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: sample_
auth_method: private_key #default
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
application_name: Anything you like
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
```

To skip files using regexp:

```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: logs/csv-
# ...
path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern
## some examples of regexp:
#path_match_pattern: /archive/ # match files in .../archive/... directory
#path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
#path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz
```

Authentication
---------------

There are three methods supported to fetch access token for the service account.

1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
2. JSON key of GCP(Google Cloud Platform)'s service account
3. Pre-defined access token (Google Compute Engine only)

### Public-Private key pair of GCP's service account

You first need to create a service account (client ID), download its private key and deploy the key with embulk.

```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: private_key
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
```

### JSON key of GCP's service account

You first need to create a service account (client ID), download its json key and deploy the key with embulk.

```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: json_key
json_keyfile: /path/to/json_keyfile.json
```

You can also embed contents of json_keyfile at config.yml.

```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: json_key
json_keyfile:
content: |
{
"private_key_id": "123456789",
"private_key": "-----BEGIN PRIVATE KEY-----\nABCDEF",
"client_email": "..."
}
```

### Pre-defined access token(GCE only)

On the other hand, you don't need to explicitly create a service account for embulk when you
run embulk in Google Compute Engine. In this third authentication method, you need to
add the API scope "https://www.googleapis.com/auth/devstorage.read_only" to the scope list of your
Compute Engine VM instance, then you can configure embulk like this.

[Setting the scope of service account access for instances](https://cloud.google.com/compute/docs/authentication)

```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: compute_engine
```

Eventually Consistency
-----------------------

An operation listing objects is eventually consistent although getting objects is strongly consistent, see https://cloud.google.com/storage/docs/consistency.

`path_prefix` uses the objects list API, therefore it would miss some of objects.
If you want to avoid such situations, you should use `paths` option which directly specifies object paths without the objects list API.

For Maintainers
----------------

### Build

```
./gradlew jar
```

### Test

To run unit tests, we need to configure the following environment variables.

Additionally, following files will be needed to upload to existing GCS bucket.
* [sample_01.csv](./src/test/resources/sample_01.csv)
* [sample_02.csv](./src/test/resources/sample_02.csv)

When environment variables are not set, skip some test cases.

```
GCP_EMAIL
GCP_P12_KEYFILE
GCP_JSON_KEYFILE
GCP_BUCKET
GCP_BUCKET_DIRECTORY(optional, if needed)
```

If you're using Mac OS X El Capitan and GUI Applications(IDE), like as follows.
```
$ vi ~/Library/LaunchAgents/environment.plist

Label
my.startup
ProgramArguments

sh
-c

launchctl setenv GCP_EMAIL ABCXYZ123ABCXYZ123.gserviceaccount.com
launchctl setenv GCP_P12_KEYFILE /path/to/p12_keyfile.p12
launchctl setenv GCP_JSON_KEYFILE /path/to/json_keyfile.json
launchctl setenv GCP_BUCKET my-bucket
launchctl setenv GCP_BUCKET_DIRECTORY unittests

RunAtLoad

$ launchctl load ~/Library/LaunchAgents/environment.plist
$ launchctl getenv GCP_EMAIL //try to get value.

Then start your applications.
```

### Release

Modify `version` in `build.gradle` at a detached commit, and then tag the commit with an annotation.

```
git checkout --detach master

(Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)

git add build.gradle

git commit -m "Release vX.Y.Z"

git tag -a vX.Y.Z

(Edit: Write a tag annotation in the changelog format.)
```

See [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) for the changelog format. We adopt a part of it for Git's tag annotation like below.

```
## [X.Y.Z] - YYYY-MM-DD

### Added
- Added a feature.

### Changed
- Changed something.

### Fixed
- Fixed a bug.
```

Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.

```
git push -u origin vX.Y.Z
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/embulk/embulk-input-gcs

Awesome Lists containing this project

README