https://github.com/embulk/embulk-input-gcs
Embulk plugin that loads records from Google Cloud Storage
https://github.com/embulk/embulk-input-gcs
embulk embulk-input-plugin embulk-plugin gcp google-cloud-storage
Last synced: about 1 month ago
JSON representation
Embulk plugin that loads records from Google Cloud Storage
- Host: GitHub
- URL: https://github.com/embulk/embulk-input-gcs
- Owner: embulk
- License: apache-2.0
- Created: 2015-03-01T05:06:39.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2025-03-15T08:01:26.000Z (2 months ago)
- Last Synced: 2025-04-02T05:08:23.054Z (about 2 months ago)
- Topics: embulk, embulk-input-plugin, embulk-plugin, gcp, google-cloud-storage
- Language: Java
- Homepage: https://github.com/embulk/embulk-input-gcs
- Size: 513 KB
- Stars: 14
- Watchers: 14
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
Google Cloud Storage file input plugin for Embulk
===================================================Overview
---------embulk-input-gcs v0.5.0+ requires Embulk v0.11.0+.
* Plugin type: **file input**
* Resume supported: **yes**
* Cleanup supported: **yes**Usage
------### Install plugin
```
java -jar embulk-X.Y.Z.jar install "org.embulk:embulk-input-gcs:0.5.0"
```### Google Service Account Settings
If you chose "private_key" or "json_key" as [auth_method](#Authentication), you can get service_account_email and private_key or json_key like below.
1. Make project at [Google Developers Console](https://console.developers.google.com/project).
1. Make "Service Account" with [this step](https://cloud.google.com/storage/docs/authentication#service_accounts).
Service Account has two specific scopes: read-only, read-write.
embulk-input-gcs can run "read-only" scopes.
1. Generate private key in P12(PKCS12) format or json_key, and upload to machine.
### run
```
java -jar embulk-X.Y.Z.jar run /path/to/config.yml
```## Configuration
- **bucket** Google Cloud Storage bucket name (string, required)
- **path_prefix** prefix of target keys (string, either of "path_prefix" or "paths" is required)
- **paths** list of target keys (array of string, either of "path_prefix" or "paths" is required)
* **path_match_pattern**: regexp to match file paths. If a file path doesn't match with this pattern, the file will be skipped (regexp string, optional)
- **incremental**: enables incremental loading(boolean, optional. default: true. If incremental loading is enabled, config diff for the next execution will include `last_path` parameter so that next execution skips files before the path. Otherwise, `last_path` will not be included.
- **auth_method** (string, optional, "private_key", "json_key" or "compute_engine". default value is "private_key")
- **service_account_email** Google Cloud Storage service_account_email (string, required when auth_method is private_key)
- **p12_keyfile** fullpath of p12 key (string, required when auth_method is private_key)
- **json_keyfile** fullpath of json_key (string, required when auth_method is json_key)
- **application_name** application name anything you like (string, optional)Example
--------```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: logs/csv-
auth_method: private_key #default
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
application_name: Anything you like
```Example for "sample_01.csv.gz" , generated by [embulk example](https://github.com/embulk/embulk#trying-examples)
```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: sample_
auth_method: private_key #default
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
application_name: Anything you like
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
```To skip files using regexp:
```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: logs/csv-
# ...
path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern
## some examples of regexp:
#path_match_pattern: /archive/ # match files in .../archive/... directory
#path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
#path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz
```Authentication
---------------There are three methods supported to fetch access token for the service account.
1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
2. JSON key of GCP(Google Cloud Platform)'s service account
3. Pre-defined access token (Google Compute Engine only)### Public-Private key pair of GCP's service account
You first need to create a service account (client ID), download its private key and deploy the key with embulk.
```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: private_key
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
```### JSON key of GCP's service account
You first need to create a service account (client ID), download its json key and deploy the key with embulk.
```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: json_key
json_keyfile: /path/to/json_keyfile.json
```You can also embed contents of json_keyfile at config.yml.
```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: json_key
json_keyfile:
content: |
{
"private_key_id": "123456789",
"private_key": "-----BEGIN PRIVATE KEY-----\nABCDEF",
"client_email": "..."
}
```### Pre-defined access token(GCE only)
On the other hand, you don't need to explicitly create a service account for embulk when you
run embulk in Google Compute Engine. In this third authentication method, you need to
add the API scope "https://www.googleapis.com/auth/devstorage.read_only" to the scope list of your
Compute Engine VM instance, then you can configure embulk like this.[Setting the scope of service account access for instances](https://cloud.google.com/compute/docs/authentication)
```yaml
in:
type:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: compute_engine
```Eventually Consistency
-----------------------An operation listing objects is eventually consistent although getting objects is strongly consistent, see https://cloud.google.com/storage/docs/consistency.
`path_prefix` uses the objects list API, therefore it would miss some of objects.
If you want to avoid such situations, you should use `paths` option which directly specifies object paths without the objects list API.For Maintainers
----------------### Build
```
./gradlew jar
```### Test
To run unit tests, we need to configure the following environment variables.
Additionally, following files will be needed to upload to existing GCS bucket.
* [sample_01.csv](./src/test/resources/sample_01.csv)
* [sample_02.csv](./src/test/resources/sample_02.csv)When environment variables are not set, skip some test cases.
```
GCP_EMAIL
GCP_P12_KEYFILE
GCP_JSON_KEYFILE
GCP_BUCKET
GCP_BUCKET_DIRECTORY(optional, if needed)
```If you're using Mac OS X El Capitan and GUI Applications(IDE), like as follows.
```
$ vi ~/Library/LaunchAgents/environment.plistLabel
my.startup
ProgramArguments
sh
-c
launchctl setenv GCP_EMAIL ABCXYZ123ABCXYZ123.gserviceaccount.com
launchctl setenv GCP_P12_KEYFILE /path/to/p12_keyfile.p12
launchctl setenv GCP_JSON_KEYFILE /path/to/json_keyfile.json
launchctl setenv GCP_BUCKET my-bucket
launchctl setenv GCP_BUCKET_DIRECTORY unittests
RunAtLoad
$ launchctl load ~/Library/LaunchAgents/environment.plist
$ launchctl getenv GCP_EMAIL //try to get value.Then start your applications.
```### Release
Modify `version` in `build.gradle` at a detached commit, and then tag the commit with an annotation.
```
git checkout --detach master(Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)
git add build.gradle
git commit -m "Release vX.Y.Z"
git tag -a vX.Y.Z
(Edit: Write a tag annotation in the changelog format.)
```See [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) for the changelog format. We adopt a part of it for Git's tag annotation like below.
```
## [X.Y.Z] - YYYY-MM-DD### Added
- Added a feature.### Changed
- Changed something.### Fixed
- Fixed a bug.
```Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.
```
git push -u origin vX.Y.Z
```