{"id":19651289,"url":"https://github.com/embulk/embulk-input-gcs","last_synced_at":"2025-09-03T12:38:04.545Z","repository":{"id":27993325,"uuid":"31487450","full_name":"embulk/embulk-input-gcs","owner":"embulk","description":"Embulk plugin that loads records from Google Cloud Storage","archived":false,"fork":false,"pushed_at":"2025-03-15T08:01:26.000Z","size":525,"stargazers_count":14,"open_issues_count":1,"forks_count":8,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-07-27T15:21:58.636Z","etag":null,"topics":["embulk","embulk-input-plugin","embulk-plugin","gcp","google-cloud-storage"],"latest_commit_sha":null,"homepage":"https://github.com/embulk/embulk-input-gcs","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/embulk.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2015-03-01T05:06:39.000Z","updated_at":"2025-03-15T07:57:24.000Z","dependencies_parsed_at":"2025-05-20T01:18:30.140Z","dependency_job_id":null,"html_url":"https://github.com/embulk/embulk-input-gcs","commit_stats":null,"previous_names":[],"tags_count":29,"template":false,"template_full_name":null,"purl":"pkg:github/embulk/embulk-input-gcs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-gcs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-gcs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-gcs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-gcs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/embulk","download_url":"https://codeload.github.com/embulk/embulk-input-gcs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-gcs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273445343,"owners_count":25107148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-03T02:00:09.631Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embulk","embulk-input-plugin","embulk-plugin","gcp","google-cloud-storage"],"created_at":"2024-11-11T15:05:58.091Z","updated_at":"2025-09-03T12:38:04.476Z","avatar_url":"https://github.com/embulk.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"Google Cloud Storage file input plugin for Embulk\n===================================================\n\nOverview\n---------\n\nembulk-input-gcs v0.5.0+ requires Embulk v0.11.0+.\n\n* Plugin type: **file input**\n* Resume supported: **yes**\n* Cleanup supported: **yes**\n\nUsage\n------\n\n### Install plugin\n\n```\njava -jar embulk-X.Y.Z.jar install \"org.embulk:embulk-input-gcs:0.5.0\"\n```\n\n### Google Service Account Settings\n\nIf you chose \"private_key\" or \"json_key\" as [auth_method](#Authentication), you can get service_account_email and private_key or json_key like below.\n\n1. Make project at [Google Developers Console](https://console.developers.google.com/project).\n\n1. Make \"Service Account\" with [this step](https://cloud.google.com/storage/docs/authentication#service_accounts).\n\n    Service Account has two specific scopes: read-only, read-write.\n\n    embulk-input-gcs can run \"read-only\" scopes.\n\n1. Generate private key in P12(PKCS12) format or json_key, and upload to machine.\n\n### run\n\n```\njava -jar embulk-X.Y.Z.jar run /path/to/config.yml\n```\n\n## Configuration\n\n- **bucket** Google Cloud Storage bucket name (string, required)\n- **path_prefix** prefix of target keys (string, either of \"path_prefix\" or \"paths\" is required)\n- **paths** list of target keys (array of string, either of \"path_prefix\" or \"paths\" is required)\n* **path_match_pattern**: regexp to match file paths. If a file path doesn't match with this pattern, the file will be skipped (regexp string, optional)\n- **incremental**: enables incremental loading(boolean, optional. default: true. If incremental loading is enabled, config diff for the next execution will include `last_path` parameter so that next execution skips files before the path. Otherwise, `last_path` will not be included.\n- **auth_method**  (string, optional, \"private_key\", \"json_key\" or \"compute_engine\". default value is \"private_key\")\n- **service_account_email** Google Cloud Storage service_account_email (string, required when auth_method is private_key)\n- **p12_keyfile** fullpath of p12 key (string, required when auth_method is private_key)\n- **json_keyfile** fullpath of json_key (string, required when auth_method is json_key)\n- **application_name** application name anything you like (string, optional)\n\nExample\n--------\n\n```yaml\nin:\n  type:\n    source: maven\n    group: org.embulk\n    name: gcs\n    verison: \"0.5.0\"\n  bucket: my-gcs-bucket\n  path_prefix: logs/csv-\n  auth_method: private_key #default\n  service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com\n  p12_keyfile: /path/to/p12_keyfile.p12\n  application_name: Anything you like\n```\n\nExample for \"sample_01.csv.gz\" , generated by [embulk example](https://github.com/embulk/embulk#trying-examples)\n\n```yaml\nin:\n  type:\n    source: maven\n    group: org.embulk\n    name: gcs\n    verison: \"0.5.0\"\n  bucket: my-gcs-bucket\n  path_prefix: sample_\n  auth_method: private_key #default\n  service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com\n  p12_keyfile: /path/to/p12_keyfile.p12\n  application_name: Anything you like\n  decoders:\n  - {type: gzip}\n  parser:\n    charset: UTF-8\n    newline: CRLF\n    type: csv\n    delimiter: ','\n    quote: '\"'\n    header_line: true\n    columns:\n    - {name: id, type: long}\n    - {name: account, type: long}\n    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}\n    - {name: purchase, type: timestamp, format: '%Y%m%d'}\n    - {name: comment, type: string}\nout: {type: stdout}\n```\n\nTo skip files using regexp:\n\n```yaml\nin:\n  type:\n    source: maven\n    group: org.embulk\n    name: gcs\n    verison: \"0.5.0\"\n  bucket: my-gcs-bucket\n  path_prefix: logs/csv-\n  # ...\n  path_match_pattern: \\.csv$   # a file will be skipped if its path doesn't match with this pattern\n  ## some examples of regexp:\n  #path_match_pattern: /archive/         # match files in .../archive/... directory\n  #path_match_pattern: /data1/|/data2/   # match files in .../data1/... or .../data2/... directory\n  #path_match_pattern: .csv$|.csv.gz$    # match files whose suffix is .csv or .csv.gz\n```\n\nAuthentication\n---------------\n\nThere are three methods supported to fetch access token for the service account.\n\n1. Public-Private key pair of GCP(Google Cloud Platform)'s service account\n2. JSON key of GCP(Google Cloud Platform)'s service account\n3. Pre-defined access token (Google Compute Engine only)\n\n### Public-Private key pair of GCP's service account\n\nYou first need to create a service account (client ID), download its private key and deploy the key with embulk.\n\n```yaml\nin:\n  type:\n    source: maven\n    group: org.embulk\n    name: gcs\n    verison: \"0.5.0\"\n  auth_method: private_key\n  service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com\n  p12_keyfile: /path/to/p12_keyfile.p12\n```\n\n### JSON key of GCP's service account\n\nYou first need to create a service account (client ID), download its json key and deploy the key with embulk.\n\n```yaml\nin:\n  type:\n    source: maven\n    group: org.embulk\n    name: gcs\n    verison: \"0.5.0\"\n  auth_method: json_key\n  json_keyfile: /path/to/json_keyfile.json\n```\n\nYou can also embed contents of json_keyfile at config.yml.\n\n```yaml\nin:\n  type:\n    source: maven\n    group: org.embulk\n    name: gcs\n    verison: \"0.5.0\"\n  auth_method: json_key\n  json_keyfile:\n    content: |\n      {\n          \"private_key_id\": \"123456789\",\n          \"private_key\": \"-----BEGIN PRIVATE KEY-----\\nABCDEF\",\n          \"client_email\": \"...\"\n       }\n```\n\n### Pre-defined access token(GCE only)\n\nOn the other hand, you don't need to explicitly create a service account for embulk when you\nrun embulk in Google Compute Engine. In this third authentication method, you need to\nadd the API scope \"https://www.googleapis.com/auth/devstorage.read_only\" to the scope list of your\nCompute Engine VM instance, then you can configure embulk like this.\n\n[Setting the scope of service account access for instances](https://cloud.google.com/compute/docs/authentication)\n\n```yaml\nin:\n  type:\n    source: maven\n    group: org.embulk\n    name: gcs\n    verison: \"0.5.0\"\n  auth_method: compute_engine\n```\n\nEventually Consistency\n-----------------------\n\nAn operation listing objects is eventually consistent although getting objects is strongly consistent, see https://cloud.google.com/storage/docs/consistency.\n\n`path_prefix` uses the objects list API, therefore it would miss some of objects.\nIf you want to avoid such situations, you should use `paths` option which directly specifies object paths without the objects list API.\n\nFor Maintainers\n----------------\n\n### Build\n\n```\n./gradlew jar\n```\n\n### Test\n\nTo run unit tests, we need to configure the following environment variables.\n\nAdditionally, following files will be needed to upload to existing GCS bucket.\n* [sample_01.csv](./src/test/resources/sample_01.csv)\n* [sample_02.csv](./src/test/resources/sample_02.csv)\n\nWhen environment variables are not set, skip some test cases.\n\n```\nGCP_EMAIL\nGCP_P12_KEYFILE\nGCP_JSON_KEYFILE\nGCP_BUCKET\nGCP_BUCKET_DIRECTORY(optional, if needed)\n```\n\nIf you're using Mac OS X El Capitan and GUI Applications(IDE), like as follows.\n```\n$ vi ~/Library/LaunchAgents/environment.plist\n\u003c?xml version=\"1.0\" encoding=\"UTF-8\"?\u003e\n\u003c!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/PropertyList-1.0.dtd\"\u003e\n\u003cplist version=\"1.0\"\u003e\n\u003cdict\u003e\n  \u003ckey\u003eLabel\u003c/key\u003e\n  \u003cstring\u003emy.startup\u003c/string\u003e\n  \u003ckey\u003eProgramArguments\u003c/key\u003e\n  \u003carray\u003e\n    \u003cstring\u003esh\u003c/string\u003e\n    \u003cstring\u003e-c\u003c/string\u003e\n    \u003cstring\u003e\n      launchctl setenv GCP_EMAIL ABCXYZ123ABCXYZ123.gserviceaccount.com\n      launchctl setenv GCP_P12_KEYFILE /path/to/p12_keyfile.p12\n      launchctl setenv GCP_JSON_KEYFILE /path/to/json_keyfile.json\n      launchctl setenv GCP_BUCKET my-bucket\n      launchctl setenv GCP_BUCKET_DIRECTORY unittests\n    \u003c/string\u003e\n  \u003c/array\u003e\n  \u003ckey\u003eRunAtLoad\u003c/key\u003e\n  \u003ctrue/\u003e\n\u003c/dict\u003e\n\u003c/plist\u003e\n\n$ launchctl load ~/Library/LaunchAgents/environment.plist\n$ launchctl getenv GCP_EMAIL //try to get value.\n\nThen start your applications.\n```\n\n### Release\n\nModify `version` in `build.gradle` at a detached commit, and then tag the commit with an annotation.\n\n```\ngit checkout --detach master\n\n(Edit: Remove \"-SNAPSHOT\" in \"version\" in build.gradle.)\n\ngit add build.gradle\n\ngit commit -m \"Release vX.Y.Z\"\n\ngit tag -a vX.Y.Z\n\n(Edit: Write a tag annotation in the changelog format.)\n```\n\nSee [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) for the changelog format. We adopt a part of it for Git's tag annotation like below.\n\n```\n## [X.Y.Z] - YYYY-MM-DD\n\n### Added\n- Added a feature.\n\n### Changed\n- Changed something.\n\n### Fixed\n- Fixed a bug.\n```\n\nPush the annotated tag, then. It triggers a release operation on GitHub Actions after approval.\n\n```\ngit push -u origin vX.Y.Z\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fembulk%2Fembulk-input-gcs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fembulk%2Fembulk-input-gcs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fembulk%2Fembulk-input-gcs/lists"}