{"id":19651273,"url":"https://github.com/embulk/embulk-input-s3","last_synced_at":"2025-04-07T12:05:22.425Z","repository":{"id":26876391,"uuid":"30336945","full_name":"embulk/embulk-input-s3","owner":"embulk","description":"S3 file input plugin for Embulk","archived":false,"fork":false,"pushed_at":"2024-01-09T22:30:29.000Z","size":639,"stargazers_count":41,"open_issues_count":5,"forks_count":29,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-31T10:06:21.384Z","etag":null,"topics":["embulk","s3"],"latest_commit_sha":null,"homepage":"https://www.embulk.org/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/embulk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-02-05T04:03:40.000Z","updated_at":"2024-05-13T05:17:01.000Z","dependencies_parsed_at":"2025-01-07T06:13:16.139Z","dependency_job_id":null,"html_url":"https://github.com/embulk/embulk-input-s3","commit_stats":{"total_commits":242,"total_committers":16,"mean_commits":15.125,"dds":0.8223140495867769,"last_synced_commit":"7efefde6a4105eaacc47d0e08d6500b108784382"},"previous_names":[],"tags_count":42,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-s3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-s3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-s3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/embulk%2Fembulk-input-s3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/embulk","download_url":"https://codeload.github.com/embulk/embulk-input-s3/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247648976,"owners_count":20972945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embulk","s3"],"created_at":"2024-11-11T15:05:55.242Z","updated_at":"2025-04-07T12:05:22.374Z","avatar_url":"https://github.com/embulk.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"# S3 file input plugin for Embulk\n\n## Overview\n\nembulk-input-s3 v0.5.0+ requires Embulk v0.9.23+\n\n* Plugin type: **file input**\n* Resume supported: **yes**\n* Cleanup supported: **yes**\n\n## Configuration\n\n- **bucket** S3 bucket name (string, required)\n\n- **path_prefix** prefix of target keys (string, optional)\n\n- **path** the direct path to target key (string, optional)\n  - **Note:** Either **path** or *path_prefix** must exist, both is able to exist at the same time and **path** will be chosen in case it happens\n\n- **endpoint** S3 endpoint login user name (string, optional)\n\n- **region** S3 region. endpoint will be in effect if you specify both of endpoint and region (string, optional)\n\n- **http_proxy** http proxy configuration to use when accessing AWS S3 via http proxy. (optional)\n  - **host** proxy host (string, required)\n  - **port** proxy port (int, optional)\n  - **https** use https or not (boolean, default true)\n  - **user** proxy user (string, optional)\n  - **password** proxy password (string, optional)\n\n- **auth_method**: name of mechanism to authenticate requests (basic, env, instance, profile, properties, anonymous, or session. default: basic)\n\n  - \"basic\": uses access_key_id and secret_access_key to authenticate.\n\n    - **access_key_id**: AWS access key ID (string, required)\n\n    - **secret_access_key**: AWS secret access key (string, required)\n\n  - \"env\": uses AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY) environment variables.\n\n  - \"instance\": uses EC2 instance profile.\n\n  - \"profile\": uses credentials written in a file. Format of the file is as following, where `[...]` is a name of profile.\n\n    - **profile_file**: path to a profiles file. (string, default: given by AWS_CREDENTIAL_PROFILES_FILE environment varialbe, or ~/.aws/credentials).\n\n    - **profile_name**: name of a profile. (string, default: `\"default\"`)\n\n    ```\n    [default]\n    aws_access_key_id=YOUR_ACCESS_KEY_ID\n    aws_secret_access_key=YOUR_SECRET_ACCESS_KEY\n\n    [profile2]\n    ...\n    ```\n\n  - \"properties\": uses aws.accessKeyId and aws.secretKey Java system properties.\n\n  - \"anonymous\": uses anonymous access. This auth method can access only public files.\n\n  - \"session\": uses temporary-generated access_key_id, secret_access_key and session_token.\n\n    - **access_key_id**: AWS access key ID (string, required)\n\n    - **secret_access_key**: AWS secret access key (string, required)\n\n    - **session_token**: session token (string, required)\n\n  - \"default\": uses AWS SDK's default strategy to look up available credentials from runtime environment. This method behaves like the combination of the following methods.\n\n    1. \"env\"\n    1. \"properties\"\n    1. \"profile\"\n    1. \"instance\"\n\n\n* **path_match_pattern**: regexp to match file paths. If a file path doesn't match with this pattern, the file will be skipped (regexp string, optional)\n\n* **total_file_count_limit**: maximum number of files to read (integer, optional)\n\n* **min_task_size** (experimental): minimum bytesize of a task. If this is larger than 0, one task includes multiple input files up until it becomes the bytesize in total. This is useful if too many number of tasks impacts performance of output or executor plugins badly. (integer, optional)\n\n* **use_modified_time**: use last modified time to filter files to read if enabled, otherwise last path is used (boolean, optional, default false)\n\n* **last_modified_time**: files are read if it is modified after that time. Timezone is UTC. If this parameter is empty, read all files. Timestamp format: `yyyy-MM-dd'T'HH:mm:ss.SSSZ` (string, optional)\n\n* **last_path**: files are read if path is after that `last_path` in lexicographical order (string, optional, default empty)\n\n* **incremental**: enables incremental loading. If incremental loading is enabled, config diff for the next execution will include:\n  * `last_modified_time` if `use_modified_time` is enabled, the next execution skips files before the modified time\n  * `last_path` if `use_modified_time` is disabled, the next execution skips files before the path.\n\n* **skip_glacier_objects**: if true, skip processing objects stored in Amazon Glacier (boolean, default false)\n\n\n## Example\n\n```yaml\nin:\n  type: s3\n  bucket: my-s3-bucket\n  path_prefix: logs/csv-\n  endpoint: s3-us-west-1.amazonaws.com\n  access_key_id: ABCXYZ123ABCXYZ123\n  secret_access_key: AbCxYz123aBcXyZ123\n```\n\nTo set http proxy configuration:\n\n```yaml\nin:\n  type: s3\n  bucket: my-s3-bucket\n  path_prefix: logs/csv-\n  endpoint: s3-us-west-1.amazonaws.com\n  access_key_id: ABCXYZ123ABCXYZ123\n  secret_access_key: AbCxYz123aBcXyZ123\n  http_proxy:\n    host: localhost\n    port: 8080\n```\n\nTo skip files using regexp:\n\n```yaml\nin:\n  type: s3\n  bucket: my-s3-bucket\n  path_prefix: logs/csv-\n  # ...\n  path_match_pattern: \\.csv$   # a file will be skipped if its path doesn't match with this pattern\n\n  ## some examples of regexp:\n  #path_match_pattern: /archive/         # match files in .../archive/... directory\n  #path_match_pattern: /data1/|/data2/   # match files in .../data1/... or .../data2/... directory\n  #path_match_pattern: .csv$|.csv.gz$    # match files whose suffix is .csv or .csv.gz\n```\n\nTo use AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables:\n\n```yaml\nin:\n  type: s3\n  bucket: my-s3-bucket\n  path_prefix: logs/csv-\n  endpoint: s3-us-west-1.amazonaws.com\n  auth_method: env\n```\n\nFor public S3 buckets such as `landsat-pds`:\n\n```yaml\nin:\n  type: s3\n  bucket: landsat-pds\n  path_prefix: scene_list.gz\n  auth_method: anonymous\n```\n\n## Build\n\n```\n./gradlew gem\n```\n\n## Release\n\n```\n./gradlew clean gem classpath\n./gradlew gemPush # release plugin gems to RubyGems.org\n./gradlew bintrayUpload # release embulk-(input-s3|input-riak_cs|util-aws-credentials) to Bintray maven repo\n```\n\n## Test\n\nTo run unit tests, we need to configure the following environment variables.\n```\nEMBULK_S3_TEST_BUCKET\nEMBULK_S3_TEST_ACCESS_KEY_ID\nEMBULK_S3_TEST_SECRET_ACCESS_KEY\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fembulk%2Fembulk-input-s3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fembulk%2Fembulk-input-s3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fembulk%2Fembulk-input-s3/lists"}