Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/embulk/embulk-input-azure_blob_storage

Microsoft Azure Blob Storage file input plugin for Embulk
https://github.com/embulk/embulk-input-azure_blob_storage

azure azure-storage embulk embulk-input-plugin embulk-plugin

Last synced: about 1 month ago
JSON representation

Microsoft Azure Blob Storage file input plugin for Embulk

Awesome Lists containing this project

README

        

# Azure Blob Storage file input plugin for Embulk
[![Build Status](https://travis-ci.org/embulk/embulk-input-azure_blob_storage.svg?branch=master)](https://travis-ci.org/embulk/embulk-input-azure_blob_storage)

[Embulk](http://www.embulk.org/) file input plugin read files stored on [Microsoft Azure](https://azure.microsoft.com/) [Blob Storage](https://azure.microsoft.com/en-us/documentation/articles/storage-introduction/#blob-storage)

embulk-input-azure_blog_storage v0.2.0+ requires Embulk v0.9.12+

## Overview

* **Plugin type**: file input
* **Resume supported**: no
* **Cleanup supported**: yes

## Configuration

First, create Azure [Storage Account](https://azure.microsoft.com/en-us/documentation/articles/storage-create-storage-account/).

- **account_name**: storage account name (string, required)
- **account_key**: primary access key (string, required)
- **container**: container name data stored (string, required)
- **path_prefix**: prefix of target keys (string, required) (string, required)
- **incremental**: enables incremental loading(boolean, optional. default: true). If incremental loading is enabled, config diff for the next execution will include `last_path` parameter so that next execution skips files before the path. Otherwise, `last_path` will not be included.
- **path_match_pattern**: regexp to match file paths. If a file path doesn't match with this pattern, the file will be skipped (regexp string, optional)
- **total_file_count_limit**: maximum number of files to read (integer, optional)

### Proxy configuration

- **proxy**:
- **type**: (string, required, default: `null`)
- **http**: use HTTP Proxy
- **host**: (string, required)
- **port**: (int, required, default: `8080`)
- **user**: (string, optional)
- **password**: (string, optional)

## Example

```yaml
in:
type: azure_blob_storage
account_name: myaccount
account_key: myaccount_key
container: my-container
path_prefix: logs/csv-
```

Example for "sample_01.csv.gz" , generated by [embulk example](https://github.com/embulk/embulk#trying-examples)

```yaml
in:
type: azure_blob_storage
account_name: myaccount
account_key: myaccount_key
container: my-container
path_prefix: logs/csv-
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
```

To filter files using regexp:

```yaml
in:
type: sftp
path_prefix: logs/csv-
...
path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern

## some examples of regexp:
#path_match_pattern: /archive/ # match files in .../archive/... directory
#path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
#path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz
```

With proxy
```yaml
in:
type: azure_blob_storage
...
proxy:
type: http
host: proxy_host
port: 8080
user: proxy_user
password: proxy_secret_pass
```
## Build

```
$ ./gradlew gem # -t to watch change of files and rebuild continuously
```

## Test

```
$ ./gradlew test # -t to watch change of files and rebuild continuously
```

To run unit tests, we need to configure the following environment variables.

Additionally, following files will be needed to upload to existing GCS bucket.

* [sample_01.csv](src/test/resources/sample_01.csv)
* [sample_02.csv](src/test/resources/sample_02.csv)
* [missing_02.csv](src/test/resources/missing_02.csv)
* [missing_03.csv](src/test/resources/missing_03.csv)

When environment variables are not set, skip some test cases.

```
AZURE_ACCOUNT_NAME
AZURE_ACCOUNT_KEY
AZURE_CONTAINER
AZURE_CONTAINER_IMPORT_DIRECTORY (optional, if needed)
```

If you're using Mac OS X El Capitan and GUI Applications(IDE), like as follows.
```xml
$ vi ~/Library/LaunchAgents/environment.plist

Label
my.startup
ProgramArguments

sh
-c

launchctl setenv AZURE_ACCOUNT_NAME my-account-name
launchctl setenv AZURE_ACCOUNT_KEY my-account-key
launchctl setenv AZURE_CONTAINER my-container
launchctl setenv AZURE_CONTAINER_IMPORT_DIRECTORY unittests


RunAtLoad

$ launchctl load ~/Library/LaunchAgents/environment.plist
$ launchctl getenv AZURE_ACCOUNT_NAME //try to get value.

Then start your applications.
```