Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sonots/embulk-filter-timestamp_format

A filter plugin for Embulk to change timestamp format
https://github.com/sonots/embulk-filter-timestamp_format

Last synced: 3 months ago
JSON representation

A filter plugin for Embulk to change timestamp format

Awesome Lists containing this project

README

        

# Timestamp format filter plugin for Embulk

[![Build Status](https://secure.travis-ci.org/sonots/embulk-filter-timestamp_format.png?branch=master)](http://travis-ci.org/sonots/embulk-filter-timestamp_format)

A filter plugin for Embulk to change timestamp format

## Configuration

- **columns**: columns to retain (array of hash)
- **name**: name of column (required)
- **type**: type to cast, choose one of `string`, `timestamp`, `long` (unixtimestamp), `double` (unixtimestamp) (string, default is `string`)
- **from_format**: specify the format of the input string (array of strings, default is default_from_timestamp_format)
- **from_timezone**: specify the timezone of the input string (string, default is default_from_timezone)
- **to_format**: specify the format of the output string (string, default is default_to_timestamp_format)
- **to_timezone**: specify the timezone of the output string (string, default is default_to_timezone)
- **from_unit**: specify the time unit of the input unixtimestamp (string, default is default_from_timestamp_unit)
- **to_unit**: specify the time unit of the output unixtimestamp (string, default is default_to_timestamp_unit)
- **default_from_timestamp_format**: default timestamp format for the input string (array of strings, default is `["%Y-%m-%d %H:%M:%S.%N %z"]`)
- **default_from_timezone**: default timezone for the input string (string, default is `UTC`)
- **default_to_timestamp_format**: default timestamp format for the output string (string, default is `%Y-%m-%d %H:%M:%S.%N %z`)
- **default_to_timezone**: default timezone for the output string (string, default is `UTC`)
- **default_from_timestamp_unit**: default time unit such as `sec` (for second), `ms` (for milli second), `us` (for micro second), `ns` (for nano second) for the input unixtimestamp (string, default is `second`)
- **default_to_timestamp_unit**: default time unit such as `sec` (for second), `ms` (for milli second), `us` (for micro second), `ns` (for nano second) for the output unixtimestamp (string, default is `second`)
- **stop_on_invalid_record**: stop bulk load transaction if a invalid record is found (boolean, default is `false`)
- **timestamp_parser** (experimental): set `auto_java` to try to convert ruby format to java format to use faster java timestamp parser (string, default is `auto`)

## Example

Say example.jsonl is as follows (this is a typical format on exporting a BigQuery table):

```
{"timestamp":"2015-07-12 15:00:00 UTC","nested":{"timestamp":"2015-07-12 15:00:00 UTC"}}
{"timestamp":"2015-07-12 15:00:00.1 UTC","nested":{"timestamp":"2015-07-12 15:00:00.1 UTC"}}
```

```yaml
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl # not json parser
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S.%N %z", "%Y-%m-%d %H:%M:%S %z"]
default_to_timezone: "Asia/Tokyo"
default_to_timestamp_format: "%Y-%m-%d %H:%M:%S.%N"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
```

Output will be as:

```
{"timestamp":1436713200000,"nested":{"timestamp":"2015-07-13 00:00:00.0}}
{"timestamp":1436713200100,"nested":{"timestamp":"2015-07-13 00:00:00.1}}
```

See [./example](./example) for more examples.

## JSONPath

For `type: json` column, you can specify [JSONPath](http://goessner.net/articles/JsonPath/) for column's name as:

```
name: $.payload.key1
name: "$.payload.array[0]"
name: "$.payload.array[*]"
name: $['payload']['key1.key2']
```

Following operators of JSONPath are not supported:

* Multiple properties such as `['name','name']`
* Multiple array indexes such as `[1,2]`
* Array slice such as `[1:2]`
* Filter expression such as `[?()]`

## JRuby Timestamp Parser Performance Issue

**NEWS: (2017/07/10) embulk 0.8.27 is released with a fast Timestamp jruby parser. This issue should be resolved, so Java Timestamp parser support will be dropped in future releases.**

Embulk's timestamp parser originally uses jruby implementation, but it is slow.
To improve performance, this plugin also supports Java's Joda-Time [DateTimeFormat](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html) format as:

```yaml
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss.SSS z", "yyyy-MM-dd HH:mm:ss z", "yyyy-MM-dd HH:mm:ss"]
default_to_timezone: "Asia/Taipei"
default_to_timestamp_format: "yyyy-MM-dd HH:mm:ss.SSS Z"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
```

If format strings contain `%`, jruby parser/formatter is used. Otherwirse, java parser/formatter is used

**Automatic Conversion of Ruby Timestamp Format to Java Timestamp Format** (experimental)

If you configure `timestamp_parser: auto_java`, this plugin tries to convert ruby format into java format automatically to use faster java timestamp parser.

**COMPARISON:**

Benchmark test sets are available at [./bench](./bench). In my environment (Mac Book Pro), for 1000000 timestamps:

* java parser + java formatter: 1.3s
* java parser + jruby formatter: 1.4s
* jruby parser + java formatter: 64.52s
* jruby parser + jruby formatter: 65.06s

JRuby parser is slow, but JRuby formatter is not so slow.

## Nano Resolution

JRuby parser has micro second resolution. Java (Joda-Time) parser has milli second resolution.

Nano second resolution is partially supported by this plugin itself. Use parser format `nnnnnnnnn` for Java parser as

```
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
```

This plugin finds places of nano second from texts with regular expression `\.(\d+)`.

For formatter, you can use `nnnnnnnnn` for nano and `nnnnnn` for micro as

```
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
yyyy-MM-dd HH:mm:ss.nnnnnn z
```

FYI: Java8's DateTimeFormatter supports nano second resolution, but we can not use it because embulk supports Java7.

## ToDo

* Write test

## Development

Run example:

```
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
```

Run test:

```
$ ./gradlew test
```

Run checkstyle:

```
$ ./gradlew check
```

Release gem:

```
$ ./gradlew gemPush
```