Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sonots/embulk-filter-timestamp_format
A filter plugin for Embulk to change timestamp format
https://github.com/sonots/embulk-filter-timestamp_format
Last synced: 3 months ago
JSON representation
A filter plugin for Embulk to change timestamp format
- Host: GitHub
- URL: https://github.com/sonots/embulk-filter-timestamp_format
- Owner: sonots
- License: mit
- Created: 2016-04-26T05:23:43.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2021-06-10T09:58:50.000Z (over 3 years ago)
- Last Synced: 2024-10-13T08:08:24.332Z (3 months ago)
- Language: Java
- Homepage:
- Size: 207 KB
- Stars: 10
- Watchers: 3
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Timestamp format filter plugin for Embulk
[![Build Status](https://secure.travis-ci.org/sonots/embulk-filter-timestamp_format.png?branch=master)](http://travis-ci.org/sonots/embulk-filter-timestamp_format)
A filter plugin for Embulk to change timestamp format
## Configuration
- **columns**: columns to retain (array of hash)
- **name**: name of column (required)
- **type**: type to cast, choose one of `string`, `timestamp`, `long` (unixtimestamp), `double` (unixtimestamp) (string, default is `string`)
- **from_format**: specify the format of the input string (array of strings, default is default_from_timestamp_format)
- **from_timezone**: specify the timezone of the input string (string, default is default_from_timezone)
- **to_format**: specify the format of the output string (string, default is default_to_timestamp_format)
- **to_timezone**: specify the timezone of the output string (string, default is default_to_timezone)
- **from_unit**: specify the time unit of the input unixtimestamp (string, default is default_from_timestamp_unit)
- **to_unit**: specify the time unit of the output unixtimestamp (string, default is default_to_timestamp_unit)
- **default_from_timestamp_format**: default timestamp format for the input string (array of strings, default is `["%Y-%m-%d %H:%M:%S.%N %z"]`)
- **default_from_timezone**: default timezone for the input string (string, default is `UTC`)
- **default_to_timestamp_format**: default timestamp format for the output string (string, default is `%Y-%m-%d %H:%M:%S.%N %z`)
- **default_to_timezone**: default timezone for the output string (string, default is `UTC`)
- **default_from_timestamp_unit**: default time unit such as `sec` (for second), `ms` (for milli second), `us` (for micro second), `ns` (for nano second) for the input unixtimestamp (string, default is `second`)
- **default_to_timestamp_unit**: default time unit such as `sec` (for second), `ms` (for milli second), `us` (for micro second), `ns` (for nano second) for the output unixtimestamp (string, default is `second`)
- **stop_on_invalid_record**: stop bulk load transaction if a invalid record is found (boolean, default is `false`)
- **timestamp_parser** (experimental): set `auto_java` to try to convert ruby format to java format to use faster java timestamp parser (string, default is `auto`)## Example
Say example.jsonl is as follows (this is a typical format on exporting a BigQuery table):
```
{"timestamp":"2015-07-12 15:00:00 UTC","nested":{"timestamp":"2015-07-12 15:00:00 UTC"}}
{"timestamp":"2015-07-12 15:00:00.1 UTC","nested":{"timestamp":"2015-07-12 15:00:00.1 UTC"}}
``````yaml
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl # not json parser
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S.%N %z", "%Y-%m-%d %H:%M:%S %z"]
default_to_timezone: "Asia/Tokyo"
default_to_timestamp_format: "%Y-%m-%d %H:%M:%S.%N"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
```Output will be as:
```
{"timestamp":1436713200000,"nested":{"timestamp":"2015-07-13 00:00:00.0}}
{"timestamp":1436713200100,"nested":{"timestamp":"2015-07-13 00:00:00.1}}
```See [./example](./example) for more examples.
## JSONPath
For `type: json` column, you can specify [JSONPath](http://goessner.net/articles/JsonPath/) for column's name as:
```
name: $.payload.key1
name: "$.payload.array[0]"
name: "$.payload.array[*]"
name: $['payload']['key1.key2']
```Following operators of JSONPath are not supported:
* Multiple properties such as `['name','name']`
* Multiple array indexes such as `[1,2]`
* Array slice such as `[1:2]`
* Filter expression such as `[?()]`## JRuby Timestamp Parser Performance Issue
**NEWS: (2017/07/10) embulk 0.8.27 is released with a fast Timestamp jruby parser. This issue should be resolved, so Java Timestamp parser support will be dropped in future releases.**
Embulk's timestamp parser originally uses jruby implementation, but it is slow.
To improve performance, this plugin also supports Java's Joda-Time [DateTimeFormat](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html) format as:```yaml
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss.SSS z", "yyyy-MM-dd HH:mm:ss z", "yyyy-MM-dd HH:mm:ss"]
default_to_timezone: "Asia/Taipei"
default_to_timestamp_format: "yyyy-MM-dd HH:mm:ss.SSS Z"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
```If format strings contain `%`, jruby parser/formatter is used. Otherwirse, java parser/formatter is used
**Automatic Conversion of Ruby Timestamp Format to Java Timestamp Format** (experimental)
If you configure `timestamp_parser: auto_java`, this plugin tries to convert ruby format into java format automatically to use faster java timestamp parser.
**COMPARISON:**
Benchmark test sets are available at [./bench](./bench). In my environment (Mac Book Pro), for 1000000 timestamps:
* java parser + java formatter: 1.3s
* java parser + jruby formatter: 1.4s
* jruby parser + java formatter: 64.52s
* jruby parser + jruby formatter: 65.06sJRuby parser is slow, but JRuby formatter is not so slow.
## Nano Resolution
JRuby parser has micro second resolution. Java (Joda-Time) parser has milli second resolution.
Nano second resolution is partially supported by this plugin itself. Use parser format `nnnnnnnnn` for Java parser as
```
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
```This plugin finds places of nano second from texts with regular expression `\.(\d+)`.
For formatter, you can use `nnnnnnnnn` for nano and `nnnnnn` for micro as
```
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
yyyy-MM-dd HH:mm:ss.nnnnnn z
```FYI: Java8's DateTimeFormatter supports nano second resolution, but we can not use it because embulk supports Java7.
## ToDo
* Write test
## Development
Run example:
```
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
```Run test:
```
$ ./gradlew test
```Run checkstyle:
```
$ ./gradlew check
```Release gem:
```
$ ./gradlew gemPush
```