Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hishidama/embulk-parser-hadoop-seqfile
Hadoop SequenceFile parser plugin for Embulk
https://github.com/hishidama/embulk-parser-hadoop-seqfile
embulk-parser-plugin embulk-plugin hadoop java-8 sequencefile
Last synced: 25 days ago
JSON representation
Hadoop SequenceFile parser plugin for Embulk
- Host: GitHub
- URL: https://github.com/hishidama/embulk-parser-hadoop-seqfile
- Owner: hishidama
- License: mit
- Created: 2023-10-01T01:13:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-08T01:05:52.000Z (over 1 year ago)
- Last Synced: 2024-11-10T15:27:17.760Z (3 months ago)
- Topics: embulk-parser-plugin, embulk-plugin, hadoop, java-8, sequencefile
- Language: Java
- Homepage: https://www.ne.jp/asahi/hishidama/home/tech/embulk/parser-sequencefile.html
- Size: 90.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Hadoop SequenceFile parser plugin for Embulk
Parses Hadoop SequenceFile read by other file input plugins.
## Overview
* **Plugin type**: parser
* **Guess supported**: no
* Embulk 0.10 or later
* jdk1.8 (jre1.8 is not supported) or Java9 later## Example
### SequenceFile(key: Text, value: IntWritable)
```yaml
in:
type: any file input plugin type
parser:
type: hadoop_seqfile
key_class: org.apache.hadoop.io.Text
value_class: org.apache.hadoop.io.IntWritable
columns:
- {name: word, type: string, key: true, wtype: text}
- {name: count, type: long, key: false, wtype: int}
```### SequenceFile(key: NullWritable, value: Asakusa Framework DataModel)
```yaml
in:
type: any file input plugin type
parser:
type: hadoop_seqfile
value_class: com.example.asakusafw.dmdl.model.WordCount
columns:
- {name: word, type: string, wtype: stringOption}
- {name: count, type: long, wtype: intOption}
```See [asakusafw-helper.xlsx](asakusafw-helper.xlsx) as a tool to assist in generating columns from dmdl.
## Configuration
* **key_class**: key class name. (string, defualt: `org.apache.hadoop.io.NullWritable`)
* **value_class**: value class name. (string, defualt: `org.apache.hadoop.io.NullWritable`)
* **columns**: column definition. see below. (hash, required)
* **default_timezone**: default time zone. (string, default: `UTC`)
* **default_timestamp_format**: default timestemp format. (string, default: `%Y-%m-%d %H:%M:%S.%N %z`)
* **flush_count**: flush count. (int, default: `100`)### columns
* **name**: Embulk column name. (string, required)
* **type**: Embulk column type. (string, required)
* **key**: key or value (`true` for key, `false` for value). (boolean, default: `false`)
* **wtype**: Writable type. (string, required)
* **timezone**: time zone. (string, default: **default_timezone**)
* **format**: timestemp format. (string, default: **default_timestamp_format**)#### wtype (Writable type)
| wtype | software | Writable class |
|------------------|-------------------|--------------------------------------------|
| `null` | Hadoop | org.apache.hadoop.io.NullWritable |
| `boolean` | Hadoop | org.apache.hadoop.io.BooleanWritable |
| `byte` | Hadoop | org.apache.hadoop.io.ByteWritable |
| `short` | Hadoop | org.apache.hadoop.io.ShortWritable |
| `int` | Hadoop | org.apache.hadoop.io.IntWritable |
| `long` | Hadoop | org.apache.hadoop.io.LongWritable |
| `float` | Hadoop | org.apache.hadoop.io.FloatWritable |
| `double` | Hadoop | org.apache.hadoop.io.DoubleWritable |
| `vint` | Hadoop | org.apache.hadoop.io.VIntWritable |
| `vlong` | Hadoop | org.apache.hadoop.io.VLongWritable |
| `text` | Hadoop | org.apache.hadoop.io.Text |
| `booleanOption` | Asakusa Framework | com.asakusafw.runtime.value.BooleanOption |
| `byteOption` | Asakusa Framework | com.asakusafw.runtime.value.ByteOption |
| `shortOption` | Asakusa Framework | com.asakusafw.runtime.value.ShortOption |
| `intOption` | Asakusa Framework | com.asakusafw.runtime.value.IntOption |
| `longOption` | Asakusa Framework | com.asakusafw.runtime.value.LongOption |
| `floatOption` | Asakusa Framework | com.asakusafw.runtime.value.FloatOption |
| `doubleOption` | Asakusa Framework | com.asakusafw.runtime.value.DoubleOption |
| `decimalOption` | Asakusa Framework | com.asakusafw.runtime.value.DecimalOption |
| `stringOption` | Asakusa Framework | com.asakusafw.runtime.value.StringOption |
| `dateOption` | Asakusa Framework | com.asakusafw.runtime.value.DateOption |
| `datetimeOption` | Asakusa Framework | com.asakusafw.runtime.value.DateTimeOption |## Install
1. install plugin
```
$ mvn dependency:get -Dartifact=io.github.hishidama.embulk:embulk-parser-hadoop-seqfile:0.1.0
```2. add setting to $HOME/.embulk/embulk.properties
```
plugins.parser.hadoop_seqfile=maven:io.github.hishidama.embulk:hadoop-seqfile:0.1.0
```## Build
```
$ ./gradlew test
```### Build to local Maven repository
```
./gradlew generatePomFileForMavenJavaPublication
mvn install -f build/publications/mavenJava/pom-default.xml
./gradlew publishToMavenLocal
```