Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/toyama0919/embulk-filter-kuromoji

Morphological analysis plugin for Embulk.
https://github.com/toyama0919/embulk-filter-kuromoji

embulk kuromoji morphological-analysis neologd

Last synced: 3 months ago
JSON representation

Morphological analysis plugin for Embulk.

Awesome Lists containing this project

README

        

# Kuromoji filter plugin for Embulk
[![Gem Version](https://badge.fury.io/rb/embulk-filter-kuromoji.svg)](http://badge.fury.io/rb/embulk-filter-kuromoji)

Kuromoji filter plugin for Embulk.
Neologd support.

## Reference

* [Atilika - Applied Search Innovation](http://www.atilika.com/en/products/kuromoji.html)
* [Home · neologd/mecab-ipadic-neologd Wiki](https://github.com/neologd/mecab-ipadic-neologd/wiki)

## Overview

* **Plugin type**: filter

## Configuration

- **tokenizer**: select tokenizer.(kuromoji or neologd) (string, default: kuromoji)
- **mode**: select mode.(normal or search or extended) (string, default: normal)
- **use_stop_tag**: neologd only.(bool, default: false)
- **key_names**: description (list, required)
- **keep_input**: keep input columns. (bool, default: `true`)
- **ok_parts_of_speech**: ok parts of speech. (list, default: null)
- **dictionary_path**: user dictionary file path. (string, default: null)
- **settings**: description (list, required)
- **suffix**: output column name suffix. if null overwrite column. (string, default: null)
- **method**: description (string, required. surface_form or base_form or reading)
- **delimiter**: delimiter (string, default: ",")
- **type**: extract data type, array or string. array is json type. (string, default: "string")

## Neologd Example

```yaml
filters:
- type: kuromoji
tokenizer: neologd
use_stop_tag: true
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
- { suffix: _array, method: 'surface_form', type: 'array' }
```

## Pure kuromoji Example

```yaml
filters:
- type: kuromoji
keep_input: false
mode: search
ok_parts_of_speech:
- 名詞
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
- { suffix: _array, method: 'surface_form', type: 'array' }
```

### input

```json
{
"catchcopy" : "安全・安心を追及した曲面ボディにデザインを一新しました。"
}
```

As below

```json
{
"catchcopy" : "アンゼン・アンシンヲツイキュウシタキョクメンボディニデザインヲイッシン。",
"catchcopy_surface_form_no_delim" : "安全・安心を追及した曲面ボディにデザインを一新。",
"catchcopy_base_form" : "安全###・###安心###を###追及###する###た###曲面###ボディ###に###デザイン###を###一新###。",
"catchcopy_surface_form" : "安全###・###安心###を###追及###し###た###曲面###ボディ###に###デザイン###を###一新###。",
"catchcopy_array" : ["安全","・","安心","を","追及","し","た","曲面","ボディ","に","デザイン","を","一新","。"]
}
```

## Example2(use user dictionary)

```yaml
- type: kuromoji
keep_input: false
dictionary_path: /tmp/kuromoji.txt
ok_parts_of_speech:
- 名詞
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '#' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
```

## user dictionary example

```
西国分寺,西国分寺,ニシコクブンジ,駅名
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
```

## Build

```
$ ./gradlew gem # -t to watch change of files and rebuild continuously
```