Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/toyama0919/embulk-filter-kuromoji
Morphological analysis plugin for Embulk.
https://github.com/toyama0919/embulk-filter-kuromoji
embulk kuromoji morphological-analysis neologd
Last synced: 3 months ago
JSON representation
Morphological analysis plugin for Embulk.
- Host: GitHub
- URL: https://github.com/toyama0919/embulk-filter-kuromoji
- Owner: toyama0919
- License: mit
- Created: 2015-09-30T04:35:38.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2017-07-04T11:21:29.000Z (over 7 years ago)
- Last Synced: 2024-09-13T12:38:10.527Z (4 months ago)
- Topics: embulk, kuromoji, morphological-analysis, neologd
- Language: Java
- Homepage:
- Size: 112 KB
- Stars: 5
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Kuromoji filter plugin for Embulk
[![Gem Version](https://badge.fury.io/rb/embulk-filter-kuromoji.svg)](http://badge.fury.io/rb/embulk-filter-kuromoji)Kuromoji filter plugin for Embulk.
Neologd support.## Reference
* [Atilika - Applied Search Innovation](http://www.atilika.com/en/products/kuromoji.html)
* [Home · neologd/mecab-ipadic-neologd Wiki](https://github.com/neologd/mecab-ipadic-neologd/wiki)## Overview
* **Plugin type**: filter
## Configuration
- **tokenizer**: select tokenizer.(kuromoji or neologd) (string, default: kuromoji)
- **mode**: select mode.(normal or search or extended) (string, default: normal)
- **use_stop_tag**: neologd only.(bool, default: false)
- **key_names**: description (list, required)
- **keep_input**: keep input columns. (bool, default: `true`)
- **ok_parts_of_speech**: ok parts of speech. (list, default: null)
- **dictionary_path**: user dictionary file path. (string, default: null)
- **settings**: description (list, required)
- **suffix**: output column name suffix. if null overwrite column. (string, default: null)
- **method**: description (string, required. surface_form or base_form or reading)
- **delimiter**: delimiter (string, default: ",")
- **type**: extract data type, array or string. array is json type. (string, default: "string")## Neologd Example
```yaml
filters:
- type: kuromoji
tokenizer: neologd
use_stop_tag: true
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
- { suffix: _array, method: 'surface_form', type: 'array' }
```## Pure kuromoji Example
```yaml
filters:
- type: kuromoji
keep_input: false
mode: search
ok_parts_of_speech:
- 名詞
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
- { suffix: _array, method: 'surface_form', type: 'array' }
```### input
```json
{
"catchcopy" : "安全・安心を追及した曲面ボディにデザインを一新しました。"
}
```As below
```json
{
"catchcopy" : "アンゼン・アンシンヲツイキュウシタキョクメンボディニデザインヲイッシン。",
"catchcopy_surface_form_no_delim" : "安全・安心を追及した曲面ボディにデザインを一新。",
"catchcopy_base_form" : "安全###・###安心###を###追及###する###た###曲面###ボディ###に###デザイン###を###一新###。",
"catchcopy_surface_form" : "安全###・###安心###を###追及###し###た###曲面###ボディ###に###デザイン###を###一新###。",
"catchcopy_array" : ["安全","・","安心","を","追及","し","た","曲面","ボディ","に","デザイン","を","一新","。"]
}
```## Example2(use user dictionary)
```yaml
- type: kuromoji
keep_input: false
dictionary_path: /tmp/kuromoji.txt
ok_parts_of_speech:
- 名詞
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '#' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
```## user dictionary example
```
西国分寺,西国分寺,ニシコクブンジ,駅名
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
```## Build
```
$ ./gradlew gem # -t to watch change of files and rebuild continuously
```