https://github.com/takuti/hive-udf-tokenize_ko

Korean NLP on Hive
https://github.com/takuti/hive-udf-tokenize_ko

hive hive-udf java korean-nlp natural-language-processing

Last synced: 4 months ago
JSON representation

Korean NLP on Hive

Host: GitHub
URL: https://github.com/takuti/hive-udf-tokenize_ko
Owner: takuti
License: apache-2.0
Created: 2019-01-25T08:50:36.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-02-07T02:54:09.000Z (over 6 years ago)
Last Synced: 2025-02-01T22:13:34.432Z (5 months ago)
Topics: hive, hive-udf, java, korean-nlp, natural-language-processing
Language: Java
Size: 71.3 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        Korean NLP on Hive

===

Tokenize Korean sentences on Hive.

```

tokenize_ko(String line [,

            const array userDict,

            const string mode = "discard",

            const array stopTags,

            boolean outputUnknownUnigrams

           ]) - returns tokenized strings in array

```

Implementation is based on [Lucene Korean analyzer](https://lucene.apache.org/core/7_4_0/analyzers-nori/org/apache/lucene/analysis/ko/KoreanAnalyzer.html).

## Usage

```sh

mvn clean install

```

```sql

add jar hive-udf-tokenize_ko-0.0.1.jar;

create temporary function tokenize_ko as 'me.takuti.hive.nlp.tokenizer.TokenizeKoUDF';

select tokenize_ko("소설 무궁화꽃이 피었습니다.");

-- ["소설","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");

-- ["소설","무궁화","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));

-- ["소설","무궁","화","꽃","이"]

select tokenize_ko("Hello, world.", null, "none", array(), true);

-- ["h","e","l","l","o","w","o","r","l","d"]

select tokenize_ko("Hello, world.", null, "none", array(), false);

-- ["hello","world"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());

-- ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());

-- ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]

```

Note that other languages, English, Japanese and Chinese, are similarly [supported by Apache Hivemall](http://hivemall.incubator.apache.org/userguide/misc/tokenizer.html).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/takuti/hive-udf-tokenize_ko

Awesome Lists containing this project

README