Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/takuti/hive-udf-tokenize_ko
Korean NLP on Hive
https://github.com/takuti/hive-udf-tokenize_ko
hive hive-udf java korean-nlp natural-language-processing
Last synced: 27 days ago
JSON representation
Korean NLP on Hive
- Host: GitHub
- URL: https://github.com/takuti/hive-udf-tokenize_ko
- Owner: takuti
- License: apache-2.0
- Created: 2019-01-25T08:50:36.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-02-07T02:54:09.000Z (almost 6 years ago)
- Last Synced: 2024-10-16T01:41:37.545Z (3 months ago)
- Topics: hive, hive-udf, java, korean-nlp, natural-language-processing
- Language: Java
- Size: 71.3 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Korean NLP on Hive
===Tokenize Korean sentences on Hive.
```
tokenize_ko(String line [,
const array userDict,
const string mode = "discard",
const array stopTags,
boolean outputUnknownUnigrams
]) - returns tokenized strings in array
```Implementation is based on [Lucene Korean analyzer](https://lucene.apache.org/core/7_4_0/analyzers-nori/org/apache/lucene/analysis/ko/KoreanAnalyzer.html).
## Usage
```sh
mvn clean install
``````sql
add jar hive-udf-tokenize_ko-0.0.1.jar;
create temporary function tokenize_ko as 'me.takuti.hive.nlp.tokenizer.TokenizeKoUDF';select tokenize_ko("소설 무궁화꽃이 피었습니다.");
-- ["소설","무궁","화","꽃","피"]select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");
-- ["소설","무궁화","무궁","화","꽃","피"]select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));
-- ["소설","무궁","화","꽃","이"]select tokenize_ko("Hello, world.", null, "none", array(), true);
-- ["h","e","l","l","o","w","o","r","l","d"]select tokenize_ko("Hello, world.", null, "none", array(), false);
-- ["hello","world"]select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());
-- ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());
-- ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
```Note that other languages, English, Japanese and Chinese, are similarly [supported by Apache Hivemall](http://hivemall.incubator.apache.org/userguide/misc/tokenizer.html).