https://github.com/wanasit/kotori

A Japanese tokenizer and morphological analysis engine written in Kotlin
https://github.com/wanasit/kotori

Last synced: 12 months ago
JSON representation

A Japanese tokenizer and morphological analysis engine written in Kotlin

Host: GitHub
URL: https://github.com/wanasit/kotori
Owner: wanasit
License: mit
Created: 2020-05-08T03:13:20.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-08-30T06:18:17.000Z (almost 6 years ago)
Last Synced: 2025-04-04T03:22:49.347Z (over 1 year ago)
Language: Kotlin
Size: 24.1 MB
Stars: 54
Watchers: 4
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-java - Kotori

README

          # Kotori

A Japanese tokenizer and morphological analysis engine written in Kotlin

### Usage

```kotlin

import com.github.wanasit.kotori.Tokenizer

fun main(args: Array) {

    val tokenizer = Tokenizer.createDefaultTokenizer()

    val words = tokenizer.tokenize("お寿司が食べたい。").map { it.text }

    println(words) // [お, 寿司, が, 食べ, たい, 。]

}

```

### Installation

Kotori packages are hosted by [bintray](https://bintray.com/beta/#/wanasit/maven/Kotori?tab=overview) and JCenter.

You can download and install it via Gradle or Maven.

Gradle:

```groovy

repositories {

    jcenter()

}

dependencies {

    ...

    implementation 'com.github.wanasit.kotori:kotori:0.0.3'

}

```

Maven:

```xml

  com.github.wanasit.kotori

  kotori

  VERSION_NUMBER

  pom

```

You can also install Kotori via [Jitpack](https://jitpack.io/#wanasit/kotori). 

### Dictionary 

Kotori has a built-in dictionary, based-on `mecab-ipadic-2.7.0-20070801`.

```kotlin

val dictionary = Dictionary.readDefaultFromResource()

val tokenizer = Tokenizer.create(dictionary)

tokenizer.tokenize("お寿司が食べたい。")

```

However, it also works out-of-box with any Mecab dictionary. For example:

* IPADIC ([2.7.0-20070801](http://atilika.com/releases/mecab-ipadic/mecab-ipadic-2.7.0-20070801.tar.gz))

* UniDic ([2.1.2](http://atilika.com/releases/unidic-mecab/unidic-mecab-2.1.2_src.zip))

* JUMANDIC ([7.0-20130310](http://atilika.com/releases/mecab-jumandic/mecab-jumandic-7.0-20130310.tar.gz))

```kotlin

val dictionary = MeCabDictionary.readFromDirectory("~/Download/mecab-ipadic-2.7.0-20070801")

val tokenizer = Tokenizer.create(dictionary)

tokenizer.tokenize("お寿司が食べたい。")

```

Note: [Sudachi](https://github.com/WorksApplications/Sudachi) dictionaries and plugins support are under development.

### Performance

Kotori is heavily inspired by [Kuromoji](https://github.com/atilika/kuromoji) and [Sudachi](https://github.com/WorksApplications/Sudachi), 

but its tokenization is even faster than other JVM-based tokenizers (based-on our *probably unfair* benchmark).

The following is statistic from tokenizing Japanese sentences from [Tatoeba](https://tatoeba.org/eng/) 

(193,898 sentences entries, 3,561,854 total characters) on Macbook Pro 2020 (2.4 GHz 8-Core Intel Core i9).

|   |  Token Count  | Time (ns per document) |  Time (ns per token)  |

|---|---:|---:|---:|

|Kuromoji (IPADIC) | 2,264,560 | 10,095 | 864 |

|**Kotori (IPADIC)**   | 2,264,705 | **8,190**| **701** |

|Sudachi (sudachi-dictionary-20200330-small)  | 2,308,873 | 27,352 | 2296 |

|Kotori (sudachi-dictionary-20200330-small)   | 2,157,820 | 13,079 | 1175 |

#### (Speculative) What makes Kotori fast

* **Minimal String.substring() usage**. [After JDK 7](https://www.programcreek.com/2013/09/the-substring-method-in-jdk-6-and-jdk-7/), 

the function makes string copy and has O(n) overhead. Some tokenizers that design before the change (e.g. Kuromoji) still have a lot of substrings.

* **A customized Trie data structure**. 

`TransitionArrayTrie` can be quickly built just-in-time when creating a tokenizer,

but it has pretty good performance on Japanese in UTF-16.

#### (Speculative) What makes Kotori slow

* **Kotori doesn't rely on any pre-built data structure** (e.g. `DoubleArrayTrie`). 

It reads a dictionary as list-of-terms format and builds Trie just-in-time.

This is a design decision to make Kotori open to multiple dictionary formats in exchange for some bootup time.

* Kotlin (written by the inexperience library author) is slower than Java, 

mostly, because Kotlin's `Array` has some overhead comparing to Java's native `T[]`.

#### Benchmark

Benchmark can be run as a gradle task.

```bash

./gradlew benchmark

./gradlew benchmark --args='--tokenizer=kuromoji'

./gradlew benchmark --args='--tokenizer=kotori --dictionary=sudachi-small'

```

Check [the source code](https://github.com/wanasit/kotori/blob/master/kotori-benchmark/src/main/kotlin/com/github/wanasit/kotori/benchmark/Benchmark.kt) 

in `kotori-benchmark` project for more details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wanasit/kotori

Awesome Lists containing this project

README