Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/himkt/konoha

๐ŸŒฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
https://github.com/himkt/konoha

janome japanese kytea mecab natural-language-processing nlp sentencepiece sudachi text-processing

Last synced: 7 days ago
JSON representation

๐ŸŒฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

Awesome Lists containing this project

README

        

# ๐ŸŒฟ Konoha: Simple wrapper of Japanese Tokenizers

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/himkt/konoha/blob/main/example/Konoha_Example.ipynb)

[![GitHub stars](https://img.shields.io/github/stars/himkt/konoha?style=social)](https://github.com/himkt/konoha/stargazers)

[![Downloads](https://pepy.tech/badge/konoha)](https://pepy.tech/project/konoha)
[![Downloads](https://pepy.tech/badge/konoha/month)](https://pepy.tech/project/konoha/month)
[![Downloads](https://pepy.tech/badge/konoha/week)](https://pepy.tech/project/konoha/week)

[![Build Status](https://github.com/himkt/konoha/actions/workflows/ci.yml/badge.svg)](https://github.com/himkt/konoha/actions/workflows/ci.yml)
[![Documentation Status](https://readthedocs.org/projects/konoha/badge/?version=latest)](https://konoha.readthedocs.io/en/latest/?badge=latest)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/konoha)
[![PyPI](https://img.shields.io/pypi/v/konoha.svg)](https://pypi.python.org/pypi/konoha)
[![GitHub Issues](https://img.shields.io/github/issues/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)

`Konoha` is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,
which enables you to switch a tokenizer and boost your pre-processing.

## Supported tokenizers






Also, `konoha` provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

## Quick Start with Docker

Simply run followings on your computer:

```bash
docker run --rm -p 8000:8000 -t himkt/konoha # from DockerHub
```

Or you can build image on your machine:

```bash
git clone https://github.com/himkt/konoha # download konoha
cd konoha && docker-compose up --build # build and launch container
```

Tokenization is done by posting a json object to `localhost:8000/api/v1/tokenize`.
You can also batch tokenize by passing `texts: ["๏ผ‘ใค็›ฎใฎๅ…ฅๅŠ›", "๏ผ’ใค็›ฎใฎๅ…ฅๅŠ›"]` to `localhost:8000/api/v1/batch_tokenize`.

(API documentation is available on `localhost:8000/redoc`, you can check it using your web browser)

Send a request using `curl` on your terminal.
Note that a path to an endpoint is changed in v4.6.4.
Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

```json
$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "text": "ใ“ใ‚Œใฏใƒšใƒณใงใ™"}'

{
"tokens": [
[
{
"surface": "ใ“ใ‚Œ",
"part_of_speech": "ๅ่ฉž"
},
{
"surface": "ใฏ",
"part_of_speech": "ๅŠฉ่ฉž"
},
{
"surface": "ใƒšใƒณ",
"part_of_speech": "ๅ่ฉž"
},
{
"surface": "ใงใ™",
"part_of_speech": "ๅŠฉๅ‹•่ฉž"
}
]
]
}
```

## Installation

I recommend you to install konoha by `pip install 'konoha[all]'`.

- Install konoha with a specific tokenizer: `pip install 'konoha[(tokenizer_name)]`.
- Install konoha with a specific tokenizer and remote file support: `pip install 'konoha[(tokenizer_name),remote]'`

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer
(e.g. `konoha[mecab]`, `konoha[sudachi]`, ...etc) or install tokenizers individually.

## Example

### Word level tokenization

```python
from konoha import WordTokenizer

sentence = '่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใ‚’ๅ‹‰ๅผทใ—ใฆใ„ใพใ™'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆ, ใ„, ใพใ™]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ–, ่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆใ„ใพใ™]
```

For more detail, please see the `example/` directory.

### Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3).
It requires installing konoha with the `remote` option, see [Installation](#installation).

```python
# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))
```

### Sentence level tokenization

```python
from konoha import SentenceTokenizer

sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']
```

You can change symbols for a sentence splitter and bracket expression.

1. sentence splitter

```python
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(period="๏ผŽ")
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽ', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']
```

2. bracket expression

```python
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(
patterns=SentenceTokenizer.PATTERNS + [re.compile(r"ใ€Ž.*?ใ€")],
)
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']
```

## Test

```
python -m pytest
```

## Article

- [ใƒˆใƒผใ‚ฏใƒŠใ‚คใ‚ถใ‚’ใ„ใ„ๆ„Ÿใ˜ใซๅˆ‡ใ‚Šๆ›ฟใˆใ‚‹ใƒฉใ‚คใƒ–ใƒฉใƒช konoha ใ‚’ไฝœใฃใŸ](https://qiita.com/klis/items/bb9ffa4d9c886af0f531)
- [ๆ—ฅๆœฌ่ชž่งฃๆžใƒ„ใƒผใƒซ Konoha ใซ AllenNLP ้€ฃๆบๆฉŸ่ƒฝใ‚’ๅฎŸ่ฃ…ใ—ใŸ](https://qiita.com/klis/items/f1d29cb431d1bf879898)

## Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!