Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/himkt/konoha
๐ฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
https://github.com/himkt/konoha
janome japanese kytea mecab natural-language-processing nlp sentencepiece sudachi text-processing
Last synced: 7 days ago
JSON representation
๐ฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
- Host: GitHub
- URL: https://github.com/himkt/konoha
- Owner: himkt
- License: mit
- Created: 2018-08-22T14:00:15.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2024-05-15T15:02:04.000Z (8 months ago)
- Last Synced: 2024-05-16T18:04:42.728Z (8 months ago)
- Topics: janome, japanese, kytea, mecab, natural-language-processing, nlp, sentencepiece, sudachi, text-processing
- Language: Python
- Homepage: https://pypi.org/project/konoha
- Size: 1.19 MB
- Stars: 214
- Watchers: 7
- Forks: 21
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ฟ Konoha: Simple wrapper of Japanese Tokenizers
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/himkt/konoha/blob/main/example/Konoha_Example.ipynb)
[![GitHub stars](https://img.shields.io/github/stars/himkt/konoha?style=social)](https://github.com/himkt/konoha/stargazers)
[![Downloads](https://pepy.tech/badge/konoha)](https://pepy.tech/project/konoha)
[![Downloads](https://pepy.tech/badge/konoha/month)](https://pepy.tech/project/konoha/month)
[![Downloads](https://pepy.tech/badge/konoha/week)](https://pepy.tech/project/konoha/week)[![Build Status](https://github.com/himkt/konoha/actions/workflows/ci.yml/badge.svg)](https://github.com/himkt/konoha/actions/workflows/ci.yml)
[![Documentation Status](https://readthedocs.org/projects/konoha/badge/?version=latest)](https://konoha.readthedocs.io/en/latest/?badge=latest)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/konoha)
[![PyPI](https://img.shields.io/pypi/v/konoha.svg)](https://pypi.python.org/pypi/konoha)
[![GitHub Issues](https://img.shields.io/github/issues/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/himkt/konoha.svg?cacheSeconds=60&color=yellow)](https://github.com/himkt/konoha/issues)`Konoha` is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,
which enables you to switch a tokenizer and boost your pre-processing.## Supported tokenizers
Also, `konoha` provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.
## Quick Start with Docker
Simply run followings on your computer:
```bash
docker run --rm -p 8000:8000 -t himkt/konoha # from DockerHub
```Or you can build image on your machine:
```bash
git clone https://github.com/himkt/konoha # download konoha
cd konoha && docker-compose up --build # build and launch container
```Tokenization is done by posting a json object to `localhost:8000/api/v1/tokenize`.
You can also batch tokenize by passing `texts: ["๏ผใค็ฎใฎๅ ฅๅ", "๏ผใค็ฎใฎๅ ฅๅ"]` to `localhost:8000/api/v1/batch_tokenize`.(API documentation is available on `localhost:8000/redoc`, you can check it using your web browser)
Send a request using `curl` on your terminal.
Note that a path to an endpoint is changed in v4.6.4.
Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).```json
$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "text": "ใใใฏใใณใงใ"}'{
"tokens": [
[
{
"surface": "ใใ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใฏ",
"part_of_speech": "ๅฉ่ฉ"
},
{
"surface": "ใใณ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใงใ",
"part_of_speech": "ๅฉๅ่ฉ"
}
]
]
}
```## Installation
I recommend you to install konoha by `pip install 'konoha[all]'`.
- Install konoha with a specific tokenizer: `pip install 'konoha[(tokenizer_name)]`.
- Install konoha with a specific tokenizer and remote file support: `pip install 'konoha[(tokenizer_name),remote]'`If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer
(e.g. `konoha[mecab]`, `konoha[sudachi]`, ...etc) or install tokenizers individually.## Example
### Word level tokenization
```python
from konoha import WordTokenizersentence = '่ช็ถ่จ่ชๅฆ็ใๅๅผทใใฆใใพใ'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆ, ใ, ใพใ]tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ, ่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆใใพใ]
```For more detail, please see the `example/` directory.
### Remote files
Konoha supports dictionary and model on cloud storage (currently supports Amazon S3).
It requires installing konoha with the `remote` option, see [Installation](#installation).```python
# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))
```### Sentence level tokenization
```python
from konoha import SentenceTokenizersentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
```You can change symbols for a sentence splitter and bracket expression.
1. sentence splitter
```python
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใ๏ผใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"tokenizer = SentenceTokenizer(period="๏ผ")
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใ๏ผ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
```2. bracket expression
```python
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"tokenizer = SentenceTokenizer(
patterns=SentenceTokenizer.PATTERNS + [re.compile(r"ใ.*?ใ")],
)
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
```## Test
```
python -m pytest
```## Article
- [ใใผใฏใใคใถใใใๆใใซๅใๆฟใใใฉใคใใฉใช konoha ใไฝใฃใ](https://qiita.com/klis/items/bb9ffa4d9c886af0f531)
- [ๆฅๆฌ่ช่งฃๆใใผใซ Konoha ใซ AllenNLP ้ฃๆบๆฉ่ฝใๅฎ่ฃ ใใ](https://qiita.com/klis/items/f1d29cb431d1bf879898)## Acknowledgement
Sentencepiece model used in test is provided by @yoheikikuta. Thanks!