https://github.com/togatoga/kanpyo
Japanese Morphological Analyzer written in Rust
https://github.com/togatoga/kanpyo
japanese morphological rust tokenizer
Last synced: 5 months ago
JSON representation
Japanese Morphological Analyzer written in Rust
- Host: GitHub
- URL: https://github.com/togatoga/kanpyo
- Owner: togatoga
- License: mit
- Created: 2023-10-11T23:02:22.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-12-13T07:09:16.000Z (6 months ago)
- Last Synced: 2025-12-14T21:27:08.242Z (6 months ago)
- Topics: japanese, morphological, rust, tokenizer
- Language: Rust
- Homepage:
- Size: 10.4 MB
- Stars: 106
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Kanpyo
[](https://crates.io/crates/kanpyo)
Kanpyo is Japanese morphological analyzer written in Rust inspired by [ikawaha/Kagome](https://github.com/ikawaha/kagome).
## Caution
This is a work in progress. I would break the API without notice.
## Installation
### With Embedded Dictionary (Recommended)
The easiest way to install `kanpyo` is with the embedded dictionary. No additional setup required.
```shell script
cargo install kanpyo --features mecab-ipadic
```
or from git:
```shell script
cargo install --git https://github.com/togatoga/kanpyo kanpyo --features mecab-ipadic
```
The dictionary will be automatically downloaded from GitHub Releases during the build process and embedded into the binary.
### Without Embedded Dictionary
If you prefer a smaller binary size or want to use a custom dictionary:
```shell script
cargo install kanpyo
```
You need to build and install a dictionary manually:
```shell script
cd kanpyo-dict
tar xvf resource/mecab-ipadic-2.7.0-20070801.tar.gz -C resource
cargo run --release --bin ipa-dict-builder -- --dict resource/mecab-ipadic-2.7.0-20070801
```
The dictionary is installed in the following directory:
- Linux: `$HOME/.config/kanpyo/`
- macOS: `$HOME/Library/Application Support/kanpyo/`
- Windows: `%APPDATA%\kanpyo\`
You're ready to use `kanpyo`!
## Usage
```shell script
kanpyo --help
Japanese Morphological Analyzer
Usage: kanpyo [COMMAND]
Commands:
tokenize Tokenize input text
graphviz Output lattice in Graphviz format
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
```
### Tokenize
```shell script
kanpyo tokenize "すもももももももものうち"
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
```
#### REPL mode
```shell script
kanpyo
自然言語処理
自然 名詞,形容動詞語幹,*,*,*,*,自然,シゼン,シゼン
言語 名詞,一般,*,*,*,*,言語,ゲンゴ,ゲンゴ
処理 名詞,サ変接続,*,*,*,*,処理,ショリ,ショリ
EOS
形態素解析
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
EOS
```
#### From piped standard input
```shell script
echo "自然言語処理" | kanpyo
自然 名詞,形容動詞語幹,*,*,*,*,自然,シゼン,シゼン
言語 名詞,一般,*,*,*,*,言語,ゲンゴ,ゲンゴ
処理 名詞,サ変接続,*,*,*,*,処理,ショリ,ショリ
EOS
```
### Graphviz
Print lattice in Graphviz format for debugging.
```shell script
kanpyo graphviz "自然言語処理" | dot -Tpng -o lattice.png
```

### TODO
- [ ] Support various dictionaries(Sudachi, UniDic, neologd, etc.)
- [ ] Support server mode
- [ ] Support search mode
- [ ] Tests for load dictionary and tokenize