Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer
https://github.com/daac-tools/vibrato

japanese morphological-analysis nlp rust segmentation tokenization tokenizer

Last synced: 6 days ago
JSON representation

🎤 vibrato: Viterbi-based accelerated tokenizer

Host: GitHub
URL: https://github.com/daac-tools/vibrato
Owner: daac-tools
License: apache-2.0
Created: 2022-07-06T08:50:52.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-19T16:17:32.000Z (11 months ago)
Last Synced: 2024-04-24T13:55:27.875Z (9 months ago)
Topics: japanese, morphological-analysis, nlp, rust, segmentation, tokenization, tokenizer
Language: Rust
Homepage: https://docs.rs/vibrato
Size: 1.09 MB
Stars: 295
Watchers: 7
Forks: 14
Open Issues: 5
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE

Awesome Lists containing this project

README

# 🎤 vibrato: VIterbi-Based acceleRAted TOkenizer

[![Crates.io](https://img.shields.io/crates/v/vibrato)](https://crates.io/crates/vibrato)
[![Documentation](https://docs.rs/vibrato/badge.svg)](https://docs.rs/vibrato)
[![Build Status](https://github.com/daac-tools/vibrato/actions/workflows/rust.yml/badge.svg)](https://github.com/daac-tools/vibrato/actions)
[![Slack](https://img.shields.io/badge/join-chat-brightgreen?logo=slack)](https://join.slack.com/t/daac-tools/shared_invite/zt-1pwwqbcz4-KxL95Nam9VinpPlzUpEGyA)

Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.

A Python wrapper is also available [here](https://github.com/daac-tools/python-vibrato).

[Wasm Demo](https://vibrato-demo.pages.dev/) (takes a little time to load the model.)

## Features

### Fast tokenization

Vibrato is a Rust reimplementation of the fast tokenizer [MeCab](https://taku910.github.io/mecab/),
although its implementation has been simplified and optimized for even faster tokenization.
Especially for language resources with a large matrix
(e.g., [`unidic-cwj-3.1.1`](https://clrd.ninjal.ac.jp/unidic/back_number.html#unidic_cwj) with a matrix of 459 MiB),
Vibrato will run faster thanks to cache-efficient id mappings.

For example, the following figure shows an experimental result of
tokenization time with MeCab and its reimplementations.
The detailed experimental settings and other results are available on [Wiki](https://github.com/daac-tools/vibrato/wiki/Speed-Comparison).

![](./figures/comparison.svg)

### MeCab compatibility

Vibrato supports options for outputting tokenized results identical to MeCab, such as ignoring whitespace.

### Training parameters

Vibrato also supports training parameters (or costs) in dictionaries from your corpora.
The detailed description can be found [here](./docs/train.md).

## Basic usage

This software is implemented in Rust.
First of all, install `rustc` and `cargo` following the [official instructions](https://www.rust-lang.org/tools/install).

### 1. Dictionary preparation

You can easily get started with Vibrato by downloading a precompiled dictionary.
[The Releases page](https://github.com/daac-tools/vibrato/releases) distributes
several precompiled dictionaries from different resources.

Here, consider to use [mecab-ipadic v2.7.0](https://taku910.github.io/mecab/).
(Specify an appropriate Vibrato release tag to `VERSION` such as `v0.5.0`.)

```
$ wget https://github.com/daac-tools/vibrato/releases/download/VERSION/ipadic-mecab-2_7_0.tar.xz
$ tar xf ipadic-mecab-2_7_0.tar.xz
```

You can also compile or train system dictionaries from your own resources.
See the [docs](./docs/) for more advanced usage.

### 2. Tokenization

To tokenize sentences using the system dictionary, run the following command.

```
$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst
```

The resultant tokens will be output in the Mecab format.

```
本名詞,一般,*,*,*,*,本,ホン,ホン
と助詞,並立助詞,*,*,*,*,と,ト,ト
カレー名詞,固有名詞,地域,一般,*,*,カレー,カレー,カレー
の助詞,連体化,*,*,*,*,の,ノ,ノ
街名詞,一般,*,*,*,*,街,マチ,マチ
神保名詞,固有名詞,地域,一般,*,*,神保,ジンボウ,ジンボー
町名詞,接尾,地域,*,*,*,町,マチ,マチ
へ助詞,格助詞,一般,*,*,*,へ,ヘ,エ
ようこそ感動詞,*,*,*,*,*,ようこそ,ヨウコソ,ヨーコソ
。記号,句点,*,*,*,*,。,。,。
EOS
```

If you want to output tokens separated by spaces, specify `-O wakati`.

```
$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -O wakati
本とカレーの街神保町へようこそ。
```

### Notes for Vibrato APIs

The distributed models are compressed in zstd format.
If you want to load these compressed models with the `vibrato` API,
you must decompress them outside of the API.

```rust
// Requires zstd crate or ruzstd crate
let reader = zstd::Decoder::new(File::open("path/to/system.dic.zst")?)?;
let dict = Dictionary::read(reader)?;
```

## Tokenization options

### MeCab-compatible options

Vibrato is a reimplementation of the MeCab algorithm,
but with the default settings it can produce different tokens from MeCab.

For example, MeCab ignores spaces (more precisely, `SPACE` defined in `char.def`) in tokenization.

```
$ echo "mens second bag" | mecab
mens 名詞,固有名詞,組織,*,*,*,*
second 名詞,一般,*,*,*,*,*
bag 名詞,固有名詞,組織,*,*,*,*
EOS
```

However, Vibrato handles such spaces as tokens with the default settings.

```
$ echo 'mens second bag' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst
mens 名詞,固有名詞,組織,*,*,*,*
記号,空白,*,*,*,*,*
second 名詞,固有名詞,組織,*,*,*,*
記号,空白,*,*,*,*,*
bag 名詞,固有名詞,組織,*,*,*,*
EOS
```

If you want to obtain the same results as MeCab, specify the arguments `-S` and `-M 24`.

```
$ echo 'mens second bag' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -S -M 24
mens 名詞,固有名詞,組織,*,*,*,*
second 名詞,一般,*,*,*,*,*
bag 名詞,固有名詞,組織,*,*,*,*
EOS
```

`-S` indicates if spaces are ignored.
`-M` indicates the maximum grouping length for unknown words.

#### Notes

There are corner cases where tokenization results in different outcomes due to cost tiebreakers.
However, this would be not an essential problem.

### User dictionary

You can use your user dictionary along with the system dictionary.
The user dictionary must be in the CSV format.

```
,,,,
```

The first four columns are always required.
The others (i.e., ``) are optional.

For example,

```
$ cat user.csv
神保町,1293,1293,334,カスタム名詞,ジンボチョウ
本とカレーの街,1293,1293,0,カスタム名詞,ホントカレーノマチ
ようこそ,3,3,-1000,感動詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen
```

To use the user dictionary, specify the file with the `-u` argument.

```
$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -u user.csv
本とカレーの街カスタム名詞,ホントカレーノマチ
神保町カスタム名詞,ジンボチョウ
へ助詞,格助詞,一般,*,*,*,へ,ヘ,エ
ようこそ感動詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen
。記号,句点,*,*,*,*,。,。,。
EOS
```

## More advanced usages

The directory [docs](./docs/) provides descriptions of more advanced usages such as training or benchmarking.

## Slack

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

* https://daac-tools.slack.com/
* Please get an invitation from [here](https://join.slack.com/t/daac-tools/shared_invite/zt-1pwwqbcz4-KxL95Nam9VinpPlzUpEGyA).

## License

Licensed under either of

* Apache License, Version 2.0
([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license
([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

## Acknowledgment

The initial version of this software was developed by LegalOn Technologies, Inc.,
but not an officially supported LegalOn Technologies product.

## Contribution

See [the guidelines](./CONTRIBUTING.md).

## References

Technical details of Vibrato are available in the following resources:

- 神田峻介, 赤部晃一, 後藤啓介, 小田悠介.
[最小コスト法に基づく形態素解析におけるCPUキャッシュの効率化](https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/C2-4.pdf),
言語処理学会第29回年次大会 (NLP2023).
- 赤部晃一, 神田峻介, 小田悠介.
[CRFに基づく形態素解析器のスコア計算の分割によるモデルサイズと解析速度の調整](https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/C2-1.pdf),
言語処理学会第29回年次大会 (NLP2023).
- [MeCab互換な形態素解析器Vibratoの高速化技法](https://tech.legalforce.co.jp/entry/2022/09/20/133132),
LegalOn Technologies Engineering Blog (2022-09-20).