https://github.com/daac-tools/python-vaporetto
π₯ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
https://github.com/daac-tools/python-vaporetto
analyzer japanese morphological-analysis nlp python rust segmentation tokenization tokenizer
Last synced: 4 months ago
JSON representation
π₯ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
- Host: GitHub
- URL: https://github.com/daac-tools/python-vaporetto
- Owner: daac-tools
- License: apache-2.0
- Created: 2022-06-09T04:26:40.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-09-04T00:56:03.000Z (over 1 year ago)
- Last Synced: 2024-12-12T20:44:45.939Z (about 1 year ago)
- Topics: analyzer, japanese, morphological-analysis, nlp, python, rust, segmentation, tokenization, tokenizer
- Language: Rust
- Homepage:
- Size: 426 KB
- Stars: 21
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE
Awesome Lists containing this project
README
# π python-vaporetto π₯
[Vaporetto](https://github.com/daac-tools/vaporetto) is a fast and lightweight pointwise prediction based tokenizer.
This is a Python wrapper for Vaporetto.
[](https://pypi.org/project/vaporetto/)
[](https://github.com/daac-tools/python-vaporetto/actions)
[](https://python-vaporetto.readthedocs.io/en/latest/?badge=latest)
## Installation
### Install pre-built package from PyPI
Run the following command:
```
$ pip install vaporetto
```
### Build from source
You need to install the Rust compiler following [the documentation](https://www.rust-lang.org/tools/install) beforehand.
vaporetto uses `pyproject.toml`, so you also need to upgrade pip to version 19 or later.
```
$ pip install --upgrade pip
```
After setting up the environment, you can install vaporetto as follows:
```
$ pip install git+https://github.com/daac-tools/python-vaporetto
```
## Example Usage
python-vaporetto does not contain model files.
To perform tokenization, follow [the document of Vaporetto](https://github.com/daac-tools/vaporetto) to download distribution models or train your own models beforehand.
Check the version number as shown below to use compatible models:
```python
>>> import vaporetto
>>> vaporetto.VAPORETTO_VERSION
'0.6.5'
```
Examples:
```python
# Import vaporetto module
>>> import vaporetto
# Load the model file
>>> with open('tests/data/vaporetto.model', 'rb') as fp:
... model = fp.read()
# Create an instance of the Vaporetto
>>> tokenizer = vaporetto.Vaporetto(model, predict_tags = True)
# Tokenize
>>> tokenizer.tokenize_to_string('γΎγη€Ύι·γ―η«ζη«γ ')
'γΎγ/εθ©/γγΌ η€Ύι·/εθ©/γ·γ£γγ§γΌ γ―/ε©θ©/γ― η«ζ/εθ©/γ«γ»γΌ η«/εθ©/γγ³ γ /ε©εθ©/γ'
>>> tokens = tokenizer.tokenize('γΎγη€Ύι·γ―η«ζη«γ ')
>>> len(tokens)
6
>>> tokens[0].surface()
'γΎγ'
>>> tokens[0].tag(0)
'εθ©'
>>> tokens[0].tag(1)
'γγΌ'
>>> [token.surface() for token in tokens]
['γΎγ', 'η€Ύι·', 'γ―', 'η«ζ', 'η«', 'γ ']
```
## Note for distributed models
The distributed models are compressed in zstd format. If you want to load these compressed models,
you must decompress them outside the API.
```python
>>> import vaporetto
>>> import zstandard # zstandard package in PyPI
>>> dctx = zstandard.ZstdDecompressor()
>>> with open('tests/data/vaporetto.model.zst', 'rb') as fp:
... with dctx.stream_reader(fp) as dict_reader:
... tokenizer = vaporetto.Vaporetto(dict_reader.read(), predict_tags = True)
```
## Note for KyTea's models
You can also use KyTea's models as follows:
```python
>>> with open('path/to/jp-0.4.7-5.mod', 'rb') as fp: # doctest: +SKIP
... tokenizer = vaporetto.Vaporetto.create_from_kytea_model(fp.read())
```
Note: Vaporetto does not support tag prediction with KyTea's models.
## [Speed Comparison](https://github.com/daac-tools/python-vaporetto/wiki/Speed-Comparison)
## License
Licensed under either of
* Apache License, Version 2.0
([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license
([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
## Contribution
See [the guidelines](./CONTRIBUTING.md).