https://github.com/daac-tools/python-vibrato

Viterbi-based accelerated tokenizer (Python wrapper)
https://github.com/daac-tools/python-vibrato

morphological-analysis nlp python segmentation tokenizer

Last synced: over 1 year ago
JSON representation

Viterbi-based accelerated tokenizer (Python wrapper)

Host: GitHub
URL: https://github.com/daac-tools/python-vibrato
Owner: daac-tools
License: apache-2.0
Created: 2022-12-08T03:30:10.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-09-04T00:55:21.000Z (almost 2 years ago)
Last Synced: 2025-02-28T20:36:41.753Z (over 1 year ago)
Topics: morphological-analysis, nlp, python, segmentation, tokenizer
Language: Rust
Homepage:
Size: 39.1 KB
Stars: 41
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE

Awesome Lists containing this project

README

          # 🐍 python-vibrato 🎤

[Vibrato](https://github.com/daac-tools/vibrato) is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.

This is a Python wrapper for Vibrato.

[![PyPI](https://img.shields.io/pypi/v/vibrato)](https://pypi.org/project/vibrato/)

[![Build Status](https://github.com/daac-tools/python-vibrato/actions/workflows/CI.yml/badge.svg)](https://github.com/daac-tools/python-vibrato/actions)

[![Documentation Status](https://readthedocs.org/projects/python-vibrato/badge/?version=latest)](https://python-vibrato.readthedocs.io/en/latest/?badge=latest)

## Installation

### Install pre-built package from PyPI

Run the following command:

```

$ pip install vibrato

```

### Build from source

You need to install the Rust compiler following [the documentation](https://www.rust-lang.org/tools/install) beforehand.

vibrato uses `pyproject.toml`, so you also need to upgrade pip to version 19 or later.

```

$ pip install --upgrade pip

```

After setting up the environment, you can install vibrato as follows:

```

$ pip install git+https://github.com/daac-tools/python-vibrato

```

## Example Usage

python-vibrato does not contain model files.

To perform tokenization, follow [the document of Vibrato](https://github.com/daac-tools/vibrato) to download distribution models or train your own models beforehand.

Check the version number as shown below to use compatible models:

```python

>>> import vibrato

>>> vibrato.VIBRATO_VERSION

'0.5.1'

```

Examples:

```python

>>> import vibrato

>>> with open('tests/data/system.dic', 'rb') as fp:

...     tokenizer = vibrato.Vibrato(fp.read())

>>> tokens = tokenizer.tokenize('社長は火星猫だ')

>>> len(tokens)

5

>>> tokens[0]

Token { surface: "社長", feature: "名詞,普通名詞,一般,*" }

>>> tokens[0].surface()

'社長'

>>> tokens[0].feature()

'名詞,普通名詞,一般,*'

>>> tokens[0].start()

0

>>> tokens[0].end()

2

```

## Note for distributed models

The distributed models are compressed in zstd format. If you want to load these compressed models,

you must decompress them outside the API.

```python

>>> import vibrato

>>> import zstandard  # zstandard package in PyPI

>>> dctx = zstandard.ZstdDecompressor()

>>> with open('tests/data/system.dic.zst', 'rb') as fp:

...     with dctx.stream_reader(fp) as dict_reader:

...         tokenizer = vibrato.Vibrato(dict_reader.read())

```

## License

Licensed under either of

 * Apache License, Version 2.0

   ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)

 * MIT license

   ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daac-tools/python-vibrato

Awesome Lists containing this project

README