{"id":21961752,"url":"https://github.com/daac-tools/python-vaporetto","last_synced_at":"2025-10-11T08:44:29.765Z","repository":{"id":54504035,"uuid":"501503088","full_name":"daac-tools/python-vaporetto","owner":"daac-tools","description":"🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.","archived":false,"fork":false,"pushed_at":"2024-09-04T00:56:03.000Z","size":436,"stargazers_count":21,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-12-12T20:44:45.939Z","etag":null,"topics":["analyzer","japanese","morphological-analysis","nlp","python","rust","segmentation","tokenization","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daac-tools.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-09T04:26:40.000Z","updated_at":"2024-11-21T10:20:37.000Z","dependencies_parsed_at":"2023-02-18T18:01:10.313Z","dependency_job_id":null,"html_url":"https://github.com/daac-tools/python-vaporetto","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fpython-vaporetto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fpython-vaporetto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fpython-vaporetto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fpython-vaporetto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daac-tools","download_url":"https://codeload.github.com/daac-tools/python-vaporetto/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230537062,"owners_count":18241515,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analyzer","japanese","morphological-analysis","nlp","python","rust","segmentation","tokenization","tokenizer"],"created_at":"2024-11-29T10:17:50.366Z","updated_at":"2025-10-11T08:44:29.667Z","avatar_url":"https://github.com/daac-tools.png","language":"Rust","readme":"# 🐍 python-vaporetto 🛥\n\n[Vaporetto](https://github.com/daac-tools/vaporetto) is a fast and lightweight pointwise prediction based tokenizer.\nThis is a Python wrapper for Vaporetto.\n\n[![PyPI](https://img.shields.io/pypi/v/vaporetto)](https://pypi.org/project/vaporetto/)\n[![Build Status](https://github.com/daac-tools/python-vaporetto/actions/workflows/CI.yml/badge.svg)](https://github.com/daac-tools/python-vaporetto/actions)\n[![Documentation Status](https://readthedocs.org/projects/python-vaporetto/badge/?version=latest)](https://python-vaporetto.readthedocs.io/en/latest/?badge=latest)\n\n## Installation\n\n### Install pre-built package from PyPI\n\nRun the following command:\n\n```\n$ pip install vaporetto\n```\n\n### Build from source\n\nYou need to install the Rust compiler following [the documentation](https://www.rust-lang.org/tools/install) beforehand.\nvaporetto uses `pyproject.toml`, so you also need to upgrade pip to version 19 or later.\n\n```\n$ pip install --upgrade pip\n```\n\nAfter setting up the environment, you can install vaporetto as follows:\n\n```\n$ pip install git+https://github.com/daac-tools/python-vaporetto\n```\n\n## Example Usage\n\npython-vaporetto does not contain model files.\nTo perform tokenization, follow [the document of Vaporetto](https://github.com/daac-tools/vaporetto) to download distribution models or train your own models beforehand.\n\nCheck the version number as shown below to use compatible models:\n\n```python\n\u003e\u003e\u003e import vaporetto\n\u003e\u003e\u003e vaporetto.VAPORETTO_VERSION\n'0.6.5'\n\n```\n\nExamples:\n\n```python\n# Import vaporetto module\n\u003e\u003e\u003e import vaporetto\n\n# Load the model file\n\u003e\u003e\u003e with open('tests/data/vaporetto.model', 'rb') as fp:\n...     model = fp.read()\n\n# Create an instance of the Vaporetto\n\u003e\u003e\u003e tokenizer = vaporetto.Vaporetto(model, predict_tags = True)\n\n# Tokenize\n\u003e\u003e\u003e tokenizer.tokenize_to_string('まぁ社長は火星猫だ')\n'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'\n\n\u003e\u003e\u003e tokens = tokenizer.tokenize('まぁ社長は火星猫だ')\n\n\u003e\u003e\u003e len(tokens)\n6\n\n\u003e\u003e\u003e tokens[0].surface()\n'まぁ'\n\n\u003e\u003e\u003e tokens[0].tag(0)\n'名詞'\n\n\u003e\u003e\u003e tokens[0].tag(1)\n'マー'\n\n\u003e\u003e\u003e [token.surface() for token in tokens]\n['まぁ', '社長', 'は', '火星', '猫', 'だ']\n\n```\n\n## Note for distributed models\n\nThe distributed models are compressed in zstd format. If you want to load these compressed models,\nyou must decompress them outside the API.\n\n```python\n\u003e\u003e\u003e import vaporetto\n\u003e\u003e\u003e import zstandard  # zstandard package in PyPI\n\n\u003e\u003e\u003e dctx = zstandard.ZstdDecompressor()\n\u003e\u003e\u003e with open('tests/data/vaporetto.model.zst', 'rb') as fp:\n...    with dctx.stream_reader(fp) as dict_reader:\n...        tokenizer = vaporetto.Vaporetto(dict_reader.read(), predict_tags = True)\n\n```\n\n## Note for KyTea's models\n\nYou can also use KyTea's models as follows:\n\n```python\n\u003e\u003e\u003e with open('path/to/jp-0.4.7-5.mod', 'rb') as fp:  # doctest: +SKIP\n...     tokenizer = vaporetto.Vaporetto.create_from_kytea_model(fp.read())\n\n```\n\nNote: Vaporetto does not support tag prediction with KyTea's models.\n\n## [Speed Comparison](https://github.com/daac-tools/python-vaporetto/wiki/Speed-Comparison)\n\n## License\n\nLicensed under either of\n\n * Apache License, Version 2.0\n   ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)\n * MIT license\n   ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)\n\nat your option.\n\n## Contribution\n\nSee [the guidelines](./CONTRIBUTING.md).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaac-tools%2Fpython-vaporetto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaac-tools%2Fpython-vaporetto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaac-tools%2Fpython-vaporetto/lists"}