{"id":13678978,"url":"https://github.com/megagonlabs/bunkai","last_synced_at":"2025-04-05T21:07:43.850Z","repository":{"id":37519242,"uuid":"360016907","full_name":"megagonlabs/bunkai","owner":"megagonlabs","description":"Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)","archived":false,"fork":false,"pushed_at":"2024-03-26T00:49:25.000Z","size":1242,"stargazers_count":189,"open_issues_count":18,"forks_count":11,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-29T19:06:43.411Z","etag":null,"topics":["japanese","python","sentence-boundary-detection","sentence-tokenizer"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/bunkai/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/megagonlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-21T03:07:43.000Z","updated_at":"2025-01-15T07:22:03.000Z","dependencies_parsed_at":"2024-06-21T05:03:49.537Z","dependency_job_id":"6206600d-acfb-40fc-b4ba-025b442634c1","html_url":"https://github.com/megagonlabs/bunkai","commit_stats":{"total_commits":443,"total_committers":5,"mean_commits":88.6,"dds":"0.25056433408577883","last_synced_commit":"b08985fba120ceff663f4a45b30ba529d01f70d9"},"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fbunkai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fbunkai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fbunkai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fbunkai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/megagonlabs","download_url":"https://codeload.github.com/megagonlabs/bunkai/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247399877,"owners_count":20932876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["japanese","python","sentence-boundary-detection","sentence-tokenizer"],"created_at":"2024-08-02T13:01:00.468Z","updated_at":"2025-04-05T21:07:43.831Z","avatar_url":"https://github.com/megagonlabs.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Bunkai\n\n[![PyPI version](https://badge.fury.io/py/bunkai.svg)](https://badge.fury.io/py/bunkai)\n[![Python Versions](https://img.shields.io/pypi/pyversions/bunkai.svg)](https://pypi.org/project/bunkai/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Downloads](https://pepy.tech/badge/bunkai/week)](https://pepy.tech/project/bunkai)\n\n[![CI](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/ci.yml)\n[![Typos](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/typos.yml)\n[![CodeQL](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/megagonlabs/bunkai/actions/workflows/codeql-analysis.yml)\n[![Maintainability](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/maintainability)](https://codeclimate.com/github/megagonlabs/bunkai/maintainability)\n[![Test Coverage](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/test_coverage)](https://codeclimate.com/github/megagonlabs/bunkai/test_coverage)\n[![markdownlint](https://img.shields.io/badge/markdown-lint-lightgrey)](https://github.com/markdownlint/markdownlint)\n[![jsonlint](https://img.shields.io/badge/json-lint-lightgrey)](https://github.com/dmeranda/demjson)\n[![yamllint](https://img.shields.io/badge/yaml-lint-lightgrey)](https://github.com/adrienverge/yamllint)\n\nBunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.  \n    Bunkaiは日本語文境界判定器です．\n\n## Quick Start\n\n### Install\n\n```console\n$ pip install -U bunkai\n```\n\n### Disambiguation without Models\n\n```console\n$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\\n2文書目の先頭行です。▁改行はU+2581で表現します。' \\\n    | bunkai\n宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★\n2文書目の先頭行です。▁│改行はU+2581で表現します。\n```\n\n- Feed a document as one line by using ``▁`` (U+2581) for line breaks.  \n    1行は1つの文書を表します．文書中の改行は ``▁`` (U+2581) で与えてください．\n- The output shows sentence boundaries with ``│`` (U+2502).  \n    出力では文境界は``│`` (U+2502) で表示されます．\n\n### Disambiguation for Line Breaks with a Model\n\nIf you want to disambiguate sentence boundaries for line breaks, please add a ``--model`` option with the path to the model.  \n    改行記号に対しても文境界判定を行いたい場合は，``--model``オプションを与える必要があります．\n\nFirst, please install extras to use ``--model`` option.  \n    ``--model``オプションを利用するために、まずextraパッケージをインストールしてください．\n\n```console\n$ pip install -U 'bunkai[lb]'\n```\n\nSecond, please setup a model. It will take some time.  \n    次にモデルをセットアップする必要があります．セットアップには少々時間がかかります．\n\n```console\n$ bunkai --model bunkai-model-directory --setup\n```\n\nThen, please designate the directory.  \n    そしてモデルを指定して動かしてください．\n\n```console\n$ echo -e \"文の途中で改行を▁入れる文章ってありますよね▁それも対象です。\" | bunkai --model bunkai-model-directory\n文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。\n```\n\n### Morphological Analysis Result\n\nYou can get morphological analysis results with ``--ma`` option.  \n``--ma``オプションを付与すると形態素解析結果が得られます．\n\nIt can be used with the ``--model`` option.  \n``--model``オプションと同時に使えます．\n\n```console\n$ echo -e '形態素解析し▁ます。結果を 表示します！' | bunkai --ma --model bunkai-model-directory\n形態素\t名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ\n解析\t名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ\nし\t動詞,自立,*,*,サ変・スル,連用形,する,シ,シ\n▁\nます\t助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス\n。\t記号,句点,*,*,*,*,。,。,。\nEOS\n結果\t名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ\nを\t助詞,格助詞,一般,*,*,*,を,ヲ,ヲ\n \t記号,空白,*,*,*,*, ,*,*\n表示\t名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ\nし\t動詞,自立,*,*,サ変・スル,連用形,する,シ,シ\nます\t助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス\n！\t記号,一般,*,*,*,*,！,！,！\nEOS\n```\n\n### Python Library\n\nYou can also use Bunkai as Python library.  \n  BunkaiはPythonライブラリとしても使えます．\n\n```python\nfrom bunkai import Bunkai\nbunkai = Bunkai()\nfor sentence in bunkai(\"はい。このようにpythonライブラリとしても使えます！\"):\n    print(sentence)\n```\n\n改行を文境界判定に含める場合はセットアップしたモデルパスを指定してください．  \n    If you want to disambiguate line breaks too, please designate the model path where you set up.\n\n```python\nfrom pathlib import Path\n\nfrom bunkai import Bunkai\n\nbunkai = Bunkai(path_model=Path(\"bunkai-model-directory\"))\nfor sentence in bunkai(\"そうなんです▁このように▁pythonライブラリとしても▁使えます！\"):\n    print(sentence)\n\n\"\"\"\nOutput:\nそうなんです▁\nこのように▁pythonライブラリとしても▁使えます！\n\"\"\"\n```\n\nFor more information, see [examples](example).  \n    ほかの例は[examples](example)をご覧ください．\n\n## Documents\n\n- [Documents](docs)\n\n## References\n\n- Yuta Hayashibe and Kensuke Mitsuzawa.\n    Sentence Boundary Detection on Line Breaks in Japanese.\n    Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75.\n    November 2020.\n    [[PDF]](https://www.aclweb.org/anthology/2020.wnut-1.10.pdf)\n    [[bib]](https://www.aclweb.org/anthology/2020.wnut-1.10.bib)\n\n## License\n\nApache License 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fbunkai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmegagonlabs%2Fbunkai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fbunkai/lists"}