{"id":13592994,"url":"https://github.com/WorksApplications/SudachiPy","last_synced_at":"2025-04-09T02:32:03.414Z","repository":{"id":25183530,"uuid":"103384678","full_name":"WorksApplications/SudachiPy","owner":"WorksApplications","description":"Python version of Sudachi, a Japanese tokenizer.","archived":true,"fork":false,"pushed_at":"2022-10-07T07:38:45.000Z","size":685,"stargazers_count":391,"open_issues_count":17,"forks_count":50,"subscribers_count":24,"default_branch":"develop","last_synced_at":"2024-11-06T14:41:25.848Z","etag":null,"topics":["morphological-analysis","nlp-library","pos-tagging","segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WorksApplications.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":"WorksApplications"}},"created_at":"2017-09-13T10:10:16.000Z","updated_at":"2024-10-31T03:03:34.000Z","dependencies_parsed_at":"2023-01-14T02:16:43.551Z","dependency_job_id":null,"html_url":"https://github.com/WorksApplications/SudachiPy","commit_stats":null,"previous_names":[],"tags_count":35,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2FSudachiPy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2FSudachiPy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2FSudachiPy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WorksApplications%2FSudachiPy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WorksApplications","download_url":"https://codeload.github.com/WorksApplications/SudachiPy/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247965623,"owners_count":21025407,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["morphological-analysis","nlp-library","pos-tagging","segmentation"],"created_at":"2024-08-01T16:01:15.474Z","updated_at":"2025-04-09T02:32:01.359Z","avatar_url":"https://github.com/WorksApplications.png","language":"Python","funding_links":["https://github.com/sponsors/WorksApplications"],"categories":["Python"],"sub_categories":[],"readme":"# SudachiPy\n[![PyPi version](https://img.shields.io/pypi/v/sudachipy.svg)](https://pypi.python.org/pypi/sudachipy/)\n[![](https://img.shields.io/badge/python-3.5+-blue.svg)](https://www.python.org/downloads/release/python-350/)\n[![Build Status](https://github.com/WorksApplications/SudachiPy/actions/workflows/build.yml/badge.svg)](https://github.com/WorksApplications/SudachiPy/actions/workflows/build.yml)\n[![](https://img.shields.io/github/license/WorksApplications/SudachiPy.svg)](https://github.com/WorksApplications/SudachiPy/blob/develop/LICENSE)\n\n[日本語](/docs/tutorial.md)\n\nSudachiPy is a Python version of [Sudachi](https://github.com/WorksApplications/Sudachi), a Japanese morphological analyzer.\n\n## Warning\n\nThis repository is for 0.5.* version of SudachiPy, 0.6* and above are developed as [Sudachi.rs](https://github.com/WorksApplications/sudachi.rs).\n\n\n## TL;DR\n\n```bash\n$ pip install sudachipy sudachidict_core\n\n$ echo \"高輪ゲートウェイ駅\" | sudachipy\n高輪ゲートウェイ駅\t名詞,固有名詞,一般,*,*,*\t高輪ゲートウェイ駅\nEOS\n\n$ echo \"高輪ゲートウェイ駅\" | sudachipy -m A\n高輪\t名詞,固有名詞,地名,一般,*,*\t高輪\nゲートウェイ\t名詞,普通名詞,一般,*,*,*\tゲートウェー\n駅\t名詞,普通名詞,一般,*,*,*\t駅\nEOS\n\n$ echo \"空缶空罐空きカン\" | sudachipy -a\n空缶\t名詞,普通名詞,一般,*,*,*\t空き缶\t空缶\tアキカン\t0\n空罐\t名詞,普通名詞,一般,*,*,*\t空き缶\t空罐\tアキカン\t0\n空きカン\t名詞,普通名詞,一般,*,*,*\t空き缶\t空きカン\tアキカン\t0\nEOS\n```\n\n## Setup\n\nYou need SudachiPy and a dictionary.\n\n### Step 1. Install SudachiPy\n\n```bash\n$ pip install sudachipy\n```\n\n### Step 2. Get a Dictionary\n\nYou can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the `core` edition).\n\n```bash\n$ pip install sudachidict_core\n```\n\nAlternatively, you can choose other dictionary editions. See [this section](#dictionary-edition) for the detail.\n\n\n## Usage: As a command\n\nThere is a CLI command `sudachipy`.\n\n```bash\n$ echo \"外国人参政権\" | sudachipy\n外国人参政権\t名詞,普通名詞,一般,*,*,*\t外国人参政権\nEOS\n$ echo \"外国人参政権\" | sudachipy -m A\n外国\t名詞,普通名詞,一般,*,*,*\t外国\n人\t接尾辞,名詞的,一般,*,*,*\t人\n参政\t名詞,普通名詞,一般,*,*,*\t参政\n権\t接尾辞,名詞的,一般,*,*,*\t権\nEOS\n```\n\n```bash\n$ sudachipy tokenize -h\nusage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]\n                          [-a] [-d] [-v]\n                          [file [file ...]]\n\nTokenize Text\n\npositional arguments:\n  file           text written in utf-8\n\noptional arguments:\n  -h, --help     show this help message and exit\n  -r file        the setting file in JSON format\n  -m {A,B,C}     the mode of splitting\n  -o file        the output file\n  -s string      sudachidict type\n  -a             print all of the fields\n  -d             print the debug information\n  -v, --version  print sudachipy version\n```\n\n### Output\n\nColumns are tab separated.\n\n- Surface\n- Part-of-Speech Tags (comma separated)\n- Normalized Form\n\nWhen you add the `-a` option, it additionally outputs\n\n- Dictionary Form\n- Reading Form\n- Dictionary ID\n  - `0` for the system dictionary\n  - `1` and above for the [user dictionaries](#user-dictionary)\n  - `-1\\t(OOV)` if a word is Out-of-Vocabulary (not in the dictionary)\n\n```bash\n$ echo \"外国人参政権\" | sudachipy -a\n外国人参政権\t名詞,普通名詞,一般,*,*,*\t外国人参政権\t外国人参政権\tガイコクジンサンセイケン\t0\nEOS\n```\n\n```bash\necho \"阿quei\" | sudachipy -a\n阿\t名詞,普通名詞,一般,*,*,*\t阿\t阿\t\t-1\t(OOV)\nquei\t名詞,普通名詞,一般,*,*,*\tquei\tquei\t\t-1\t(OOV)\nEOS\n```\n\n\n## Usage: As a Python package\n\nHere is an example;\n\n```python\nfrom sudachipy import tokenizer\nfrom sudachipy import dictionary\n\ntokenizer_obj = dictionary.Dictionary().create()\n```\n\n```python\n# Multi-granular Tokenization\n\nmode = tokenizer.Tokenizer.SplitMode.C\n[m.surface() for m in tokenizer_obj.tokenize(\"国家公務員\", mode)]\n# =\u003e ['国家公務員']\n\nmode = tokenizer.Tokenizer.SplitMode.B\n[m.surface() for m in tokenizer_obj.tokenize(\"国家公務員\", mode)]\n# =\u003e ['国家', '公務員']\n\nmode = tokenizer.Tokenizer.SplitMode.A\n[m.surface() for m in tokenizer_obj.tokenize(\"国家公務員\", mode)]\n# =\u003e ['国家', '公務', '員']\n```\n\n\n```python\n# Morpheme information\n\nm = tokenizer_obj.tokenize(\"食べ\", mode)[0]\n\nm.surface() # =\u003e '食べ'\nm.dictionary_form() # =\u003e '食べる'\nm.reading_form() # =\u003e 'タベ'\nm.part_of_speech() # =\u003e ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']\n```\n\n\n```python\n# Normalization\n\ntokenizer_obj.tokenize(\"附属\", mode)[0].normalized_form()\n# =\u003e '付属'\ntokenizer_obj.tokenize(\"SUMMER\", mode)[0].normalized_form()\n# =\u003e 'サマー'\ntokenizer_obj.tokenize(\"シュミレーション\", mode)[0].normalized_form()\n# =\u003e 'シミュレーション'\n```\n\n(With `20200330` `core` dictionary. The results may change when you use other versions)\n\n\n## Dictionary Edition\n\n**WARNING: `sudachipy link` is no longer available in SudachiPy v0.5.2 and later. **\n\n\nThere are three editions of Sudachi Dictionary, namely, `small`, `core`, and `full`. See [WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict) for the detail.\n\nSudachiPy uses `sudachidict_core` by default. \n\nDictionaries are installed as Python packages `sudachidict_small`, `sudachidict_core`, and `sudachidict_full`.\n\n* [SudachiDict-small · PyPI](https://pypi.org/project/SudachiDict-small/)\n* [SudachiDict-core · PyPI](https://pypi.org/project/SudachiDict-core/)\n* [SudachiDict-full · PyPI](https://pypi.org/project/SudachiDict-full/)\n\nThe dictionary files are not in the package itself, but it is downloaded upon installation.\n\n### Dictionary option: command line\n\nYou can specify the dictionary with the tokenize option `-s`.\n\n```bash\n$ pip install sudachidict_small\n$ echo \"外国人参政権\" | sudachipy -s small\n```\n\n```bash\n$ pip install sudachidict_full\n$ echo \"外国人参政権\" | sudachipy -s full\n```\n\n### Dictionary option: Python package\n\nYou can specify the dictionary with the `Dicionary()` argument; `config_path` or `dict_type`.\n\n```python\nclass Dictionary(config_path=None, resource_dir=None, dict_type=None)\n```\n\n1. `config_path`\n    * You can specify the file path to the setting file with `config_path` (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).\n    * If the dictionary file is specified in the setting file as `systemDict`, SudachiPy will use the dictionary.\n2. `dict_type`\n    * You can also specify the dictionary type with `dict_type`.\n    * The available arguments are `small`, `core`, or `full`.\n    * If different dictionaries are specified with `config_path` and `dict_type`, **a dictionary defined `dict_type` overrides** those defined in the config path.\n\n```python\nfrom sudachipy import tokenizer\nfrom sudachipy import dictionary\n\n# default: sudachidict_core\ntokenizer_obj = dictionary.Dictionary().create()  \n\n# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used\ntokenizer_obj = dictionary.Dictionary(config_path=\"/path/to/sudachi.json\").create()  \n\n# The dictionary specified by `dict_type` will be set.\ntokenizer_obj = dictionary.Dictionary(dict_type=\"core\").create()  # sudachidict_core (same as default)\ntokenizer_obj = dictionary.Dictionary(dict_type=\"small\").create()  # sudachidict_small\ntokenizer_obj = dictionary.Dictionary(dict_type=\"full\").create()  # sudachidict_full\n\n# The dictionary specified by `dict_type` overrides those defined in the config path.\n# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file. \ntokenizer_obj = dictionary.Dictionary(config_path=\"/path/to/sudachi.json\", dict_type=\"full\").create()  \n```\n\n\n### Dictionary in The Setting File\n\nAlternatively, if the dictionary file is specified in the setting file, `sudachi.json`, SudachiPy will use that file.\n\n```\n{\n    \"systemDict\" : \"relative/path/to/system.dic\",\n    ...\n}\n```\n\nThe default setting file is [sudachipy/resources/sudachi.json](https://github.com/WorksApplications/SudachiPy/blob/develop/sudachipy/resources/sudachi.json). You can specify your `sudachi.json` with the `-r` option.\n\n```bash\n$ sudachipy -r path/to/sudachi.json\n``` \n\n\n## User Dictionary\n\nTo use a user dictionary, `user.dic`, place [sudachi.json](https://github.com/WorksApplications/SudachiPy/blob/develop/sudachipy/resources/sudachi.json) to anywhere you like, and add `userDict` value with the relative path from `sudachi.json` to your `user.dic`.\n\n```js\n{\n    \"userDict\" : [\"relative/path/to/user.dic\"],\n    ...\n}\n```\n\nThen specify your `sudachi.json` with the `-r` option.\n\n```bash\n$ sudachipy -r path/to/sudachi.json\n``` \n\n\nYou can build a user dictionary with the subcommand `ubuild`.  \n\n**WARNING: v0.3.\\* ubuild contains bug.**\n\n```bash\n$ sudachipy ubuild -h\nusage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]\n\nBuild User Dictionary\n\npositional arguments:\n  file        source files with CSV format (one or more)\n\noptional arguments:\n  -h, --help  show this help message and exit\n  -d string   description comment to be embedded on dictionary\n  -o file     output file (default: user.dic)\n  -s file     system dictionary path (default: system core dictionary path)\n```\n\nAbout the dictionary file format, please refer to [this document](https://github.com/WorksApplications/Sudachi/blob/develop/docs/user_dict.md) (written in Japanese, English version is not available yet).\n\n\n## Customized System Dictionary\n\n```bash\n$ sudachipy build -h\nusage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]\n\nBuild Sudachi Dictionary\n\npositional arguments:\n  file        source files with CSV format (one of more)\n\noptional arguments:\n  -h, --help  show this help message and exit\n  -o file     output file (default: system.dic)\n  -d string   description comment to be embedded on dictionary\n\nrequired named arguments:\n  -m file     connection matrix file with MeCab's matrix.def format\n```\n\nTo use your customized `system.dic`, place [sudachi.json](https://github.com/WorksApplications/SudachiPy/blob/develop/sudachipy/resources/sudachi.json) to anywhere you like, and overwrite `systemDict` value with the relative path from `sudachi.json` to your `system.dic`.\n\n```\n{\n    \"systemDict\" : \"relative/path/to/system.dic\",\n    ...\n}\n```\n\nThen specify your `sudachi.json` with the `-r` option.\n\n```bash\n$ sudachipy -r path/to/sudachi.json\n``` \n\n\n## For Developers\n\n### Cython Build\n\n```sh\n$ python setup.py build_ext --inplace\n```\n\n### Code Format\n\nRun `scripts/format.sh` to check if your code is formatted correctly.\n\nYou need packages `flake8` `flake8-import-order` `flake8-buitins` (See `requirements.txt`).\n\n### Test\n\nRun `scripts/test.sh` to run the tests.\n\n\n## Contact\n\nSudachi and SudachiPy are developed by [WAP Tokushima Laboratory of AI and NLP](http://nlp.worksap.co.jp/).\n\nOpen an issue, or come to our Slack workspace for questions and discussion.\n\nhttps://sudachi-dev.slack.com/ (Get invitation [here](https://join.slack.com/t/sudachi-dev/shared_invite/enQtMzg2NTI2NjYxNTUyLTMyYmNkZWQ0Y2E5NmQxMTI3ZGM3NDU0NzU4NGE1Y2UwYTVmNTViYjJmNDI0MWZiYTg4ODNmMzgxYTQ3ZmI2OWU))\n\nEnjoy tokenization!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWorksApplications%2FSudachiPy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FWorksApplications%2FSudachiPy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWorksApplications%2FSudachiPy/lists"}