{"id":28207990,"url":"https://github.com/haven-jeon/pykospacing","last_synced_at":"2025-06-12T05:30:46.901Z","repository":{"id":37549761,"uuid":"130193484","full_name":"haven-jeon/PyKoSpacing","owner":"haven-jeon","description":"Automatic Korean word spacing with Python ","archived":false,"fork":false,"pushed_at":"2024-07-04T00:06:18.000Z","size":4750,"stargazers_count":414,"open_issues_count":1,"forks_count":115,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-05-25T07:07:18.739Z","etag":null,"topics":["korean-nlp","nlp","spacing","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/haven-jeon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-19T09:41:22.000Z","updated_at":"2025-05-21T08:38:02.000Z","dependencies_parsed_at":"2023-11-21T04:28:34.271Z","dependency_job_id":"9f9e6628-ba48-4d64-b922-ad67d8f38f91","html_url":"https://github.com/haven-jeon/PyKoSpacing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/haven-jeon/PyKoSpacing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haven-jeon%2FPyKoSpacing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haven-jeon%2FPyKoSpacing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haven-jeon%2FPyKoSpacing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haven-jeon%2FPyKoSpacing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/haven-jeon","download_url":"https://codeload.github.com/haven-jeon/PyKoSpacing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haven-jeon%2FPyKoSpacing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259404105,"owners_count":22852119,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["korean-nlp","nlp","spacing","text-processing"],"created_at":"2025-05-17T14:11:27.755Z","updated_at":"2025-06-12T05:30:46.895Z","avatar_url":"https://github.com/haven-jeon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"PyKoSpacing \n---------------\n\nPython package for automatic Korean word spacing.\n\nR verson can be found [here](https://github.com/haven-jeon/KoSpacing).\n\n[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](http://www.gnu.org/licenses/gpl-3.0)\n\n\n#### Introduction\n\nWord spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis. `PyKoSpacing` has fairly accurate automatic word spacing performance,especially good for online text originated from SNS or SMS.\n\nFor example.\n\n\"아버지가방에들어가신다.\" can be spaced both of below.\n\n\n1. \"아버지가 방에 들어가신다.\" means  \"My father enters the room.\"\n1. \"아버지 가방에 들어가신다.\" means  \"My father goes into the bag.\"\n\nCommon sense, the first is the right answer.\n\n`PyKoSpacing` is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from [Chan-Yub Park](https://github.com/mrchypark)). \n\n\n#### Performance\n\n| Test Set  | Accuracy | \n|---|---|\n| Sejong(colloquial style) Corpus(1M) | 97.1% |\n| OOOO(literary style)  Corpus(3M)   | 94.3% |\n\n- Accuracy = # correctly spaced characters/# characters in the test data.\n  - Might be increased performance if normalize compound words. \n\n\n#### Install\n\n##### PyPI Install\nPre-requisite:\n```bash\nproper installation of python3\nproper installation of pip\n\npip install tensorflow\npip install keras\n\n\nWindows-Ubuntu case: On following error.\nOn error: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found\n   sudo apt-get install libstdc++6\n   sudo add-apt-repository ppa:ubuntu-toolchain-r/test\n   sudo apt-get update\n   sudo apt-get upgrade\n   sudo apt-get dist-upgrade (This takes long time.)\n```     \nDarwin(m1) case: You should install tensorflow in a different way.(Use [Miniforge3](https://github.com/conda-forge/miniforge))\n```bash\n# Install Miniforge3 for mac\ncurl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh\nchmod +x Miniforge3-MacOSX-arm64.sh\nsh Miniforge3-MacOSX-arm64.sh\n# Activate Miniforge3 virtualenv\n# You should use Python version 3.10 or less.\nsource ~/miniforge3/bin/activate\n# Install the Tensorflow dependencies \nconda install -c apple tensorflow-deps \n# Install base tensorflow \npython -m pip install tensorflow-macos \n# Install metal plugin \npython -m pip install tensorflow-metal\n```\n\nTo install from GitHub, use\n\n    pip install git+https://github.com/haven-jeon/PyKoSpacing.git\n\n\n#### Example \n\n```python\n\u003e\u003e\u003e from pykospacing import Spacing\n\u003e\u003e\u003e spacing = Spacing()\n\u003e\u003e\u003e spacing(\"김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.\")\n\"김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다.\"\n\u003e\u003e\u003e # Apply a list of words that must be non-spacing\n\u003e\u003e\u003e spacing('귀밑에서턱까지잇따라난수염을구레나룻이라고한다.')\n'귀 밑에서 턱까지 잇따라 난 수염을 구레나 룻이라고 한다.'\n\u003e\u003e\u003e spacing = Spacing(rules=['구레나룻'])\n\u003e\u003e\u003e spacing('귀밑에서턱까지잇따라난수염을구레나룻이라고한다.')\n'귀 밑에서 턱까지 잇따라 난 수염을 구레나룻이라고 한다.'\n```\n\nSetting rules with csv file. (you only need to use `set_rules_by_csv()` method.)\n\n```bash\n$ cat test.csv\n인덱스,단어\n1,네이버영화\n2,언급된단어\n```\n\n```python\n\u003e\u003e\u003e from pykospacing import Spacing\n\u003e\u003e\u003e spacing = Spacing(rules=[''])\n\u003e\u003e\u003e spacing.set_rules_by_csv('./test.csv', '단어')\n\u003e\u003e\u003e spacing(\"김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.\")\n\"김형호 영화시장 분석가는 '1987'의 네이버영화 정보 네티즌 10점 평에서 언급된단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다.\"\n```\n\nRun on command line(thanks [lqez](https://github.com/lqez)). \n\n```bash\n$ cat test_in.txt\n김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.\n아버지가방에들어가신다.\n$ python -m pykospacing.pykos test_in.txt\n김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다.\n아버지가 방에 들어가신다.\n```\n\nCurrent model [have problems](https://github.com/haven-jeon/PyKoSpacing/issues/52) in some cases when the input includes English characters.\u003cbr\u003e\nPyKoSpacing provides the parameter `ignore` and `ignore_pattern` to deal with that problem.\n\n- **About `ignore` parameter** (str, optional) \u003cbr\u003e\n  - `ignore='none'`: No pre/post-processing will be applied. The output will be the same as the model output. \u003cbr\u003e\n  - `ignore='pre'`: Apply pre-processing which deletes characters that match with `ignore_pattern`. These deleted characters will be merged after model prediction. This option has the problem that it always puts space *after* the deleted characters, since it doesn't know if the deleted character will have a space to the left, right, or both of them. \u003cbr\u003e\n  - `ignore='post'`: Apply post-processing which ignores model outputs on characters that match with `ignore_pattern`. This option has the problem that English characters in model input can also affect near non-English characters. \u003cbr\u003e\n  - `ignore='pre2'`: Apply pre-processing which delete characters which matches with `ignore_pattern`, and predict on **both preprocessed text and original text**. This allows it to know where to put space left, right, or both of the deleted characters. However, this option requires to predict **twice**, which doubles the computation time. \u003cbr\u003e\n  - Default: `ignore='none'`\n\n- **About `ignore_pattern` parameter** (str, optional) \u003cbr\u003e\nYou can input your own regex pattern to `ignore_pattern`. The regex pattern should be the pattern of characters you want to ignore.\u003cbr\u003e\n  - Default: ``ignore_pattern=r'[^가-힣ㄱ-ㅣ!-@[-`{-~\\s]+,*( [^가-힣ㄱ-ㅣ!-@[-`{-~\\s]+,*)*[.,!?]* *'``, which matches characters, words, or a sentence of non-Korean and non-ascii symbols.\n\nExamples of `ignore` parameter\n\n```python\n\u003e\u003e\u003e from pykospacing import Spacing\n\u003e\u003e\u003e spacing = Spacing()\n\u003e\u003e\u003e spacing(\"친구와함께bmw썬바이저를썼다.\", ignore='none')\n\"친구와 함께 bm w 썬바이저를 썼다.\"\n\u003e\u003e\u003e spacing(\"친구와함께bmw썬바이저를썼다.\", ignore='pre')\n\"친구와 함께bmw 썬바이저를 썼다.\"\n\u003e\u003e\u003e spacing(\"친구와함께bmw썬바이저를썼다.\", ignore='post')\n\"친구와 함께 bm w 썬바이저를 썼다.\"\n\u003e\u003e\u003e spacing(\"친구와함께bmw썬바이저를썼다.\", ignore='pre2')\n\"친구와 함께 bmw 썬바이저를 썼다.\"\n\n\u003e\u003e\u003e spacing(\"chicken박스를열고닭다리를꺼내입에문다.crispy한튀김옷덕에내입주변은glossy해진다.\", ignore='none')\n\"chicken박스를 열고 닭다리를 꺼내 입에 문다. crispy 한튀김 옷 덕에 내 입 주변은 glossy해진다.\"\n\u003e\u003e\u003e spacing(\"chicken박스를열고닭다리를꺼내입에문다.crispy한튀김옷덕에내입주변은glossy해진다.\", ignore='pre')\n\"chicken박스를 열고 닭다리를 꺼내 입에 문다.crispy 한 튀김옷 덕에 내 입 주변은glossy 해진다.\"\n\u003e\u003e\u003e spacing(\"chicken박스를열고닭다리를꺼내입에문다.crispy한튀김옷덕에내입주변은glossy해진다.\", ignore='post')\n\"chicken박스를 열고 닭다리를 꺼내 입에 문다. crispy 한튀김 옷 덕에 내 입 주변은 glossy해진다.\"\n\u003e\u003e\u003e spacing(\"chicken박스를열고닭다리를꺼내입에문다.crispy한튀김옷덕에내입주변은glossy해진다.\", ignore='pre2')\n\"chicken박스를 열고 닭다리를 꺼내 입에 문다. crispy 한 튀김옷 덕에 내 입 주변은 glossy해진다.\"\n\n\u003e\u003e\u003e spacing(\"김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.\", ignore='none')\n\"김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다.\"\n\u003e\u003e\u003e spacing(\"김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.\", ignore='pre')\n\"김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램R과KoNLP 패키지로 텍스트마이닝하여 분석했다.\"\n\u003e\u003e\u003e spacing(\"김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.\", ignore='post')\n\"김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다.\"\n\u003e\u003e\u003e spacing(\"김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.\", ignore='pre2')\n\"김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다.\"\n```\n\n#### Model Architecture\n\n![](kospacing_arch.png)\n\n\n#### For Training\n\n- Training code uses an architecture that is more advanced than PyKoSpacing, but also contains the learning logic of PyKoSpacing.\n  - https://github.com/haven-jeon/Train_KoSpacing\n\n#### Citation\n\n```markdowns\n@misc{heewon2018,\nauthor = {Heewon Jeon},\ntitle = {KoSpacing: Automatic Korean word spacing},\npublisher = {GitHub},\njournal = {GitHub repository},\nhowpublished = {\\url{https://github.com/haven-jeon/KoSpacing}}\n```\n\n### Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=haven-jeon/PyKoSpacing\u0026type=Date)](https://star-history.com/#haven-jeon/PyKoSpacing\u0026Date)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaven-jeon%2Fpykospacing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhaven-jeon%2Fpykospacing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaven-jeon%2Fpykospacing/lists"}