{"id":33119778,"url":"https://github.com/lovit/soyspacing","last_synced_at":"2026-01-17T18:30:47.826Z","repository":{"id":57469476,"uuid":"91172102","full_name":"lovit/soyspacing","owner":"lovit","description":"띄어쓰기 오류 교정 라이브러리입니다. CRF 와 같은 머신러닝 알고리즘이 아닌, 직관적인 접근법으로 띄어쓰기를 교정합니다.","archived":false,"fork":false,"pushed_at":"2019-09-26T14:36:14.000Z","size":2188,"stargazers_count":150,"open_issues_count":2,"forks_count":34,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-11-19T22:03:21.265Z","etag":null,"topics":["korean-nlp","nlp","noise-cancellation","spacing","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lovit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-13T12:17:08.000Z","updated_at":"2025-05-31T04:49:31.000Z","dependencies_parsed_at":"2022-09-19T09:30:44.575Z","dependency_job_id":null,"html_url":"https://github.com/lovit/soyspacing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lovit/soyspacing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoyspacing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoyspacing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoyspacing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoyspacing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lovit","download_url":"https://codeload.github.com/lovit/soyspacing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoyspacing/sbom","scorecard":{"id":600233,"data":{"date":"2025-08-11","repo":{"name":"github.com/lovit/soyspacing","commit":"d345db4a3ce4300793016e5085f012744b78303e"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.6,"checks":[{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-21T00:13:29.676Z","repository_id":57469476,"created_at":"2025-08-21T00:13:29.676Z","updated_at":"2025-08-21T00:13:29.676Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28515728,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T18:28:00.501Z","status":"ssl_error","status_checked_at":"2026-01-17T18:28:00.150Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["korean-nlp","nlp","noise-cancellation","spacing","text-processing"],"created_at":"2025-11-15T04:00:36.383Z","updated_at":"2026-01-17T18:30:47.812Z","avatar_url":"https://github.com/lovit.png","language":"Python","funding_links":[],"categories":["soyspacing"],"sub_categories":[],"readme":"# Korean Space Error Corrector\n\nsoyspacing 은 한국어 띄어쓰기 문제를 해결하기 위한 휴리스틱 알고리즘을 제공합니다. Conditional Random Field 와 비교하여 가벼운 모델 사이즈와 빠른 학습이 가능합니다. \n\n이 알고리즘은 [ScatterLab][scatter_url]의 [sunggu][sunggu_url]님, [Emily Yunha Shin][eyshin_url]님과 함께 작업하였습니다. \n\n- version = 0.1.23은 미완성된 CRF 기반 띄어쓰기 알고리즘을 포함하고 있었습니다. \n- version = 1.0.0부터 미완성된 CRF를 지우고 휴리스틱 기반 알고리즘만 제공합니다. \n\n현재 (1.0.15) 버전에서는 학습된 모델을 제공하지 않습니다. 띄어쓰기 교정은 이를 적용할 데이터셋의 단어 분포에 따라 적합한 모델이 다릅니다. 이러한 이유로 soyspacing 에서는 학습된 모델 대신, 학습이 가능한 패키지만을 제공합니다. 사용법은 아래의 usage 에, 더 자세한 설명은 [slides](https://raw.githubusercontent.com/lovit/soyspacing/master/tutorials/presentation.pdf) 를 참고하세요.\n\n## Setup\n\n```\npip install soyspacing\n```\n\n## Require\n\n- Python \u003e= 3.4 (not tested in Python 2)\n- numpy \u003e= 1.12.1\n\n## Usage \n\n학습은 텍스트 파일 경로를 입력합니다. \n\n```python\nfrom soyspacing.countbase import CountSpace\n\ncorpus_fname = '../demo_model/134963_norm.txt'\nmodel = CountSpace()\nmodel.train(corpus_fname)\n```\n\n학습된 모델의 저장을 위해서는 모델 파일 경로를 입력합니다. JSON 형식으로 모델을 저장할 수 있습니다. 저장된 파일 용량을 고려하며 JSON 형식이 아닐 때 save / load 가 좀 더 쉽습니다.\n\n```python\nmodel.save_model(model_fname, json_format=False)\n```\n\n학습된 모델을 불러올 수 있습니다. \n\n```python\nmodel = CountSpace()\nmodel.load_model(another_model_fname, json_format=False)\n```\n\n띄어쓰기 교정을 위한 패러메터는 네 가지가 있습니다. 이를 입력하지 않으면 default value 를 이용합니다. \n\n```python\nverbose=False\nmc = 10  # min_count\nft = 0.3 # force_abs_threshold\nnt =-0.3 # nonspace_threshold\nst = 0.3 # space_threshold\n\nsent = '이건진짜좋은영화 라라랜드진짜좋은영화'\n\n# with parameters\nsent_corrected, tags = model.correct(\n    doc=sent,\n    verbose=verbose,\n    force_abs_threshold=ft,\n    nonspace_threshold=nt,\n    space_threshold=st,\n    min_count=mc)\n\n# without parameters\nsent_corrected, tags = model.correct(sent)\n\nprint(sent_corrected)\n# 이건 진짜 좋은 영화 라라랜드진짜 좋은 영화\n```\n\n특정 단어, 혹은 어절의 앞 뒤를 반드시 띄거나 붙여쓴다는 규칙이 있다면 이를 적용할 수 있습니다. 아래처럼 어절과 어절 앞, 뒤에 대한 띄어쓰기 태그가 포함되어 있는 텍스트 파일을 준비합니다. `진짜` 라는 단어의 앞, 뒤는 반드시 띄어쓰기를 하고, `진`과 `짜` 사이에는 반드시 붙여쓰기를 한다는 의미입니다. 이 파일을 `rules.txt` 에 저장합니다.\n\n```\n가령\t101\n진짜\t101\n가게는\t1001\n가게로\t1001\n가게야\t1001\n```\n\n위의 파일을 `RuleDict` 로 읽어온 뒤, 위의 예시를 다시 적용하면 라라랜드와 진짜 사이가 띄어졌음을 확인할 수 있습니다.\n\n```python\nfrom soyspacing.countbase import RuleDict\n\nrule_dict = RuleDict('filepath')\nsent_corrected, tags = model.correct(sent, rules=rule_dict)\nprint(sent_corrected)\n# 이건 진짜 좋은 영화 라라랜드 진짜 좋은 영화\n```\n\n더 자세한 내용의 Jupyter notebook 형식 tutorial 파일이 ./tutorials/에 있습니다.\n\n관련 연구 / 제안된 모델의 원리 / CRF 와의 성능 비교 / 그 외 활용 팁의 내용이 포함되어 있는 [presentation 파일][presentation]이 제공됩니다.  \n\n## CRF based space error correction\n\npycrfsuite 를 이용하여 띄어쓰기를 교정하는 패키지입니다. pycrfsuite 에 데이터를 입력하기 편하도록 Template, Transformer 의 utils 를 함께 제공합니다. \n\n[링크][pycrfsuite_space]\n\n\n[scatter_url]: http://www.scatterlab.co.kr/\n[sunggu_url]: https://github.com/new21cccc\n[eyshin_url]: https://github.com/eyshin05\n[presentation]: /tutorials/presentation.pdf\n[pycrfsuite_space]: https://github.com/lovit/pycrfsuite_spacing\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovit%2Fsoyspacing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flovit%2Fsoyspacing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovit%2Fsoyspacing/lists"}