{"id":39081701,"url":"https://github.com/lovit/soykeyword","last_synced_at":"2026-01-17T18:30:49.672Z","repository":{"id":48864704,"uuid":"91730586","full_name":"lovit/soykeyword","owner":"lovit","description":"Python library for keyword extraction","archived":false,"fork":false,"pushed_at":"2021-07-08T03:45:19.000Z","size":60,"stargazers_count":39,"open_issues_count":3,"forks_count":12,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-09-29T16:04:23.767Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lovit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-18T19:32:09.000Z","updated_at":"2024-04-30T05:51:07.000Z","dependencies_parsed_at":"2022-09-26T20:20:27.740Z","dependency_job_id":null,"html_url":"https://github.com/lovit/soykeyword","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lovit/soykeyword","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoykeyword","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoykeyword/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoykeyword/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoykeyword/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lovit","download_url":"https://codeload.github.com/lovit/soykeyword/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lovit%2Fsoykeyword/sbom","scorecard":{"id":600231,"data":{"date":"2025-08-11","repo":{"name":"github.com/lovit/soykeyword","commit":"93597b502467d3e6161075d9e54e88be15813f1b"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.6,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 2/28 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 4 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-21T00:13:28.368Z","repository_id":48864704,"created_at":"2025-08-21T00:13:28.368Z","updated_at":"2025-08-21T00:13:28.368Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28515730,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T18:28:00.501Z","status":"ssl_error","status_checked_at":"2026-01-17T18:28:00.150Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-17T18:30:49.162Z","updated_at":"2026-01-17T18:30:49.663Z","avatar_url":"https://github.com/lovit.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python library for Keyword Extraction\n\n키워드 / 연관어 추출을 위한 파이썬 라이브러리 입니다. by [Lovit (Hyunjoong)][lovit] and [Hunsik Shin][hunsik] \n\nsoykeyword 에서 추출하는 키워드와 연관어는 다음과 같이 정의됩니다. 한 문서 집합의 **키워드**는 다른 문서 집합과 해당 문서 집합을 구분할 수 있는 질 좋은 단어이며 (구분력, discriminative power), 해당 집합을 잘 설명할 수 있는 (설명력, high coverage) 단어입니다. 빈도수가 낮은 단어는 한 집합에서만 등장할 가능성이 높기 때문에 구분력은 크지만 설명력이 약합니다. 제안된 두 가지 알고리즘은 높은 설명력과 구분력을 동시에 지니는 단어들을 키워드로 선택합니다. \n\n**연관어**는 기준 단어가 포함된 문서 집합과 포함되지 않은 문서 집합을 구분하는 키워드를 연관어로 정의합니다. 이는 co-occurrence 가 높은 단어라는 의미이기도 합니다. co-occurrence 가 높으면서도 설명력이 좋은 단어를 선택합니다. \n\n\n\n## Setup\n\n- pip install soykeyword\n\n## Requires\n\n- Python \u003e= 3.4 (not tested in Python 2)\n- numpy \u003e= 1.12.1\n- scikit-learn \u003e= 0.18\n- psutil \u003e=5.0.1\n\n## Usage\n\n### Lasso Regerssion Keyword Extractor\n\n학습은 sparse matrix x 를 extractor 에 입력합니다. index2word 는 word idx 에 대한 단어 list 형식입니다. 이를 train() 에 입력하지 않으면 키워드와 연관어가 단어가 아닌 word idx 로 출력됩니다.\n\n    from soykeyword.lasso import LassoKeywordExtractor\n\n    lassobased_extractor = LassoKeywordExtractor(min_tf=20, min_df=10)\n    lassobased_extractor.train(x, index2word) # x: sparse matrix\n\n키워드를 추출할 문서 집합 documents 를 extract_from_docs() 에 입력하면, 해당 문서 집합과 그 외의 문서 집합을 구분하는 keywords 를 추출합니다. \n\n    keywords = lassobased_extractor.extract_from_docs(\n        documents, \n        min_num_of_keywords=30\n    )\n\n연관어는 extract_from_word 에 단어를 입력하면 됩니다.\n\n    lassobased_extractor.extract_from_word(\n        '아이오아이',\n        min_num_of_keywords=30\n    )\n\n하루 뉴스를 기준으로 '아이오아이'의 연관어를 추출한 예시입니다.\n\n    [KeywordScore(word='아이오아이', frequency=270, coefficient=17.850189941320671),\n     KeywordScore(word='엠카운트다운', frequency=221, coefficient=1.200759338786378),\n     KeywordScore(word='뮤직', frequency=195, coefficient=1.081777863860977),\n     KeywordScore(word='일산동구', frequency=36, coefficient=0.98636875892070186),\n     KeywordScore(word='키미', frequency=297, coefficient=0.70877507721215616),\n     KeywordScore(word='챔피언', frequency=105, coefficient=0.51940928356916138),\n     KeywordScore(word='강렬', frequency=352, coefficient=0.36972563098092176),\n     KeywordScore(word='컴백', frequency=536, coefficient=0.30677481146665397),\n     KeywordScore(word='화려', frequency=518, coefficient=0.26764304959838653),\n     KeywordScore(word='수출', frequency=735, coefficient=0.23882691530127598),\n     KeywordScore(word='걸그룹', frequency=1060, coefficient=0.20972098801573957),\n     KeywordScore(word='방영', frequency=208, coefficient=0.19694219657704334),\n     KeywordScore(word='프로듀스101', frequency=96, coefficient=0.17074232136595247),\n     ...\n\n자세한 튜토리얼은 [링크][lasso_tutorial]에 있습니다.\n\n### Proportion based Keyword Extractor\n\nProportion based 키워드 / 연관어 추출은 두 집합의 단어 출연 확률의 비율을 바탕으로 키워드를 추출합니다. P(w|pos) 는 키워드를 추출할 문서 집합에서의 단어 w 의 출연 비율이며, P(w|neg)는 그 외의 문서 집합에서의 단어 w의 출연 비율 입니다. \n\nscore(w) = P(w|pos) / { P(w|pos) + P(w|neg) }\n\n학습 데이터의 형태는 (sparse matrix, index2word) 혹은 텍스트 데이터, 두 종류를 모두 지원합니다. \n\n텍스트 데이터 형식으로 학습을 할 경우에는 min_tf, min_df, tokenize 를 설정해줍니다. 다음의 예시는 default value 입니다.\n\n    from soykeyword.proportion import CorpusbasedKeywordExtractor\n    corpusbased_extractor = CorpusbasedKeywordExtractor(\n        min_tf=20,\n        min_df=2,\n        tokenize=lambda x:x.strip().split(),\n        verbose=True\n    )\n\n    # docs: list of str like\n    corpusbased_extractor.train(docs)\n\n키워드를 추출할 문서 집합 documents 를 입력합니다.\n\n    keywords = corpusbased_extractor.extract_from_docs(\n        documents,\n        min_score=0.8,\n        min_frequency=100\n    )\n\n연관어를 추출할 단어 word 를 입력합니다. \n\n    keywords = corpusbased_extractor.extract_from_word(\n        '아이오아이',\n        min_score=0.8,\n        min_frequency=100\n    )\n\n하루의 뉴스를 바탕으로 추출한 아이오아이의 연관어 입니다. \n\n    keywords[:10]\n\n    [KeywordScore(word='아이오아이', frequency=270, score=1.0),\n     KeywordScore(word='엠카운트다운', frequency=221, score=0.997897148491129),\n     KeywordScore(word='펜타곤', frequency=104, score=0.9936420169665052),\n     KeywordScore(word='잠깐', frequency=162, score=0.9931809154109712),\n     KeywordScore(word='엠넷', frequency=125, score=0.9910325251765126),\n     KeywordScore(word='걸크러쉬', frequency=111, score=0.9904705029926091),\n     KeywordScore(word='타이틀곡', frequency=311, score=0.987384461584851),\n     KeywordScore(word='코드', frequency=105, score=0.9871835929954923),\n     KeywordScore(word='본명', frequency=105, score=0.9863934667369743),\n     KeywordScore(word='엑스', frequency=101, score=0.9852544036088814)]\n\n학습데이터의 형태가 (sparse matrix, index2word) 라면 MatrixbasedKeywordExtractor 를 이용합니다.\n\n    from soykeyword.proportion import MatrixbasedKeywordExtractor\n\n    matrixbased_extractor = MatrixbasedKeywordExtractor(\n        min_tf=20,\n        min_df=2,\n        verbose=True\n    )\n\n    matrixbased_extractor.train(x, index2word)\n\n자세한 튜토리얼은 [링크][proportion_tutorial]에 있습니다.\n\n## 함께 이용하면 좋은 라이브러리들\n\n### soynlp\n\n한국어 자연어처리를 위한 미등록단어 문제 해결을 위한 단어 추출 / 단어 추출기의 학습 결과를 이용하는 토크나이저 / 품사 판별 / 정규화 를 지원합니다.\n\n- https://github.com/lovit/soynlp\n- pip install soynlp\n\n### KoNLPy\n\nKoNLPy 는 한국어 정보처리를 위한 파이썬 패키지입니다. 한나눔, 꼬꼬마, 코모란, MeCab-ko, 트위터 한국어 분석기를 파이썬 환경에서 제공합니다. \n\n- http://konlpy.org\n- KoNLPy 는 Java를 이용하기 때문에 Java 와 JPype 가 필요합니다. 홈페이지의 설치법을 반드시 보시기 바랍니다. \n\n### customized KoNLPy\n\nKoNLPy 에 등록되지 않은 단어를 손쉽게 처리하기 위하여 템플릿과 사전 기반 string match 를 KoNLPy 와 함께 이용하는 wrapping 파이썬 패키지입니다.\n\n- https://github.com/lovit/customized_konlpy\n- pip install customized_konlpy\n\n### soyspacing\n\n띄어쓰기 오류가 있을 경우 이를 제거하면 텍스트 분석이 쉬워질 수 있습니다. 분석하려는 데이터를 기반으로 띄어쓰기 엔진을 학습하고, 이를 이용하여 띄어쓰기 오류를 교정합니다. \n\n- https://github.com/lovit/soyspacing\n- pip install soyspacing\n\n[lovit]: https://github.com/lovit\n[hunsik]: https://github.com/hunsik\n[lasso_tutorial]: tutorials/keyword_extraction_using_lasso_regression.ipynb\n[proportion_tutorial]: tutorials/keyword_extraction_using_proportion_ratio.ipynb\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovit%2Fsoykeyword","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flovit%2Fsoykeyword","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovit%2Fsoykeyword/lists"}