{"id":45368164,"url":"https://github.com/affjljoo3581/canrevan","last_synced_at":"2026-02-21T15:07:42.593Z","repository":{"id":54772247,"uuid":"268205993","full_name":"affjljoo3581/canrevan","owner":"affjljoo3581","description":"대량의 네이버 뉴스 기사를 수집하는 라이브러리입니다.","archived":false,"fork":false,"pushed_at":"2023-02-03T08:04:16.000Z","size":115,"stargazers_count":95,"open_issues_count":5,"forks_count":19,"subscribers_count":2,"default_branch":"master","last_synced_at":"2026-01-04T12:41:04.465Z","etag":null,"topics":["dataset","datasets","natural-language-processing","naver","naver-news","news","news-articles","nlp","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/affjljoo3581.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-31T04:03:40.000Z","updated_at":"2025-11-25T08:57:12.000Z","dependencies_parsed_at":"2023-02-18T04:45:58.597Z","dependency_job_id":null,"html_url":"https://github.com/affjljoo3581/canrevan","commit_stats":null,"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"purl":"pkg:github/affjljoo3581/canrevan","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/affjljoo3581%2Fcanrevan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/affjljoo3581%2Fcanrevan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/affjljoo3581%2Fcanrevan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/affjljoo3581%2Fcanrevan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/affjljoo3581","download_url":"https://codeload.github.com/affjljoo3581/canrevan/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/affjljoo3581%2Fcanrevan/sbom","scorecard":{"id":169736,"data":{"date":"2025-08-11","repo":{"name":"github.com/affjljoo3581/canrevan","commit":"9bb83b3abac8e0732a41bdb2b67bcfd1f1546c35"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.4,"checks":[{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/build.yml:1","Warn: no topLevel permission defined: .github/workflows/publish.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":0,"reason":"Found 2/25 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/build.yml:21: update your workflow using https://app.stepsecurity.io/secureworkflow/affjljoo3581/canrevan/build.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/build.yml:24: update your workflow using https://app.stepsecurity.io/secureworkflow/affjljoo3581/canrevan/build.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/build.yml:52: update your workflow using https://app.stepsecurity.io/secureworkflow/affjljoo3581/canrevan/build.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish.yml:16: update your workflow using https://app.stepsecurity.io/secureworkflow/affjljoo3581/canrevan/publish.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish.yml:18: update your workflow using https://app.stepsecurity.io/secureworkflow/affjljoo3581/canrevan/publish.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/build.yml:30","Warn: pipCommand not pinned by hash: .github/workflows/build.yml:31","Warn: pipCommand not pinned by hash: .github/workflows/publish.yml:23","Warn: pipCommand not pinned by hash: .github/workflows/publish.yml:24","Info:   0 out of   4 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 third-party GitHubAction dependencies pinned","Info:   0 out of   4 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 7 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-16T16:05:28.570Z","repository_id":54772247,"created_at":"2025-08-16T16:05:28.571Z","updated_at":"2025-08-16T16:05:28.571Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29684122,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-21T14:31:22.911Z","status":"ssl_error","status_checked_at":"2026-02-21T14:31:22.570Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","datasets","natural-language-processing","naver","naver-news","news","news-articles","nlp","python"],"created_at":"2026-02-21T15:07:37.209Z","updated_at":"2026-02-21T15:07:42.578Z","avatar_url":"https://github.com/affjljoo3581.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# canrevan\n\n[![PyPI version](https://badge.fury.io/py/canrevan.svg)](https://badge.fury.io/py/canrevan)\n![build](https://github.com/affjljoo3581/canrevan/workflows/build/badge.svg)\n[![GitHub license](https://img.shields.io/github/license/affjljoo3581/canrevan)](https://github.com/affjljoo3581/canrevan/blob/master/LICENSE)\n[![codecov](https://codecov.io/gh/affjljoo3581/canrevan/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/canrevan)\n[![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/canrevan/badge)](https://www.codefactor.io/repository/github/affjljoo3581/canrevan)\n\n## Introduction\n`canrevan`은 대량의 네이버 뉴스 기사를 수집하는 라이브러리입니다. 간단하게 한국어 뉴스\n데이터셋을 구성하도록 도와줍니다.\n\nNLP task에서 가장 중요한 부분 중 하나는 데이터셋입니다. 특히 한국어의 경우, 영어에 비해\n수집할 수 있는 데이터가 매우 부족합니다. 특히, [위키피디아](https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC)\n덤프 파일의 경우 영문 버전이 16.1GB인 것에 비해 한국어 버전은 651.3MB밖에 되지 않습니다.\n그렇기 때문에, 일반적인 NLP 학습에 있어서 위키피디아와 같은 데이터셋을 사용하기에는 턱없이\n부족합니다. 그렇다면 데이터셋 규모를 키우기 위해서 어떻게 해야 할까요? 우리는 다른 곳에서\n데이터를 수집해야 합니다. 대표적으로 **뉴스 기사**가 있습니다.\n\n실제로 많은 연구자들이 위키 데이터를 포함하여, 인터넷 뉴스 기사를 함께 사용해 말뭉치를\n구성합니다. 인터넷 뉴스 기사는 다음과 같은 특징을 가지고 있습니다.\n\n* 매우 많은 데이터를 가지고 있습니다. 매일 다양한 언론사에서 작성되는 기사의 양은 상당히\n많습니다.\n* 데이터의 품질이 우수합니다. 기본적으로 뉴스 기사는 맞춤법 뿐만 아니라 내용상으로 잘\n구성되어 있습니다.\n* 비교적 잘 정형화되어 있습니다. 인터넷 뉴스 기사는 암묵적으로 일정한 규칙과 구조를 가지고\n있습니다. 정규화하기 쉽습니다.\n* 다양한 분야의 문서가 존재합니다. 뉴스 기사는 분야와 주제를 가리지 않습니다. 정치, 사회,\n경제 등등의 주제를 다룹니다.\n\n[네이버 뉴스](https://news.naver.com/)는 각 언론사의 뉴스를 종합하여 제공합니다. 하나의\n플랫폼에서 다양한 언론사의 방대한 뉴스 기사를 수집할 수 있습니다. 실제로 많은 연구자들이\n네이버 뉴스를 통해 기사를 수집합니다.\n\n`canrevan`은 네이버 뉴스에서 기사를 수집하도록 도와줍니다. 명령창에서 한 줄로 수\n기가바이트의 데이터를 손쉽게 수집할 수 있습니다. 자세한 내용은 [여기](#Example)를\n참고하시기 바랍니다.\n\n## Dependencies\n* tqdm\u003e=4.46.0\n* bs4\n* lxml\u003e=4.5.1\n* aiohttp\n* langumo\n\n## Installation\n### With pip\nPyPI에서 canrevan을 설치할 수 있습니다. 자세한 명령어는 다음과 같습니다.\n```console\n$ pip install canrevan\n```\n\n### From source\n혹은, 원격 저장소에서 복제하여 소스코드에서 직접 설치할 수 있습니다.\n```console\n$ git clone https://github.com/affjljoo3581/canrevan.git\n$ cd canrevan\n$ python setup.py install\n```\n\n## Example\n수집하고자 하는 카테고리의 id를 [네이버 뉴스](https://news.naver.com/)에서 확인합니다. 본 예제에서는 정치(100)와 경제(101) 카테고리에 대한 뉴스를 수집해봅시다. 다음은 2020년 5월 1일부터 31일까지 5개의 페이지에 대한 기사를 수집하는 명령입니다. 자세한 사용법은 ``canrevan --help``를 참고하시기 바랍니다.\n```console\n$ canrevan --category 100 101 --start_date 20200501 --end_date 20200531 --max_page 5\n```\n성공적으로 뉴스 기사가 수집되었다면, 다음과 같은 출력을 확인할 수 있습니다.\n```\n[*] navigation pages: 310\n[*] collect article urls: 100%|█████████████████████████████████████████████████████████████| 310/310 [00:05\u003c00:00, 60.43it/s]\n[*] total collected articles: 4998\n[*] crawl news article contents: 100%|███████████████████████████████████████████████████| 4998/4998 [00:24\u003c00:00, 200.41it/s]\n[*] finish crawling 4781 news articles to [articles.txt]\n```\n\n## Format\n`canrevan`은 수집된 뉴스 기사를 `json.encoder.encode_basestring`으로 인코딩합니다.\n\n    \"국방부는 18일부터 입대하는 모든 장정의 검체를 채취할 예정이며, 8주간 매주 6,300여명이 코로나19 검사를 받는다고 18일 밝혔다.\\n군이 훈련소에서 자체적으로 검체를 채취하고, 질병관리본부와 계약을 맺은 민간 업체 등이 검체 이송과 검사를 담당한다. 대규모 인원의 빠른 검사를 위해 취합검사법(Pooling)이 활용된다.\\n군 관계자는 “이태원 클럽 등으로 인해 코로나19 20대 감염 사례가 늘었다”며 “집단 생활하는 훈련병이 뒤늦게 코로나19 확진을 받으면 집단 감염이 발생할 수 있기 때문에 선제적으로 전원 검사를 시행한다”고 설명했다.\\n군은 확진자가 나온 지역에서 입소하거나 확진자와 동선이 겹칠 경우에 예방적 격리와 검사를 시행했었다.\\n현재까지 이태원 일대를 방문했다고 부대에 알린 훈련병 83명이 코로나19 검사를 받았고, 전원 음성 판정이 나왔다.\\n훈련병이 입소 후 일주일 전 확진 판정을 받으면 귀가 조치되고, 일주일이 넘은 뒤 확진을 받으면 군 소속으로 치료를 받게 된다.\\n앞서 지난달 13일 육군훈련소에 입소한 3명이 코로나19 확진 판정을 받아 귀가 조치됐다.\"\n\n모든 수집된 뉴스 기사들은 위와 같은 포맷을 가지고 있습니다. `json.decoder.scanstring` 함수를 이용하여 개행 문자를 포함한 평문으로 디코딩할 수 있습니다.\n\n## License\n`canrevan`은 Apache-2.0 라이센스가 적용되어 있습니다.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faffjljoo3581%2Fcanrevan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faffjljoo3581%2Fcanrevan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faffjljoo3581%2Fcanrevan/lists"}