{"id":13635645,"url":"https://github.com/SKT-AI/KoBART","last_synced_at":"2025-04-19T04:31:23.537Z","repository":{"id":38197353,"uuid":"317120814","full_name":"SKT-AI/KoBART","owner":"SKT-AI","description":"Korean BART","archived":false,"fork":false,"pushed_at":"2024-10-03T01:40:37.000Z","size":8435,"stargazers_count":447,"open_issues_count":1,"forks_count":94,"subscribers_count":23,"default_branch":"main","last_synced_at":"2024-11-09T05:34:40.958Z","etag":null,"topics":["korean-nlp","language-model","summarization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SKT-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-30T05:33:10.000Z","updated_at":"2024-11-04T01:21:12.000Z","dependencies_parsed_at":"2024-08-02T00:02:58.663Z","dependency_job_id":null,"html_url":"https://github.com/SKT-AI/KoBART","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SKT-AI%2FKoBART","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SKT-AI%2FKoBART/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SKT-AI%2FKoBART/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SKT-AI%2FKoBART/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SKT-AI","download_url":"https://codeload.github.com/SKT-AI/KoBART/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249606341,"owners_count":21298851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["korean-nlp","language-model","summarization"],"created_at":"2024-08-02T00:00:48.858Z","updated_at":"2025-04-19T04:31:23.209Z","avatar_url":"https://github.com/SKT-AI.png","language":"Python","funding_links":[],"categories":["Resources"],"sub_categories":["Pre-trained Models"],"readme":"# 🤣 KoBART\n\n* [🤣 KoBART](#-kobart)\n  * [How to install](#how-to-install)\n  * [Data](#data)\n  * [Tokenizer](#tokenizer)\n  * [Model](#model)\n    * [Performances](#performances)\n      * [Classification or Regression](#classification-or-regression)\n      * [Summarization](#summarization)\n  * [Demos](#demos)\n  * [Examples](#examples)\n  * [Release](#release)\n  * [Contacts](#contacts)\n  * [License](#license)\n\n[**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)는 입력 텍스트 일부에 노이즈를 추가하여 이를 다시 원문으로 복구하는 `autoencoder`의 형태로 학습이 됩니다. 한국어 BART(이하 **KoBART**) 는 논문에서 사용된 `Text Infilling` 노이즈 함수를 사용하여 **40GB** 이상의 한국어 텍스트에 대해서 학습한 한국어 `encoder-decoder` 언어 모델입니다. 이를 통해 도출된 `KoBART-base`를 배포합니다.\n\n![bart](imgs/bart.png)\n\n## How to install\n\n```bash\npip install git+https://github.com/SKT-AI/KoBART#egg=kobart\n```\n\n## Data\n\n| Data         | # of Sentences |\n| ------------ | -------------: |\n| Korean Wiki  |             5M |\n| Other corpus |          0.27B |\n\n한국어 위키 백과 이외, 뉴스, 책, [모두의 말뭉치 v1.0(대화, 뉴스, ...)](https://corpus.korean.go.kr/) 등의 다양한 데이터가 모델 학습에 사용되었습니다.\n\n## Tokenizer\n\n[`tokenizers`](https://github.com/huggingface/tokenizers) 패키지의 `Character BPE tokenizer`로 학습되었습니다.\n\n`vocab` 사이즈는 30,000 이며 대화에 자주 쓰이는 아래와 같은 이모티콘, 이모지 등을 추가하여 해당 토큰의 인식 능력을 올렸습니다.\n\u003e 😀, 😁, 😆, 😅, 🤣, .. , `:-)`, `:)`, `-)`, `(-:`...\n\n또한 `\u003cunused0\u003e` ~ `\u003cunused99\u003e`등의 미사용 토큰을 정의해, 필요한 `subtasks`에 따라 자유롭게 정의해 사용할 수 있게 했습니다.\n\n```python\n\u003e\u003e\u003e from kobart import get_kobart_tokenizer\n\u003e\u003e\u003e kobart_tokenizer = get_kobart_tokenizer()\n\u003e\u003e\u003e kobart_tokenizer.tokenize(\"안녕하세요. 한국어 BART 입니다.🤣:)l^o\")\n['▁안녕하', '세요.', '▁한국어', '▁B', 'A', 'R', 'T', '▁입', '니다.', '🤣', ':)', 'l^o']\n```\n\n## Model\n\n| Model         | # of params |  Type   | # of layers | # of heads | ffn_dim | hidden_dims |\n| ------------- | :---------: | :-----: | ----------: | ---------: | ------: | ----------: |\n| `KoBART-base` |    124M     | Encoder |           6 |         16 |    3072 |         768 |\n|               |             | Decoder |           6 |         16 |    3072 |         768 |\n\n```python\n\u003e\u003e\u003e from transformers import BartModel\n\u003e\u003e\u003e from kobart import get_pytorch_kobart_model, get_kobart_tokenizer\n\u003e\u003e\u003e kobart_tokenizer = get_kobart_tokenizer()\n\u003e\u003e\u003e model = BartModel.from_pretrained(get_pytorch_kobart_model())\n\u003e\u003e\u003e inputs = kobart_tokenizer(['안녕하세요.'], return_tensors='pt')\n\u003e\u003e\u003e model(inputs['input_ids'])\nSeq2SeqModelOutput(last_hidden_state=tensor([[[-0.4418, -4.3673,  3.2404,  ...,  5.8832,  4.0629,  3.5540],\n         [-0.1316, -4.6446,  2.5955,  ...,  6.0093,  2.7467,  3.0007]]],\n       grad_fn=\u003cNativeLayerNormBackward\u003e), past_key_values=((tensor([[[[-9.7980e-02, -6.6584e-01, -1.8089e+00,  ...,  9.6023e-01, -1.8818e-01, -1.3252e+00],\n```\n\n### Performances\n\n#### Classification or Regression\n\n|                 | [NSMC](https://github.com/e9t/nsmc)(acc) | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) | [Question Pair](https://github.com/aisolab/nlp_classification/tree/master/BERT_pairwise_text_classification/qpair)(acc) |\n| --------------- | ---------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------\n----------------------------------- |\n| **KoBART-base** | 90.24                                    | 81.66                                                            | 94.34                                                                                                                   |\n\n#### Summarization\n\n* 업데이트 예정 *\n\n## Demos\n\n* [요약 데모](https://huggingface.co/spaces/gogamza/kobart-summarization)\n\n\u003cimg src=\"imgs/kobart_summ.png\" width=\"600\"/\u003e\n\n*위 예시는 [ZDNET 기사](https://zdnet.co.kr/view/?no=20201125093328)를 요약한 결과임*\n\n## Examples\n\n* [NSMC Classification](https://github.com/SKT-AI/KoBART/tree/main/examples)\n* [KoBART ChitChatBot](https://github.com/haven-jeon/KoBART-chatbot)\n* [KoBART Summarization](https://github.com/seujung/KoBART-summarization)\n* [KoBART Translation](https://github.com/seujung/KoBART-translation)\n* [LegalQA using Sentence**KoBART**](https://github.com/haven-jeon/LegalQA)\n* [KoBART Question Generation](https://github.com/Seoneun/KoBART-Question-Generation)\n\n*KoBART를 사용한 흥미로운 예제가 있다면 PR주세요!*\n\n## Release\n\n* v0.5.1\n  * guide default 'import statements'\n* v0.5\n  * download large files from `aws s3`\n* v0.4\n  * Update model binary\n* v0.3\n  * 토크나이저 버그로 인해 `\u003cunk\u003e` 토큰이 사라지는 이슈 해결\n* v0.2\n  * `KoBART` 모델 업데이트(서브테스트 sample efficient가 좋아짐)\n  * `모두의 말뭉치` 사용 버전 명시\n  * downloder 버그 수정\n  * `pip` 설치 지원\n\n## Contacts\n\n`KoBART` 관련 이슈는 [이곳](https://github.com/SKT-AI/KoBART/issues)에 올려주세요.\n\n## License\n\n`KoBART`는 `modified MIT` 라이선스 하에 공개되어 있습니다. 모델 및 코드를 사용할 경우 라이선스 내용을 준수해주세요. 라이선스 전문은 `LICENSE` 파일에서 확인하실 수 있습니다.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSKT-AI%2FKoBART","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSKT-AI%2FKoBART","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSKT-AI%2FKoBART/lists"}