Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Beomi/KcBERT

๐Ÿค— Pretrained BERT model & WordPiece tokenizer trained on Korean Comments ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ํ•œ BERT ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹
https://github.com/Beomi/KcBERT

bert bert-model korean-nlp nlp transformers

Last synced: 4 days ago
JSON representation

๐Ÿค— Pretrained BERT model & WordPiece tokenizer trained on Korean Comments ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ํ•œ BERT ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹

Awesome Lists containing this project

README

        

# KcBERT: Korean comments BERT

** Updates on 2022.11.07 **

- KcELECTRA v2022 ํ•™์Šต์— ์‚ฌ์šฉํ•œ, ํ™•์žฅ๋œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹(v2022.3Q)๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.
- https://github.com/Beomi/KcBERT/releases/tag/v2022.3Q
- ๊ธฐ์กด 11GB -> ์‹ ๊ทœ 45GB, ๊ธฐ์กด 0.9์–ต๊ฑด -> ์‹ ๊ทœ 3.4์–ต๊ฑด์œผ๋กœ ๊ธฐ์กด v1 ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๋น„ ์•ฝ 4๋ฐฐ ์ฆ๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

** Updates on 2022.10.08 **

- KcELECTRA-base-v2022 (๊ตฌ dev) ๋ชจ๋ธ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
- ๊ธฐ์กด KcELECTRA-base(v2021) ๋Œ€๋น„ ๋Œ€๋ถ€๋ถ„์˜ downstream task์—์„œ ~1%p ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

** Updates on 2022.09.14 **

- emoji์˜ v2.0.0 ์—…๋ฐ์ดํŠธ๋จ์— ๋”ฐ๋ผ Preprocessing ์ฝ”๋“œ๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

** Updates on 2021.04.07 **

- KcELECTRA๊ฐ€ ๋ฆด๋ฆฌ์ฆˆ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!๐Ÿค—
- KcELECTRA๋Š” ๋ณด๋‹ค ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹, ๊ทธ๋ฆฌ๊ณ  ๋” ํฐ General vocab์„ ํ†ตํ•ด KcBERT ๋Œ€๋น„ **๋ชจ๋“  ํƒœ์Šคํฌ์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ**์„ ๋ณด์ž…๋‹ˆ๋‹ค.
- ์•„๋ž˜ ๊นƒํ—™ ๋งํฌ์—์„œ ์ง์ ‘ ์‚ฌ์šฉํ•ด๋ณด์„ธ์š”!
- https://github.com/Beomi/KcELECTRA

** Updates on 2021.03.14 **

- KcBERT Paper ์ธ์šฉ ํ‘œ๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.(bibtex)
- KcBERT-finetune Performance score๋ฅผ ๋ณธ๋ฌธ์— ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

** Updates on 2020.12.04 **

Huggingface Transformers๊ฐ€ v4.0.0์œผ๋กœ ์—…๋ฐ์ดํŠธ๋จ์— ๋”ฐ๋ผ Tutorial์˜ ์ฝ”๋“œ๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์—…๋ฐ์ดํŠธ๋œ KcBERT-Large NSMC Finetuning Colab:
Open In Colab

** Updates on 2020.09.11 **

KcBERT๋ฅผ Google Colab์—์„œ TPU๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํŠœํ† ๋ฆฌ์–ผ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค! ์•„๋ž˜ ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ๋ณด์„ธ์š”.

Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ:
Open In Colab

ํ…์ŠคํŠธ ๋ถ„๋Ÿ‰๋งŒ ์ „์ฒด 12G ํ…์ŠคํŠธ ์ค‘ ์ผ๋ถ€(144MB)๋กœ ์ค„์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹/์ฝ”ํผ์Šค๋ฅผ ์ข€๋” ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” [Korpora](https://github.com/ko-nlp/Korpora) ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

** Updates on 2020.09.08 **

Github Release๋ฅผ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์—…๋กœ๋“œํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ํ•œ ํŒŒ์ผ๋‹น 2GB ์ด๋‚ด์˜ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ๋ถ„ํ• ์••์ถ•๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ๋งํฌ๋ฅผ ํ†ตํ•ด ๋ฐ›์•„์ฃผ์„ธ์š”. (๊ฐ€์ž… ์—†์ด ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์š”. ๋ถ„ํ• ์••์ถ•)

๋งŒ์•ฝ ํ•œ ํŒŒ์ผ๋กœ ๋ฐ›๊ณ ์‹ถ์œผ์‹œ๊ฑฐ๋‚˜/Kaggle์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด ์•„๋ž˜์˜ ์บ๊ธ€ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•ด์ฃผ์„ธ์š”.

- Github๋ฆด๋ฆฌ์ฆˆ: https://github.com/Beomi/KcBERT/releases/tag/TrainData_v1

** Updates on 2020.08.22 **

Pretrain Dataset ๊ณต๊ฐœ

- ์บ๊ธ€: https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments (ํ•œ ํŒŒ์ผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์š”. ๋‹จ์ผํŒŒ์ผ)

Kaggle์— ํ•™์Šต์„ ์œ„ํ•ด ์ •์ œํ•œ(์•„๋ž˜ `clean`์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ) Dataset์„ ๊ณต๊ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค!

์ง์ ‘ ๋‹ค์šด๋ฐ›์œผ์…”์„œ ๋‹ค์–‘ํ•œ Task์— ํ•™์Šต์„ ์ง„ํ–‰ํ•ด๋ณด์„ธ์š” :)

---

๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด BERT๋Š” ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ ๋Œ“๊ธ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

KcBERT๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์˜จ๋ผ์ธ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

KcBERT๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

## KcBERT Performance

- Finetune ์ฝ”๋“œ๋Š” https://github.com/Beomi/KcBERT-finetune ์—์„œ ์ฐพ์•„๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

| | Size
(์šฉ๋Ÿ‰) | **NSMC**
(acc) | **Naver NER**
(F1) | **PAWS**
(acc) | **KorNLI**
(acc) | **KorSTS**
(spearman) | **Question Pair**
(acc) | **KorQuaD (Dev)**
(EM/F1) |
| :-------------------- | :---: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: |
| KcBERT-Base | 417M | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 |
| KcBERT-Large | 1.2G | **90.68** | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 |
| KoBERT | 351M | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 |
| XLM-Roberta-Base | 1.03G | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 |
| HanBERT | 614M | 90.16 | **87.31** | 82.40 | **80.89** | 83.33 | 94.19 | 78.74 / 92.02 |
| KoELECTRA-Base | 423M | **90.21** | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 |
| KoELECTRA-Base-v2 | 423M | 89.70 | 87.02 | **83.90** | 80.61 | **84.30** | **94.72** | **84.34 / 92.58** |
| DistilKoBERT | 108M | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |

\*HanBERT์˜ Size๋Š” Bert Model๊ณผ Tokenizer DB๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

\***config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.**

## How to use

### Requirements

- `pytorch <= 1.8.0`
- `transformers ~= 3.0.1`
- `transformers ~= 4.0.0` ๋„ ํ˜ธํ™˜๋ฉ๋‹ˆ๋‹ค.
- `emoji ~= 0.6.0`
- `soynlp ~= 0.0.493`

```python
from transformers import AutoTokenizer, AutoModelWithLMHead

# Base Model (108M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-base")

# Large Model (334M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-large")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-large")
```

### Pretrain & Finetune Colab ๋งํฌ ๋ชจ์Œ

#### Pretrain Data

- [๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ(Kaggle, ๋‹จ์ผํŒŒ์ผ, ๋กœ๊ทธ์ธ ํ•„์š”)](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments)
- [๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ(Github, ์••์ถ• ์—ฌ๋ŸฌํŒŒ์ผ, ๋กœ๊ทธ์ธ ๋ถˆํ•„์š”)](https://github.com/Beomi/KcBERT/releases/tag/TrainData_v1)

#### Pretrain Code

Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ:
Open In Colab

#### Finetune Samples

**KcBERT-Base** NSMC Finetuning with PyTorch-Lightning (Colab)
Open In Colab

**KcBERT-Large** NSMC Finetuning with PyTorch-Lightning (Colab)
Open In Colab

> ์œ„ ๋‘ ์ฝ”๋“œ๋Š” Pretrain ๋ชจ๋ธ(base, large)์™€ batch size๋งŒ ๋‹ค๋ฅผ ๋ฟ, ๋‚˜๋จธ์ง€ ์ฝ”๋“œ๋Š” ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

## Train Data & Preprocessing

### Raw Data

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2020.06.15 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ **๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค** ๊ธฐ์‚ฌ๋“ค์˜ **๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€**์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ **์•ฝ 15.4GB์ด๋ฉฐ, 1์–ต1์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ**์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

### Preprocessing

PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!

์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ `ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ` ์œผ๋กœ ์ง€์ •ํ•ด `ใ„ฑ-ํžฃ` ๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

2. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ

`ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹`์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ `ใ…‹ใ…‹`์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

3. Cased Model

KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.

4. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ

10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

5. ์ค‘๋ณต ์ œ๊ฑฐ

์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ๋งŒ๋“  ์ตœ์ข… ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” **12.5GB, 8.9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ**์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. (`[UNK]` ๊ฐ์†Œ)

```bash
pip install soynlp emoji
```

์•„๋ž˜ `clean` ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.

```python
import re
import emoji
from soynlp.normalizer import repeat_normalize

pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ]+')
url_pattern = re.compile(
r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x):
x = pattern.sub(' ', x)
x = emoji.replace_emoji(x, replace='') #emoji ์‚ญ์ œ
x = url_pattern.sub('', x)
x = x.strip()
x = repeat_normalize(x, num_repeats=2)

return x
```

### Cleaned Data (Released on Kaggle)

์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ `clean`ํ•จ์ˆ˜๋กœ ์ •์ œํ•œ 12GB๋ถ„๋Ÿ‰์˜ txt ํŒŒ์ผ์„ ์•„๋ž˜ Kaggle Dataset์—์„œ ๋‹ค์šด๋ฐ›์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค :)

https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments

## Tokenizer Train

Tokenizer๋Š” Huggingface์˜ [Tokenizers](https://github.com/huggingface/tokenizers) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘ `BertWordPieceTokenizer` ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š” `30000`์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š” `1/10`๋กœ ์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ณด๋‹ค ๊ณจ๊ณ ๋ฃจ ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์œ„ํ•ด ์ผ์ž๋ณ„๋กœ stratify๋ฅผ ์ง€์ •ํ•œ ๋’ค ํ–‘์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

## BERT Model Pretrain

- KcBERT Base config

```json
{
"max_position_embeddings": 300,
"hidden_dropout_prob": 0.1,
"hidden_act": "gelu",
"initializer_range": 0.02,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30000,
"hidden_size": 768,
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"num_attention_heads": 12,
"intermediate_size": 3072,
"architectures": [
"BertForMaskedLM"
],
"model_type": "bert"
}
```

- KcBERT Large config

```json
{
"type_vocab_size": 2,
"initializer_range": 0.02,
"max_position_embeddings": 300,
"vocab_size": 30000,
"hidden_size": 1024,
"hidden_dropout_prob": 0.1,
"model_type": "bert",
"directionality": "bidi",
"pad_token_id": 0,
"layer_norm_eps": 1e-12,
"hidden_act": "gelu",
"num_hidden_layers": 24,
"num_attention_heads": 16,
"attention_probs_dropout_prob": 0.1,
"intermediate_size": 4096,
"architectures": [
"BertForMaskedLM"
]
}
```

BERT Model Config๋Š” Base, Large ๊ธฐ๋ณธ ์„ธํŒ…๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (MLM 15% ๋“ฑ)

TPU `v3-8` ์„ ์ด์šฉํ•ด ๊ฐ๊ฐ 3์ผ, N์ผ(Large๋Š” ํ•™์Šต ์ง„ํ–‰ ์ค‘)์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 1m(100๋งŒ) step์„ ํ•™์Šตํ•œ ckpt๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 200k์— ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค 400k์ดํ›„๋กœ๋Š” ์กฐ๊ธˆ์”ฉ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

- Base Model Loss

![KcBERT-Base Pretraining Loss](./img/image-20200719183852243.38b124.png)

- Large Model Loss

![KcBERT-Large Pretraining Loss](./img/image-20200806160746694.d56fa1.png)

ํ•™์Šต์€ GCP์˜ TPU v3-8์„ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ•™์Šต ์‹œ๊ฐ„์€ Base Model ๊ธฐ์ค€ 2.5์ผ์ •๋„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Large Model์€ ์•ฝ 5์ผ์ •๋„ ์ง„ํ–‰ํ•œ ๋’ค ๊ฐ€์žฅ ๋‚ฎ์€ loss๋ฅผ ๊ฐ€์ง„ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

## Example

### HuggingFace MASK LM

[HuggingFace kcbert-base ๋ชจ๋ธ](https://huggingface.co/beomi/kcbert-base?text=์˜ค๋Š˜์€+๋‚ ์”จ๊ฐ€+[MASK]) ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ํ…Œ์ŠคํŠธ ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

![์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ "์ข‹๋„ค์š”", KcBERT-Base](./img/image-20200719205919389.5670d6.png)

๋ฌผ๋ก  [kcbert-large ๋ชจ๋ธ](https://huggingface.co/beomi/kcbert-large?text=์˜ค๋Š˜์€+๋‚ ์”จ๊ฐ€+[MASK]) ์—์„œ๋„ ํ…Œ์ŠคํŠธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

![image-20200806160624340](./img/image-20200806160624340.58f9be.png)

### NSMC Binary Classification

[๋„ค์ด๋ฒ„ ์˜ํ™”ํ‰ ์ฝ”ํผ์Šค](https://github.com/e9t/nsmc) ๋ฐ์ดํ„ฐ์…‹์„ ๋Œ€์ƒ์œผ๋กœ Fine Tuning์„ ์ง„ํ–‰ํ•ด ์„ฑ๋Šฅ์„ ๊ฐ„๋‹จํžˆ ํ…Œ์ŠคํŠธํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Base Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š”
Open In Colab
์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Large Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š”
Open In Colab
์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

- GPU๋Š” P100 x1๋Œ€ ๊ธฐ์ค€ 1epoch์— 2-3์‹œ๊ฐ„, TPU๋Š” 1epoch์— 1์‹œ๊ฐ„ ๋‚ด๋กœ ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
- GPU RTX Titan x4๋Œ€ ๊ธฐ์ค€ 30๋ถ„/epoch ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
- ์˜ˆ์‹œ ์ฝ”๋“œ๋Š” [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning)์œผ๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

#### ์‹คํ—˜๊ฒฐ๊ณผ

- KcBERT-Base Model ์‹คํ—˜๊ฒฐ๊ณผ: Val acc `.8905`

![KcBERT Base finetune on NSMC](./img/image-20200719201102895.ddbdfc.png)

- KcBERT-Large Model ์‹คํ—˜ ๊ฒฐ๊ณผ: Val acc `.9089`

![image-20200806190242834](./img/image-20200806190242834.56d6ee.png)

> ๋” ๋‹ค์–‘ํ•œ Downstream Task์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ๊ณต๊ฐœํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

## ์ธ์šฉํ‘œ๊ธฐ/Citation

KcBERT๋ฅผ ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ์•„๋ž˜ ์–‘์‹์„ ํ†ตํ•ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

```
@inproceedings{lee2020kcbert,
title={KcBERT: Korean Comments BERT},
author={Lee, Junbum},
booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
pages={437--440},
year={2020}
}
```

- ๋…ผ๋ฌธ์ง‘ ๋‹ค์šด๋กœ๋“œ ๋งํฌ: http://hclt.kr/dwn/?v=bG5iOmNvbmZlcmVuY2U7aWR4OjMy (*ํ˜น์€ http://hclt.kr/symp/?lnb=conference )

## Acknowledgement

KcBERT Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ [TFRC](https://www.tensorflow.org/tfrc?hl=ko) ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  [Monologg](https://github.com/monologg/) ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

## Reference

### Github Repos

- [BERT by Google](https://github.com/google-research/bert)
- [KoBERT by SKT](https://github.com/SKTBrain/KoBERT)
- [KoELECTRA by Monologg](https://github.com/monologg/KoELECTRA/)

- [Transformers by Huggingface](https://github.com/huggingface/transformers)
- [Tokenizers by Hugginface](https://github.com/huggingface/tokenizers)

### Papers

- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

### Blogs

- [Monologg๋‹˜์˜ KoELECTRA ํ•™์Šต๊ธฐ](https://monologg.kr/categories/NLP/ELECTRA/)
- [Colab์—์„œ TPU๋กœ BERT ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๊ธฐ - Tensorflow/Google ver.](https://beomi.github.io/2020/02/26/Train-BERT-from-scratch-on-colab-TPU-Tensorflow-ver/)