Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/beomi/kcelectra

๐Ÿค— Korean Comments ELECTRA: ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ•™์Šตํ•œ ELECTRA ๋ชจ๋ธ
https://github.com/beomi/kcelectra

electra korean-nlp nlp transformers

Last synced: 2 days ago
JSON representation

๐Ÿค— Korean Comments ELECTRA: ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ•™์Šตํ•œ ELECTRA ๋ชจ๋ธ

Awesome Lists containing this project

README

        

# KcELECTRA: Korean comments ELECTRA

** Updates on 2022.11.07 **

- KcELECTRA v2022 ํ•™์Šต์— ์‚ฌ์šฉํ•œ, ํ™•์žฅ๋œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹(v2022.3Q)๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.
- https://github.com/Beomi/KcBERT/releases/tag/v2022.3Q
- ๊ธฐ์กด 11GB -> ์‹ ๊ทœ 45GB, ๊ธฐ์กด 0.9์–ต๊ฑด -> ์‹ ๊ทœ 3.4์–ต๊ฑด์œผ๋กœ ๊ธฐ์กด v1 ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๋น„ ์•ฝ 4๋ฐฐ ์ฆ๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

** Updates on 2022.10.08 **

- KcELECTRA-base-v2022 (๊ตฌ v2022-dev) ๋ชจ๋ธ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
- ์œ„ ๋ชจ๋ธ์˜ ์„ธ๋ถ€ ์Šค์ฝ”์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
- ๊ธฐ์กด KcELECTRA-base(v2021) ๋Œ€๋น„ ๋Œ€๋ถ€๋ถ„์˜ downstream task์—์„œ ~1%p ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

---

๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด Transformer ๊ณ„์—ด ๋ชจ๋ธ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ User-Generated Noisy text domain ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

KcELECTRA๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์˜จ๋ผ์ธ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ ELECTRA๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained ELECTRA ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๊ธฐ์กด KcBERT ๋Œ€๋น„ ๋ฐ์ดํ„ฐ์…‹ ์ฆ๊ฐ€ ๋ฐ vocab ํ™•์žฅ์„ ํ†ตํ•ด ์ƒ๋‹นํ•œ ์ˆ˜์ค€์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

KcELECTRA๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

```
๐Ÿ’ก NOTE ๐Ÿ’ก
General Corpus๋กœ ํ•™์Šตํ•œ KoELECTRA๊ฐ€ ๋ณดํŽธ์ ์ธ task์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋” ์ž˜ ๋‚˜์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.
KcBERT/KcELECTRA๋Š” User genrated, Noisy text์— ๋Œ€ํ•ด์„œ ๋ณด๋‹ค ์ž˜ ๋™์ž‘ํ•˜๋Š” PLM์ž…๋‹ˆ๋‹ค.
```

## KcELECTRA Performance

- Finetune ์ฝ”๋“œ๋Š” https://github.com/Beomi/KcBERT-finetune ์—์„œ ์ฐพ์•„๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ํ•ด๋‹น Repo์˜ ๊ฐ Checkpoint ํด๋”์—์„œ Step๋ณ„ ์„ธ๋ถ€ ์Šค์ฝ”์–ด๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

| | Size
(์šฉ๋Ÿ‰) | **NSMC**
(acc) | **Naver NER**
(F1) | **PAWS**
(acc) | **KorNLI**
(acc) | **KorSTS**
(spearman) | **Question Pair**
(acc) | **KorQuaD (Dev)**
(EM/F1) |
| :----------------- | :-------------: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: |
| **KcELECTRA-base-v2022** | 475M | **91.97** | 87.35 | 76.50 | 82.12 | 83.67 | 95.12 | 69.00 / 90.40 |
| **KcELECTRA-base** | 475M | 91.71 | 86.90 | 74.80 | 81.65 | 82.65 | **95.78** | 70.60 / 90.11 |
| KcBERT-Base | 417M | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 |
| KcBERT-Large | 1.2G | 90.68 | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 |
| KoBERT | 351M | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 |
| XLM-Roberta-Base | 1.03G | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 |
| HanBERT | 614M | 90.16 | 87.31 | 82.40 | 80.89 | 83.33 | 94.19 | 78.74 / 92.02 |
| KoELECTRA-Base | 423M | 90.21 | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 |
| KoELECTRA-Base-v2 | 423M | 89.70 | 87.02 | 83.90 | 80.61 | 84.30 | 94.72 | 84.34 / 92.58 |
| KoELECTRA-Base-v3 | 423M | 90.63 | **88.11** | **84.45** | **82.24** | **85.53** | 95.25 | **84.83 / 93.45** |
| DistilKoBERT | 108M | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |

\*HanBERT์˜ Size๋Š” Bert Model๊ณผ Tokenizer DB๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

\***config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.**

## How to use

### Requirements

- `pytorch ~= 1.8.0`
- `transformers ~= 4.11.3`
- `emoji ~= 0.6.0`
- `soynlp ~= 0.0.493`

### Default usage

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base-v2022")
model = AutoModel.from_pretrained("beomi/KcELECTRA-base-v2022")

# ๊ตฌ๋ฒ„์ „(v2021)์„ ์‚ฌ์šฉํ•˜๊ธฐ ์›ํ•˜์‹ค ๊ฒฝ์šฐ
#tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base")
#model = AutoModel.from_pretrained("beomi/KcELECTRA-base")
```

> ๐Ÿ’ก ์ด์ „ KcBERT ๊ด€๋ จ ์ฝ”๋“œ๋“ค์—์„œ `AutoTokenizer`, `AutoModel` ์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ `.from_pretrained("beomi/kcbert-base")` ๋ถ€๋ถ„์„ `.from_pretrained("beomi/KcELECTRA-base")` ๋กœ๋งŒ ๋ณ€๊ฒฝํ•ด์ฃผ์‹œ๋ฉด ์ฆ‰์‹œ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

### Pretrain & Finetune Colab ๋งํฌ ๋ชจ์Œ

#### Pretrain Data

- KcBERTํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ + ์ดํ›„ 2021.03์›” ์ดˆ๊นŒ์ง€ ์ˆ˜์ง‘ํ•œ ๋Œ“๊ธ€
- ์•ฝ 17GB
- ๋Œ“๊ธ€-๋Œ€๋Œ“๊ธ€์„ ๋ฌถ์€ ๊ธฐ๋ฐ˜์œผ๋กœ Document ๊ตฌ์„ฑ

#### Pretrain Code

- https://github.com/KLUE-benchmark/KLUE-ELECTRA Repo๋ฅผ ํ†ตํ•œ Pretrain

#### Finetune Code

- https://github.com/Beomi/KcBERT-finetune Repo๋ฅผ ํ†ตํ•œ Finetune ๋ฐ ์Šค์ฝ”์–ด ๋น„๊ต

#### Finetune Samples

- NSMC with PyTorch-Lightning 1.3.0, GPU, Colab
Open In Colab

## Train Data & Preprocessing

### Raw Data

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2021.03.09 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ **๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค/ํ˜น์€ ์ „์ฒด ๋‰ด์Šค** ๊ธฐ์‚ฌ๋“ค์˜ **๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€**์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ **์•ฝ 17.3GB์ด๋ฉฐ, 1์–ต8์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ**์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

> KcBERT๋Š” 2019.01-2020.06์˜ ํ…์ŠคํŠธ๋กœ, ์ •์ œ ํ›„ ์•ฝ 9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

### Preprocessing

PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!

์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ `ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ` ์œผ๋กœ ์ง€์ •ํ•ด `ใ„ฑ-ํžฃ` ๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

2. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ

`ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹`์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ `ใ…‹ใ…‹`์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

3. Cased Model

KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.

4. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ

10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

5. ์ค‘๋ณต ์ œ๊ฑฐ

์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์™„์ „ํžˆ ์ผ์น˜ํ•˜๋Š” ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

6. `OOO` ์ œ๊ฑฐ

์ผ๋ถ€ ํฌํƒˆ ๋Œ“๊ธ€์˜ ๊ฒฝ์šฐ, ๋น„์†์–ด๋Š” ์ž์ฒด ํ•„ํ„ฐ๋ง์„ ํ†ตํ•ด `OOO` ๋กœ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์„ ๊ณต๋ฐฑ์œผ๋กœ ์ œ๊ฑฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. (`[UNK]` ๊ฐ์†Œ)

```bash
pip install soynlp emoji
```

์•„๋ž˜ `clean` ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.

```python
import re
import emoji
from soynlp.normalizer import repeat_normalize

emojis = ''.join(emoji.UNICODE_EMOJI.keys())
pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ{emojis}]+')
url_pattern = re.compile(
r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

import re
import emoji
from soynlp.normalizer import repeat_normalize

pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ]+')
url_pattern = re.compile(
r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x):
x = pattern.sub(' ', x)
x = emoji.replace_emoji(x, replace='') #emoji ์‚ญ์ œ
x = url_pattern.sub('', x)
x = x.strip()
x = repeat_normalize(x, num_repeats=2)
return x
```

> ๐Ÿ’ก Finetune Score์—์„œ๋Š” ์œ„ `clean` ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

### Cleaned Data

- KcBERT ์™ธ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ๋Š” ์ •๋ฆฌ ํ›„ ๊ณต๊ฐœ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

## Tokenizer, Model Train

Tokenizer๋Š” Huggingface์˜ [Tokenizers](https://github.com/huggingface/tokenizers) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘ `BertWordPieceTokenizer` ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š” `30000`์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ชจ๋ธ์˜ General Downstream task์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด KoELECTRA์—์„œ ์‚ฌ์šฉํ•œ Vocab์„ ๊ฒน์น˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„์„ ์ถ”๊ฐ€๋กœ ๋„ฃ์–ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. (์‹ค์ œ๋กœ ๋‘ ๋ชจ๋ธ์ด ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์€ ์•ฝ 5000ํ† ํฐ์ด์—ˆ์Šต๋‹ˆ๋‹ค.)

TPU `v3-8` ์„ ์ด์šฉํ•ด ์•ฝ 10์ผ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 848k step์„ ํ•™์Šตํ•œ ๋ชจ๋ธ weight๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

(100k step๋ณ„ Checkpoint๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ถ€๋ถ„์€ `KcBERT-finetune` repo๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.)

๋ชจ๋ธ ํ•™์Šต Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 100-200k ์‚ฌ์ด์— ๊ธ‰๊ฒฉํžˆ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค ํ•™์Šต ์ข…๋ฃŒ๊นŒ์ง€๋„ ์ง€์†์ ์œผ๋กœ loss๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

![KcELECTRA-base Pretrain Loss](https://cdn.jsdelivr.net/gh/beomi/blog-img@master/2021/04/07/image-20210407201231133.png)

### KcELECTRA Pretrain Step๋ณ„ Downstream task ์„ฑ๋Šฅ ๋น„๊ต

> ๐Ÿ’ก ์•„๋ž˜ ํ‘œ๋Š” ์ „์ฒด ckpt๊ฐ€ ์•„๋‹Œ ์ผ๋ถ€์— ๋Œ€ํ•ด์„œ๋งŒ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

![KcELECTRA Pretrain Step๋ณ„ Downstream task ์„ฑ๋Šฅ ๋น„๊ต](https://cdn.jsdelivr.net/gh/beomi/blog-img@master/2021/04/07/image-20210407215557039.png)

- ์œ„์™€ ๊ฐ™์ด KcBERT-base, KcBERT-large ๋Œ€๋น„ **๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด** KcELECTRA-base๊ฐ€ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
- KcELECTRA pretrain์—์„œ๋„ Train step์ด ๋Š˜์–ด๊ฐ์— ๋”ฐ๋ผ ์ ์ง„์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

## ์ธ์šฉํ‘œ๊ธฐ/Citation

KcELECTRA๋ฅผ ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ์•„๋ž˜ ์–‘์‹์„ ํ†ตํ•ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

```
@misc{lee2021kcelectra,
author = {Junbum Lee},
title = {KcELECTRA: Korean comments ELECTRA},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Beomi/KcELECTRA}}
}
```

๋…ผ๋ฌธ์„ ํ†ตํ•œ ์‚ฌ์šฉ ์™ธ์—๋Š” MIT ๋ผ์ด์„ผ์Šค๋ฅผ ํ‘œ๊ธฐํ•ด์ฃผ์„ธ์š”. โ˜บ๏ธ

## Acknowledgement

KcELECTRA Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ [TFRC](https://www.tensorflow.org/tfrc?hl=ko) ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  [Monologg](https://github.com/monologg/) ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

## Reference

### Github Repos

- [KcBERT by Beomi](https://github.com/Beomi/KcBERT)
- [BERT by Google](https://github.com/google-research/bert)
- [KoBERT by SKT](https://github.com/SKTBrain/KoBERT)
- [KoELECTRA by Monologg](https://github.com/monologg/KoELECTRA/)
- [Transformers by Huggingface](https://github.com/huggingface/transformers)
- [Tokenizers by Hugginface](https://github.com/huggingface/tokenizers)
- [ELECTRA train code by KLUE](https://github.com/KLUE-benchmark/KLUE-ELECTRA)

### Blogs

- [Monologg๋‹˜์˜ KoELECTRA ํ•™์Šต๊ธฐ](https://monologg.kr/categories/NLP/ELECTRA/)
- [Colab์—์„œ TPU๋กœ BERT ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๊ธฐ - Tensorflow/Google ver.](https://beomi.github.io/2020/02/26/Train-BERT-from-scratch-on-colab-TPU-Tensorflow-ver/)