https://github.com/SKTBrain/KoBERT

Korean BERT pre-trained cased (KoBERT)
https://github.com/SKTBrain/KoBERT

bert korean-nlp language-model nlp pytorch transformers

Last synced: 8 months ago
JSON representation

Korean BERT pre-trained cased (KoBERT)

Host: GitHub
URL: https://github.com/SKTBrain/KoBERT
Owner: SKTBrain
License: apache-2.0
Created: 2019-09-23T13:05:33.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2024-10-03T01:26:57.000Z (about 1 year ago)
Last Synced: 2024-11-09T05:34:45.811Z (about 1 year ago)
Topics: bert, korean-nlp, language-model, nlp, pytorch, transformers
Language: Jupyter Notebook
Homepage:
Size: 204 KB
Stars: 1,301
Watchers: 38
Forks: 368
Open Issues: 18
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Text-Summarization-Repo - **KOBERT** - Base(92M parameters) | - 위키백과(문장 5M개), 뉴스(문장 20M개)<br />- [SentencePiece](https://github.com/google/sentencepiece)<br />- 8,002 vocabs(unused token 없음) | - PyTorch<br />- [KoBERT-Transformers(monologg)](https://github.com/monologg/KoBERT-Transformers)를 통해 Huggingface Transformers 라이브러리로 이용 가능, [DistilKoBERT](https://github.com/monologg/DistilKoBERT) 이용 가능 | SKTBrain<br />(Apache-2.0) | (Resources / Pre-trained Models)

README

          # KoBERT

* [KoBERT](#kobert)

  * [Korean BERT pre-trained cased (KoBERT)](#korean-bert-pre-trained-cased-kobert)

    * [Why'?'](#why)

    * [Training Environment](#training-environment)

    * [Requirements](#requirements)

    * [How to install](#how-to-install)

  * [How to use](#how-to-use)

    * [Using with PyTorch](#using-with-pytorch)

    * [Using with ONNX](#using-with-onnx)

    * [Using with MXNet-Gluon](#using-with-mxnet-gluon)

    * [Tokenizer](#tokenizer)

  * [Subtasks](#subtasks)

    * [Naver Sentiment Analysis](#naver-sentiment-analysis)

    * [KoBERT와 CRF로 만든 한국어 객체명인식기](#kobert와-crf로-만든-한국어-객체명인식기)

    * [Korean Sentence BERT](#korean-sentence-bert)

  * [Release](#release)

  * [Contacts](#contacts)

  * [License](#license)

---

## Korean BERT pre-trained cased (KoBERT)

### Why'?'

* 구글 [BERT base multilingual cased](https://github.com/google-research/bert/blob/master/multilingual.md)의 한국어 성능 한계

### Training Environment

* Architecture

```python

predefined_args = {

        'attention_cell': 'multi_head',

        'num_layers': 12,

        'units': 768,

        'hidden_size': 3072,

        'max_length': 512,

        'num_heads': 12,

        'scaled': True,

        'dropout': 0.1,

        'use_residual': True,

        'embed_size': 768,

        'embed_dropout': 0.1,

        'token_type_vocab_size': 2,

        'word_embed': None,

    }

```

* 학습셋

| 데이터      | 문장 | 단어 |

| ----------- | ---- | ---- |

| 한국어 위키 | 5M   | 54M  |

* 학습 환경

  * V100 GPU x 32, Horovod(with InfiniBand)

![2019-04-29 텐서보드 로그](imgs/2019-04-29_TensorBoard.png)

* 사전(Vocabulary)

  * 크기 : 8,002

  * 한글 위키 기반으로 학습한 토크나이저(SentencePiece)

  * Less number of parameters(92M < 110M )

### Requirements

* see [requirements.txt](https://github.com/SKTBrain/KoBERT/blob/master/requirements.txt)

### How to install

* Install KoBERT as a python package

  ```sh

  pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

  ```

* If you want to modify source codes, please clone this repository

  ```sh

  git clone https://github.com/SKTBrain/KoBERT.git

  cd KoBERT

  pip install -r requirements.txt

  ```

---

## How to use

### PyTorch

*Huggingface transformers API가 편하신 분은 [여기](kobert_hf)를 참고하세요.*

```python

>>> import torch

>>> from kobert import get_pytorch_kobert_model

>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])

>>> input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])

>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

>>> model, vocab  = get_pytorch_kobert_model()

>>> sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)

>>> pooled_output.shape

torch.Size([2, 768])

>>> vocab

Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")

>>> # Last Encoding Layer

>>> sequence_output[0]

tensor([[-0.2461,  0.2428,  0.2590,  ..., -0.4861, -0.0731,  0.0756],

        [-0.2478,  0.2420,  0.2552,  ..., -0.4877, -0.0727,  0.0754],

        [-0.2472,  0.2420,  0.2561,  ..., -0.4874, -0.0733,  0.0765]],

       grad_fn=)

```

`model`은 디폴트로 `eval()`모드로 리턴됨, 따라서 학습 용도로 사용시 `model.train()`명령을 통해 학습 모드로 변경할 필요가 있다.

* Naver Sentiment Analysis Fine-Tuning with pytorch

  * Colab에서 [런타임] - [런타임 유형 변경] - 하드웨어 가속기(GPU) 사용을 권장합니다.

  * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SKTBrain/KoBERT/blob/master/scripts/NSMC/naver_review_classifications_pytorch_kobert.ipynb)

### ONNX

```python

>>> import onnxruntime

>>> import numpy as np

>>> from kobert import get_onnx_kobert_model

>>> onnx_path = get_onnx_kobert_model()

>>> sess = onnxruntime.InferenceSession(onnx_path)

>>> input_ids = [[31, 51, 99], [15, 5, 0]]

>>> input_mask = [[1, 1, 1], [1, 1, 0]]

>>> token_type_ids = [[0, 0, 1], [0, 1, 0]]

>>> len_seq = len(input_ids[0])

>>> pred_onnx = sess.run(None, {'input_ids':np.array(input_ids),

>>>                             'token_type_ids':np.array(token_type_ids),

>>>                             'input_mask':np.array(input_mask),

>>>                             'position_ids':np.array(range(len_seq))})

>>> # Last Encoding Layer

>>> pred_onnx[-2][0]

array([[-0.24610452,  0.24282141,  0.25895312, ..., -0.48613444,

        -0.07305173,  0.07560554],

       [-0.24783179,  0.24200465,  0.25520486, ..., -0.4877185 ,

        -0.0727044 ,  0.07536091],

       [-0.24721591,  0.24196623,  0.2560626 , ..., -0.48743123,

        -0.07326943,  0.07650235]], dtype=float32)

```

_ONNX 컨버팅은 [soeque1](https://github.com/soeque1)께서 도움을 주셨습니다._

### MXNet-Gluon

```python

>>> import mxnet as mx

>>> from kobert import get_mxnet_kobert_model

>>> input_id = mx.nd.array([[31, 51, 99], [15, 5, 0]])

>>> input_mask = mx.nd.array([[1, 1, 1], [1, 1, 0]])

>>> token_type_ids = mx.nd.array([[0, 0, 1], [0, 1, 0]])

>>> model, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False)

>>> encoder_layer, pooled_output = model(input_id, token_type_ids)

>>> pooled_output.shape

(2, 768)

>>> vocab

Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")

>>> # Last Encoding Layer

>>> encoder_layer[0]

[[-0.24610372  0.24282135  0.2589539  ... -0.48613444 -0.07305248

   0.07560539]

 [-0.24783105  0.242005    0.25520545 ... -0.48771808 -0.07270523

   0.07536077]

 [-0.24721491  0.241966    0.25606337 ... -0.48743105 -0.07327032

   0.07650219]]

```

* Naver Sentiment Analysis Fine-Tuning with MXNet

  * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SKTBrain/KoBERT/blob/master/scripts/NSMC/naver_review_classifications_gluon_kobert.ipynb)

### Tokenizer

* Pretrained [Sentencepiece](https://github.com/google/sentencepiece) tokenizer

```python

>>> from gluonnlp.data import SentencepieceTokenizer

>>> from kobert import get_tokenizer

>>> tok_path = get_tokenizer()

>>> sp  = SentencepieceTokenizer(tok_path)

>>> sp('한국어 모델을 공유합니다.')

['▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.']

```

---

## Task Fine-tuning

### Naver Sentiment Analysis

* Dataset : 

| Model                                                                                               | Accuracy                                                        |

| --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |

| [BERT base multilingual cased](https://github.com/google-research/bert/blob/master/multilingual.md) | 0.875                                                           |

| KoBERT                                                                                              | **[0.901](logs/bert_naver_small_512_news_simple_20190624.txt)** |

| [KoGPT2](https://github.com/SKT-AI/KoGPT2)                                                          | 0.899                                                           |

### KoBERT와 CRF로 만든 한국어 객체명인식기

* 

```text

문장을 입력하세요:  SKTBrain에서 KoBERT 모델을 공개해준 덕분에 BERT-CRF 기반 객체명인식기를 쉽게 개발할 수 있었다.

len: 40, input_token:['[CLS]', '▁SK', 'T', 'B', 'ra', 'in', '에서', '▁K', 'o', 'B', 'ER', 'T', '▁모델', '을', '▁공개', '해', '준', '▁덕분에', '▁B', 'ER', 'T', '-', 'C', 'R', 'F', '▁기반', '▁', '객', '체', '명', '인', '식', '기를', '▁쉽게', '▁개발', '할', '▁수', '▁있었다', '.', '[SEP]']

len: 40, pred_ner_tag:['[CLS]', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '[SEP]']

decoding_ner_sentence: [CLS] 에서  모델을 공개해준 덕분에  기반 객체명인식기를 쉽게 개발할 수 있었다.[SEP]

```

### Korean Sentence BERT

* 

|Model|Cosine Pearson|Cosine Spearman|Euclidean Pearson|Euclidean Spearman|Manhattan Pearson|Manhattan Spearman|Dot Pearson|Dot Spearman|

|:------------------------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|

|NLl|65.05|68.48|68.81|68.18|68.90|68.20|65.22|66.81|

|STS|**80.42**|**79.64**|**77.93**|77.43|**77.92**|77.44|**76.56**|**75.83**|

|STS + NLI|78.81|78.47|77.68|**77.78**|77.71|**77.83**|75.75|75.22|

---

## Release

* v0.2.3

  * support `onnx 1.8.0`

* v0.2.2

  * fix `No module named 'kobert.utils'`

* v0.2.1

  * guide default 'import statements'

* v0.2

  * download large files from `aws s3`

  * rename functions

* v0.1.2

  * Guaranteed compatibility with higher versions of transformers

  * fix pad token index id

* v0.1.1

  * 사전(vocabulary)과 토크나이저 통합

* v0.1

  * 초기 모델 릴리즈

## Contacts

`KoBERT` 관련 이슈는 [이곳](https://github.com/SKTBrain/KoBERT/issues)에 등록해 주시기 바랍니다.

## License

`KoBERT`는 `Apache-2.0` 라이선스 하에 공개되어 있습니다. 모델 및 코드를 사용할 경우 라이선스 내용을 준수해주세요. 라이선스 전문은 `LICENSE` 파일에서 확인하실 수 있습니다.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/SKTBrain/KoBERT

Awesome Lists containing this project

README