Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/monologg/kobert-transformers
KoBERT on ๐ค Huggingface Transformers ๐ค (with Bug Fixed)
https://github.com/monologg/kobert-transformers
kobert korean-nlp transformers
Last synced: 3 months ago
JSON representation
KoBERT on ๐ค Huggingface Transformers ๐ค (with Bug Fixed)
- Host: GitHub
- URL: https://github.com/monologg/kobert-transformers
- Owner: monologg
- License: apache-2.0
- Archived: true
- Created: 2020-05-14T06:53:15.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-08-21T04:33:40.000Z (4 months ago)
- Last Synced: 2024-09-30T16:37:41.675Z (3 months ago)
- Topics: kobert, korean-nlp, transformers
- Language: Python
- Homepage:
- Size: 49.8 KB
- Stars: 201
- Watchers: 5
- Forks: 45
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# KoBERT-Transformers
## **โ ๏ธ Deprecated (2024.08.21) โ ๏ธ**
- ์ด์ฐฝ๊ธฐ transformers (4๋ ์ ) ์์๋ custom code๋ฅผ ์ด์ฉํ Auto mapping์ด ๋ถ๊ฐํ์ฌ ํ์ด์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ ํํ๋ก ์ ๊ณตํ์์ต๋๋ค.
- `trust_remote_code=True` ์ต์ ์ ํตํด `AutoModel`, `AutoTokenizer`๋ก ๋ฐ๋ก ํธ์ถ์ด ๊ฐ๋ฅํ๊ฒ ์์ ํ์๊ณ , ์ด์ ๋ฐ๋ผ ํด๋น ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ ๋ ์ด์ ์ ์ง๋ณด์ํ์ง ์์ ์์ ์ ๋๋ค.
- https://huggingface.co/monologg/kobert
- https://huggingface.co/monologg/distilkobert---
`KoBERT` & `DistilKoBERT` on ๐ค Huggingface Transformers ๐ค
KoBERT ๋ชจ๋ธ์ [๊ณต์ ๋ ํฌ](https://github.com/SKTBrain/KoBERT)์ ๊ฒ๊ณผ ๋์ผํฉ๋๋ค. ๋ณธ ๋ ํฌ๋ **Huggingface tokenizer์ ๋ชจ๋ API๋ฅผ ์ง์**ํ๊ธฐ ์ํด์ ์ ์๋์์ต๋๋ค.
## Changes
### ๐ TL;DR
1. `transformers` ๋ `v3.0` ์ด์์ ๋ฐ๋์ ์ค์น!
2. `tokenizer`๋ ๋ณธ ๋ ํฌ์ `kobert_transformers/tokenization_kobert.py`๋ฅผ ์ฌ์ฉ!### 1. Tokenizer ํธํ
`Huggingface Transformers`๊ฐ `v2.9.0`๋ถํฐ tokenization ๊ด๋ จ API๊ฐ ์ผ๋ถ ๋ณ๊ฒฝ๋์์ต๋๋ค. ์ด์ ๋ง์ถฐ ๊ธฐ์กด์ `tokenization_kobert.py`๋ฅผ ์์ ๋ฒ์ ์ ๋ง๊ฒ ์์ ํ์์ต๋๋ค.
### 2. Embedding์ padding_idx ์ด์
์ด์ ๋ถํฐ `BertModel`์ `BertEmbeddings`์์ `padding_idx=0`์ผ๋ก **Hard-coding**๋์ด ์์์ต๋๋ค. (์๋ ์ฝ๋ ์ฐธ๊ณ )
```python
class BertEmbeddings(nn.Module):
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
```๊ทธ๋ฌ๋ Sentencepiece์ ๊ฒฝ์ฐ ๊ธฐ๋ณธ๊ฐ์ผ๋ก `pad_token_id=1`, `unk_token_id=0`์ผ๋ก ์ค์ ์ด ๋์ด ์๊ณ (์ด๋ KoBERT๋ ๋์ผ), ์ด๋ฅผ ๊ทธ๋๋ก ์ฌ์ฉํ๋ BertModel์ ๊ฒฝ์ฐ ์์น ์์ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ์ ธ์ฌ ์ ์์ต๋๋ค.
Huggingface์์๋ ์ต๊ทผ์ ํด๋น ์ด์๋ฅผ ์ธ์งํ์ฌ ์ด๋ฅผ ์์ ํ์ฌ `v2.9.0`์ ๋ฐ์ํ์์ต๋๋ค. ([๊ด๋ จ PR #3793](https://github.com/huggingface/transformers/pull/3793)) config์ `pad_token_id=1` ์ ์ถ๊ฐ ๊ฐ๋ฅํ์ฌ ์ด๋ฅผ ํด๊ฒฐํ ์ ์๊ฒ ํ์์ต๋๋ค.
```python
class BertEmbeddings(nn.Module):
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
```๊ทธ๋ฌ๋ `v.2.9.0`์์ `DistilBERT`, `ALBERT` ๋ฑ์๋ ์ด ์ด์๊ฐ ํด๊ฒฐ๋์ง ์์ ์ง์ PR์ ์ฌ๋ ค ์ฒ๋ฆฌํ์๊ณ ([๊ด๋ จ PR #3965](https://github.com/huggingface/transformers/pull/3965)), **`v2.9.1`์ ์ต์ข ์ ์ผ๋ก ๋ฐ์๋์ด ๋ฐฐํฌ๋์์ต๋๋ค.**
์๋๋ ์ด์ ๊ณผ ํ์ฌ ๋ฒ์ ์ ์ฐจ์ด์ ์ ๋ณด์ฌ์ฃผ๋ ์ฝ๋์ ๋๋ค.
```python
# Transformers v2.7.0
>>> from transformers import BertModel, DistilBertModel
>>> model = BertModel.from_pretrained("monologg/kobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=0)
>>> model = DistilBertModel.from_pretrained("monologg/distilkobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=0)### Transformers v2.9.1
>>> from transformers import BertModel, DistilBertModel
>>> model = BertModel.from_pretrained("monologg/kobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=1)
>>> model = DistilBertModel.from_pretrained("monologg/distilkobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=1)
```## KoBERT / DistilKoBERT on ๐ค Transformers ๐ค
### Dependencies
- torch>=1.1.0
- transformers>=3,<5### How to Use
```python
>>> from transformers import BertModel, DistilBertModel
>>> bert_model = BertModel.from_pretrained('monologg/kobert')
>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert')
```**Tokenizer๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด, [`kobert_transformers/tokenization_kobert.py`](https://github.com/monologg/KoBERT-Transformers/blob/master/kobert_transformers/tokenization_kobert.py) ํ์ผ์ ๋ณต์ฌํ ํ, `KoBertTokenizer`๋ฅผ ์ํฌํธํ๋ฉด ๋ฉ๋๋ค.**
- KoBERT์ DistilKoBERT ๋ชจ๋ ๋์ผํ ํ ํฌ๋์ด์ ๋ฅผ ์ฌ์ฉํฉ๋๋ค.
- **๊ธฐ์กด KoBERT์ ๊ฒฝ์ฐ Special Token์ด ์ ๋๋ก ๋ถ๋ฆฌ๋์ง ์๋ ์ด์**๊ฐ ์์ด์ ํด๋น ๋ถ๋ถ์ ์์ ํ์ฌ ๋ฐ์ํ์์ต๋๋ค. ([Issue link](https://github.com/SKTBrain/KoBERT/issues/11))```python
>>> from tokenization_kobert import KoBertTokenizer
>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋ ๋์ผ
>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]")
>>> ['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'])
>>> [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]
```## Kobert-Transformers (Pip library)
[![PyPI](https://img.shields.io/pypi/v/kobert-transformers)](https://pypi.org/project/kobert-transformers/)
[![license](https://img.shields.io/badge/license-Apache%202.0-red)](https://github.com/monologg/DistilKoBERT/blob/master/LICENSE)
[![Downloads](https://pepy.tech/badge/kobert-transformers)](https://pepy.tech/project/kobert-transformers)- `tokenization_kobert.py`๋ฅผ ๋ฉํํ ํ์ด์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- KoBERT, DistilKoBERT๋ฅผ Huggingface Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ ํํ๋ก ์ ๊ณต
- `v0.5.1`์ด์๋ถํฐ๋ `transformers v3.0` ์ด์์ผ๋ก ๊ธฐ๋ณธ ์ค์นํฉ๋๋ค. (`transformers v4.0` ๊น์ง๋ ์ด์ ์์ด ์ฌ์ฉ ๊ฐ๋ฅ)### Install Kobert-Transformers
```bash
pip3 install kobert-transformers
```### How to Use
```python
>>> import torch
>>> from kobert_transformers import get_kobert_model, get_distilkobert_model
>>> model = get_kobert_model()
>>> model.eval()
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids)
>>> sequence_output[0]
tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756],
[-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754],
[-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]],
grad_fn=)
``````python
>>> from kobert_transformers import get_tokenizer
>>> tokenizer = get_tokenizer()
>>> tokenizer.tokenize("[CLS] ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค. [SEP]")
['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.', '[SEP]'])
[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]
```## Reference
- [KoBERT](https://github.com/SKTBrain/KoBERT)
- [DistilKoBERT](https://github.com/monologg/DistilKoBERT)
- [Huggingface Transformers](https://github.com/huggingface/transformers)