Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ckiplab/han-transformers


https://github.com/ckiplab/han-transformers

Last synced: 4 days ago
JSON representation

Awesome Lists containing this project

README

        

# Han Transformers

This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.

Our paper has been accepted to ROCLING! Please check out our [paper](https://aclanthology.org/2022.rocling-1.21/).

## Dependency
* transformers ≤ 4.15.0
* pytorch

## Models

We uploaded our models to HuggingFace hub.
* Pretrained models using a masked language modeling (MLM) objective.
* [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese)
* Fine-tuned models for Word Segmentation.
* [ckiplab/bert-base-han-chinese-ws](https://huggingface.co/ckiplab/bert-base-han-chinese-ws) (Merge)
* [ckiplab/bert-base-han-chinese-ws-shanggu](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-shanggu) (上古)
* [ckiplab/bert-base-han-chinese-ws-zhonggu](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-zhonggu) (中古)
* [ckiplab/bert-base-han-chinese-ws-jindai](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-jindai) (近代)
* [ckiplab/bert-base-han-chinese-ws-xiandai](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-xiandai) (現代)
* Fine-tuned models for Part-of-Speech tagging.
* [ckiplab/bert-base-han-chinese-pos](https://huggingface.co/ckiplab/bert-base-han-chinese-pos?) (Merge)
* [ckiplab/bert-base-han-chinese-pos-shanggu](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-shanggu) (上古 / [標記列表](shanggu.md))
* [ckiplab/bert-base-han-chinese-pos-zhonggu](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-zhonggu) (中古 / [標記列表](zhonggu.md))
* [ckiplab/bert-base-han-chinese-pos-jindai](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-jindai) (近代 / [標記列表](jindai.md))
* [ckiplab/bert-base-han-chinese-pos-xiandai](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-xiandai) (現代 / [標記列表](xiandai.md))

## Training Corpus
The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.
* [中央研究院上古漢語標記語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh)
* [中央研究院中古漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh)
* [中央研究院近代漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh)
* [中央研究院現代漢語語料庫](http://asbc.iis.sinica.edu.tw)

## Usage

### Installation
```bash
pip install transformers==4.15.0
pip install torch==1.10.2
```

### Inference

* **Pre-trained Language Model**

You can use [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese) directly with a pipeline for masked language modeling.

```python
from transformers import pipeline

# Initialize
unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')

# Input text with [MASK]
unmasker("黎[MASK]於變時雍。")

# output
[{'sequence': '黎 民 於 變 時 雍 。',
'score': 0.14885780215263367,
'token': 3696,
'token_str': '民'},
{'sequence': '黎 庶 於 變 時 雍 。',
'score': 0.0859643816947937,
'token': 2433,
'token_str': '庶'},
{'sequence': '黎 氏 於 變 時 雍 。',
'score': 0.027848130092024803,
'token': 3694,
'token_str': '氏'},
{'sequence': '黎 人 於 變 時 雍 。',
'score': 0.023678112775087357,
'token': 782,
'token_str': '人'},
{'sequence': '黎 生 於 變 時 雍 。',
'score': 0.018718384206295013,
'token': 4495,
'token_str': '生'}]
```

You can use [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese) to get the features of a given text in PyTorch.

```python
from transformers import AutoTokenizer, AutoModel

# Initialize tokenzier and model
tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")

# Input text
text = "黎民於變時雍。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# get encoded token vectors
output.last_hidden_state # torch.Tensor with Size([1, 9, 768])

# get encoded sentence vector
output.pooler_output # torch.Tensor with Size([1, 768])
```

* **Word Segmentation (WS)**

In WS, [ckiplab/bert-base-han-chinese-ws](https://huggingface.co/ckiplab/bert-base-han-chinese-ws) divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).

```python
from transformers import pipeline

# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")

# Input text
classifier("帝堯曰放勳")

# output
[{'entity': 'B',
'score': 0.9999793,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'I',
'score': 0.9915047,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'B',
'score': 0.99992275,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'B',
'score': 0.99905187,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'I',
'score': 0.96299917,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]
```

* **Part-of-Speech (PoS) Tagging**

In PoS tagging, [ckiplab/bert-base-han-chinese-pos](https://huggingface.co/ckiplab/bert-base-han-chinese-pos) recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.

```python
from transformers import pipeline

# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")

# Input text
classifier("帝堯曰放勳")

# output
[{'entity': 'NB1',
'score': 0.99410427,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'NB1',
'score': 0.98874336,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'VG',
'score': 0.97059363,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'NB1',
'score': 0.9864504,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'NB1',
'score': 0.9543974,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]
```

## Model Performance

### Pre-trained Language Model, **Perplexity ↓**


Language Model
MLM Training Data
MLM Testing Data


上古
中古
近代
現代


ckiplab/bert-base-han-Chinese
上古
24.7588
87.8176
297.1111
60.3993


中古
67.861
70.6244
133.0536
23.0125


近代
69.1364
77.4154
46.8308
20.4289


現代
118.8596
163.6896
146.5959
4.6143


Merge
31.1807
61.2381
49.0672
4.5017


ckiplab/bert-base-chinese
-
233.6394
405.9008
278.7069
8.8521

### Word Segmentation (WS), **F1 score (%) ↑**


WS Model
Training Data
Testing Data


上古
中古
近代
現代


ckiplab/bert-base-han-chinese-ws
上古
97.6090
88.5734
83.2877
70.3772


中古
92.6402
92.6538
89.4803
78.3827


近代
90.8651
92.1861
94.6495
81.2143


現代
87.0234
83.5810
84.9370
96.9446


Merge
97.4537
91.9990
94.0970
96.7314


ckiplab/bert-base-chinese-ws
-
86.5698
82.9115
84.3213
98.1325

### Part-of-Speech (POS) Tagging, **F1 score (%) ↑**


POS Model
Training Data
Testing Data


上古
中古
近代
現代


ckiplab/bert-base-han-chinese-pos
上古
91.2945
-
-
-


中古
7.3662
80.4896
11.3371
10.2577


近代
6.4794
14.3653
88.6580
0.5316


現代
11.9895
11.0775
0.4033
93.2813


Merge
88.8772
42.4369
86.9093
92.9012

## License
[
](https://www.gnu.org/licenses/gpl-3.0.html)

Copyright (c) 2022 [CKIP Lab](https://ckip.iis.sinica.edu.tw/) under the [GPL-3.0 License](https://www.gnu.org/licenses/gpl-3.0.html).

## Citation
Please cite our paper if you use Han-Transformers in your work:

```bibtex
@inproceedings{lin-ma-2022-hantrans,
title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
author = "Lin, Chin-Tung and Ma, Wei-Yun",
booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
year = "2022",
address = "Taipei, Taiwan",
publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
url = "https://aclanthology.org/2022.rocling-1.21",
pages = "164--173",
}
```