Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ckiplab/han-transformers
https://github.com/ckiplab/han-transformers
Last synced: 4 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/ckiplab/han-transformers
- Owner: ckiplab
- Created: 2022-06-23T13:01:32.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-01-19T08:49:37.000Z (almost 2 years ago)
- Last Synced: 2023-03-05T19:09:23.159Z (over 1 year ago)
- Size: 18.6 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Han Transformers
This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.
Our paper has been accepted to ROCLING! Please check out our [paper](https://aclanthology.org/2022.rocling-1.21/).
## Dependency
* transformers ≤ 4.15.0
* pytorch## Models
We uploaded our models to HuggingFace hub.
* Pretrained models using a masked language modeling (MLM) objective.
* [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese)
* Fine-tuned models for Word Segmentation.
* [ckiplab/bert-base-han-chinese-ws](https://huggingface.co/ckiplab/bert-base-han-chinese-ws) (Merge)
* [ckiplab/bert-base-han-chinese-ws-shanggu](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-shanggu) (上古)
* [ckiplab/bert-base-han-chinese-ws-zhonggu](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-zhonggu) (中古)
* [ckiplab/bert-base-han-chinese-ws-jindai](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-jindai) (近代)
* [ckiplab/bert-base-han-chinese-ws-xiandai](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-xiandai) (現代)
* Fine-tuned models for Part-of-Speech tagging.
* [ckiplab/bert-base-han-chinese-pos](https://huggingface.co/ckiplab/bert-base-han-chinese-pos?) (Merge)
* [ckiplab/bert-base-han-chinese-pos-shanggu](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-shanggu) (上古 / [標記列表](shanggu.md))
* [ckiplab/bert-base-han-chinese-pos-zhonggu](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-zhonggu) (中古 / [標記列表](zhonggu.md))
* [ckiplab/bert-base-han-chinese-pos-jindai](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-jindai) (近代 / [標記列表](jindai.md))
* [ckiplab/bert-base-han-chinese-pos-xiandai](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-xiandai) (現代 / [標記列表](xiandai.md))## Training Corpus
The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.
* [中央研究院上古漢語標記語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh)
* [中央研究院中古漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh)
* [中央研究院近代漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh)
* [中央研究院現代漢語語料庫](http://asbc.iis.sinica.edu.tw)## Usage
### Installation
```bash
pip install transformers==4.15.0
pip install torch==1.10.2
```### Inference
* **Pre-trained Language Model**
You can use [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese) directly with a pipeline for masked language modeling.
```python
from transformers import pipeline# Initialize
unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')# Input text with [MASK]
unmasker("黎[MASK]於變時雍。")
# output
[{'sequence': '黎 民 於 變 時 雍 。',
'score': 0.14885780215263367,
'token': 3696,
'token_str': '民'},
{'sequence': '黎 庶 於 變 時 雍 。',
'score': 0.0859643816947937,
'token': 2433,
'token_str': '庶'},
{'sequence': '黎 氏 於 變 時 雍 。',
'score': 0.027848130092024803,
'token': 3694,
'token_str': '氏'},
{'sequence': '黎 人 於 變 時 雍 。',
'score': 0.023678112775087357,
'token': 782,
'token_str': '人'},
{'sequence': '黎 生 於 變 時 雍 。',
'score': 0.018718384206295013,
'token': 4495,
'token_str': '生'}]
```You can use [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese) to get the features of a given text in PyTorch.
```python
from transformers import AutoTokenizer, AutoModel# Initialize tokenzier and model
tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")# Input text
text = "黎民於變時雍。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)# get encoded token vectors
output.last_hidden_state # torch.Tensor with Size([1, 9, 768])# get encoded sentence vector
output.pooler_output # torch.Tensor with Size([1, 768])
```* **Word Segmentation (WS)**
In WS, [ckiplab/bert-base-han-chinese-ws](https://huggingface.co/ckiplab/bert-base-han-chinese-ws) divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).
```python
from transformers import pipeline# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")# Input text
classifier("帝堯曰放勳")# output
[{'entity': 'B',
'score': 0.9999793,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'I',
'score': 0.9915047,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'B',
'score': 0.99992275,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'B',
'score': 0.99905187,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'I',
'score': 0.96299917,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]
```* **Part-of-Speech (PoS) Tagging**
In PoS tagging, [ckiplab/bert-base-han-chinese-pos](https://huggingface.co/ckiplab/bert-base-han-chinese-pos) recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.
```python
from transformers import pipeline# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")# Input text
classifier("帝堯曰放勳")# output
[{'entity': 'NB1',
'score': 0.99410427,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'NB1',
'score': 0.98874336,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'VG',
'score': 0.97059363,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'NB1',
'score': 0.9864504,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'NB1',
'score': 0.9543974,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]
```## Model Performance
### Pre-trained Language Model, **Perplexity ↓**
Language Model
MLM Training Data
MLM Testing Data
上古
中古
近代
現代
ckiplab/bert-base-han-Chinese
上古
24.7588
87.8176
297.1111
60.3993
中古
67.861
70.6244
133.0536
23.0125
近代
69.1364
77.4154
46.8308
20.4289
現代
118.8596
163.6896
146.5959
4.6143
Merge
31.1807
61.2381
49.0672
4.5017
ckiplab/bert-base-chinese
-
233.6394
405.9008
278.7069
8.8521
### Word Segmentation (WS), **F1 score (%) ↑**
WS Model
Training Data
Testing Data
上古
中古
近代
現代
ckiplab/bert-base-han-chinese-ws
上古
97.6090
88.5734
83.2877
70.3772
中古
92.6402
92.6538
89.4803
78.3827
近代
90.8651
92.1861
94.6495
81.2143
現代
87.0234
83.5810
84.9370
96.9446
Merge
97.4537
91.9990
94.0970
96.7314
ckiplab/bert-base-chinese-ws
-
86.5698
82.9115
84.3213
98.1325
### Part-of-Speech (POS) Tagging, **F1 score (%) ↑**
POS Model
Training Data
Testing Data
上古
中古
近代
現代
ckiplab/bert-base-han-chinese-pos
上古
91.2945
-
-
-
中古
7.3662
80.4896
11.3371
10.2577
近代
6.4794
14.3653
88.6580
0.5316
現代
11.9895
11.0775
0.4033
93.2813
Merge
88.8772
42.4369
86.9093
92.9012
## License
[
](https://www.gnu.org/licenses/gpl-3.0.html)Copyright (c) 2022 [CKIP Lab](https://ckip.iis.sinica.edu.tw/) under the [GPL-3.0 License](https://www.gnu.org/licenses/gpl-3.0.html).
## Citation
Please cite our paper if you use Han-Transformers in your work:```bibtex
@inproceedings{lin-ma-2022-hantrans,
title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
author = "Lin, Chin-Tung and Ma, Wei-Yun",
booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
year = "2022",
address = "Taipei, Taiwan",
publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
url = "https://aclanthology.org/2022.rocling-1.21",
pages = "164--173",
}
```