Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/ckiplab/han-transformers

Last synced: 4 days ago
JSON representation
Host: GitHub
URL: https://github.com/ckiplab/han-transformers
Owner: ckiplab
Created: 2022-06-23T13:01:32.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-01-19T08:49:37.000Z (almost 2 years ago)
Last Synced: 2023-03-05T19:09:23.159Z (over 1 year ago)
Size: 18.6 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Han Transformers

This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.

Our paper has been accepted to ROCLING! Please check out our [paper](https://aclanthology.org/2022.rocling-1.21/).

## Dependency

* transformers ≤ 4.15.0

* pytorch

## Models

We uploaded our models to HuggingFace hub.

* Pretrained models using a masked language modeling (MLM) objective.

    * [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese)

* Fine-tuned models for Word Segmentation.

    * [ckiplab/bert-base-han-chinese-ws](https://huggingface.co/ckiplab/bert-base-han-chinese-ws) (Merge)

    * [ckiplab/bert-base-han-chinese-ws-shanggu](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-shanggu) (上古)

    * [ckiplab/bert-base-han-chinese-ws-zhonggu](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-zhonggu) (中古)

    * [ckiplab/bert-base-han-chinese-ws-jindai](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-jindai) (近代)

    * [ckiplab/bert-base-han-chinese-ws-xiandai](https://huggingface.co/ckiplab/bert-base-han-chinese-ws-xiandai) (現代)

* Fine-tuned models for Part-of-Speech tagging.

    * [ckiplab/bert-base-han-chinese-pos](https://huggingface.co/ckiplab/bert-base-han-chinese-pos?) (Merge)

    * [ckiplab/bert-base-han-chinese-pos-shanggu](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-shanggu) (上古 / [標記列表](shanggu.md))

    * [ckiplab/bert-base-han-chinese-pos-zhonggu](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-zhonggu) (中古 / [標記列表](zhonggu.md))

    * [ckiplab/bert-base-han-chinese-pos-jindai](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-jindai) (近代 / [標記列表](jindai.md))

    * [ckiplab/bert-base-han-chinese-pos-xiandai](https://huggingface.co/ckiplab/bert-base-han-chinese-pos-xiandai) (現代 / [標記列表](xiandai.md))

## Training Corpus

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

* [中央研究院上古漢語標記語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh)

* [中央研究院中古漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh)

* [中央研究院近代漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh)

* [中央研究院現代漢語語料庫](http://asbc.iis.sinica.edu.tw)

## Usage

### Installation

```bash

pip install transformers==4.15.0

pip install torch==1.10.2

```

### Inference

* **Pre-trained Language Model**

    You can use [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese) directly with a pipeline for masked language modeling.

    ```python

    from transformers import pipeline

    # Initialize 

    unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')

    # Input text with [MASK]

    unmasker("黎[MASK]於變時雍。")

    

    # output

    [{'sequence': '黎 民 於 變 時 雍 。',

    'score': 0.14885780215263367,

    'token': 3696,

    'token_str': '民'},

    {'sequence': '黎 庶 於 變 時 雍 。',

    'score': 0.0859643816947937,

    'token': 2433,

    'token_str': '庶'},

    {'sequence': '黎 氏 於 變 時 雍 。',

    'score': 0.027848130092024803,

    'token': 3694,

    'token_str': '氏'},

    {'sequence': '黎 人 於 變 時 雍 。',

    'score': 0.023678112775087357,

    'token': 782,

    'token_str': '人'},

    {'sequence': '黎 生 於 變 時 雍 。',

    'score': 0.018718384206295013,

    'token': 4495,

    'token_str': '生'}]

    ```

    You can use [ckiplab/bert-base-han-chinese](https://huggingface.co/ckiplab/bert-base-han-chinese) to get the features of a given text in PyTorch.

    

    ```python

    from transformers import AutoTokenizer, AutoModel

    # Initialize tokenzier and model

    tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")

    model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")

    # Input text

    text = "黎民於變時雍。"

    encoded_input = tokenizer(text, return_tensors='pt')

    output = model(**encoded_input)

    # get encoded token vectors

    output.last_hidden_state    # torch.Tensor with Size([1, 9, 768])

    # get encoded sentence vector

    output.pooler_output        # torch.Tensor with Size([1, 768])

    ```

* **Word Segmentation (WS)**

    In WS, [ckiplab/bert-base-han-chinese-ws](https://huggingface.co/ckiplab/bert-base-han-chinese-ws) divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).

    ```python

    from transformers import pipeline

    # Initialize

    classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")

    # Input text

    classifier("帝堯曰放勳")

    # output

    [{'entity': 'B',

    'score': 0.9999793,

    'index': 1,

    'word': '帝',

    'start': 0,

    'end': 1},

    {'entity': 'I',

    'score': 0.9915047,

    'index': 2,

    'word': '堯',

    'start': 1,

    'end': 2},

    {'entity': 'B',

    'score': 0.99992275,

    'index': 3,

    'word': '曰',

    'start': 2,

    'end': 3},

    {'entity': 'B',

    'score': 0.99905187,

    'index': 4,

    'word': '放',

    'start': 3,

    'end': 4},

    {'entity': 'I',

    'score': 0.96299917,

    'index': 5,

    'word': '勳',

    'start': 4,

    'end': 5}]

    ```

* **Part-of-Speech (PoS) Tagging**

    In PoS tagging, [ckiplab/bert-base-han-chinese-pos](https://huggingface.co/ckiplab/bert-base-han-chinese-pos) recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.

    ```python

    from transformers import pipeline

    # Initialize

    classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")

    # Input text

    classifier("帝堯曰放勳")

    # output

    [{'entity': 'NB1',

    'score': 0.99410427,

    'index': 1,

    'word': '帝',

    'start': 0,

    'end': 1},

    {'entity': 'NB1',

    'score': 0.98874336,

    'index': 2,

    'word': '堯',

    'start': 1,

    'end': 2},

    {'entity': 'VG',

    'score': 0.97059363,

    'index': 3,

    'word': '曰',

    'start': 2,

    'end': 3},

    {'entity': 'NB1',

    'score': 0.9864504,

    'index': 4,

    'word': '放',

    'start': 3,

    'end': 4},

    {'entity': 'NB1',

    'score': 0.9543974,

    'index': 5,

    'word': '勳',

    'start': 4,

    'end': 5}]

    ```

## Model Performance

### Pre-trained Language Model, **Perplexity ↓**

  

    Language Model

    MLM Training Data

    MLM Testing Data

  

  

    上古

    中古

    近代

    現代

  

  

    ckiplab/bert-base-han-Chinese

    上古

    24.7588

    87.8176

    297.1111

    60.3993

  

  

    中古

    67.861

    70.6244

    133.0536

    23.0125

  

  

    近代

    69.1364

    77.4154

    46.8308

    20.4289

  

  

    現代

    118.8596

    163.6896

    146.5959

    4.6143

  

  

    Merge

    31.1807

    61.2381

    49.0672

    4.5017

  

  

    ckiplab/bert-base-chinese

    -

    233.6394

    405.9008

    278.7069

    8.8521

  

### Word Segmentation (WS), **F1 score (%) ↑**

  

    WS Model

    Training Data

    Testing Data

  

  

    上古

    中古

    近代

    現代

  

  

    ckiplab/bert-base-han-chinese-ws

    上古

    97.6090

    88.5734

     83.2877

    70.3772

  

  

    中古

    92.6402

    92.6538

    89.4803

    78.3827

  

  

    近代

    90.8651

    92.1861

    94.6495

    81.2143

  

  

    現代

    87.0234

    83.5810

    84.9370

    96.9446

  

  

    Merge

    97.4537

    91.9990

    94.0970

    96.7314

  

  

    ckiplab/bert-base-chinese-ws

    -

    86.5698

    82.9115

    84.3213

    98.1325

  

### Part-of-Speech (POS) Tagging, **F1 score (%) ↑**

  

    POS Model

    Training Data

    Testing Data

  

  

    上古

    中古

    近代

    現代

  

  

    ckiplab/bert-base-han-chinese-pos

    上古

    91.2945

    -

    -

    -

  

  

    中古

    7.3662

    80.4896

    11.3371

    10.2577

  

  

    近代

    6.4794

     14.3653

    88.6580

    0.5316

  

  

    現代

    11.9895

    11.0775

    0.4033

    93.2813

  

  

    Merge

    88.8772

    42.4369

    86.9093

    92.9012

  

## License

[

](https://www.gnu.org/licenses/gpl-3.0.html)

Copyright (c) 2022 [CKIP Lab](https://ckip.iis.sinica.edu.tw/) under the [GPL-3.0 License](https://www.gnu.org/licenses/gpl-3.0.html).

## Citation

Please cite our paper if you use Han-Transformers in your work:

```bibtex

@inproceedings{lin-ma-2022-hantrans,

    title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",

    author = "Lin, Chin-Tung  and  Ma, Wei-Yun",

    booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",

    year = "2022",

    address = "Taipei, Taiwan",

    publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",

    url = "https://aclanthology.org/2022.rocling-1.21",

    pages = "164--173",

}

```