{"id":13753051,"url":"https://github.com/ShannonAI/ChineseBert","last_synced_at":"2025-05-09T20:34:42.609Z","repository":{"id":36961401,"uuid":"372481592","full_name":"ShannonAI/ChineseBert","owner":"ShannonAI","description":"Code for ACL 2021 paper \"ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information\"","archived":false,"fork":false,"pushed_at":"2023-07-26T03:15:32.000Z","size":280,"stargazers_count":542,"open_issues_count":49,"forks_count":92,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-11-16T05:32:30.698Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ShannonAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-31T11:24:05.000Z","updated_at":"2024-11-12T02:39:53.000Z","dependencies_parsed_at":"2024-08-03T09:04:57.088Z","dependency_job_id":"54d594b7-c85b-427e-bfb8-06b5f098b925","html_url":"https://github.com/ShannonAI/ChineseBert","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FChineseBert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FChineseBert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FChineseBert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FChineseBert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ShannonAI","download_url":"https://codeload.github.com/ShannonAI/ChineseBert/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253321814,"owners_count":21890471,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:15.365Z","updated_at":"2025-05-09T20:34:40.521Z","avatar_url":"https://github.com/ShannonAI.png","language":"Python","funding_links":[],"categories":["BERT优化"],"sub_categories":[],"readme":"# ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information\n\nThis repository contains code, model, dataset for [ChineseBERT]() at ACL2021.\n\n**[ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/pdf/2106.16038.pdf)**  \n*Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu and Jiwei Li*\n\n\n## Guide  \n\n| Section | Description |\n|  ----  | ----  |\n| [Introduction](#Introduction) | Introduction to ChineseBERT |  \n| [Download](#Download) | Download links for ChineseBERT |\n| [Quick tour](#Quick-tour) | Learn how to quickly load models |\n| [Experiment](#Experiments) | Experiment results on different Chinese NLP datasets |\n| [Citation](#Citation) | Citation | \n| [Contact](#Contact) | How to contact us | \n\n## Introduction\nWe propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese\ncharacters into language model pretraining.  \n \nFirst, for each Chinese character, we get three kind of embedding.\n - **Char Embedding:** the same as origin BERT token embedding.\n - **Glyph Embedding:** capture visual features based on different fonts of a Chinese character.\n - **Pinyin Embedding:** capture phonetic feature from the pinyin sequence ot a Chinese Character.\n \nThen, char embedding, glyph embedding and pinyin embedding \nare first concatenated, and mapped to a D-dimensional embedding through a fully \nconnected layer to form the fusion embedding.   \nFinally, the fusion embedding is added with the position embedding, which is fed as input to the BERT model.  \nThe following image shows an overview architecture of ChineseBERT model.\n \n![MODEL](https://raw.githubusercontent.com/ShannonAI/ChineseBert/main/images/ChineseBERT.png)\n\nChineseBERT leverages the glyph and pinyin information of Chinese \ncharacters to enhance the model's ability of capturing\ncontext semantics from surface character forms and\ndisambiguating polyphonic characters in Chinese.\n\n## Download \nWe provide pre-trained ChineseBERT models in Pytorch version and followed huggingFace model format. \n\n* **`ChineseBERT-base`**：12-layer, 768-hidden, 12-heads, 147M parameters \n* **`ChineseBERT-large`**: 24-layer, 1024-hidden, 16-heads, 374M parameters   \n  \nOur model can be downloaded here:\n\n| Model | Model Hub | Google Drive |\n| --- | --- | --- |\n| **`ChineseBERT-base`**  | [564M](https://huggingface.co/ShannonAI/ChineseBERT-base) | [560M](https://drive.google.com/file/d/1CseJzc58W4s8U_eIuAnshHQmnmi7Sr5-/view?usp=sharing) |\n| **`ChineseBERT-large`**   | [1.4G](https://huggingface.co/ShannonAI/ChineseBERT-large) | [1.4G](https://drive.google.com/file/d/1-glLDbmCrPgs_odjPvacaBniY0KnC8Z5/view?usp=sharing) |\n\n\n*Note: The model hub contains model, fonts and pinyin config files.*\n\n## Quick tour\nWe train our model with Huggingface, so the model can be easily loaded.  \nDownload ChineseBERT model and save at `[CHINESEBERT_PATH]`.  \nHere is a quick tour to load our model. \n```\n\u003e\u003e\u003e from models.modeling_glycebert import GlyceBertForMaskedLM\n\n\u003e\u003e\u003e chinese_bert = GlyceBertForMaskedLM.from_pretrained([CHINESEBERT_PATH])\n\u003e\u003e\u003e print(chinese_bert)\n```\nThe complete example can be find here: \n[Masked word completion with ChineseBERT](tasks/language_model/README.md)\n\nAnother example to get representation of a sentence:\n```\n\u003e\u003e\u003e from datasets.bert_dataset import BertDataset\n\u003e\u003e\u003e from models.modeling_glycebert import GlyceBertModel\n\n\u003e\u003e\u003e tokenizer = BertDataset([CHINESEBERT_PATH])\n\u003e\u003e\u003e chinese_bert = GlyceBertModel.from_pretrained([CHINESEBERT_PATH])\n\u003e\u003e\u003e sentence = '我喜欢猫'\n\n\u003e\u003e\u003e input_ids, pinyin_ids = tokenizer.tokenize_sentence(sentence)\n\u003e\u003e\u003e length = input_ids.shape[0]\n\u003e\u003e\u003e input_ids = input_ids.view(1, length)\n\u003e\u003e\u003e pinyin_ids = pinyin_ids.view(1, length, 8)\n\u003e\u003e\u003e output_hidden = chinese_bert.forward(input_ids, pinyin_ids)[0]\n\u003e\u003e\u003e print(output_hidden)\ntensor([[[ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519],\n         [ 0.0144, -0.2494, -0.1853,  ...,  0.0673,  0.0424, -0.1074],\n         [ 0.0839, -0.2989, -0.2421,  ...,  0.0454, -0.1474, -0.1736],\n         [-0.0499, -0.2983, -0.1604,  ..., -0.0550, -0.1863,  0.0226],\n         [ 0.1428, -0.0682, -0.1310,  ..., -0.1126,  0.0440, -0.1782],\n         [ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519]]],\n       grad_fn=\u003cNativeLayerNormBackward\u003e)\n```\nThe complete code can be find [HERE](tasks/language_model/chinese_bert.py)\n\n## Experiments\n\n## ChnSetiCorp\nChnSetiCorp is a dataset for sentiment analysis.  \nEvaluation Metrics: Accuracy\n\n| Model  | Dev | Test |  \n|  ----  | ----  | ----  |\n| ERNIE |  95.4 |   95.5  |\n| BERT | 95.1 |  95.4 |  \n| BERT-wwm | 95.4 | 95.3 |  \n| RoBERTa |  95.0 |  95.6 |  \n| MacBERT | 95.2 |   95.6 |  \n| ChineseBERT | **95.6** | **95.7** |  \n|   | ----  | ----  |  \n| RoBERTa-large | **95.8** | 95.8 |  \n| MacBERT-large |  95.7 |  **95.9** |  \n| ChineseBERT-large | **95.8** |  **95.9** | \n\nTraining details and code can be find [HERE](tasks/ChnSetiCorp/README.md)\n\n### THUCNews\nTHUCNews contains news in 10 categories.  \nEvaluation Metrics: Accuracy\n\n| Model  | Dev | Test |  \n|  ----  | ----  | ----  |\n| ERNIE |  95.4 |   95.5  |\n| BERT | 95.1 |  95.4 |  \n| BERT-wwm | 95.4 | 95.3 |  \n| RoBERTa |  95.0 |  95.6 |  \n| MacBERT | 95.2 |   95.6 |  \n| ChineseBERT | **95.6** | **95.7** |  \n|   | ----  | ----  |  \n| RoBERTa-large | **95.8** | 95.8 |  \n| MacBERT-large |  95.7 |  **95.9** |  \n| ChineseBERT-large | **95.8** |  **95.9** |\n\nTraining details and code can be find [HERE](tasks/THUCNew/README.md)\n\n### XNLI\nXNLI is a dataset for natural language inference.  \nEvaluation Metrics: Accuracy  \n\n| Model  | Dev | Test |  \n|  ----  | ----  | ----  |\n| ERNIE |  79.7 |   78.6  |\n| BERT | 79.0 |  78.2 |  \n| BERT-wwm | 79.4 | 78.7 |  \n| RoBERTa |  80.0 |  78.8 |  \n| MacBERT | 80.3 |  79.3 |  \n| ChineseBERT | **80.5** | **79.6** |  \n|   | ----  | ----  |  \n| RoBERTa-large | 82.1 | 81.2 |  \n| MacBERT-large |  82.4 |  81.3 |  \n| ChineseBERT-large | **82.7** |  **81.6** |\n\nTraining details and code can be find [HERE](tasks/XNLI/README.md)\n\n### BQ\nBQ Corpus is a sentence pair matching dataset.  \nEvaluation Metrics: Accuracy\n\n| Model  | Dev | Test |  \n|  ----  | ----  | ----  |\n| ERNIE | 86.3 | 85.0  |\n| BERT | 86.1 | 85.2 |  \n| BERT-wwm | **86.4** | **85.3** |  \n| RoBERTa |  86.0 | 85.0 |  \n| MacBERT | 86.0 | 85.2 |  \n| ChineseBERT | **86.4** | 85.2 |  \n|    | ----  | ----  |\n| RoBERTa-large | 86.3 | 85.8 |  \n| MacBERT-large |  86.2 | 85.6 |  \n| ChineseBERT-large | **86.5** |  **86.0** | \n\nTraining details and code can be find [HERE](tasks/BQ/README.md)\n\n### LCQMC\nLCQMC Corpus is a sentence pair matching dataset.  \nEvaluation Metrics: Accuracy\n\n| Model  | Dev | Test |  \n|  ----  | ----  | ----  |\n| ERNIE | 89.8 |  87.2  |\n| BERT | 89.4 | 87.0 |  \n| BERT-wwm | 89.6 | 87.1 |  \n| RoBERTa |  89.0 |  86.4 |  \n| MacBERT | 89.5 | 87.0 |  \n| ChineseBERT | **89.8** | **87.4** |  \n|   | ----  | ----  |  \n| RoBERTa-large | 90.4 | 87.0 |  \n| MacBERT-large |  **90.6** | 87.6 |  \n| ChineseBERT-large | 90.5 |  **87.8** |  \n\nTraining details and code can be find [HERE](tasks/LCQMC/README.md)\n\n### TNEWS\n\nTNEWS is a 15-class short news text classification dataset. \u003cbr\u003e\nEvaluation Metrics: Accuracy\n\n| Model  | Dev | Test |  \n|  ----  | ----  | ----  |\n| ERNIE | 58.24 |  58.33 | \n| BERT | 56.09 |  56.58 | \n| BERT-wwm | 56.77 | 56.86 | \n| RoBERTa |   57.51 |  56.94 | \n| ChineseBERT | **58.64** | **58.95** | \n|   | ----  | ----  |  \n| RoBERTa-large | 58.32 | 58.61 | \n| ChineseBERT-large |  **59.06** | **59.47** | \n\nTraining details and code can be find [HERE](tasks/TNews/README.md)\n\n### CMRC\n\nCMRC is a machin reading comprehension task dataset.  \nEvaluation Metrics: EM\n\n| Model  | Dev | Test |  \n|  ----  | ----  | ----  |\n| ERNIE |  66.89 |   74.70  |\n| BERT | 66.77 |  71.60 |  \n| BERT-wwm | 66.96 | 73.95 |  \n| RoBERTa |  67.89 |  75.20 |  \n| MacBERT | - |   - |  \n| ChineseBERT | **67.95** | **95.7** |  \n|   | ----  | ----  |  \n| RoBERTa-large | 70.59 | 77.95 |  \n| ChineseBERT-large | **70.70** |  **78.05** |  \n\nTraining details and code can be find [HERE](tasks/CMRC/README.md)\n\n### OntoNotes\n\nOntoNotes 4.0 is a Chinese named entity recognition dataset and contains 18 named entity types. \u003cbr\u003e\n\nEvaluation Metrics: Span-Level F1\n\n| Model  |  Test Precision |  Test Recall |  Test F1 |  \n|  ----  | ----  | ----  | ----  |\n| BERT | 79.69 | 82.09 | 80.87 | \n| RoBERTa |  **80.43** | 80.30 |  80.37 | \n| ChineseBERT | 80.03 | **83.33** | **81.65** | \n|    | ----  | ----  | ----  |\n| RoBERTa-large |  80.72 | 82.07 | 81.39 |\n| ChineseBERT-large | **80.77** | **83.65** | **82.18** | \n\nFor reproducing experiment results, please **install and use** `torch1.7.1+cu101` via `pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html`. \u003cbr\u003e\nTraining details and code can be find [HERE](tasks/OntoNotes/README.md)\n\n\n### Weibo \n\nWeibo is a Chinese named entity recognition dataset and contains 4 named entity types. \u003cbr\u003e\n\nEvaluation Metrics: Span-Level F1\n\n| Model  |  Test Precision |  Test Recall |  Test F1 |  \n|  ----  | ----  | ----  | ----  |\n| BERT | 67.12 | 66.88 |  67.33 |\n| RoBERTa | **68.49** | 67.81 | 68.15 |\n| ChineseBERT | 68.27 | **69.78** | **69.02** |\n|  | ----  | ----  | ----  |\n| RoBERTa-large |  66.74 | 70.02 | 68.35 |\n| ChineseBERT-large | **68.75** | **72.97** | **70.80** |\n\nFor reproducing experiment results, please **install and use** `torch1.7.1+cu101` via `pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html`. \u003cbr\u003e\nTraining details and code can be find [HERE](tasks/Weibo/README.md)\n\n\n## Citation\n```latex\n@article{sun2021chinesebert,\n  title={ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information},\n  author={Sun, Zijun and Li, Xiaoya and Sun, Xiaofei and Meng, Yuxian and Ao, Xiang and He, Qing and Wu, Fei and Li, Jiwei},\n  journal={arXiv preprint arXiv:2106.16038},\n  year={2021}\n}\n```\n\n\n## Contact\nIf you have any question about our paper/code/modal/data...  \nPlease feel free to discuss through github issues or emails.  \nYou can send emails to **zijun_sun@shannonai.com** OR **xiaoya_li@shannonai.com**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShannonAI%2FChineseBert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FShannonAI%2FChineseBert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShannonAI%2FChineseBert/lists"}