{"id":13534936,"url":"https://github.com/brightmart/roberta_zh","last_synced_at":"2025-05-15T04:07:31.925Z","repository":{"id":38401899,"uuid":"205902991","full_name":"brightmart/roberta_zh","owner":"brightmart","description":"RoBERTa中文预训练模型: RoBERTa for Chinese ","archived":false,"fork":false,"pushed_at":"2024-07-22T15:02:06.000Z","size":315,"stargazers_count":2708,"open_issues_count":47,"forks_count":413,"subscribers_count":52,"default_branch":"master","last_synced_at":"2025-05-15T04:07:19.896Z","etag":null,"topics":["bert","chinese","gpt2","pre-trained","pre-trained-language-models","roberta"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brightmart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-02T17:11:57.000Z","updated_at":"2025-05-12T12:11:06.000Z","dependencies_parsed_at":"2024-09-20T23:01:13.550Z","dependency_job_id":null,"html_url":"https://github.com/brightmart/roberta_zh","commit_stats":{"total_commits":40,"total_committers":2,"mean_commits":20.0,"dds":"0.025000000000000022","last_synced_commit":"438476f7da1661faf45e5b8da9f55df403e44997"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Froberta_zh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Froberta_zh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Froberta_zh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Froberta_zh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brightmart","download_url":"https://codeload.github.com/brightmart/roberta_zh/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270646,"owners_count":22042859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","chinese","gpt2","pre-trained","pre-trained-language-models","roberta"],"created_at":"2024-08-01T08:00:47.168Z","updated_at":"2025-05-15T04:07:26.911Z","avatar_url":"https://github.com/brightmart.png","language":"Python","funding_links":[],"categories":["Pretrained Language Model","Pretrained BERT weights:","Python","BERT优化","Contents 列表"],"sub_categories":["Repository","大语言对话模型及数据","Pre-trained Language Models 预训练语言模型"],"readme":"RoBERTa for Chinese, TensorFlow \u0026 PyTorch\n\n中文预训练RoBERTa模型 \n-------------------------------------------------\nRoBERTa是BERT的改进版，通过改进训练任务和数据生成方式、训练更久、使用更大批次、使用更多数据等获得了State of The Art的效果；可以用Bert直接加载。\n\n本项目是用TensorFlow实现了在大规模中文上RoBERTa的预训练，也会提供PyTorch的预训练模型和加载方式。\n\n*** 2019-10-12：添加【阅读理解】不同模型上测试效果对比 ***\n\n*** 2019-09-08： 添加国内下载地址、PyTorch版本、与多个模型bert-wwm、xlnet等模型效果初步对比 ***\n\n\n \u003ca href=\"https://www.modelfun.cn/demo\"\u003eNLP自动标注工具（提效最多100X）-预约\u003c/a\u003e\n \nPre-trained model of \u003ca href=\"https://github.com/brightmart/albert_zh\"\u003ealbert, chinese version\u003c/a\u003e is also available for you now.\n\n中文预训练RoBERTa模型-下载\n-------------------------------------------------\n*** 6层RoBERTa体验版 ***\nRoBERTa-zh-Layer6: \u003ca href=\"https://drive.google.com/file/d/1QXFqD6Qm8H9bRSbw7yZIgTGxD0O6ejUq/view?usp=sharing\"\u003e Google Drive\u003c/a\u003e 或 \u003ca href=\"https://pan.baidu.com/s/1TfKz-d9wvfqct8vN0c-vjg\"\u003e百度网盘\u003c/a\u003e，TensorFlow版本，Bert 直接加载, 大小为200M\n\n###### ** 推荐 RoBERTa-zh-Large 通过验证**\nRoBERTa-zh-Large: \u003ca href='https://drive.google.com/open?id=1W3WgPJWGVKlU9wpUYsdZuurAIFKvrl_Y'\u003e Google Drive \u003c/a\u003e 或 \u003ca href=\"https://pan.baidu.com/s/1Rk_QWqd7-wBTwycr91bmug\"\u003e百度网盘\u003c/a\u003e ，TensorFlow版本，Bert 直接加载\n\nRoBERTa-zh-Large: \u003ca href='https://drive.google.com/open?id=1yK_P8VhWZtdgzaG0gJ3zUGOKWODitKXZ'\u003e Google Drive \u003c/a\u003e 或 \u003ca href=\"https://pan.baidu.com/s/1MRDuVqUROMdSKr6HD9x1mw\"\u003e百度网盘\u003c/a\u003e ，PyTorch版本，Bert的PyTorch版直接加载\n\nRoBERTa 24/12层版训练数据：30G原始文本，近3亿个句子，100亿个中文字(token)，产生了2.5亿个训练数据(instance)；\n\n覆盖新闻、社区问答、多个百科数据等；\n\n本项目与中文预训练24层XLNet模型 \u003ca href=\"https://github.com/brightmart/xlnet_zh\"\u003eXLNet_zh\u003c/a\u003e项目，使用相同的训练数据。\n\nRoBERTa_zh_L12: \u003ca href='https://drive.google.com/open?id=1ykENKV7dIFAqRRQbZIh0mSb7Vjc2MeFA'\u003e Google Drive\u003c/a\u003e 或 \u003ca href=\"https://pan.baidu.com/s/1hAs7-VSn5HZWxBHQMHKkrg\"\u003e百度网盘\u003c/a\u003e TensorFlow版本，Bert 直接加载 \n \nRoBERTa_zh_L12: \u003ca href=\"https://drive.google.com/open?id=1H6f4tYlGXgug1DdhYzQVBuwIGAkAflwB\"\u003eGoogle Drive\u003c/a\u003e 或\u003ca href=\"https://pan.baidu.com/s/1AGC76N7pZOzWuo8ua1AZfw\"\u003e百度网盘\u003c/a\u003e  PyTorch版本，Bert的PyTorch版直接加载\n\n---------------------------------------------------------------\n\n\u003ca href='https://drive.google.com/file/d/1cg3tVKPyUEmiI88H3gasqYC4LV4X8dNm/view?usp=sharing'\u003eRoberta_l24_zh_base\u003c/a\u003e TensorFlow版本，Bert 直接加载\n\n24层base版训练数据：10G文本，包含新闻、社区问答、多个百科数据等。\n\n\n\nWhat is RoBERTa:\n-------------------------------------------------\n    A robustly optimized method for pretraining natural language processing (NLP) systems that improves on Bidirectional Encoder Representations from Transformers, or BERT, the self-supervised method released by Google in 2018. \n    \n    RoBERTa, produces state-of-the-art results on the widely used NLP benchmark, General Language Understanding Evaluation (GLUE). The model delivered state-of-the-art performance on the MNLI, QNLI, RTE, STS-B, and RACE tasks and a sizable performance improvement on the GLUE benchmark. With a score of 88.5, RoBERTa reached the top position on the GLUE leaderboard, matching the performance of the previous leader, XLNet-Large. \n    \n    (Introduction from Facebook blog)\n\n发布计划 Release Plan：\n-------------------------------------------------\n1、24层RoBERTa模型(roberta_l24_zh)，使用30G文件训练，        9月8日\n\n2、12层RoBERTa模型(roberta_l12_zh)，使用30G文件训练，        9月8日\n\n3、6层RoBERTa模型(roberta_l6_zh)， 使用30G文件训练，         9月8日\n\n4、PyTorch版本的模型(roberta_l6_zh_pytorch)                9月8日\n\n5、30G中文语料，预训练格式，可直接训练(bert,xlent,gpt2)       待定\n\n6、测试集测试和效果对比                                     9月14日\n\n效果测试与对比 Performance \n-------------------------------------------------\n### 互联网新闻情感分析：CCF-Sentiment-Analysis\n\n| 模型 | 线上F1 |\n| :------- | :---------: |\n| BERT | 80.3 |\n| Bert-wwm-ext | 80.5 | \n| XLNet | 79.6 | \n| Roberta-mid | 80.5 |\n| Roberta-large (max_seq_length=512, split_num=1) | 81.25 |\n\n注：数据来源于\u003ca href=\"https://github.com/guoday/CCF-BDCI-Sentiment-Analysis-Baseline/blob/master/README.md\"\u003eguoday的开源项目\u003c/a\u003e；数据集和任务介绍见：\u003ca href=\"https://www.datafountain.cn/competitions/350/ranking\"\u003eCCF互联网新闻情感分析\u003c/a\u003e\n\n### 自然语言推断：XNLI\n\n| 模型 | 开发集 | 测试集 |\n| :------- | :---------: | :---------: |\n| BERT | 77.8 (77.4) | 77.8 (77.5) | \n| ERNIE | 79.7 (79.4) | 78.6 (78.2) | \n| BERT-wwm | 79.0 (78.4) | 78.2 (78.0) | \n| BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) |\n| XLNet | 79.2  | 78.7 |\n| RoBERTa-zh-base | 79.8 |78.8  |\n| **RoBERTa-zh-Large** | **80.2 (80.0)** | **79.9 (79.5)** |\n\n注：RoBERTa_l24_zh，只跑了两次，Performance可能还会提升; \n\nBERT-wwm-ext来自于\u003ca href=\"https://github.com/ymcui/Chinese-BERT-wwm\"\u003e这里\u003c/a\u003e；XLNet来自于\u003ca href=\"https://github.com/ymcui/Chinese-PreTrained-XLNet\"\u003e这里\u003c/a\u003e; RoBERTa-zh-base，指12层RoBERTa中文模型\n\n###  问题匹配语任务：LCQMC(Sentence Pair Matching)\n\n| 模型 | 开发集(Dev) | 测试集(Test) |\n| :------- | :---------: | :---------: |\n| BERT | 89.4(88.4) | 86.9(86.4) | \n| ERNIE | 89.8 (89.6) | **87.2** (87.0) | \n| BERT-wwm |89.4 (89.2) | 87.0 (86.8) | \n| BERT-wwm-ext | - |-  |\n| RoBERTa-zh-base | 88.7 | 87.0  |\n| **RoBERTa-zh-Large** | **89.9**(89.6) | **87.2**(86.7) |\n| RoBERTa-zh-Large(20w_steps) | 89.7| 87.0 |\n\n注：RoBERTa_l24_zh，只跑了两次，Performance可能还会提升。保持训练轮次和论文一致：\n\n### 阅读理解测试\n目前阅读理解类问题bert和roberta最优参数均为epoch2, batch=32, lr=3e-5, warmup=0.1\n\n#### cmrc2018(阅读理解)\n\n| models | DEV |\n| ------ | ------ |\n| sibert_base | F1:87.521(88.628) EM:67.381(69.152) |\n| sialbert_middle | F1:87.6956(87.878) EM:67.897(68.624) |\n| 哈工大讯飞 roberta_wwm_ext_base | F1:87.521(88.628) EM:67.381(69.152) |\n| brightmart roberta_middle | F1:86.841(87.242) EM:67.195(68.313) |\n| brightmart roberta_large | **F1:88.608(89.431) EM:69.935(72.538)** |\n\n#### DRCD(阅读理解)\n\n| models | DEV |\n| ------ | ------ |\n| siBert_base | F1:93.343(93.524) EM:87.968(88.28) |\n| siALBert_middle | F1:93.865(93.975) EM:88.723(88.961) |\n| 哈工大讯飞 roberta_wwm_ext_base | F1:94.257(94.48) EM:89.291(89.642) |\n| brightmart roberta_large | **F1:94.933(95.057) EM:90.113(90.238)** |\n\n#### CJRC(带有yes,no,unkown的阅读理解)\n\n| models | DEV |\n| ------ | ------ |\n| siBert_base | F1:80.714(81.14) EM:64.44(65.04) |\n| siALBert_middle | F1:80.9838(81.299) EM:63.796(64.202) |\n| 哈工大讯飞 roberta_wwm_ext_base | F1:81.510(81.684) EM:64.924(65.574) |\n| brightmart roberta_large | F1:80.16(80.475) EM:65.249(66.133) |\n\n阅读理解测试对比数据来源\u003ca href=\"https://github.com/ewrfcas/bert_cn_finetune\"\u003ebert_cn_finetune\u003c/a\u003e\n\n? 处地方，将会很快更新到具体的值\n\nRoBERTa中文版 Chinese Version\n-------------------------------------------------\n本项目所指的中文预训练RoBERTa模型只指按照RoBERTa论文主要精神训练的模型。包括：\n\n    1、数据生成方式和任务改进：取消下一个句子预测，并且数据连续从一个文档中获得(见：Model Input Format and Next Sentence Prediction，DOC-SENTENCES)\n    \n    2、更大更多样性的数据：使用30G中文训练，包含3亿个句子，100亿个字(即token）。由新闻、社区讨论、多个百科，包罗万象，覆盖数十万个主题，\n    \n    所以数据具有多样性（为了更有多样性，可以可以加入网络书籍、小说、故事类文学、微博等）。\n    \n    3、训练更久：总共训练了近20万，总共见过近16亿个训练数据(instance)； 在Cloud TPU v3-256 上训练了24小时，相当于在TPU v3-8(128G显存)上需要训练一个月。\n    \n    4、更大批次：使用了超大（8k）的批次batch size。\n    \n    5、调整优化器等超参数。\n\n除以上外，本项目中文版，使用了全词mask(whole word mask)。在全词Mask中，如果一个完整的词的部分WordPiece子词被mask，则同属该词的其他部分也会被mask，即全词Mask。\n\n本项目中并没有直接实现dynamic mask。通过复制一个训练样本得到多份数据，每份数据使用不同mask，并加大复制的份数，可间接得到dynamic mask效果。\n\n##### 使用说明 Instructions for Use\n\n当前本项目是使用sequence length为256训练的，所以可能对长度在这个范围内的效果不错；如果你的任务的输入比较长（如序列长度为512），或许效果有影响。\n\n有同学结合滑动窗口的形式，将序列做拆分，还是得到了比较好的效果，见\u003ca href=\"https://github.com/brightmart/roberta_zh/issues/16\"\u003e#issue-16\u003c/a\u003e\n\n##### 中文全词遮蔽 Whole Word Mask\n\n| 说明 | 样例 |\n| :------- | :--------- |\n| 原始文本 | 使用语言模型来预测下一个词的probability。 |\n| 分词文本 | 使用 语言 模型 来 预测 下 一个 词 的 probability 。 |\n| 原始Mask输入 | 使 用 语 言 [MASK] 型 来 [MASK] 测 下 一 个 词 的 pro [MASK] ##lity 。 |\n| 全词Mask输入 | 使 用 语 言 [MASK] [MASK] 来 [MASK] [MASK] 下 一 个 词 的 [MASK] [MASK] [MASK] 。 |\n\n模型加载（以Sentence Pair Matching即句子对任务，LCQMC为例）\n-------------------------------------------------\n\n下载\u003ca href=\"https://drive.google.com/open?id=1HXYMqsXjmA5uIfu_SFqP7r_vZZG-m_H0\"\u003eLCQMC\u003c/a\u003e数据集，包含训练、验证和测试集，训练集包含24万口语化描述的中文句子对，标签为1或0。1为句子语义相似，0为语义不相似。\n\ntensorFlow版本：\n\n    1、复制本项目： git clone https://github.com/brightmart/roberta_zh\n    \n    2、进到项目(roberta_zh)中。\n    \n      假设你将RoBERTa预训练模型下载并解压到该改项目的roberta_zh_large目录，即roberta_zh/roberta_zh_large\n    \n    运行命令:\n  \n    export BERT_BASE_DIR=./roberta_zh_large\n    export MY_DATA_DIR=./data/lcqmc\n    python run_classifier.py \\\n      --task_name=lcqmc_pair \\\n      --do_train=true \\\n      --do_eval=true \\\n      --data_dir=$MY_DATA_DIR \\\n      --vocab_file=$BERT_BASE_DIR/vocab.txt \\\n      --bert_config_file=$BERT_BASE_DIR/bert_config_large.json \\\n      --init_checkpoint=$BERT_BASE_DIR/roberta_zh_large_model.ckpt \\\n      --max_seq_length=128 \\\n      --train_batch_size=64 \\\n      --learning_rate=2e-5 \\\n      --num_train_epochs=3 \\\n      --output_dir=./checkpoint_lcqmc\n    \n    注：task_name为lcqmc_pair。这里已经在run_classifier.py中的添加一个processor,并加到processors中，用于指定做lcqmc任务，并加载训练和验证数据。\n\nPyTorch加载方式，先参考\u003ca href=\"https://github.com/brightmart/roberta_zh/issues/9\"\u003eissue 9\u003c/a\u003e；将很快提供更具体方式。\n\n预训练 Pre-training\n-------------------------------------------------\n#### 1) 预训练的数据 data of pre-training\n你可以使用你的任务相关领域的数据来训练，也可以从通用的语料中筛选出一部分与你领域相关的数据做训练。\n\n通用语料数据见\u003ca href=\"https://github.com/brightmart/nlp_chinese_corpus\"\u003enlp_chinese_corpus\u003c/a\u003e:包含多个拥有数千万句子的语料的数据集。\n\n#### 2) 生成预训练数据 generate data for pre-training \n包括使用参照DOC-SENTENCES的形式，连续从一个文档中获得数据；以及做全词遮蔽(whole word mask)\n\nshell脚本：批量将多个txt文本转化为tfrecord的数据。\n\n    如将第1到10个txt转化为tfrecords文件：\n\n    nohup bash create_pretrain_data.sh 1 10 \u0026 \n                                                                                 \n    注：在我们的实验中使用15%的比例做全词遮蔽，模型学习难度大、收敛困难，所以我们用了10%的比例；\n\n#### 3）运行预训练命令 pre-training\n去掉next sentence prediction任务\n    \n    export BERT_BASE_DIR=\u003cpath_of_robert_or_bert_model\u003e\n    nohup python3 run_pretraining.py --input_file=./tf_records_all/tf*.tfrecord  \\\n    --output_dir=my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/bert_config.json \\\n    --train_batch_size=8192 --max_seq_length=256 --max_predictions_per_seq=23 \\\n    --num_train_steps=200000 --num_warmup_steps=10000 --learning_rate=1e-4    \\\n    --save_checkpoints_steps=3000  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt  \u0026\n\n    注：如果你重头开始训练，可以不指定init_checkpoint；\n    如果你从现有的模型基础上训练，指定一下BERT_BASE_DIR的路径，并确保bert_config_file和init_checkpoint两个参数的值能对应到相应的文件上；\n    领域上的预训练，可以不用训练特别久。\n\nLearning Curve 学习曲线\n-------------------------------------------------\n\u003cimg src=\"https://github.com/brightmart/roberta_zh/blob/master/resources/RoBERTa_zh_Large_Learning_Curve.png\"  width=\"70%\" height=\"60%\" /\u003e\n\n对显存的要求 Trade off between batch Size and sequence length\n-------------------------------------------------\n\nSystem       | Seq Length | Max Batch Size\n------------ | ---------- | --------------\n`RoBERTa-Base`  | 64         | 64\n...          | 128        | 32\n...          | 256        | 16\n...          | 320        | 14\n...          | 384        | 12\n...          | 512        | 6\n`RoBERTa-Large` | 64         | 12\n...          | 128        | 6\n...          | 256        | 2\n...          | 320        | 1\n...          | 384        | 0\n...          | 512        | 0\n\n\n\n#### 技术交流与问题讨论QQ群: 836811304\n\nIf you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com;\n\nYou can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on.\n\nIf you have ideas for generate best performance pre-training Chinese model, please also let me know.\n\n请报告在你的任务上的准确率情况及与其他模型的比较。\n\n\n项目贡献者，还包括：\n-------------------------------------------------\n\u003ca href=\"https://github.com/skyhawk1990\"\u003e skyhawk1990\u003c/a\u003e\n\n\n##### Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)\n\n\n\n\nReference\n-------------------------------------------------\n1、\u003ca href=\"https://arxiv.org/pdf/1907.11692.pdf\"\u003eRoBERTa: A Robustly Optimized BERT Pretraining Approach\u003c/a\u003e\n\n2、\u003ca href=\"https://arxiv.org/pdf/1906.08101.pdf\"\u003ePre-Training with Whole Word Masking for Chinese BERT\u003c/a\u003e\n\n3、\u003ca href=\"https://arxiv.org/pdf/1810.04805.pdf\"\u003eBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\u003c/a\u003e\n\n4、\u003ca href=\"https://aclweb.org/anthology/C18-1166\"\u003eLCQMC: A Large-scale Chinese Question Matching Corpus\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrightmart%2Froberta_zh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrightmart%2Froberta_zh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrightmart%2Froberta_zh/lists"}