{"id":13752987,"url":"https://github.com/Lisennlp/TinyBert","last_synced_at":"2025-05-09T20:34:41.434Z","repository":{"id":41062538,"uuid":"277869435","full_name":"Lisennlp/TinyBert","owner":"Lisennlp","description":"简洁易用版TinyBert：基于Bert进行知识蒸馏的预训练语言模型","archived":false,"fork":false,"pushed_at":"2020-10-24T06:08:56.000Z","size":2303,"stargazers_count":252,"open_issues_count":6,"forks_count":49,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-16T05:32:28.683Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Lisennlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-07T16:34:49.000Z","updated_at":"2024-10-30T01:24:16.000Z","dependencies_parsed_at":"2022-09-12T06:50:13.626Z","dependency_job_id":null,"html_url":"https://github.com/Lisennlp/TinyBert","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lisennlp%2FTinyBert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lisennlp%2FTinyBert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lisennlp%2FTinyBert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lisennlp%2FTinyBert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Lisennlp","download_url":"https://codeload.github.com/Lisennlp/TinyBert/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253321806,"owners_count":21890469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:14.041Z","updated_at":"2025-05-09T20:34:37.618Z","avatar_url":"https://github.com/Lisennlp.png","language":"Python","funding_links":[],"categories":["BERT优化"],"sub_categories":["大语言对话模型及数据"],"readme":"# TinyBERT\n\n本项目是基于华为的TinyBert进行修改的，简化了数据读取的过程，方便我们利用自己的数据进行读取操作。  \n\n\nTinyBert的训练过程：  \n- 1、用通用的Bert base进行蒸馏，得到一个通用的student model base版本；  \n- 2、用相关任务的数据对Bert进行fine-tune得到fine-tune的Bert base模型；  \n- 3、用2得到的模型再继续蒸馏得到fine-tune的student model base，注意这一步的student model base要用1中通用的student model base去初始化；（词向量loss + 隐层loss + attention loss）  \n- 4、重复第3步，但student model base模型初始化用的是3得到的student模型。（任务的预测label loss）\n\n\nGeneral Distillation （通用版预训练语言模型蒸馏）\n====================\n- 预训练\n\n```\nsh script/general_train.sh\n                             \n```\nTask Distillation （fine-tune版预训练语言模型蒸馏）\n====================\n- 预训练\n\n```\n# 第一阶段\nsh script/task_train.sh one\n\n# 第二阶段\nsh script/task_train.sh two\n                             \n```\n\n## 数据格式  \n\n    data/train.txt,   data/eval.txt, \n\n## 数据增强\n\n    sh script/augmentation.sh\n\n- 参数说明：\n\n    python data_augmentation.py  \\   \n                --pretrained_bert_model  /nas/pretrain-bert/pretrain-pytorch/bert-base-uncased  \\   # bert预训练模型   \n                --data_path  data/en_data.txt    \\  # 需要增强的数据路径   \n                --glove_embs  /nas/lishengping/datas/glove.6B.300d.txt   \\   # glove词向量文件   \n                --M  15    \\  # 从文本中选择M个词可能被替换   \n                --N  30    \\  # 通过bert mask预测的概率前N个词去替换   \n                --p  0.4   \\  # 某个词被替换的概率   \n                3\u003e\u00262 2\u003e\u00261 1\u003e\u00263 | tee logs/data_augmentation.log\n\n论文在fine-tune阶段采用了数据增强的策略，从后面的实验中可以看出，数据增强起到了很重要的作用。   \n\n**数据扩充的过程如下:** 对于特定任务的数据中每一条文本，首先使用bert自带的方式进行bpe分词，bpe分词之后是完整单词（single-piece word），用[MASK]符号代替，然后使用bert进行预测并选择其对应的候选词N个；如果bpe（就是BertTokenizer）分词之后不是完整单词，则使用Glove词向量以及余弦相似度来选择对应的N个候选词，最后以概率p选择是否替换这个单词，从而产生更多的文本数据。  \n\n- 数据\n\n    因为懒得找中文的词向量下载，暂时用英文的glove文件和英文原始数据。中文的话只需要把预训练模型改为中文bert，glove改为中文的词向量文件即可。原始数据样式和增强后的数据样式在data/en_data.txt和data/aug_en_data.txt\n\n\n## Evaluation  \n\n待续...\n\n\n## 官方版本\n\n=================1st version to reproduce our results in the paper ===========================\n\n[General_TinyBERT(4layer-312dim)](https://drive.google.com/uc?export=download\u0026id=1dDigD7QBv1BmE6pWU71pFYPgovvEqOOj) \n\n[General_TinyBERT(6layer-768dim)](https://drive.google.com/uc?export=download\u0026id=1wXWR00EHK-Eb7pbyw0VP234i2JTnjJ-x)\n\n=================2nd version (2019/11/18) trained with more (book+wiki) and no `[MASK]` corpus =======\n\n[General_TinyBERT_v2(4layer-312dim)](https://drive.google.com/open?id=1PhI73thKoLU2iliasJmlQXBav3v33-8z)\n\n[General_TinyBERT_v2(6layer-768dim)](https://drive.google.com/open?id=1r2bmEsQe4jUBrzJknnNaBJQDgiRKmQjF)\n\n\nWe here also provide the distilled TinyBERT(both 4layer-312dim and 6layer-768dim) of all GLUE tasks for evaluation. Every task has its own folder where the corresponding model has been saved.\n\n[TinyBERT(4layer-312dim)](https://drive.google.com/uc?export=download\u0026id=1_sCARNCgOZZFiWTSgNbE7viW_G5vIXYg) \n\n[TinyBERT(6layer-768dim)](https://drive.google.com/uc?export=download\u0026id=1Vf0ZnMhtZFUE0XoD3hTXc6QtHwKr_PwS)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLisennlp%2FTinyBert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLisennlp%2FTinyBert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLisennlp%2FTinyBert/lists"}