{"id":13535289,"url":"https://github.com/terrifyzhao/bert-utils","last_synced_at":"2025-05-16T07:00:21.531Z","repository":{"id":37663744,"uuid":"168135554","full_name":"terrifyzhao/bert-utils","owner":"terrifyzhao","description":"一行代码使用BERT生成句向量，BERT做文本分类、文本相似度计算","archived":false,"fork":false,"pushed_at":"2019-10-14T06:52:57.000Z","size":5568,"stargazers_count":1658,"open_issues_count":56,"forks_count":427,"subscribers_count":27,"default_branch":"master","last_synced_at":"2025-04-08T16:06:53.972Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/terrifyzhao.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-29T10:24:18.000Z","updated_at":"2025-04-02T09:03:35.000Z","dependencies_parsed_at":"2022-08-08T21:15:32.955Z","dependency_job_id":null,"html_url":"https://github.com/terrifyzhao/bert-utils","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terrifyzhao%2Fbert-utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terrifyzhao%2Fbert-utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terrifyzhao%2Fbert-utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terrifyzhao%2Fbert-utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/terrifyzhao","download_url":"https://codeload.github.com/terrifyzhao/bert-utils/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254485025,"owners_count":22078764,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T08:00:52.828Z","updated_at":"2025-05-16T07:00:21.503Z","avatar_url":"https://github.com/terrifyzhao.png","language":"Python","readme":"# bert-utils\n\n本文基于Google开源的[BERT](https://github.com/google-research/bert)代码进行了进一步的简化，方便生成句向量与做文本分类\n\n---\n\n***** New July 1st, 2019 *****\n+ 修改句向量`graph`文件的生成方式，提升句向量启动速度。不再每次以临时文件的方式生成，首次执行extract_feature.py时会创建`tmp/result/graph`，\n再次执行时直接读取该文件，如果`args.py`文件内容有修改，需要删除`tmp/result/graph`文件\n+ 修复同时启动两个进程生成句向量时代码报错的bug\n+ 修改文本匹配数据集为QA_corpus，该份数据相比于蚂蚁金服的数据更有权威性\n\n---\n\n1、下载BERT中文模型 \n\n下载地址: https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip\n\n2、把下载好的模型添加到当前目录下\n\n3、句向量生成\n\n生成句向量不需要做fine tune，使用预先训练好的模型即可，可参考`extract_feature.py`的`main`方法，注意参数必须是一个list。\n\n首次生成句向量时需要加载graph，并在output_dir路径下生成一个新的graph文件，因此速度比较慢，再次调用速度会很快\n```\nfrom bert.extrac_feature import BertVector\nbv = BertVector()\nbv.encode(['今天天气不错'])\n```\n\n4、文本分类\n\n文本分类需要做fine tune，首先把数据准备好存放在`data`目录下，训练集的名字必须为`train.csv`，验证集的名字必须为`dev.csv`，测试集的名字必须为`test.csv`，\n必须先调用`set_mode`方法，可参考`similarity.py`的`main`方法，\n\n训练：\n```\nfrom similarity import BertSim\nimport tensorflow as tf\n\nbs = BertSim()\nbs.set_mode(tf.estimator.ModeKeys.TRAIN)\nbs.train()\n```\n\n验证：\n```\nfrom similarity import BertSim\nimport tensorflow as tf\n\nbs = BertSim()\nbs.set_mode(tf.estimator.ModeKeys.EVAL)\nbs.eval()\n```\n\n测试：\n```\nfrom similarity import BertSim\nimport tensorflow as tf\n\nbs = BertSim()\nbs.set_mode(tf.estimator.ModeKeys.PREDICT)\nbs.test()\n```\n\n5、DEMO中自带了QA_corpus数据集，这里给出[地址](http://icrc.hitsz.edu.cn/info/1037/1162.htm)，\n该份数据的生成方式请参阅附件中的论文`The BQ Corpus.pdf`","funding_links":[],"categories":["Pretrained Language Model","BERT language model and embedding:"],"sub_categories":["Repository"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fterrifyzhao%2Fbert-utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fterrifyzhao%2Fbert-utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fterrifyzhao%2Fbert-utils/lists"}