{"id":13595116,"url":"https://github.com/bojone/SimCSE","last_synced_at":"2025-04-09T10:32:48.527Z","repository":{"id":39365914,"uuid":"361671656","full_name":"bojone/SimCSE","owner":"bojone","description":"SimCSE在中文任务上的简单实验","archived":false,"fork":false,"pushed_at":"2023-08-07T07:46:17.000Z","size":12,"stargazers_count":602,"open_issues_count":14,"forks_count":81,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-05T01:06:01.819Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bojone.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-04-26T08:19:11.000Z","updated_at":"2025-02-26T09:12:27.000Z","dependencies_parsed_at":"2024-01-16T22:18:56.182Z","dependency_job_id":"30c0eb7e-a752-4861-9e95-df09624538e2","html_url":"https://github.com/bojone/SimCSE","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bojone%2FSimCSE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bojone%2FSimCSE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bojone%2FSimCSE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bojone%2FSimCSE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bojone","download_url":"https://codeload.github.com/bojone/SimCSE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248020593,"owners_count":21034459,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:01:44.282Z","updated_at":"2025-04-09T10:32:48.505Z","avatar_url":"https://github.com/bojone.png","language":"Python","funding_links":[],"categories":["Python","文本匹配 文本检索 文本相似度"],"sub_categories":["其他_文本生成、文本对话"],"readme":"# SimCSE 中文测试\n\nSimCSE在常见中文数据集上的测试，包含[ATEC](https://github.com/IceFlameWorm/NLP_Datasets/tree/master/ATEC)、[BQ](http://icrc.hitsz.edu.cn/info/1037/1162.htm)、[LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[PAWSX](https://arxiv.org/abs/1908.11828)、[STS-B](https://github.com/pluto-junzeng/CNSD)共5个任务。\n\n## 介绍\n\n- 博客：https://kexue.fm/archives/8348\n- 论文：[《SimCSE: Simple Contrastive Learning of Sentence Embeddings》](https://arxiv.org/abs/2104.08821)\n- 官方：https://github.com/princeton-nlp/SimCSE\n\n## 文件\n\n```\n- utils.py  工具函数\n- eval.py  评测主文件\n```\n\n## 评测\n\n命令格式：\n```\npython eval.py [model_type] [pooling] [task_name] [dropout_rate]\n```\n\n使用例子：\n```\npython eval.py BERT cls ATEC 0.3\n```\n\n其中四个参数必须传入，含义分别如下：\n```\n- model_type: 模型，必须是['BERT', 'RoBERTa', 'WoBERT', 'RoFormer', 'BERT-large', 'RoBERTa-large', 'SimBERT', 'SimBERT-tiny', 'SimBERT-small']之一；\n- pooling: 池化方式，必须是['first-last-avg', 'last-avg', 'cls', 'pooler']之一；\n- task_name: 评测数据集，必须是['ATEC', 'BQ', 'LCQMC', 'PAWSX', 'STS-B']之一；\n- dropout_rate: 浮点数，dropout的比例，如果为0则不dropout；\n```\n\n## 环境\n测试环境：tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.5，如果在其他环境组合下报错，请根据错误信息自行调整代码。\n\n## 下载\n\nGoogle官方的两个BERT模型：\n- BERT：[chinese_L-12_H-768_A-12.zip](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)\n- RoBERTa：[chinese_roberta_wwm_ext_L-12_H-768_A-12.zip](https://github.com/ymcui/Chinese-BERT-wwm)\n- NEZHA：[NEZHA-base-WWM](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-TensorFlow)\n- WoBERT：[chinese_wobert_plus_L-12_H-768_A-12.zip](https://github.com/ZhuiyiTechnology/WoBERT)\n- RoFormer：[chinese_roformer_L-12_H-768_A-12.zip](https://github.com/ZhuiyiTechnology/roformer)\n- SimBERT: [chinese_simbert_L-12_H-768_A-12.zip](https://github.com/ZhuiyiTechnology/simbert)\n- SimBERT-small: [chinese_simbert_L-6_H-384_A-12.zip](https://github.com/ZhuiyiTechnology/simbert)\n- SimBERT-tiny: [chinese_simbert_L-4_H-312_A-12.zip](https://github.com/ZhuiyiTechnology/simbert)\n\n关于语义相似度数据集，可以从数据集对应的链接自行下载，也可以从作者提供的百度云链接下载。\n- 链接: https://pan.baidu.com/s/1oXeLB_cFR9lB7CPkO5N_cQ 提取码: nww9\n\n其中senteval_cn目录是评测数据集汇总，senteval_cn.zip是senteval目录的打包，两者下其一就好。\n\n## 相关\n- BERT-whitening：https://github.com/bojone/BERT-whitening\n\n## 交流\n\nQQ交流群：808623966，微信群请加机器人微信号spaces_ac_cn\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbojone%2FSimCSE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbojone%2FSimCSE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbojone%2FSimCSE/lists"}