{"id":13754256,"url":"https://github.com/vdogmcgee/SimCSE-Chinese-Pytorch","last_synced_at":"2025-05-09T22:31:35.802Z","repository":{"id":48181655,"uuid":"390987660","full_name":"vdogmcgee/SimCSE-Chinese-Pytorch","owner":"vdogmcgee","description":"SimCSE在中文上的复现，有监督+无监督","archived":false,"fork":false,"pushed_at":"2021-12-14T10:01:38.000Z","size":78,"stargazers_count":272,"open_issues_count":15,"forks_count":48,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-21T04:31:21.357Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vdogmcgee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-30T08:20:30.000Z","updated_at":"2025-02-06T07:01:23.000Z","dependencies_parsed_at":"2022-08-12T19:41:11.652Z","dependency_job_id":null,"html_url":"https://github.com/vdogmcgee/SimCSE-Chinese-Pytorch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vdogmcgee%2FSimCSE-Chinese-Pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vdogmcgee%2FSimCSE-Chinese-Pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vdogmcgee%2FSimCSE-Chinese-Pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vdogmcgee%2FSimCSE-Chinese-Pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vdogmcgee","download_url":"https://codeload.github.com/vdogmcgee/SimCSE-Chinese-Pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335704,"owners_count":21892714,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:52.307Z","updated_at":"2025-05-09T22:31:35.790Z","avatar_url":"https://github.com/vdogmcgee.png","language":"Python","funding_links":[],"categories":["文本匹配 文本检索 文本相似度"],"sub_categories":["其他_文本生成、文本对话"],"readme":"![](https://img.shields.io/badge/license-MIT-blue.svg) \n![](https://img.shields.io/badge/Python-3.6.12-blue.svg)\n![](https://img.shields.io/badge/torch-1.7.0-brightgreen.svg)\n![](https://img.shields.io/badge/transformers-4.4.1-brightgreen.svg)\n![](https://img.shields.io/badge/scikitlearn-0.24.0-brightgreen.svg)\n![](https://img.shields.io/badge/tqdm-4.49.0-brightgreen.svg)\n![](https://img.shields.io/badge/jsonlines-2.0.0-brightgreen.svg)\n![](https://img.shields.io/badge/loguru-0.5.3-brightgreen.svg)\n\n\n\n# SimCSE-Chinese-Pytorch\n\nSimCSE在中文上的复现，无监督 + 有监督\n\n### 1. 背景\n\n最近看了SimCSE这篇论文，思路很有意思，便对论文做了pytorch版的复现和评测\n\n- 论文：https://arxiv.org/pdf/2104.08821.pdf\n- 官方：https://github.com/princeton-nlp/SimCSE\n\n### 2. 文件\n\n```shell\n\u003e datasets\t\t数据集文件夹\n   \u003e cnsd-snli\n   \u003e STS-B\n\u003e pretrained_model\t各种预训练模型文件夹\n\u003e saved_model\t\t微调之后保存的模型文件夹\n  data_preprocess.py\tsnli数据集的数据预处理\n  simcse_sup.py\t\t有监督训练\n  simcse_unsup.py\t无监督训练\n```\n\n### 3. 使用\n\n需要将公开数据集和预训练模型放到指定目录下， 并检查在代码中的位置是否对应\n\n```python\n# 预训练模型目录\nBERT = 'pretrained_model/bert_pytorch'\nmodel_path = BERT \n# 微调后参数存放位置\nSAVE_PATH = './saved_model/simcse_unsup.pt'\n# 数据目录\nSNIL_TRAIN = './datasets/cnsd-snli/train.txt'\nSTS_TRAIN = './datasets/STS-B/cnsd-sts-train.txt'\nSTS_DEV = './datasets/STS-B/cnsd-sts-dev.txt'\nSTS_TEST = './datasets/STS-B/cnsd-sts-test.txt'\n```\n\n数据预处理(需要先执行此文件)：\n\n```shell\npython data_preprocess.py\n```\n\n无监督训练\n\n```shell\npython simcse_unsup.py\n```\n\n有监督训练\n\n```shell\npython simcse_sup.py\n```\n\n### 4. 下载\n\n数据集：\n\n- CNSD：https://github.com/pluto-junzeng/CNSD\n\n预训练模型：\n\n- [BERT](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz)\n- [BERT-wwm](https://drive.google.com/file/d/1AQitrjbvCWc51SYiLN-cJq4e0WiNN4KY/view)\n- [BERT-wwm-ext](https://drive.google.com/file/d/1iNeYFhCBJWeUsIlnW_2K6SMwXkM4gLb_/view)\n- [RoBERTa-wwm-ext](https://drive.google.com/file/d/1eHM3l4fMo6DsQYGmey7UZGiTmQquHw25/view)\n\n### 5. 测评\n\n测评指标为spearman相关系数\n\n无监督：batch_size=64，lr=1e-5，droupout_rate=0.3，pooling=cls， 抽样10000样本\n\n| 模型            | STS-B dev | STS-B test |\n| :-------------- | --------- | ---------- |\n| BERT            | 0.7308    | 0.6725     |\n| BERT-wwm        | 0.7229    | 0.6628     |\n| BERT-wwm-ext    | 0.7271    | 0.6669     |\n| RoBERTa-wwm-ext | 0.7558    | 0.7141     |\n\n有监督：batch_size=64，lr=1e-5，pooling=cls\n\n| 模型            | STS-B dev | STS-B test | 收敛所需样本数 |\n| :-------------- | --------- | ---------- | -------------- |\n| BERT            | 0.8016    | 0.7624     | 23040          |\n| BERT-wwm        | 0.8022    | 0.7572     | 16640          |\n| BERT-wwm-ext    | 0.8081    | 0.7539     | 33280          |\n| RoBERTa-wwm-ext | 0.8135    | 0.7763     | 38400          |\n\n### 6. 参考\n\n- https://arxiv.org/pdf/2104.08821.pdf\n- 苏剑林. (Apr. 26, 2021). 《中文任务还是SOTA吗？我们给SimCSE补充了一些实验 》[Blog post]. Retrieved from https://kexue.fm/archives/8348\n- https://github.com/zhengyanzhao1997/NLP-model/tree/main/model/model/Torch_model/SimCSE-Chinese\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvdogmcgee%2FSimCSE-Chinese-Pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvdogmcgee%2FSimCSE-Chinese-Pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvdogmcgee%2FSimCSE-Chinese-Pytorch/lists"}