https://github.com/princeton-nlp/simcse

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
https://github.com/princeton-nlp/simcse
nlp sentence-embeddings
Last synced: 2 months ago
JSON representation
[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
Host: GitHub
URL: https://github.com/princeton-nlp/simcse
Owner: princeton-nlp
License: mit
Created: 2021-04-16T02:57:04.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2024-10-16T14:35:29.000Z (9 months ago)
Last Synced: 2025-04-23T18:56:56.800Z (3 months ago)
Topics: nlp, sentence-embeddings
Language: Python
Homepage:
Size: 40.4 MB
Stars: 3,545
Watchers: 27
Forks: 524
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        ## SimCSE: Simple Contrastive Learning of Sentence Embeddings

This repository contains the code and pre-trained models for our paper [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821).

**************************** **Updates** ****************************

* 8/31: Our paper has been accepted to EMNLP! Please check out our [updated paper](https://arxiv.org/pdf/2104.08821.pdf) (with updated numbers and baselines). 

* 5/12: We updated our [unsupervised models](#model-list) with new hyperparameters and better performance.

* 5/10: We released our [sentence embedding tool](#getting-started) and [demo code](./demo).

* 4/23: We released our [training code](#training).

* 4/20: We released our [model checkpoints](#use-our-models-out-of-the-box) and [evaluation code](#evaluation).

* 4/18: We released [our paper](https://arxiv.org/pdf/2104.08821.pdf). Check it out!

## Quick Links

  - [Overview](#overview)

  - [Getting Started](#getting-started)

  - [Model List](#model-list)

  - [Use SimCSE with Huggingface](#use-simcse-with-huggingface)

  - [Train SimCSE](#train-simcse)

    - [Requirements](#requirements)

    - [Evaluation](#evaluation)

    - [Training](#training)

  - [Bugs or Questions?](#bugs-or-questions)

  - [Citation](#citation)

  - [SimCSE Elsewhere](#simcse-elsewhere)

## Overview

We propose a simple contrastive learning framework that works with both unlabeled and labeled data. Unsupervised SimCSE simply takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise. Our supervised SimCSE incorporates annotated pairs from NLI datasets into contrastive learning by using `entailment` pairs as positives and `contradiction` pairs as hard negatives. The following figure is an illustration of our models.

![](figure/model.png)

## Getting Started

We provide an easy-to-use sentence embedding tool based on our SimCSE model (see our [Wiki](https://github.com/princeton-nlp/SimCSE/wiki) for detailed usage). To use the tool, first install the `simcse` package from PyPI

```bash

pip install simcse

```

Or directly install it from our code

```bash

python setup.py install

```

Note that if you want to enable GPU encoding, you should install the correct version of PyTorch that supports CUDA. See [PyTorch official website](https://pytorch.org) for instructions.

After installing the package, you can load our model by just two lines of code

```python

from simcse import SimCSE

model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

```

See [model list](#model-list) for a full list of available models. 

Then you can use our model for **encoding sentences into embeddings**

```python

embeddings = model.encode("A woman is reading.")

```

**Compute the cosine similarities** between two groups of sentences

```python

sentences_a = ['A woman is reading.', 'A man is playing a guitar.']

sentences_b = ['He plays guitar.', 'A woman is making a photo.']

similarities = model.similarity(sentences_a, sentences_b)

```

Or build index for a group of sentences and **search** among them

```python

sentences = ['A woman is reading.', 'A man is playing a guitar.']

model.build_index(sentences)

results = model.search("He plays guitar.")

```

We also support [faiss](https://github.com/facebookresearch/faiss), an efficient similarity search library. Just install the package following [instructions](https://github.com/princeton-nlp/SimCSE/wiki/Installation) here and `simcse` will automatically use `faiss` for efficient search.

**WARNING**: We have found that `faiss` did not well support Nvidia AMPERE GPUs (3090 and A100). In that case, you should change to other GPUs or install the CPU version of `faiss` package.

We also provide an easy-to-build [demo website](./demo) to show how SimCSE can be used in sentence retrieval. The code is based on [DensePhrases](https://arxiv.org/abs/2012.12624)' [repo](https://github.com/princeton-nlp/DensePhrases) and [demo](http://densephrases.korea.ac.kr) (a lot of thanks to the authors of DensePhrases). 

## Model List

Our released models are listed as following. You can import these models by using the `simcse` package or using [HuggingFace's Transformers](https://github.com/huggingface/transformers). 

|              Model              | Avg. STS |

|:-------------------------------|:--------:|

|  [princeton-nlp/unsup-simcse-bert-base-uncased](https://huggingface.co/princeton-nlp/unsup-simcse-bert-base-uncased) |   76.25 |

| [princeton-nlp/unsup-simcse-bert-large-uncased](https://huggingface.co/princeton-nlp/unsup-simcse-bert-large-uncased) |   78.41  |

|    [princeton-nlp/unsup-simcse-roberta-base](https://huggingface.co/princeton-nlp/unsup-simcse-roberta-base)    |   76.57  |

|    [princeton-nlp/unsup-simcse-roberta-large](https://huggingface.co/princeton-nlp/unsup-simcse-roberta-large)   |   78.90  |

|   [princeton-nlp/sup-simcse-bert-base-uncased](https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased)  |   81.57  |

|  [princeton-nlp/sup-simcse-bert-large-uncased](https://huggingface.co/princeton-nlp/sup-simcse-bert-large-uncased)  |   82.21  |

|     [princeton-nlp/sup-simcse-roberta-base](https://huggingface.co/princeton-nlp/sup-simcse-roberta-base)     |   82.52  |

|     [princeton-nlp/sup-simcse-roberta-large](https://huggingface.co/princeton-nlp/sup-simcse-roberta-large)    |   83.76  |

Note that the results are slightly better than what we have reported in the current version of the paper after adopting a new set of hyperparameters (for hyperparamters, see the [training](#training) section).

**Naming rules**: `unsup` and `sup` represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.

## Use SimCSE with Huggingface

Besides using our provided sentence embedding tool, you can also easily import our models with HuggingFace's `transformers`:

```python

import torch

from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically

tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

# Tokenize input texts

texts = [

    "There's a kid on a skateboard.",

    "A kid is skateboarding.",

    "A kid is inside the house."

]

inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings

with torch.no_grad():

    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

# Calculate cosine similarities

# Cosine similarities are in [-1, 1]. Higher means more similar

cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])

cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

```

If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use `model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL})`.

## Train SimCSE

In the following section, we describe how to train a SimCSE model by using our code.

### Requirements

First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.7.1` version corresponding to your platforms/CUDA versions. PyTorch version higher than `1.7.1` should also work. For example, if you use Linux and **CUDA11** ([how to check CUDA version](https://varhowto.com/check-cuda-version/)), install PyTorch by the following command,

```bash

pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

```

If you instead use **CUDA** `<11` or **CPU**, install PyTorch by the following command,

```bash

pip install torch==1.7.1

```

Then run the following script to install the remaining dependencies,

```bash

pip install -r requirements.txt

```

### Evaluation

Our evaluation code for sentence embeddings is based on a modified version of [SentEval](https://github.com/facebookresearch/SentEval). It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See [our paper](https://arxiv.org/pdf/2104.08821.pdf) (Appendix B) for evaluation details.

Before evaluation, please download the evaluation datasets by running

```bash

cd SentEval/data/downstream/

bash download_dataset.sh

```

Then come back to the root directory, you can evaluate any `transformers`-based pre-trained models using our evaluation code. For example,

```bash

python evaluation.py \

    --model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \

    --pooler cls \

    --task_set sts \

    --mode test

```

which is expected to output the results in a tabular format:

```

------ test ------

+-------+-------+-------+-------+-------+--------------+-----------------+-------+

| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |

+-------+-------+-------+-------+-------+--------------+-----------------+-------+

| 75.30 | 84.67 | 80.19 | 85.40 | 80.82 |    84.26     |      80.39      | 81.58 |

+-------+-------+-------+-------+-------+--------------+-----------------+-------+

```

Arguments for the evaluation script are as follows,

* `--model_name_or_path`: The name or path of a `transformers`-based pre-trained checkpoint. You can directly use the models in the above table, e.g., `princeton-nlp/sup-simcse-bert-base-uncased`.

* `--pooler`: Pooling method. Now we support

    * `cls` (default): Use the representation of `[CLS]` token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation). If you use **supervised SimCSE**, you should use this option.

    * `cls_before_pooler`: Use the representation of `[CLS]` token without the extra linear+activation. If you use **unsupervised SimCSE**, you should take this option.

    * `avg`: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa ([paper](https://arxiv.org/abs/1908.10084)), you should use this option.

    * `avg_top2`: Average embeddings of the last two layers.

    * `avg_first_last`: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best. Note that in the paper we reported the average of last layer and the static word embedding; we fixed this to be last and first layer average and it led to better performance. See [this issue](https://github.com/princeton-nlp/SimCSE/issues/285) for a detailed discussion.

* `--mode`: Evaluation mode

    * `test` (default): The default test mode. To faithfully reproduce our results, you should use this option.

    * `dev`: Report the development set results. Note that in STS tasks, only `STS-B` and `SICK-R` have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than the `test` mode (though numbers are slightly lower).

    * `fasttest`: It is the same as `test`, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks).

* `--task_set`: What set of tasks to evaluate on (if set, it will override `--tasks`)

    * `sts` (default): Evaluate on STS tasks, including `STS 12~16`, `STS-B` and `SICK-R`. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings.

    * `transfer`: Evaluate on transfer tasks.

    * `full`: Evaluate on both STS and transfer tasks.

    * `na`: Manually set tasks by `--tasks`.

* `--tasks`: Specify which dataset(s) to evaluate on. Will be overridden if `--task_set` is not `na`. See the code for a full list of tasks.

### Training

**Data**

For unsupervised SimCSE, we sample 1 million sentences from English Wikipedia; for supervised SimCSE, we use the SNLI and MNLI datasets. You can run `data/download_wiki.sh` and `data/download_nli.sh` to download the two datasets.

**Training scripts**

We provide example training scripts for both unsupervised and supervised SimCSE. In `run_unsup_example.sh`, we provide a single-GPU (or CPU) example for the unsupervised version, and in `run_sup_example.sh` we give a **multiple-GPU** example for the supervised version. Both scripts call `train.py` for training. We explain the arguments in following:

* `--train_file`: Training file path. We support "txt" files (one line for one sentence) and "csv" files (2-column: pair data with no hard negative; 3-column: pair data with one corresponding hard negative instance). You can use our provided Wikipedia or NLI data, or you can use your own data with the same format.

* `--model_name_or_path`: Pre-trained checkpoints to start with. For now we support BERT-based models (`bert-base-uncased`, `bert-large-uncased`, etc.) and RoBERTa-based models (`RoBERTa-base`, `RoBERTa-large`, etc.).

* `--temp`: Temperature for the contrastive loss.

* `--pooler_type`: Pooling method. It's the same as the `--pooler_type` in the [evaluation part](#evaluation).

* `--mlp_only_train`: We have found that for unsupervised SimCSE, it works better to train the model with MLP layer but test the model without it. You should use this argument when training unsupervised SimCSE models.

* `--hard_negative_weight`: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value).

* `--do_mlm`: Whether to use the MLM auxiliary objective. If True:

  * `--mlm_weight`: Weight for the MLM objective.

  * `--mlm_probability`: Masking rate for the MLM objective.

All the other arguments are standard Huggingface's `transformers` training arguments. Some of the often-used arguments are: `--output_dir`, `--learning_rate`, `--per_device_train_batch_size`. In our example scripts, we also set to evaluate the model on the STS-B development set (need to download the dataset following the [evaluation](#evaluation) section) and save the best checkpoint.

For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

**Hyperparameters**

We use the following hyperparamters for training SimCSE:

|               | Unsup. BERT | Unsup. RoBERTa | Sup.      |

|:--------------|:-----------:|:--------------:|:---------:|

| Batch size    | 64          | 512            | 512       |

| Learning rate (base)  | 3e-5 | 1e-5 | 5e-5 |

| Learning rate (large) | 1e-5 | 3e-5 | 1e-5 |

**Convert models**

Our saved checkpoints are slightly different from Huggingface's pre-trained checkpoints. Run `python simcse_to_huggingface.py --path {PATH_TO_CHECKPOINT_FOLDER}` to convert it. After that, you can evaluate it by our [evaluation](#evaluation) code or directly use it [out of the box](#use-our-models-out-of-the-box).

## Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Tianyu (`[email protected]`) and Xingcheng (`[email protected]`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

## Citation

Please cite our paper if you use SimCSE in your work:

```bibtex

@inproceedings{gao2021simcse,

   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},

   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},

   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},

   year={2021}

}

```

## SimCSE Elsewhere

We thank the community's efforts for extending SimCSE!

- [Jianlin Su](https://github.com/bojone) has provided [a Chinese version of SimCSE](https://github.com/bojone/SimCSE).

- [AK391](https://github.com/AK391) integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/SimCSE)

- [Nils Reimers](https://github.com/nreimers) has implemented a `sentence-transformers`-based [training code](https://colab.research.google.com/drive/1gAjXcI4uSxDE_IcvZdswFYVAo7XvPeoU?usp=sharing#scrollTo=UXUsikOc6oiB) for SimCSE.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/princeton-nlp/simcse

Awesome Lists containing this project

README