Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Jun-jie-Huang/WhiteningBERT
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.
https://github.com/Jun-jie-Huang/WhiteningBERT
Last synced: 2 months ago
JSON representation
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.
- Host: GitHub
- URL: https://github.com/Jun-jie-Huang/WhiteningBERT
- Owner: Jun-jie-Huang
- License: mit
- Created: 2021-04-04T16:59:38.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2021-04-06T07:07:16.000Z (almost 4 years ago)
- Last Synced: 2024-08-02T12:22:54.188Z (5 months ago)
- Language: Python
- Size: 1.61 MB
- Stars: 56
- Watchers: 1
- Forks: 8
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-huggingface - WhiteningBERT - An easy unsupervised sentence embedding approach with whitening. (🥡 Text Representation)
README
# WhiteningBERT
Source code and data for paper [WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach](https://arxiv.org/abs/2104.01767).
## Preparation
```
git clone https://github.com/Jun-jie-Huang/WhiteningBERT.git
pip install -r requirements.txt
cd examples/evaluation
```## Usage
#### Datasets
We use seven STS datasets, including STSBenchmark, SICK-Relatedness, STS12, STS13, STS14, STS15, STS16.
The processed data can be found in [./examples/datasets/](./examples/datasets/).
#### Run
1. To run a quick demo:
```
python evaluation_stsbenchmark.py \
--pooling aver \
--layer_num 1,12 \
--whitening \
--encoder_name bert-base-cased
```Specify `--pooing` with `cls` or `aver` to choose whether use the [CLS] token or averaging all tokens. Also specify `--layer_num` to combine layers, separated by a comma.
2. To enumerate all possible combinations of two layers and automatically evaluate the combinations consequently:
```
python evaluation_stsbenchmark_layer2.py \
--pooling aver \
--whitening \
--encoder_name bert-base-cased
```3. To enumerate all possible combinations of N layers:
```
python evaluation_stsbenchmark_layerN.py \
--pooling aver \
--whitening \
--encoder_name bert-base-cased\
--combination_num 4
```4. You can also save the embeddings of the sentences
```
python evaluation_stsbenchmark_save_embed.py \
--pooling aver \
--layer_num 1,12 \
--whitening \
--encoder_name bert-base-cased \
--summary_dir ./save_embeddings
```#### A list of PLMs you can select:
- `bert-base-uncased` , ` bert-large-uncased `
- `roberta-base`, `roberta-large `
- `bert-base-multilingual-uncased`
- `sentence-transformers/LaBSE`
- `albert-base-v1 `, `albert-large-v1 `
- `microsoft/layoutlm-base-uncased `, `microsoft/layoutlm-large-uncased `
- `SpanBERT/spanbert-base-cased `, `SpanBERT/spanbert-large-cased `
- `microsoft/deberta-base `, `microsoft/deberta-large `
- `google/electra-base-discriminator`
- `google/mobilebert-uncased `
- `microsoft/DialogRPT-human-vs-rand `
- `distilbert-base-uncased`
- ......## Acknowledgements
Codes are adapted from the repos of the EMNLP19 paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://github.com/UKPLab/sentence-transformers) and the EMNLP20 paper [An Unsupervised Sentence Embedding Method by Mutual Information Maximization](https://github.com/yanzhangnlp/IS-BERT/)