Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/sdadas/polish-sentence-evaluation

Evaluation of Sentence Representations in Polish
https://github.com/sdadas/polish-sentence-evaluation

natural-language-processing polish-language sentence-embeddings word-embeddings

Last synced: 2 months ago
JSON representation

Evaluation of Sentence Representations in Polish

Lists

README

        

### Evaluation of Sentence Representations in Polish
This repository contains experiments related to dense representations of sentences in Polish. It includes code for evaluating different sentence representation methods such as aggregated word embeddings or neural sentence encoders, both multilingual and language-specific. This source code has been used in the following publications:

#### [[1]](https://aclanthology.org/2020.lrec-1.207/) Evaluation of Sentence Representations in Polish

The paper contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks.
Dataset for these tasks are distributed with the repository and two of them are released specifically for this evaluation:
the [SICK (Sentences Involving Compositional Knowledge)](https://github.com/text-machine-lab/MUTT/tree/master/data/sick) corpus translated to Polish and 8TAGS classification dataset. Pre-trained models used in this study are available for download in separate repository: [Polish NLP Resources](https://github.com/sdadas/polish-nlp-resources).

BibTeX

```
@inproceedings{dadas-etal-2020-evaluation,
title = "Evaluation of Sentence Representations in {P}olish",
author = "Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.207",
pages = "1674--1680",
language = "English",
ISBN = "979-10-95546-34-4",
}
```

#### [[2]](https://arxiv.org/abs/2207.12759) Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

In this publication, we show a simple method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer.

BibTeX

```
@inproceedings{9945218,
author={Dadas, S{\l}awomir},
booktitle={2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)},
title={Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases},
year={2022},
volume={},
number={},
pages={371-378},
doi={10.1109/SMC53654.2022.9945218}
}
```

### Updates:

- **29.12.2022** - Our supervised datasets are now available on the [Huggingface Hub](https://huggingface.co/sdadas).
- **20.01.2022** - [New code example](https://github.com/sdadas/polish-sentence-evaluation/tree/master/examples/paraphrase_mining) added: training sentence encoders on paraphrase pairs mined from OPUS parallel corpus.
- **23.10.2020** - Added pre-trained multilingual models from the [Sentence-Transformers](https://www.sbert.net/) library
- **02.09.2020** - Added [LaBSE](https://tfhub.dev/google/LaBSE/1) multilingual sentence encoder
- **09.05.2020** - Added new [Polish RoBERTa](https://github.com/sdadas/polish-roberta) models
- **03.03.2020** - Added [XLM-RoBERTa (base)](https://github.com/pytorch/fairseq/tree/master/examples/xlmr) model
- **02.02.2020** - Added detailed results of static word embedding models with dimensionalities from 300 to 800
- **01.02.2020** - Added [Polish RoBERTa](https://github.com/sdadas/polish-nlp-resources#roberta) model and multilingual [XLM-RoBERTa (large)](https://github.com/pytorch/fairseq/tree/master/examples/xlmr) model

### Evaluation results:


#
Method
Language
WCCRS
Hotels

WCCRS
Medicine

SICK‑E
SICK‑R
8TAGS


Word embeddings

1Randomn/a65.8360.6472.770.62831.95
2.aWord2Vec (300d)Polish78.1973.2375.420.74670.27
2.bWord2Vec (500d)Polish81.7273.9876.250.76470.56
2.cWord2Vec (800d)Polish82.2473.8875.600.77270.79
3.aGloVe (300d)Polish80.0572.5473.810.75669.78
3.bGloVe (500d)Polish80.7672.5475.090.76170.27
3.cGloVe (800d)Polish81.7974.3276.480.77970.63
4.aFastText (300d)Polish80.3172.6475.190.72969.24
4.bFastText (500d)Polish80.3173.8876.660.75570.22
4.cFastText (800d)Polish80.9572.9477.090.76869.95

Language models

5.aELMo (all)Polish85.5278.4277.150.78971.41
5.bELMo (top)Polish83.2078.1774.050.75671.41
6FlairPolish80.8275.4678.430.74365.62
7.aRoBERTa-base (all)Polish85.7878.9678.820.79970.27
7.bRoBERTa-base (top)Polish84.6279.3676.090.75070.33
7.cRoBERTa-large (all)Polish89.1284.7478.130.82075.75
7.dRoBERTa-large (top)Polish88.9383.1175.560.76776.67
8.aXLM-RoBERTa-base (all)Multilingual85.5278.8175.250.73468.78
8.bXLM-RoBERTa-base (top)Multilingual82.3775.2664.470.57969.81
8.cXLM-RoBERTa-large (all)Multilingual87.3983.6074.340.76473.33
8.dXLM-RoBERTa-large (top)Multilingual85.0778.9161.500.56873.35
9BERTMultilingual76.8372.5473.830.69865.05

Sentence encoders

10LASERMultilingual81.2178.1782.210.82564.91
11USEMultilingual79.4773.7882.140.83369.92
12LaBSEMultilingual85.5280.8981.570.82572.35
13aSentence-Transformers
(distiluse-base-multilingual-cased-v2)Multilingual79.9975.8078.900.80770.86
13bSentence-Transformers
(xlm-r-distilroberta-base-paraphrase-v1)Multilingual82.6380.8481.350.83970.61
13cSentence-Transformers
(xlm-r-bert-base-nli-stsb-mean-tokens)Multilingual81.0279.9579.090.82069.12
13dSentence-Transformers
(distilbert-multilingual-nli-stsb-quora-ranking)Multilingual80.0574.6479.410.81769.28

Table: Evaluation of sentence representations on four classification tasks and one semantic relatedness task (SICK-R). For classification, we report accuracy of each model. For semantic relatedness, Pearson correlation between true and predicted relatedness scores is reported.

### Evaluated methods:

1. Randomly initialized word embeddings
2. Word2Vec ([Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
3. GloVe ([Glove: Global Vectors for Word Representation](https://www.aclweb.org/anthology/D14-1162.pdf)) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings. [[Download]](https://github.com/sdadas/polish-nlp-resources#glove)
4. FastText ([Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf)) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.
5. ELMo language model described in [Deep contextualized word representations](https://arxiv.org/pdf/1802.05365.pdf) paper, pre-trained by us for Polish. In the `all` variant, we construct the word representation by concatenating all hidden states of the LM. In the `top` variant, only the top LM layer is used as word representation. [[Download]](https://github.com/sdadas/polish-nlp-resources#elmo)
6. Flair language model described in [Contextual String Embeddings for Sequence Labeling](https://www.aclweb.org/anthology/C18-1139.pdf). We concatenate the outputs of the original `pl-forward` and `pl-backward` pre-trained language models available in the [Flair framework](https://github.com/flairNLP/flair).
7. RoBERTa language model described in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692), pre-trained by us for Polish. [[Download]](https://github.com/sdadas/polish-roberta)
8. XLM-RoBERTa is a large, multilingual language model trained by Facebook on 2.5 TB of text extracted from CommonCrawl. We evaluate two pre-trained architectures: base and large model. More information in their paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf). [[Download]](https://github.com/pytorch/fairseq/tree/master/examples/xlmr)
9. Original BERT language model by Google described in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf). We use the `bert-base-multilingual-cased` version. [[Download]](https://github.com/google-research/bert/blob/master/multilingual.md)
10. Multilingual sentence encoder by Facebook, presented in [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/pdf/1812.10464.pdf). [[Download]](https://github.com/facebookresearch/LASER)
11. Multilingual sentence encoder by Google, presented in [Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/pdf/1907.04307.pdf).
12. [The language-agnostic BERT sentence embedding (LaBSE)](https://arxiv.org/pdf/2007.01852.pdf).
13. Pre-trained models from the [Sentence-Transformers](https://www.sbert.net/) library.

![results](results.png)

Figure: Evaluation of aggregation techniques for word embedding models with different dimensionalities. Baseline models use simple averaging, SIF is a method proposed by Arora et al. (2017), Max Pooling is a concatenation of arithmetic mean and max pooled vector from word embeddings.

### Usage

`evaluate_all.py` is used for evaluation of all available models. \
Run `evaluate.py [model_name] [model_params]` to evaluate single model. For example, `evaluate.py word2vec` runs evaluation on `word2vec_100_3_polish.bin` model.
Please note that in case of static embeddings and ELMo, you need to manually download the model from [Polish NLP Resources](https://github.com/sdadas/polish-nlp-resources) and place it in the `resources` directory.

### Acknowledgements
This evaluation is based on [SentEval](https://github.com/facebookresearch/SentEval) modified by us to support models, tasks and preprocessing for Polish language.
We'd like to thank authors of SentEval toolkit for making their code available.

Two tasks in this study are based on [Wroclaw Corpus of Consumer Reviews](https://clarin-pl.eu/dspace/handle/11321/700). We would like to thank the authors for making this data collection available.