Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/sdadas/polish-sentence-evaluation

Evaluation of Sentence Representations in Polish
https://github.com/sdadas/polish-sentence-evaluation

natural-language-processing polish-language sentence-embeddings word-embeddings

Last synced: 2 months ago
JSON representation

Evaluation of Sentence Representations in Polish

Host: GitHub
URL: https://github.com/sdadas/polish-sentence-evaluation
Owner: sdadas
License: gpl-3.0
Created: 2019-06-25T16:01:05.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2022-12-29T11:46:49.000Z (over 1 year ago)
Last Synced: 2023-02-26T01:22:20.787Z (over 1 year ago)
Topics: natural-language-processing, polish-language, sentence-embeddings, word-embeddings
Language: Python
Homepage:
Size: 4.96 MB
Stars: 18
Watchers: 7
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-nlp-polish - Polish Sentence Evaluation - contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks (Papers, articles, blog post / Other models)
awesome-nlp-polish - Polish Sentence Evaluation - contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks (Useful articles or projects)

README

        ### Evaluation of Sentence Representations in Polish

This repository contains experiments related to dense representations of sentences in Polish. It includes code for evaluating different sentence representation methods such as aggregated word embeddings or neural sentence encoders, both multilingual and language-specific. This source code has been used in the following publications:

#### [[1]](https://aclanthology.org/2020.lrec-1.207/) Evaluation of Sentence Representations in Polish 

The paper contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks.

Dataset for these tasks are distributed with the repository and two of them are released specifically for this evaluation:

the [SICK (Sentences Involving Compositional Knowledge)](https://github.com/text-machine-lab/MUTT/tree/master/data/sick) corpus translated to Polish and 8TAGS classification dataset. Pre-trained models used in this study are available for download in separate repository: [Polish NLP Resources](https://github.com/sdadas/polish-nlp-resources).

  BibTeX

  

  ```

  @inproceedings{dadas-etal-2020-evaluation,

    title = "Evaluation of Sentence Representations in {P}olish",

    author = "Dadas, Slawomir  and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}",

    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",

    month = may,

    year = "2020",

    address = "Marseille, France",

    publisher = "European Language Resources Association",

    url = "https://aclanthology.org/2020.lrec-1.207",

    pages = "1674--1680",

    language = "English",

    ISBN = "979-10-95546-34-4",

}

  ```

#### [[2]](https://arxiv.org/abs/2207.12759) Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

In this publication, we show a simple method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer.

  BibTeX

  

```

@inproceedings{9945218,

  author={Dadas, S{\l}awomir},

  booktitle={2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, 

  title={Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases}, 

  year={2022},

  volume={},

  number={},

  pages={371-378},

  doi={10.1109/SMC53654.2022.9945218}

}

```

### Updates:

- **29.12.2022** - Our supervised datasets are now available on the [Huggingface Hub](https://huggingface.co/sdadas).

- **20.01.2022** - [New code example](https://github.com/sdadas/polish-sentence-evaluation/tree/master/examples/paraphrase_mining) added: training sentence encoders on paraphrase pairs mined from OPUS parallel corpus.

- **23.10.2020** - Added pre-trained multilingual models from the [Sentence-Transformers](https://www.sbert.net/) library

- **02.09.2020** - Added [LaBSE](https://tfhub.dev/google/LaBSE/1) multilingual sentence encoder

- **09.05.2020** - Added new [Polish RoBERTa](https://github.com/sdadas/polish-roberta) models

- **03.03.2020** - Added [XLM-RoBERTa (base)](https://github.com/pytorch/fairseq/tree/master/examples/xlmr) model

- **02.02.2020** - Added detailed results of static word embedding models with dimensionalities from 300 to 800

- **01.02.2020** - Added [Polish RoBERTa](https://github.com/sdadas/polish-nlp-resources#roberta) model and multilingual [XLM-RoBERTa (large)](https://github.com/pytorch/fairseq/tree/master/examples/xlmr) model

### Evaluation results:

  

    #

    Method

    Language

    WCCRS
Hotels

    WCCRS
Medicine

    SICK‑E

    SICK‑R

    8TAGS

  

  

    Word embeddings                                                                     

  

  1Randomn/a65.8360.6472.770.62831.95

    2.aWord2Vec (300d)Polish78.1973.2375.420.74670.27

  2.bWord2Vec (500d)Polish81.7273.9876.250.76470.56

  2.cWord2Vec (800d)Polish82.2473.8875.600.77270.79

    3.aGloVe (300d)Polish80.0572.5473.810.75669.78

  3.bGloVe (500d)Polish80.7672.5475.090.76170.27

  3.cGloVe (800d)Polish81.7974.3276.480.77970.63

    4.aFastText (300d)Polish80.3172.6475.190.72969.24

  4.bFastText (500d)Polish80.3173.8876.660.75570.22

  4.cFastText (800d)Polish80.9572.9477.090.76869.95

  

    Language models

  

  5.aELMo (all)Polish85.5278.4277.150.78971.41

    5.bELMo (top)Polish83.2078.1774.050.75671.41

    6FlairPolish80.8275.4678.430.74365.62

  7.aRoBERTa-base (all)Polish85.7878.9678.820.79970.27 

  7.bRoBERTa-base (top)Polish84.6279.3676.090.75070.33

  7.cRoBERTa-large (all)Polish89.1284.7478.130.82075.75 

  7.dRoBERTa-large (top)Polish88.9383.1175.560.76776.67

  8.aXLM-RoBERTa-base (all)Multilingual85.5278.8175.250.73468.78 

  8.bXLM-RoBERTa-base (top)Multilingual82.3775.2664.470.57969.81 

  8.cXLM-RoBERTa-large (all)Multilingual87.3983.6074.340.76473.33 

  8.dXLM-RoBERTa-large (top)Multilingual85.0778.9161.500.56873.35 

    9BERTMultilingual76.8372.5473.830.69865.05

  

    Sentence encoders

  

  10LASERMultilingual81.2178.1782.210.82564.91

    11USEMultilingual79.4773.7882.140.83369.92

  12LaBSEMultilingual85.5280.8981.570.82572.35

  13aSentence-Transformers
^{(distiluse-base-multilingual-cased-v2)}Multilingual79.9975.8078.900.80770.86

  13bSentence-Transformers
^{(xlm-r-distilroberta-base-paraphrase-v1)}Multilingual82.6380.8481.350.83970.61

  13cSentence-Transformers
^{(xlm-r-bert-base-nli-stsb-mean-tokens)}Multilingual81.0279.9579.090.82069.12

  13dSentence-Transformers
^{(distilbert-multilingual-nli-stsb-quora-ranking)}Multilingual80.0574.6479.410.81769.28

Table: Evaluation of sentence representations on four classification tasks and one semantic relatedness task (SICK-R). For classification, we report accuracy of each model. For semantic relatedness, Pearson correlation between true and predicted relatedness scores is reported.

### Evaluated methods:

1. Randomly initialized word embeddings

2. Word2Vec ([Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.

3. GloVe ([Glove: Global Vectors for Word Representation](https://www.aclweb.org/anthology/D14-1162.pdf)) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings. [[Download]](https://github.com/sdadas/polish-nlp-resources#glove)

4. FastText ([Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf)) model pre-trained by us. The number in parentheses indicates the dimensionality of the embeddings.

5. ELMo language model described in [Deep contextualized word representations](https://arxiv.org/pdf/1802.05365.pdf) paper, pre-trained by us for Polish. In the `all` variant, we construct the word representation by concatenating all hidden states of the LM. In the `top` variant, only the top LM layer is used as word representation. [[Download]](https://github.com/sdadas/polish-nlp-resources#elmo)

6. Flair language model described in [Contextual String Embeddings for Sequence Labeling](https://www.aclweb.org/anthology/C18-1139.pdf). We concatenate the outputs of the original `pl-forward` and `pl-backward` pre-trained language models available in the [Flair framework](https://github.com/flairNLP/flair).

7. RoBERTa language model described in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692), pre-trained by us for Polish. [[Download]](https://github.com/sdadas/polish-roberta)

8. XLM-RoBERTa is a large, multilingual language model trained by Facebook on 2.5 TB of text extracted from CommonCrawl. We evaluate two pre-trained architectures: base and large model. More information in their paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf). [[Download]](https://github.com/pytorch/fairseq/tree/master/examples/xlmr)

9. Original BERT language model by Google described in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf). We use the `bert-base-multilingual-cased` version. [[Download]](https://github.com/google-research/bert/blob/master/multilingual.md)

10. Multilingual sentence encoder by Facebook, presented in [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/pdf/1812.10464.pdf). [[Download]](https://github.com/facebookresearch/LASER)

11. Multilingual sentence encoder by Google, presented in [Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/pdf/1907.04307.pdf).

12. [The language-agnostic BERT sentence embedding (LaBSE)](https://arxiv.org/pdf/2007.01852.pdf).

13. Pre-trained models from the [Sentence-Transformers](https://www.sbert.net/) library.

![results](results.png)

Figure: Evaluation of aggregation techniques for word embedding models with different dimensionalities. Baseline models use simple averaging, SIF is a method proposed by Arora et al. (2017), Max Pooling is a concatenation of arithmetic mean and max pooled vector from word embeddings.

### Usage

`evaluate_all.py` is used for evaluation of all available models. \

Run `evaluate.py [model_name] [model_params]` to evaluate single model. For example, `evaluate.py word2vec` runs evaluation on `word2vec_100_3_polish.bin` model.

Please note that in case of static embeddings and ELMo, you need to manually download the model from [Polish NLP Resources](https://github.com/sdadas/polish-nlp-resources) and place it in the `resources` directory.

### Acknowledgements

This evaluation is based on [SentEval](https://github.com/facebookresearch/SentEval) modified by us to support models, tasks and preprocessing for Polish language.

We'd like to thank authors of SentEval toolkit for making their code available. 

Two tasks in this study are based on [Wroclaw Corpus of Consumer Reviews](https://clarin-pl.eu/dspace/handle/11321/700).  We would like to thank the authors for making this data collection available.