https://github.com/ruanchaves/elmo

Supporting code for the paper "Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks".
https://github.com/ruanchaves/elmo

elmo embeddings natural-language-processing natural-language-understanding nlp portuguese portuguese-language semantic-similarity textual-entailment

Last synced: 6 months ago
JSON representation

Supporting code for the paper "Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks".

Host: GitHub
URL: https://github.com/ruanchaves/elmo
Owner: ruanchaves
Created: 2019-07-17T15:55:36.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2022-12-08T03:34:01.000Z (almost 3 years ago)
Last Synced: 2025-03-24T16:41:51.133Z (7 months ago)
Topics: elmo, embeddings, natural-language-processing, natural-language-understanding, nlp, portuguese, portuguese-language, semantic-similarity, textual-entailment
Language: Jupyter Notebook
Homepage:
Size: 12.1 MB
Stars: 11
Watchers: 3
Forks: 2
Open Issues: 14
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          Portuguese Language Models and Word Embeddings

=================

This repository has primarily been designed to assess the quality of the [Portuguese ELMo representations made available through the AllenNLP library](https://allennlp.org/elmo) in comparison with the language models and word embeddings currently available for the Portuguese language.

This source code can reproduce the experiments mentioned in our paper [Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks](https://www.springer.com/gp/book/9783030415044). It's designed to evaluate all word embeddings from [nathanshartmann/portuguese_word_embeddings](https://github.com/nathanshartmann/portuguese_word_embeddings) on the semantic textual similarity tasks of the [ASSIN datasets](https://github.com/erickrf/assin) and also compare them with the results achieved by ELMo and BERT. Some of our tests will concatenate ELMo and word embeddings from the said repository.

* [Paper](https://www.springer.com/gp/book/9783030415044)

* [Blog post](https://ruanchaves.github.io/portuguese-language-models/)

* [PROPOR 2020 Presentation](presentations/PROPOR_2020_presentation.pdf)

* [Benchmarks](reports/evaluation.csv)

## Benchmarks

Our full benchmarks are available under [`reports/evaluation.csv`](reports/evaluation.csv). The most relevant benchmarks for the semantic textual similarity task are reproduced below.

| Dataset           | Model                 | Embedding | Architecture | Dimensions |           PCC |           MSE |

|-------------------|-----------------------|-----------|--------------|------------|--------------:|--------------:|

| ASSIN 1 (pt-BR) | ELMo - wiki (reduced) |           |              |            |          0.62 |          0.47 |

|                   | ELMo - wiki (reduced) | word2vec  | CBOW         | 1000       |          0.62 |          0.47 |

|                   | [portuguese-BERT](https://github.com/neuralmind-ai/portuguese-bert)       |           |              |            |          0.53 |          0.55 |

|                   | [BERT-multilingual (cased)](https://github.com/google-research/bert/blob/master/multilingual.md)     |           |              |            |          0.51 |          1.94 |

| ASSIN 1 (pt-PT) | ELMo - wiki (reduced) |           |              |            |          0.63 |          0.73 |

|                   | ELMo - wiki (reduced) | word2vec  | CBOW         | 1000       |          0.64 |          0.73 |

|                   | [portuguese-BERT](https://github.com/neuralmind-ai/portuguese-bert)       |           |              |            |          0.53 |          0.88 |

|                   | [BERT-multilingual (cased)](https://github.com/google-research/bert/blob/master/multilingual.md)     |           |              |            |          0.52 |          0.90 |

| ASSIN 2           | ELMo - wiki (reduced) |           |              |            |          0.57 |          1.94 |

|                   | ELMo - wiki (reduced) | word2vec  | CBOW         | 1000       |          0.59 |          1.88 |

|                   | [portuguese-BERT](https://github.com/neuralmind-ai/portuguese-bert)       |           |              |            |          0.64 |          1.69 |

|                   | BERT-multilingual     |           |              |            |          0.51 |          1.94 |

In our benchmarks, the ELMo model labelled as `wiki` is the first public Portuguese ELMo model that was made available through the [AllenNLP library website](https://allennlp.org/elmo). Since then it has been replaced on the website by `wiki (reduced)`.

The `BRWAC` model was trained on [brWaC](https://www.researchgate.net/publication/326303825_The_brWaC_Corpus_A_New_Open_Resource_for_Brazilian_Portuguese), and the `wiki (reduced)` was trained on the same dataset as `wiki` after words with word frequency below four occurrences were eliminated from the dataset. 

## Installation

Assuming you have installed Docker and nvidia-docker, the command below will reproduce all test results on this repository.

```

sudo bash scripts/quickstart.sh

```

Running this command will generate the `ruanchaves/elmo:2.0` docker image, if it doesn't exist yet, and also download all NILC embeddings, if they still haven't been downloaded to the `embeddings/NILC` folder.

If you would also like to run BERT, extract your Tensorflow checkpoint files under the folder `embeddings/bert/portuguese`. It must be provided as a model checkpoint that can be understood by [bert-as-service](https://github.com/hanxiao/bert-as-service): you may have to rename some of the files in order to comply. Move `sentence_similarity/bert.yaml` to `settings/bert.yaml` and then recompile `scripts/quickstart.sh` by running `python generate_start.py`.

Your results will be stored in the folder `sentence_similarity/results` by default.

## Associated Repositories

* [Pull request to nathanshartmann/portuguese_word_embeddings: Improvements to the scores of evaluated embeddings #11](https://github.com/nathanshartmann/portuguese_word_embeddings/pull/11) 

* You may want to take a look at the [ruanchaves/assin](https://github.com/ruanchaves/assin) repository. It contains tests which were performed with ensembles of fine-tuned Transformer models on the ASSIN datasets.

## Citation

```

@inproceedings{rodrigues_propor2020,

  author = {Ruan Chaves Rodrigues and Jéssica Rodrigues da Silva and Pedro Vitor Quinta de Castro and Nádia Félix Felipe da Silva and Anderson da Silva Soares },

  title = {Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks},

  editor = { Paulo Quaresma and Renata Vieira and Sandra Aluísio and Helena Moniz and Fernando Batista and Teresa Gonçalves },

  booktitle = { Computational Processing of the Portuguese Language },

  note = { 14th International Conference, PROPOR 2020, Evora, Portugal, March 2–4, 2020, Proceedings },

  publisher = { Springer International Publishing },

  address = { Springer Nature Switzerland AG },

  doi = {10.1007/978-3-030-41505-1},

  year = {2020}}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ruanchaves/elmo

Awesome Lists containing this project

README