{"id":21582419,"url":"https://github.com/ruanchaves/elmo","last_synced_at":"2025-04-10T18:54:38.278Z","repository":{"id":39739776,"uuid":"197420795","full_name":"ruanchaves/elmo","owner":"ruanchaves","description":"Supporting code for the paper \"Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks\".","archived":false,"fork":false,"pushed_at":"2022-12-08T03:34:01.000Z","size":12720,"stargazers_count":11,"open_issues_count":14,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-24T16:41:51.133Z","etag":null,"topics":["elmo","embeddings","natural-language-processing","natural-language-understanding","nlp","portuguese","portuguese-language","semantic-similarity","textual-entailment"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ruanchaves.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-17T15:55:36.000Z","updated_at":"2022-11-11T03:23:30.000Z","dependencies_parsed_at":"2023-01-25T04:30:09.530Z","dependency_job_id":null,"html_url":"https://github.com/ruanchaves/elmo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Felmo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Felmo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Felmo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Felmo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ruanchaves","download_url":"https://codeload.github.com/ruanchaves/elmo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248275590,"owners_count":21076631,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elmo","embeddings","natural-language-processing","natural-language-understanding","nlp","portuguese","portuguese-language","semantic-similarity","textual-entailment"],"created_at":"2024-11-24T14:15:47.443Z","updated_at":"2025-04-10T18:54:38.259Z","avatar_url":"https://github.com/ruanchaves.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"Portuguese Language Models and Word Embeddings\n=================\n\nThis repository has primarily been designed to assess the quality of the [Portuguese ELMo representations made available through the AllenNLP library](https://allennlp.org/elmo) in comparison with the language models and word embeddings currently available for the Portuguese language.\n\nThis source code can reproduce the experiments mentioned in our paper [Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks](https://www.springer.com/gp/book/9783030415044). It's designed to evaluate all word embeddings from [nathanshartmann/portuguese_word_embeddings](https://github.com/nathanshartmann/portuguese_word_embeddings) on the semantic textual similarity tasks of the [ASSIN datasets](https://github.com/erickrf/assin) and also compare them with the results achieved by ELMo and BERT. Some of our tests will concatenate ELMo and word embeddings from the said repository.\n\n* [Paper](https://www.springer.com/gp/book/9783030415044)\n\n* [Blog post](https://ruanchaves.github.io/portuguese-language-models/)\n\n* [PROPOR 2020 Presentation](presentations/PROPOR_2020_presentation.pdf)\n\n* [Benchmarks](reports/evaluation.csv)\n\n## Benchmarks\n\nOur full benchmarks are available under [`reports/evaluation.csv`](reports/evaluation.csv). The most relevant benchmarks for the semantic textual similarity task are reproduced below.\n\n| Dataset           | Model                 | Embedding | Architecture | Dimensions |           PCC |           MSE |\n|-------------------|-----------------------|-----------|--------------|------------|--------------:|--------------:|\n| ASSIN 1 (pt-BR) | ELMo - wiki (reduced) |           |              |            |          0.62 |          0.47 |\n|                   | ELMo - wiki (reduced) | word2vec  | CBOW         | 1000       |          0.62 |          0.47 |\n|                   | [portuguese-BERT](https://github.com/neuralmind-ai/portuguese-bert)       |           |              |            |          0.53 |          0.55 |\n|                   | [BERT-multilingual (cased)](https://github.com/google-research/bert/blob/master/multilingual.md)     |           |              |            |          0.51 |          1.94 |\n| ASSIN 1 (pt-PT) | ELMo - wiki (reduced) |           |              |            |          0.63 |          0.73 |\n|                   | ELMo - wiki (reduced) | word2vec  | CBOW         | 1000       |          0.64 |          0.73 |\n|                   | [portuguese-BERT](https://github.com/neuralmind-ai/portuguese-bert)       |           |              |            |          0.53 |          0.88 |\n|                   | [BERT-multilingual (cased)](https://github.com/google-research/bert/blob/master/multilingual.md)     |           |              |            |          0.52 |          0.90 |\n| ASSIN 2           | ELMo - wiki (reduced) |           |              |            |          0.57 |          1.94 |\n|                   | ELMo - wiki (reduced) | word2vec  | CBOW         | 1000       |          0.59 |          1.88 |\n|                   | [portuguese-BERT](https://github.com/neuralmind-ai/portuguese-bert)       |           |              |            |          0.64 |          1.69 |\n|                   | BERT-multilingual     |           |              |            |          0.51 |          1.94 |\n\nIn our benchmarks, the ELMo model labelled as `wiki` is the first public Portuguese ELMo model that was made available through the [AllenNLP library website](https://allennlp.org/elmo). Since then it has been replaced on the website by `wiki (reduced)`.\n\nThe `BRWAC` model was trained on [brWaC](https://www.researchgate.net/publication/326303825_The_brWaC_Corpus_A_New_Open_Resource_for_Brazilian_Portuguese), and the `wiki (reduced)` was trained on the same dataset as `wiki` after words with word frequency below four occurrences were eliminated from the dataset. \n\n## Installation\n\nAssuming you have installed Docker and nvidia-docker, the command below will reproduce all test results on this repository.\n\n```\nsudo bash scripts/quickstart.sh\n```\n\nRunning this command will generate the `ruanchaves/elmo:2.0` docker image, if it doesn't exist yet, and also download all NILC embeddings, if they still haven't been downloaded to the `embeddings/NILC` folder.\n\nIf you would also like to run BERT, extract your Tensorflow checkpoint files under the folder `embeddings/bert/portuguese`. It must be provided as a model checkpoint that can be understood by [bert-as-service](https://github.com/hanxiao/bert-as-service): you may have to rename some of the files in order to comply. Move `sentence_similarity/bert.yaml` to `settings/bert.yaml` and then recompile `scripts/quickstart.sh` by running `python generate_start.py`.\n\nYour results will be stored in the folder `sentence_similarity/results` by default.\n\n## Associated Repositories\n\n* [Pull request to nathanshartmann/portuguese_word_embeddings: Improvements to the scores of evaluated embeddings #11](https://github.com/nathanshartmann/portuguese_word_embeddings/pull/11) \n\n* You may want to take a look at the [ruanchaves/assin](https://github.com/ruanchaves/assin) repository. It contains tests which were performed with ensembles of fine-tuned Transformer models on the ASSIN datasets.\n\n## Citation\n\n```\n@inproceedings{rodrigues_propor2020,\n  author = {Ruan Chaves Rodrigues and Jéssica Rodrigues da Silva and Pedro Vitor Quinta de Castro and Nádia Félix Felipe da Silva and Anderson da Silva Soares },\n  title = {Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks},\n  editor = { Paulo Quaresma and Renata Vieira and Sandra Aluísio and Helena Moniz and Fernando Batista and Teresa Gonçalves },\n  booktitle = { Computational Processing of the Portuguese Language },\n  note = { 14th International Conference, PROPOR 2020, Evora, Portugal, March 2–4, 2020, Proceedings },\n  publisher = { Springer International Publishing },\n  address = { Springer Nature Switzerland AG },\n  doi = {10.1007/978-3-030-41505-1},\n  year = {2020}}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruanchaves%2Felmo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fruanchaves%2Felmo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruanchaves%2Felmo/lists"}