https://github.com/ajdavidl/Portuguese-NLP

List of resources and tools developed with focus on Portuguese.
https://github.com/ajdavidl/Portuguese-NLP
nlp portuguese portuguese-language
Last synced: 5 months ago
JSON representation
List of resources and tools developed with focus on Portuguese.
Host: GitHub
URL: https://github.com/ajdavidl/Portuguese-NLP
Owner: ajdavidl
Created: 2022-07-02T18:46:41.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-11-05T02:14:10.000Z (11 months ago)
Last Synced: 2024-11-05T03:19:56.551Z (11 months ago)
Topics: nlp, portuguese, portuguese-language
Homepage:
Size: 152 KB
Stars: 235
Watchers: 12
Forks: 26
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-nlpbr - Portuguese-NLP
README

          # Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

## Datasets

- [#PraCegoVer](https://huggingface.co/datasets/gabrielsantosrv/pracegover) -  multi-modal dataset with Portuguese captions based on posts from Instagram.

- [18th-century Portuguese medical texts](https://github.com/uebelsetzer/NLP_for_18th-century_Portuguese_medical_texts)

- [AG_news pt](https://huggingface.co/datasets/maritaca-ai/ag_news_pt) - automatic translation of the AG's corpus of news articles.

- [Alpaca data pt-br](https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br) - Stanford Alpaca dataset translated into Brazilian Portuguese using the Helsinki-NLP/opus-mt-tc-big-en-pt model.

- [AspectBR](https://github.com/franciellevargas/AspectBR) - Aspect-based annotated dataset of web consumer reviews.

- [ASSIN](http://nilc.icmc.usp.br/assin/) - a dataset with semantic similarity score and entailment annotations. ([HuggingFace](https://huggingface.co/datasets/assin))

- [ASSIN 2](https://sites.google.com/view/assin2) - sequence of ASSIN. ([HuggingFace](https://huggingface.co/datasets/assin2))

- [Automated Essay Score (AES) ENEM Dataset](https://github.com/kamel-usp/aes_enem) - Benchmark for automatic essay scoring in Portuguese ([HuggingFace](https://huggingface.co/datasets/kamel-usp/aes_enem_dataset))

- [Aya Dataset PT](https://huggingface.co/datasets/nicolasdec/aya_dataset_pt) - CohereForAI Aya Dataset filtrado para português (PT).

- [BlogSet-BR](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/blogset-br-english/) - a collection of posts gathered from Blogspot platform written by Brazillian users.

- [BLUEX](https://huggingface.co/datasets/portuguese-benchmark-datasets/BLUEX) - A benchmark based on Brazilian Leading Universities Entrance eXams.

- [BoolQ](https://huggingface.co/datasets/maritaca-ai/boolq_pt) - tradução automática do BoolQ.

- [br-quad-2.0](https://github.com/piEsposito/br-quad-2.0) - Stanford Question Answering Dataset (SQuAD) 2.0 translated to Brazilian Portuguese (PT-BR) language.

- [Brands.Br](https://github.com/metalmorphy/Brands.Br) - a Portuguese Reviews Corpus

- [Brazilian Court Decisions](joelniklaus/brazilian_court_decisions) - collection of 4043 Ementa (summary) court decisions and their metadata from the Tribunal de Justiça de Alagoas (TJAL), the State Supreme Court of Alagoas (Brazil).

- [Brazilian E-Commerce](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce) - Brazilian E-Commerce Public Dataset by Olist store.

- [Brazilian Headlines Sentiments](https://www.kaggle.com/datasets/brunoluvizotto/brazilian-headlines-sentiments) - Dataset containing sentiment analysis of Brazilian news agencies headlines.

- [Brazilian Portuguese Literature Corpus](https://www.kaggle.com/datasets/rtatman/brazilian-portuguese-literature-corpus) - 3.7 million word corpus of Brazilian literature published between 1840-1908.

- [Brazilian Portuguese Narrative Essays Dataset](https://www.kaggle.com/datasets/moesiof/portuguese-narrative-essays) - Dataset for Automatic Essay Scoring of Brazilian Portuguese Narrative Essays.

- [Brazilian Portuguese Sentiment Analysis Datasets](https://www.kaggle.com/datasets/fredericods/ptbr-sentiment-analysis-datasets).

- [Brazilian TCU's judgments](https://www.kaggle.com/datasets/ferraz/acordaos-tcu) - Judgments of Federal Court of Accounts - Brazil (TCU).

- [BrWaC](https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC) - Brazilian Portuguese Web as Corpus.

- [BrWac2Wiki](https://github.com/aseidelo/BrWac2Wiki) - a dataset for multi-document summarization in Portuguese.

- [B2W-Reviews01](https://github.com/americanas-tech/b2w-reviews01) - product reviews.

- [Canarim](https://github.com/DominguesM/canarim) - A Large-Scale Dataset of Web Pages in the Portuguese Language ([huggingface](https://huggingface.co/datasets/dominguesm/canarim))

- [Carolina](https://sites.usp.br/corpuscarolina/) - Corpus Geral do Português Brasileiro Contemporâneo ([huggingface](https://huggingface.co/datasets/carolina-c4ai/corpus-carolina)).

- [Capes](https://huggingface.co/datasets/capes) - parallel corpus of theses and dissertations abstracts in English and Portuguese.

- [CC100-Portuguese](https://autonlp.ai/datasets/cc100-portuguese) - Created by Conneau & Wenzek et al. at 2020. This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. 

- [CETENFolha](https://www.linguateca.pt/cetenfolha/index_info.html) - news from the newspaper Folha de S. Paulo.

- [CHAVE](https://www.linguateca.pt/CHAVE/) - collection for Information Retrieval and Question Answering.

- [CINTIL Corpus](http://cintil.ul.pt/cintilfeatures.html#corpus) - a linguistically interpreted corpus of Portuguese. 

- [ClinicalNER](https://github.com/fabioacl/PortugueseClinicalNER) - Clinical Named Entity Recognition in Portuguese.

- [Complexidade Textual para Estágios Escolares do Sistema Educacional Brasileiro](https://github.com/gazzola/corpus_readability_nlp_portuguese).

- [CORAA](https://github.com/nilc-nlp/CORAA) - dataset for Automatic Speech Recognition.

- [CORAA SER](https://github.com/rmarcacini/ser-coraa-pt-br) - Emotion Recognition from Brazilian Portuguese Informal Spontaneous Speech.

- [CrawlPT_dedup](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) - CrawlPT (deduplicated) is composed by three corpora: brWaC, C100-PT, OSCAR-2301.

- [CSTNews](https://sites.icmc.usp.br/taspardo/sucinto/cstnews.html) - a corpus with 50 clusters of news texts with their multi-document summaries, as well as several discourse and semantic annotations.

- [C-ORAL-BRASIL](http://www.c-oral-brasil.org/english-site/index.html) - This project is dedicated to the study of Brazilian Portuguese spontaneous speech and, more broadly, to the compilation of spoken corpora.

- [DANTEStocks](https://www.kaggle.com/datasets/michelmzerbinati/portuguese-tweet-corpus-annotated-with-ner) - Corpus of stock market tweets written in Brazilian Portuguese and annotated with named entities according to HAREM's taxonomy.

- [DEEPAGÉ](https://github.com/C4AI/deepage) - Answering Questions in Portuguese about the Brazilian Environment.

- [DNLT-BP](https://github.com/nilc-nlp/DNLT-BP) - Datasets of Neuropsychological Language Tests in Brazilian Portuguese.

- [ENEM Challenge](https://www.ime.usp.br/~ddm/project/enem/) - Consists of the writing of an essay and an objective part containing 180 multiple choice questions.

- [ENEM-2022 and ENEM-2023](https://huggingface.co/datasets/maritaca-ai/enem) - These projects encompass all multiple-choice questions from the last two editions of the Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities.

- [Essay-BR](https://github.com/rafaelanchieta/essay) - Essay-BR: a corpus of essays for the Brazilian Portuguese language.

- [Extended Essay-BR](https://github.com/lplnufpi/essay-br) - Extended version of the Essay-BR corpus.

- [FACTCK.BR](https://github.com/jghm-f/FACTCK.BR) - A dataset to study Fake News in Portuguese.

- [FactNews](https://github.com/franciellevargas/FactNews) - dataset to predict sentence-level factuality of news reporting.

- [fake voices](https://huggingface.co/datasets/unfake/fake_voices) - deepfakes in Brazilian Portuguese created with XTTS model.

- [Fake.Br](https://github.com/roneysco/Fake.br-Corpus) - aligned true and fake news written in Brazilian Portuguese ([Hugginface](https://huggingface.co/datasets/fake-news-UFG/fakebr)).

- [Central_de_fatos](https://doi.org/10.5281/zenodo.5191798) - ([Huggingface](https://huggingface.co/datasets/fake-news-UFG/central_de_fatos)).

- [FakeNewsSet](https://github.com/kamplus/FakeNewsSetGen) - ([HuggingFace](https://huggingface.co/datasets/fake-news-UFG/FakeNewsSet)).

- [Fakepedia-Corpus](https://github.com/andersoncordeiro/Fakepedia-Corpus) - fake news dataset.

- [FakeRecogna](https://github.com/Gabriel-Lino-Garcia/FakeRecogna) - dataset comprised of real and fake news ([Huggingface](https://huggingface.co/datasets/recogna-nlp/FakeRecogna)).

- [FakeWhatsApp.Br](https://github.com/cabrau/FakeWhatsApp.Br) - An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.

- [FKTC](https://github.com/GoloMarcos/FKTC) - FaKe news Text Collections.

- [Floresta Sintá(c)tica](https://www.linguateca.pt/Floresta/) - treebank for Portuguese.

- [HAREM first](https://www.linguateca.pt/primeiroHAREM/harem_coleccaodourada_en.html) - evaluation contest for named entity recognizers in Portuguese.

- [HAREM second](https://www.linguateca.pt/HAREM/) - evaluation contest for named entity recognizers in Portuguese.

- [HateBR](https://github.com/franciellevargas/HateBR) - large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.

- [Historical Portuguese Corpora](http://www.nilc.icmc.usp.br/nilc/projects/hpc/) - tools and resources for manipulation of historical corpora and management of historical dictionaries.

- [IMDB pt](https://huggingface.co/datasets/maritaca-ai/imdb_pt) - Tradução atomática do IMBD.

- [InferBR](https://github.com/lbencke/InferBR) - Natural Language Inference dataset.

- [Iudicium Textum Dataset](http://dadosabertos.c3sl.ufpr.br/acordaos/) - contains legal documents created by Brazilian Federal Supreme Court in its integral composition ([paper](https://www.researchgate.net/publication/336022563_Iudicium_Textum_Dataset_Uma_Base_de_Textos_Juridicos_para_NLP)).

- [LeNER-Br](https://github.com/peluz/lener-br) - a Dataset for Named Entity Recognition in Brazilian Legal Text.

- [LegalPT_dedup](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) - LegalPT (deduplicated) aggregates the maximum amount of publicly available legal data in Portuguese.

- [Lex2Kids](http://www.nilc.icmc.usp.br/leg2kids/) - lexicon in Portuguese most heard by children.

- [Mac-Morpho](http://www.nilc.icmc.usp.br/macmorpho/) - Brazilian Portuguese texts annotated with part-of-speech tags.

- [MilkQA](http://www.nilc.icmc.usp.br/nilc/index.php/milkqa/) - a dataset of dense questions for the task of answer selection.

- [Minutes of Central Bank of Brazil](https://github.com/ajdavidl/corpus-atas-copom/blob/main/README_English.md) - Minutes of the Monetary Policy Committee of the Central Bank of Brazil.

- [NER in Brazilian Portuguese tweets](https://www.kaggle.com/datasets/rafaelperes/ner-in-brazilian-portuguese-tweets) - Twitter messages in pt-br annotated for the entities PER, LOC and ORG.

- [NERDE](https://huggingface.co/datasets/Gpaiva/NERDE) - Documents from [CADE](https://www.gov.br/cade/pt-br)'s jurisprudence annotated for the entities ORG, PER, TEMPO, LOC, LEG (legislation), DOCS (documents), VALOR. 

- [News-Crawl-PT](https://data.statmt.org/news-crawl/pt/) - Monolingual News Crawl used for WMT.

- [News of the site Folha de São Paulo](https://www.kaggle.com/datasets/marlesson/news-of-the-site-folhauol) - news of the Brazilian Newspaper Folha de São Paulo.

- [News published in Brazil](https://www.kaggle.com/datasets/diogocaliman/notcias-publicadas-no-brasil) - news compilation of the Globo group.

 

- [OAB exams](https://github.com/legal-nlp/oab-exams) - Brazilian version of the BAR exam (USA) ([HuggingFace](https://huggingface.co/datasets/eduagarcia/oab_exams)).

- [Parallel Corpora from Revista Pesquisa FAPESP](http://www.nilc.icmc.usp.br/nilc/tools/Fapesp%20Corpora.htm) - Portuguese-English and Portuguese-Spanish bilingual collections of the online issues of the scientific news Brazilian magazine Revista Pesquisa FAPESP.

- [Pirá](https://github.com/C4AI/Pira) - A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean.

- [PL-corpus](https://huggingface.co/datasets/bergoliveira/pl-corpus) - part of the UlyssesNER-Br, a corpus of Brazilian Legislative Documents for NER with quality baselines.

- [PLUE](https://github.com/ju-resplande/PLUE) - Portuguese translation of the GLUE benchmark and Scitail dataset.

- [POeTiSA](https://sites.google.com/icmc.usp.br/poetisa) - POrtuguese processing - Towards Syntactic Analysis and parsing.

- [politiquices](https://github.com/politiquices/data-releases) -  Datasets related with the politiquices.pt project.

- [PorSimplesSent](https://github.com/sidleal/porsimplessent) - of aligned sentences pairs to investigate sentence readability assessment.

- [PortiLexicon-UD](https://portilexicon.icmc.usp.br/) -  a lexicon for Brazilian Portuguese according to Universal Dependencies.

- [Portuguese-Hate-Speech-Dataset](https://github.com/paulafortuna/Portuguese-Hate-Speech-Dataset) - Portuguese dataset for hate speech detection composed of 5,668 tweets with binary annotations (i.e. 'hate' vs. 'no-hate') ([HuggingFace](https://huggingface.co/datasets/hate_speech_portuguese))

- [Portuguese Legal Sentences](https://huggingface.co/datasets/rufimelo/PortugueseLegalSentences-v3) - Collection of Legal Sentences from the Portuguese Supreme Court of Justice.

- [Portuguese Presidential Elections](https://github.com/msramalho/election-watch/blob/master/datasets/01_portuguese_presidential_elections_2021_01_24.md) - This dataset contains tweets and users mostly from the Portuguese Twittersphere.

- [PraCegoVer](https://github.com/larocs/PraCegoVer) - multi-modal dataset containing images associated to Portuguese captions based on posts from Instagram.

- [Priberam Fine-Grained Opinion Corpus](http://labs.priberam.pt/Resources/Fine-Grained-Opinion-Corpus.aspx) - a Portuguese fine-grained dependency opinion mining corpus.

- [Propbank](http://143.107.183.175:21380/portlex/index.php/en/downloadsingl) - Contains instances annotated with semantic role labels (SRL). 

- [Projeto ACDC](https://www.linguateca.pt/ACDC/) - Internet Access to Corpora.

- [Puntuguese](https://github.com/Superar/Puntuguese/) - A Corpus of Puns in Portuguese with Micro-editions ([HuggingFace](https://huggingface.co/datasets/Superar/Puntuguese))

- [QA-Portuguese](https://huggingface.co/datasets/ju-resplande/qa-pt) - Adaptation from MQA dataset Portuguese split (QA entailment pairs).

- [Quati](https://huggingface.co/datasets/unicamp-dl/quati) - This dataset aims to support Brazilian Portuguese (pt-br) Information Retrieval (IR) systems development, providing document passagens originally created in pt-br, as well as queries (topics) created by native speakers.

- [REBEL-Portuguese](https://huggingface.co/datasets/ju-resplande/rebel-pt) - Datasets de relações a partir da Wikipedia.

- [ReLi](https://www.linguateca.pt/Repositorio/ReLi/) - REsenha de LIvros.

- [RePro: A Benchmark Dataset for Opinion Mining for Brazilian Portuguese](https://github.com/lucasnil/repro) - A Benchmark Dataset for Opinion Mining for Brazilian Portuguese. ([HuggingFace](https://huggingface.co/datasets/lucasnil/repro))

- [Rhetalho](https://sites.icmc.usp.br/taspardo/rhetalho.zip) - corpus annotated with Daniel Marcu's RSTTool.

- [SemClinBr](https://github.com/HAILab-PUCPR/SemClinBr) - multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.

- [SESAME](https://sesame-pt.github.io/) - corpus for NER in portuguese.

- [SIGARRA News Corpus](https://rdm.inesctec.pt/dataset/cs-2017-004) - SIGARRA information system at the University of Porto.

- [SIMPLEX-PB](https://github.com/nathanshartmann/SIMPLEX-PB) - A Lexical Simplification Database and Benchmark for Portuguese.

- [SIMPLEX-PB-2.0](https://github.com/nathanshartmann/simplex-pb-2.0) - improved version of SIMPLEX-PB.

- [SIMPLEX-PB-3.0](https://github.com/nathanshartmann/simplex-pb-3.0) - new version of SIMPLEX-PB.

- [Spotify Subset](https://github.com/aryamtos/spotify-subset) - classifying language variations in Brazilian Portuguese

- [SQUAD-PT v1.1](https://github.com/nunorc/squad-v1.1-pt) - Portuguese translation of the SQuAD dataset.

- [SQUAD-PT v1.1-pt-br](https://huggingface.co/datasets/ArthurBaia/squad_v1_pt_br) - Brazilian Portuguese translation of the SQuAD dataset, translated by Deep Learning Brasil.

- [SQUAD-PT v2.0](https://github.com/cjaniake/squad_v2.0_pt) - Portuguese translation of SQuAD 2.0 dataset.

- [SST-2 pt](https://huggingface.co/datasets/maritaca-ai/sst2_pt) - Automatic translation of the Stanford Sentiment Treebank.

- [TeMário](https://www.linguateca.pt/Repositorio/TeMario/) - news texts and the corresponding human summaries for summarization purposes.

- [Textual Complexity Corpus](https://github.com/gazzola/corpus_readability_nlp_portuguese) - Textual Complexity Corpus for School Internships in the Brazilian Educational System.

- [ToLD-Br](https://huggingface.co/datasets/told-br) - Toxic Language Detection in Social Media for Brazilian Portuguese ([github](https://github.com/JAugusto97/ToLD-Br)).

- [TTS-Portuguese Corpus](https://github.com/Edresson/TTS-Portuguese-Corpus) - Text To Speech Portuguese.

- [TweetSentBR](https://bitbucket.org/HBrum/tweetsentbr/src/master/) - Tweets in Brazilian Portuguese.

- [Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis).

- [UD_Portuguese-Bosque](https://github.com/UniversalDependencies/UD_Portuguese-Bosque) - Universal Dependencies (UD) Portuguese treebank.

- [UD_Portuguese-CINTIL](https://github.com/UniversalDependencies/UD_Portuguese-CINTIL) - Universal Dependencies (UD) Portuguese treebank.

- [UD_Portuguese-GSD](https://github.com/UniversalDependencies/UD_Portuguese-GSD) - Universal Dependencies (UD) Portuguese treebank.

- [UD_Portuguese-PetroGold](https://github.com/UniversalDependencies/UD_Portuguese-PetroGold) - Universal Dependencies (UD) Portuguese treebank.

- [UD_Portuguese-PUD](https://github.com/UniversalDependencies/UD_Portuguese-PUD) - Universal Dependencies (UD) Portuguese treebank.

- [UlyssesNER-Br](https://github.com/ulysses-camara/ulysses-ner-br/) - Corpus of Brazilian Legislative Documents for Named Entity Recognition

- [UTLCorpus](https://github.com/RogerFig/UTLCorpus) - a corpus of online reviews in Brazilian Portuguese annotated with helpfulness classification.

- [Winograd Schema Challenge](https://github.com/gabimelo/portuguese_wsc) - Solver for the Portuguese-based Winograd Schema Challenge.

- [WizardVicuna-PTBR-Instruct-Clean](https://huggingface.co/datasets/cnmoro/WizardVicuna-PTBR-Instruct-Clean) - Wizard Vicuna PT-Br Instruct Clean dataset.

### Multilingual datasets

- [A Multilingual Dataset for Investigating Stereotypes and Negative Attitudes Towards Migrant Groups in Large Language Models](https://github.com/dsorato/stereotypes_negative_attitudes_towards_migrants_dataset)

- [askD](https://huggingface.co/datasets/ju-resplande/askD) - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.

- [English-Portuguese Sentences](http://www.manythings.org/bilingual/por/) - English-Portuguese Sentences from the Tatoeba Project.

- [EUR-Lex](https://www.sketchengine.eu/eurlex-corpus/) - multilingual corpus in all the official languages of the European Union.

- [Europarl](https://www.statmt.org/europarl/) - European Parliament Proceedings Parallel Corpus 1996-2011.

- [Europarl-ST](https://www.mllp.upv.es/europarl-st/) - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.

- [mc4](https://huggingface.co/datasets/mc4/viewer/pt/train) - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.

- [mfaq](https://huggingface.co/datasets/clips/mfaq) - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.

- [MKQA](https://huggingface.co/datasets/mkqa) - Multilingual Knowledge Questions & Answers ([github](https://github.com/apple/ml-mkqa)).

- [MQA](https://huggingface.co/datasets/clips/mqa) - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.

- [MMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) - Multilingual version of the MS MARCO passage ranking dataset.

- [mRobust](https://huggingface.co/datasets/unicamp-dl/mrobust) - Multilingual version of the TREC 2004 Robust passage ranking dataset 

- [MultiCoNER](https://huggingface.co/datasets/MultiCoNER/multiconer_v2) - a large multilingual dataset for Named Entity Recognition.

- [MuST-C](https://ict.fbk.eu/must-c/) - multilingual speech translation corpus.

- [OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles.php) - collection of translated movie subtitles.

- [OSCAR](https://oscar-corpus.com/) - Open Super-large Crawled Aggregated coRpus.

- [Tatoeba](https://tatoeba.org/en/downloads) - a large database of sentences and translations. 

- [TED2020](https://opus.nlpl.eu/TED2020.php) - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020. 

- [TSAR-2022-Shared-Task](https://github.com/LaSTUS-TALN-UPF/TSAR-2022-Shared-Task) - TSAR2022 Shared Task on Lexical Simplification.

- [WikiANN](https://huggingface.co/datasets/wikiann) - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.

- [WikiLingua](https://github.com/esdurmus/Wikilingua) - Multilingual abstractive summarization dataset extracted from WikiHow.

- [WikiMatrix](https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix) - Parallel Sentences in 1620 Language Pairs from Wikipedia.

- [Wikiner](https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) - Learning multilingual named entity recognition from Wikipedia.

- [WikiNEuRal](https://github.com/Babelscape/wikineural) - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

- [Wikipedia](https://huggingface.co/datasets/wikipedia) - Wikipedia dataset containing cleaned articles of all languages.

- [XFORMAL](https://github.com/Elbria/xformal-FoST) - A Benchmark for Multilingual Formality Style Transfer.

- [XLSUM](https://huggingface.co/datasets/csebuetnlp/xlsum) - 1.35 million professionally annotated article-summary pairs from BBC.

## Lexicon

- [BATS-PT](https://github.com/NLP-CISUC/PT-LexicalSemantics/tree/master/BATS-PT) - manual translation of the lexicographic portion of the Bigger Analogy Test Set (BATS) to Portuguese

- [br.ispell](https://www.ime.usp.br/~ueda/br.ispell/summary.html) - Ispell dictionary for brazilian portuguese ([github](https://github.com/fititnt/br.ispell-dicionario-portugues-brasileiro)).

- [Conceptnet](https://conceptnet.io/) - an open, multilingual knowledge graph.

- [DicSin](https://github.com/fititnt/DicSin-dicionario-sinonimos-portugues-brasileiro) - Dictionary of synonyms and antonyms.

- [lexiconPT](https://github.com/sillasgonzaga/lexiconPT) - R package that provides lexicons for Portuguese Text Analysis.

- [lexicons](https://github.com/davidsbatista/lexicons) - Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc.

- [LIWC](http://nilc.icmc.usp.br/portlex/index.php/en/liwc) - Linguistic Inquiry and Word Count ([dictionary](https://sites.icmc.usp.br/sandra/LIWC/LIWC2007_Portugues_win.dic))

- [Onto.PT](http://ontopt.dei.uc.pt/) - Ontologia Lexical para o Português.

- [OpenWordnet-PT](https://github.com/own-pt/openWordnet-PT) - an open access wordnet for Portuguese ([site](http://wn.mybluemix.net/)).

- [OpLexicon](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/oplexicon/) - a sentiment lexicon for the Portuguese language.

- [palavras](https://github.com/pythonprobr/palavras) - Word list of Brazillian Portuguese.

- [PAPEL](https://www.linguateca.pt/PAPEL/).

- [pt-br](https://github.com/fserb/pt-br) - Wordlist, verbs, conjugations, term frequencies.

- [PT-LKB](https://github.com/NLP-CISUC/PT-LexicalSemantics/) - Large Portuguese Lexical-Semantic Knowledge Base

- [PULO](http://wordnet.pt/) - Portuguese Unified Lexical Ontology.

- [SentiLex-PT](http://b2find.eudat.eu/dataset/b6bd16c2-a8ab-598f-be41-1e7aeecd60d3) - a sentiment lexicon for Portuguese.

- [Stopwords](https://github.com/stopwords-iso/stopwords-pt) - Portuguese stopwords collection.

- [Tep2](http://www.nilc.icmc.usp.br/tep2/).

- [Unitex-PB](http://www.nilc.icmc.usp.br/nilc/projects/unitex-pb/web/dicionarios.html) - lexical resources.

- [VaLexPB](https://github.com/jessemourao/VaLexPB) - a lexicon of Brazilian Portuguese verb valences.

- [VerbNet.Br 1.0](http://143.107.183.175:21380/portlex/index.php/en/projects/verbnetbringl) - verbal lexicon of Brazilian Portuguese.

- [wikidict-dsl-pt](https://github.com/open-dsl-dict/wikidict-dsl-pt) - Wikidata Bilingual DSL Dictionaries.

- [wikidict-pt](https://github.com/open-dict-data/wikidict-pt) - Wikipedia Bilingual Reference Data (Portuguese).

- [Wordnetaffectbr](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/wordnetaffectbr/) - vocabulary of emotions words.

- [Wordnet.Br](http://www.nilc.icmc.usp.br/wordnetbr/) - Portuguese WordNet.

## Models

- [Albertina PT-BR](https://huggingface.co/PORTULAN/albertina-ptbr) - It is an encoder of the BERT family for the Portuguese language - the American variant from Brazil.

- [Albertina PT-PT](https://huggingface.co/PORTULAN/albertina-ptpt) - It is an encoder of the BERT family for the Portuguese language - the European variant from Portugal.

- [Alpaca-LoRA-PTBR](https://huggingface.co/dominguesm/alpaca-lora-ptbr-7b) - Low-Rank LLaMA Instruct-Tuning.

- [BART](https://huggingface.co/adalbertojunior/bart-base-portuguese) - BART pre-treinado em português.

- [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment ([Github](https://github.com/neuralmind-ai/portuguese-bert)).

- [BioBERTpt](https://huggingface.co/pucpr/biobertpt-all) - fine-tuned BERT models trained on the clinical domain for Portuguese language ([Github](https://github.com/HAILab-PUCPR/BioBERTpt)).

- [Bode](https://huggingface.co/recogna-nlp/bode-7b-alpaca-pt-br) - a fine-tuned LLaMA 2-based model for Portuguese prompts ([13b](https://huggingface.co/recogna-nlp/bode-13b-alpaca-pt-br)).

- [Cabrita](https://huggingface.co/22h/cabrita-lora-v0-1) - A portuguese finetuned instruction LLaMA ([Github](https://github.com/22-hours/cabrita)).

- [DeBERTinha](https://huggingface.co/sagui-nlp/debertinha-ptbr-xsmall) - A DeBERTa V3 XSmall adapted to the Brazilian Portuguese language ([Github](https://github.com/sagui-nlp/DeBERTinha)).

- [Electra](https://huggingface.co/dlb/electra-base-portuguese-uncased-brwac) - Electra model trained on BRWAC.

- [FinBERT-PT-BR](https://huggingface.co/lucas-leme/FinBERT-PT-BR) - a pre-trained NLP model to analyze sentiment of Brazilian Portuguese financial texts.

- [Gervasio-PT-BR](https://huggingface.co/PORTULAN/gervasio-ptbr-base) - It is a decoder of the GPT family for the Portuguese language - the American variant from Brazil.

- [Gervasio-PT-PT](https://huggingface.co/PORTULAN/gervasio-ptpt-base) - It is a decoder of the GPT family for the Portuguese language - the European variant from Portugal.

- [GlórIA 1.3B](https://github.com/rvlopes/GlorIA) - A Portuguese European-focused Large Language Model ([HuggingFace](https://huggingface.co/NOVA-vision-language/GlorIA-1.3B))

- [GPT2 small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.

- [GPT-Neo small](https://huggingface.co/HeyLucasLeao/gpt-neo-small-portuguese) - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.

- [GPT2-Bio-PT](https://huggingface.co/pucpr/gpt2-bio-pt) - a biomedical finetuned version from GPorTuguese-2 ([Github](https://github.com/HAILab-PUCPR/gpt2-bio-pt)).

- [NERDE-base](https://huggingface.co/Gpaiva/NERDE-base) - BERTimbau finetuned to NER on Judicial Documents. 

- [roberta-pt-br](https://huggingface.co/josu/roberta-pt-br)

- [RoBERTaCrawlPT-base](https://huggingface.co/eduagarcia/RoBERTaCrawlPT-base) - RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora

- [RoBERTaLexPT-base](https://huggingface.co/eduagarcia/RoBERTaLexPT-base) - Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora

- [Sabiá](https://huggingface.co/maritaca-ai/sabia-7b) - Sabiá-7B is Portuguese language model developed by Maritaca AI.

- [Sabiá 2](https://www.maritaca.ai/en/sabia-2) - Language model trained on Portuguese text, especially in the Brazilian domain.

- [T5](https://github.com/unicamp-dl/PTT5) - T5 model on Brazilian Portuguese data.

- [tgf-xlm-roberta-base-pt-br](https://huggingface.co/thegoodfellas/tgf-xlm-roberta-base-pt-br) - a fine-tuned version of xlm-roberta-base on the BrWac dataset ([Github](https://github.com/the-good-fellas/xlm-roberta-pt-br)).

- [Wav2vec](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) - Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1. 

### Multilingual Models

- [Bloom](https://huggingface.co/bigscience/bloom) - BigScience Large Open-science Open-access Multilingual Language Model.

- [mBert](https://huggingface.co/bert-base-multilingual-cased) - Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.

- [mDeBERTa](https://huggingface.co/microsoft/mdeberta-v3-base) - improves the BERT and RoBERTa models.

- [mGPT](https://huggingface.co/sberbank-ai/mGPT) - Multilingual GPT model. An autoregressive GPT-like model.

- [mMiniLM](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-pt-v2) - mMiniLM-L6-v2 Reranker finetuned on mMARCO

- [mT5](https://huggingface.co/google/mt5-base) - Multilingual T5. A massively multilingual pre-trained text-to-text transformer.

- [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) - XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. 

- [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) - Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. 

## Word Embeddings

- [fastText](https://fasttext.cc/docs/en/crawl-vectors.html) - Multi-lingual word vectors.

- [LASER](https://github.com/facebookresearch/LASER) - Language-Agnostic SEntence Representations.

- [NILC-Embeddings](http://www.nilc.icmc.usp.br/embeddings) - Word embeddings trained in Portuguese by USP.

- [MUSE](https://github.com/facebookresearch/MUSE) - Multilingual Unsupervised and Supervised Embeddings.

- [word vectors](https://github.com/Kyubyong/wordvectors) - Pre-trained word vectors of 30+ languages.

## Metrics

- [Coh-Metrix-Port](https://github.com/nilc-nlp/coh-metrix-port) - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.

- [NILC-Metrix](https://github.com/sidleal/nilcmetrix) - it gathers the metrics developed over more than a decade in NILC Lab.

## Leaderboards

- [Open PT LLM Leaderboard](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard) - Open PT LLM Leaderboard aims to provide a benchmark for the evaluation of Large Language Models (LLMs) in the Portuguese language across a variety of tasks and datasets.

## Frameworks

- [nlpnet](http://nilc.icmc.usp.br/nlpnet/)

- [NLTK](https://www.nltk.org/howto/portuguese_en.html)

- [polyglot](https://github.com/aboSamoor/polyglot)

- [spaCy](https://spacy.io/models/pt)

- [Stanza NLP](https://stanfordnlp.github.io/stanza/available_models.html)

- [udpipe](https://github.com/bnosac/udpipe)

## Institutions

- [Brasileiras em PLN](https://brasileiraspln.com/).

- [HAILab-PUCPR](https://github.com/HAILab-PUCPR) - A pioneering research group aiming to develop solutions for health care using Natural Language Processing and Machine Learning.

- [Linguateca](https://www.linguateca.pt/).

- [NILC](http://www.nilc.icmc.usp.br/nilc/index.php).

- [NLPortuguês](https://nlportugues.ime.usp.br/) - Devoted to creating NLP courses in brazilian portuguese.

- [NLX-Group](http://nlx.di.fc.ul.pt/).

- [PLN PUCRS](https://www.inf.pucrs.br/linatural/wordpress/).

## Tools

- [Apertium-por](https://github.com/apertium/apertium-por) - Apertium linguistic data for Portuguese.

- [Autocorrect](https://github.com/filyp/autocorrect) - Spelling corrector in python.

- [BrGram](https://github.com/LR-POR/BrGram) - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.

- [Dicio API](https://github.com/ThiagoNelsi/dicio-api) - Portuguese dictionary API.

- [dict-pt-br](https://github.com/VisualText/dict-pt-br) - dictionary for Brazilian Portuguese.

- [Languagetool](https://github.com/languagetool-org/languagetool) - Style and Grammar Checker for 25+ Languages.

- [LegalNLP](https://github.com/felipemaiapolo/legalnlp) - Natural Language Processing Methods for the Brazilian Legal Language.

- [LexML Parser](https://github.com/lexml/lexml-parser-projeto-lei) - parser for legal documents.

- [LX parser](http://lxcenter.di.fc.ul.pt/tools/en/LXParserEN.html) - statistical constituency parser for Portuguese.

- [metaphone-ptbr](https://github.com/carlosjordao/metaphone-ptbr) - Metaphone algorithm for the Portuguese language.

- [mlconjug3](https://github.com/SekouDiaoNlp/mlconjug3) - a Python library to conjugate verbs in Portuguese and other languages.

- [MorphoBr](https://github.com/LR-POR/MorphoBr) - Resources for morphological analysis of Portuguese.

- [OpCluster](https://github.com/franciellevargas/Opcluster) - Automatic extraction and clustering of fine-grained opinions.

- [Phonemizer](https://github.com/bootphon/phonemizer) - Simple text to phones converter for multiple languages.

- [PorGram](https://github.com/LR-POR/PorGram) - Open source computational grammar for Portuguese in the HPSG formalism.

- [pymetaphone-br](https://github.com/Escavador/pymetaphone-br) - Metaphone algorithm package for the Portuguese language.

- [pysentimiento](https://github.com/pysentimiento/pysentimiento) - Multilingual toolkit for Sentiment Analysis and Social NLP tasks.

- [pyspellchecker](https://github.com/barrust/pyspellchecker) - Multilingual Spell Checking.

- [RBAMR](https://github.com/rafaelanchieta/rbamr) - A Rule-Based AMR Parser for Portuguese.

- [Verbecc](https://github.com/bretttolbert/verbecc) - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

## Other lists

- [Annotated Semantic Relationships Datasets](https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets)

- [Linguistic datasets](https://github.com/EticaAI/linguistic-datasets-portuguese) - Linguistic Datasets for Portuguese.

- [NER-datasets for Portuguese](https://github.com/davidsbatista/NER-datasets/tree/master/Portuguese)

- [NILC](http://www.nilc.icmc.usp.br/nilc/index.php/tools-and-resources)

- [NILC 2](https://sites.google.com/view/nilc-usp/resources-and-tools)

- [NILC 3](https://sites.icmc.usp.br/taspardo/Projects.htm)

- [Opinando](https://sites.google.com/icmc.usp.br/opinando/p%C3%A1gina-inicial) - Opinion Mining for Portuguese.

- [Portuguese dataset List](https://forum.ailab.unb.br/t/datasets-em-portugues/251)

## Other links

- [OPUS](https://opus.nlpl.eu/) - OPUS is a growing collection of translated texts from the web. 

- [Statistical and Neural Machine Translation](https://statmt.org/).

![Visitor Badge](https://visitor-badge.laobi.icu/badge?page_id=ajdavidl.Portuguese-NLP)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ajdavidl/Portuguese-NLP

Awesome Lists containing this project

README