Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ajdavidl/portuguese-nlp

List of resources and tools developed with focus on Portuguese.
https://github.com/ajdavidl/portuguese-nlp

nlp portuguese portuguese-language

Last synced: about 9 hours ago
JSON representation

List of resources and tools developed with focus on Portuguese.

Awesome Lists containing this project

README

        

# Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

## Datasets

- [#PraCegoVer](https://huggingface.co/datasets/gabrielsantosrv/pracegover) - multi-modal dataset with Portuguese captions based on posts from Instagram.
- [18th-century Portuguese medical texts](https://github.com/uebelsetzer/NLP_for_18th-century_Portuguese_medical_texts)
- [AG_news pt](https://huggingface.co/datasets/maritaca-ai/ag_news_pt) - automatic translation of the AG's corpus of news articles.
- [Alpaca data pt-br](https://huggingface.co/datasets/dominguesm/alpaca-data-pt-br) - Stanford Alpaca dataset translated into Brazilian Portuguese using the Helsinki-NLP/opus-mt-tc-big-en-pt model.
- [AspectBR](https://github.com/franciellevargas/AspectBR) - Aspect-based annotated dataset of web consumer reviews.
- [ASSIN](http://nilc.icmc.usp.br/assin/) - a dataset with semantic similarity score and entailment annotations. ([HuggingFace](https://huggingface.co/datasets/assin))
- [ASSIN 2](https://sites.google.com/view/assin2) - sequence of ASSIN. ([HuggingFace](https://huggingface.co/datasets/assin2))
- [Automated Essay Score (AES) ENEM Dataset](https://github.com/kamel-usp/aes_enem) - Benchmark for automatic essay scoring in Portuguese ([HuggingFace](https://huggingface.co/datasets/kamel-usp/aes_enem_dataset))
- [Aya Dataset PT](https://huggingface.co/datasets/nicolasdec/aya_dataset_pt) - CohereForAI Aya Dataset filtrado para português (PT).
- [BlogSet-BR](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/blogset-br-english/) - a collection of posts gathered from Blogspot platform written by Brazillian users.
- [BLUEX](https://huggingface.co/datasets/portuguese-benchmark-datasets/BLUEX) - A benchmark based on Brazilian Leading Universities Entrance eXams.
- [BoolQ](https://huggingface.co/datasets/maritaca-ai/boolq_pt) - tradução automática do BoolQ.
- [br-quad-2.0](https://github.com/piEsposito/br-quad-2.0) - Stanford Question Answering Dataset (SQuAD) 2.0 translated to Brazilian Portuguese (PT-BR) language.
- [Brands.Br](https://github.com/metalmorphy/Brands.Br) - a Portuguese Reviews Corpus
- [Brazilian Court Decisions](joelniklaus/brazilian_court_decisions) - collection of 4043 Ementa (summary) court decisions and their metadata from the Tribunal de Justiça de Alagoas (TJAL), the State Supreme Court of Alagoas (Brazil).
- [Brazilian E-Commerce](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce) - Brazilian E-Commerce Public Dataset by Olist store.
- [Brazilian Headlines Sentiments](https://www.kaggle.com/datasets/brunoluvizotto/brazilian-headlines-sentiments) - Dataset containing sentiment analysis of Brazilian news agencies headlines.
- [Brazilian Portuguese Literature Corpus](https://www.kaggle.com/datasets/rtatman/brazilian-portuguese-literature-corpus) - 3.7 million word corpus of Brazilian literature published between 1840-1908.
- [Brazilian Portuguese Narrative Essays Dataset](https://www.kaggle.com/datasets/moesiof/portuguese-narrative-essays) - Dataset for Automatic Essay Scoring of Brazilian Portuguese Narrative Essays.
- [Brazilian Portuguese Sentiment Analysis Datasets](https://www.kaggle.com/datasets/fredericods/ptbr-sentiment-analysis-datasets).
- [Brazilian TCU's judgments](https://www.kaggle.com/datasets/ferraz/acordaos-tcu) - Judgments of Federal Court of Accounts - Brazil (TCU).
- [BrWaC](https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC) - Brazilian Portuguese Web as Corpus.
- [BrWac2Wiki](https://github.com/aseidelo/BrWac2Wiki) - a dataset for multi-document summarization in Portuguese.
- [B2W-Reviews01](https://github.com/americanas-tech/b2w-reviews01) - product reviews.
- [Canarim](https://github.com/DominguesM/canarim) - A Large-Scale Dataset of Web Pages in the Portuguese Language ([huggingface](https://huggingface.co/datasets/dominguesm/canarim))
- [Carolina](https://sites.usp.br/corpuscarolina/) - Corpus Geral do Português Brasileiro Contemporâneo ([huggingface](https://huggingface.co/datasets/carolina-c4ai/corpus-carolina)).
- [Capes](https://huggingface.co/datasets/capes) - parallel corpus of theses and dissertations abstracts in English and Portuguese.
- [CC100-Portuguese](https://autonlp.ai/datasets/cc100-portuguese) - Created by Conneau & Wenzek et al. at 2020. This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository.
- [CETENFolha](https://www.linguateca.pt/cetenfolha/index_info.html) - news from the newspaper Folha de S. Paulo.
- [CHAVE](https://www.linguateca.pt/CHAVE/) - collection for Information Retrieval and Question Answering.
- [CINTIL Corpus](http://cintil.ul.pt/cintilfeatures.html#corpus) - a linguistically interpreted corpus of Portuguese.
- [ClinicalNER](https://github.com/fabioacl/PortugueseClinicalNER) - Clinical Named Entity Recognition in Portuguese.
- [Complexidade Textual para Estágios Escolares do Sistema Educacional Brasileiro](https://github.com/gazzola/corpus_readability_nlp_portuguese).
- [CORAA](https://github.com/nilc-nlp/CORAA) - dataset for Automatic Speech Recognition.
- [CORAA SER](https://github.com/rmarcacini/ser-coraa-pt-br) - Emotion Recognition from Brazilian Portuguese Informal Spontaneous Speech.
- [CrawlPT_dedup](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) - CrawlPT (deduplicated) is composed by three corpora: brWaC, C100-PT, OSCAR-2301.
- [CSTNews](https://sites.icmc.usp.br/taspardo/sucinto/cstnews.html) - a corpus with 50 clusters of news texts with their multi-document summaries, as well as several discourse and semantic annotations.
- [C-ORAL-BRASIL](http://www.c-oral-brasil.org/english-site/index.html) - This project is dedicated to the study of Brazilian Portuguese spontaneous speech and, more broadly, to the compilation of spoken corpora.
- [DANTEStocks](https://www.kaggle.com/datasets/michelmzerbinati/portuguese-tweet-corpus-annotated-with-ner) - Corpus of stock market tweets written in Brazilian Portuguese and annotated with named entities according to HAREM's taxonomy.
- [DEEPAGÉ](https://github.com/C4AI/deepage) - Answering Questions in Portuguese about the Brazilian Environment.
- [DNLT-BP](https://github.com/nilc-nlp/DNLT-BP) - Datasets of Neuropsychological Language Tests in Brazilian Portuguese.
- [ENEM Challenge](https://www.ime.usp.br/~ddm/project/enem/) - Consists of the writing of an essay and an objective part containing 180 multiple choice questions.
- [ENEM-2022 and ENEM-2023](https://huggingface.co/datasets/maritaca-ai/enem) - These projects encompass all multiple-choice questions from the last two editions of the Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities.
- [Essay-BR](https://github.com/rafaelanchieta/essay) - Essay-BR: a corpus of essays for the Brazilian Portuguese language.
- [Extended Essay-BR](https://github.com/lplnufpi/essay-br) - Extended version of the Essay-BR corpus.
- [FACTCK.BR](https://github.com/jghm-f/FACTCK.BR) - A dataset to study Fake News in Portuguese.
- [FactNews](https://github.com/franciellevargas/FactNews) - dataset to predict sentence-level factuality of news reporting.
- [fake voices](https://huggingface.co/datasets/unfake/fake_voices) - deepfakes in Brazilian Portuguese created with XTTS model.
- [Fake.Br](https://github.com/roneysco/Fake.br-Corpus) - aligned true and fake news written in Brazilian Portuguese ([Hugginface](https://huggingface.co/datasets/fake-news-UFG/fakebr)).
- [Central_de_fatos](https://doi.org/10.5281/zenodo.5191798) - ([Huggingface](https://huggingface.co/datasets/fake-news-UFG/central_de_fatos)).
- [FakeNewsSet](https://github.com/kamplus/FakeNewsSetGen) - ([HuggingFace](https://huggingface.co/datasets/fake-news-UFG/FakeNewsSet)).
- [Fakepedia-Corpus](https://github.com/andersoncordeiro/Fakepedia-Corpus) - fake news dataset.
- [FakeRecogna](https://github.com/Gabriel-Lino-Garcia/FakeRecogna) - dataset comprised of real and fake news ([Huggingface](https://huggingface.co/datasets/recogna-nlp/FakeRecogna)).
- [FakeWhatsApp.Br](https://github.com/cabrau/FakeWhatsApp.Br) - An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.
- [FKTC](https://github.com/GoloMarcos/FKTC) - FaKe news Text Collections.
- [Floresta Sintá(c)tica](https://www.linguateca.pt/Floresta/) - treebank for Portuguese.
- [HAREM first](https://www.linguateca.pt/primeiroHAREM/harem_coleccaodourada_en.html) - evaluation contest for named entity recognizers in Portuguese.
- [HAREM second](https://www.linguateca.pt/HAREM/) - evaluation contest for named entity recognizers in Portuguese.
- [HateBR](https://github.com/franciellevargas/HateBR) - large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.
- [Historical Portuguese Corpora](http://www.nilc.icmc.usp.br/nilc/projects/hpc/) - tools and resources for manipulation of historical corpora and management of historical dictionaries.
- [IMDB pt](https://huggingface.co/datasets/maritaca-ai/imdb_pt) - Tradução atomática do IMBD.
- [InferBR](https://github.com/lbencke/InferBR) - Natural Language Inference dataset.
- [Iudicium Textum Dataset](http://dadosabertos.c3sl.ufpr.br/acordaos/) - contains legal documents created by Brazilian Federal Supreme Court in its integral composition ([paper](https://www.researchgate.net/publication/336022563_Iudicium_Textum_Dataset_Uma_Base_de_Textos_Juridicos_para_NLP)).
- [LeNER-Br](https://github.com/peluz/lener-br) - a Dataset for Named Entity Recognition in Brazilian Legal Text.
- [LegalPT_dedup](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) - LegalPT (deduplicated) aggregates the maximum amount of publicly available legal data in Portuguese.
- [Lex2Kids](http://www.nilc.icmc.usp.br/leg2kids/) - lexicon in Portuguese most heard by children.
- [Mac-Morpho](http://www.nilc.icmc.usp.br/macmorpho/) - Brazilian Portuguese texts annotated with part-of-speech tags.
- [MilkQA](http://www.nilc.icmc.usp.br/nilc/index.php/milkqa/) - a dataset of dense questions for the task of answer selection.
- [Minutes of Central Bank of Brazil](https://github.com/ajdavidl/corpus-atas-copom/blob/main/README_English.md) - Minutes of the Monetary Policy Committee of the Central Bank of Brazil.
- [NER in Brazilian Portuguese tweets](https://www.kaggle.com/datasets/rafaelperes/ner-in-brazilian-portuguese-tweets) - Twitter messages in pt-br annotated for the entities PER, LOC and ORG.
- [NERDE](https://huggingface.co/datasets/Gpaiva/NERDE) - Documents from [CADE](https://www.gov.br/cade/pt-br)'s jurisprudence annotated for the entities ORG, PER, TEMPO, LOC, LEG (legislation), DOCS (documents), VALOR.
- [News-Crawl-PT](https://data.statmt.org/news-crawl/pt/) - Monolingual News Crawl used for WMT.
- [News of the site Folha de São Paulo](https://www.kaggle.com/datasets/marlesson/news-of-the-site-folhauol) - news of the Brazilian Newspaper Folha de São Paulo.
- [News published in Brazil](https://www.kaggle.com/datasets/diogocaliman/notcias-publicadas-no-brasil) - news compilation of the Globo group.
- [OAB exams](https://github.com/legal-nlp/oab-exams) - Brazilian version of the BAR exam (USA) ([HuggingFace](https://huggingface.co/datasets/eduagarcia/oab_exams)).
- [Parallel Corpora from Revista Pesquisa FAPESP](http://www.nilc.icmc.usp.br/nilc/tools/Fapesp%20Corpora.htm) - Portuguese-English and Portuguese-Spanish bilingual collections of the online issues of the scientific news Brazilian magazine Revista Pesquisa FAPESP.
- [NURC-SP](http://tarsila.icmc.usp.br:8080/nurc/catna)
- [Pirá](https://github.com/C4AI/Pira) - A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean.
- [PL-corpus](https://huggingface.co/datasets/bergoliveira/pl-corpus) - part of the UlyssesNER-Br, a corpus of Brazilian Legislative Documents for NER with quality baselines.
- [PLUE](https://github.com/ju-resplande/PLUE) - Portuguese translation of the GLUE benchmark and Scitail dataset.
- [POeTiSA](https://sites.google.com/icmc.usp.br/poetisa) - POrtuguese processing - Towards Syntactic Analysis and parsing.
- [politiquices](https://github.com/politiquices/data-releases) - Datasets related with the politiquices.pt project.
- [PorSimplesSent](https://github.com/sidleal/porsimplessent) - of aligned sentences pairs to investigate sentence readability assessment.
- [PortiLexicon-UD](https://portilexicon.icmc.usp.br/) - a lexicon for Brazilian Portuguese according to Universal Dependencies.
- [Portuguese-Hate-Speech-Dataset](https://github.com/paulafortuna/Portuguese-Hate-Speech-Dataset) - Portuguese dataset for hate speech detection composed of 5,668 tweets with binary annotations (i.e. 'hate' vs. 'no-hate') ([HuggingFace](https://huggingface.co/datasets/hate_speech_portuguese))
- [Portuguese Legal Sentences](https://huggingface.co/datasets/rufimelo/PortugueseLegalSentences-v3) - Collection of Legal Sentences from the Portuguese Supreme Court of Justice.
- [Portuguese Presidential Elections](https://github.com/msramalho/election-watch/blob/master/datasets/01_portuguese_presidential_elections_2021_01_24.md) - This dataset contains tweets and users mostly from the Portuguese Twittersphere.
- [PraCegoVer](https://github.com/larocs/PraCegoVer) - multi-modal dataset containing images associated to Portuguese captions based on posts from Instagram.
- [Priberam Fine-Grained Opinion Corpus](http://labs.priberam.pt/Resources/Fine-Grained-Opinion-Corpus.aspx) - a Portuguese fine-grained dependency opinion mining corpus.
- [Propbank](http://143.107.183.175:21380/portlex/index.php/en/downloadsingl) - Contains instances annotated with semantic role labels (SRL).
- [Projeto ACDC](https://www.linguateca.pt/ACDC/) - Internet Access to Corpora.
- [Puntuguese](https://github.com/Superar/Puntuguese/) - A Corpus of Puns in Portuguese with Micro-editions ([HuggingFace](https://huggingface.co/datasets/Superar/Puntuguese))
- [QA-Portuguese](https://huggingface.co/datasets/ju-resplande/qa-pt) - Adaptation from MQA dataset Portuguese split (QA entailment pairs).
- [Quati](https://huggingface.co/datasets/unicamp-dl/quati) - This dataset aims to support Brazilian Portuguese (pt-br) Information Retrieval (IR) systems development, providing document passagens originally created in pt-br, as well as queries (topics) created by native speakers.
- [REBEL-Portuguese](https://huggingface.co/datasets/ju-resplande/rebel-pt) - Datasets de relações a partir da Wikipedia.
- [ReLi](https://www.linguateca.pt/Repositorio/ReLi/) - REsenha de LIvros.
- [RePro: A Benchmark Dataset for Opinion Mining for Brazilian Portuguese](https://github.com/lucasnil/repro) - A Benchmark Dataset for Opinion Mining for Brazilian Portuguese. ([HuggingFace](https://huggingface.co/datasets/lucasnil/repro))
- [Rhetalho](https://sites.icmc.usp.br/taspardo/rhetalho.zip) - corpus annotated with Daniel Marcu's RSTTool.
- [SemClinBr](https://github.com/HAILab-PUCPR/SemClinBr) - multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.
- [SESAME](https://sesame-pt.github.io/) - corpus for NER in portuguese.
- [SIGARRA News Corpus](https://rdm.inesctec.pt/dataset/cs-2017-004) - SIGARRA information system at the University of Porto.
- [SIMPLEX-PB](https://github.com/nathanshartmann/SIMPLEX-PB) - A Lexical Simplification Database and Benchmark for Portuguese.
- [SIMPLEX-PB-2.0](https://github.com/nathanshartmann/simplex-pb-2.0) - improved version of SIMPLEX-PB.
- [SIMPLEX-PB-3.0](https://github.com/nathanshartmann/simplex-pb-3.0) - new version of SIMPLEX-PB.
- [Spotify Subset](https://github.com/aryamtos/spotify-subset) - classifying language variations in Brazilian Portuguese
- [SQUAD-PT v1.1](https://github.com/nunorc/squad-v1.1-pt) - Portuguese translation of the SQuAD dataset.
- [SQUAD-PT v1.1-pt-br](https://huggingface.co/datasets/ArthurBaia/squad_v1_pt_br) - Brazilian Portuguese translation of the SQuAD dataset, translated by Deep Learning Brasil.
- [SQUAD-PT v2.0](https://github.com/cjaniake/squad_v2.0_pt) - Portuguese translation of SQuAD 2.0 dataset.
- [SST-2 pt](https://huggingface.co/datasets/maritaca-ai/sst2_pt) - Automatic translation of the Stanford Sentiment Treebank.
- [TeMário](https://www.linguateca.pt/Repositorio/TeMario/) - news texts and the corresponding human summaries for summarization purposes.
- [Textual Complexity Corpus](https://github.com/gazzola/corpus_readability_nlp_portuguese) - Textual Complexity Corpus for School Internships in the Brazilian Educational System.
- [ToLD-Br](https://huggingface.co/datasets/told-br) - Toxic Language Detection in Social Media for Brazilian Portuguese ([github](https://github.com/JAugusto97/ToLD-Br)).
- [TTS-Portuguese Corpus](https://github.com/Edresson/TTS-Portuguese-Corpus) - Text To Speech Portuguese.
- [TweetSentBR](https://bitbucket.org/HBrum/tweetsentbr/src/master/) - Tweets in Brazilian Portuguese.
- [Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis).
- [UD_Portuguese-Bosque](https://github.com/UniversalDependencies/UD_Portuguese-Bosque) - Universal Dependencies (UD) Portuguese treebank.
- [UD_Portuguese-CINTIL](https://github.com/UniversalDependencies/UD_Portuguese-CINTIL) - Universal Dependencies (UD) Portuguese treebank.
- [UD_Portuguese-GSD](https://github.com/UniversalDependencies/UD_Portuguese-GSD) - Universal Dependencies (UD) Portuguese treebank.
- [UD_Portuguese-PetroGold](https://github.com/UniversalDependencies/UD_Portuguese-PetroGold) - Universal Dependencies (UD) Portuguese treebank.
- [UD_Portuguese-PUD](https://github.com/UniversalDependencies/UD_Portuguese-PUD) - Universal Dependencies (UD) Portuguese treebank.
- [UlyssesNER-Br](https://github.com/ulysses-camara/ulysses-ner-br/) - Corpus of Brazilian Legislative Documents for Named Entity Recognition
- [UTLCorpus](https://github.com/RogerFig/UTLCorpus) - a corpus of online reviews in Brazilian Portuguese annotated with helpfulness classification.
- [Winograd Schema Challenge](https://github.com/gabimelo/portuguese_wsc) - Solver for the Portuguese-based Winograd Schema Challenge.
- [WizardVicuna-PTBR-Instruct-Clean](https://huggingface.co/datasets/cnmoro/WizardVicuna-PTBR-Instruct-Clean) - Wizard Vicuna PT-Br Instruct Clean dataset.

### Multilingual datasets

- [A Multilingual Dataset for Investigating Stereotypes and Negative Attitudes Towards Migrant Groups in Large Language Models](https://github.com/dsorato/stereotypes_negative_attitudes_towards_migrants_dataset)
- [askD](https://huggingface.co/datasets/ju-resplande/askD) - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
- [English-Portuguese Sentences](http://www.manythings.org/bilingual/por/) - English-Portuguese Sentences from the Tatoeba Project.
- [EUR-Lex](https://www.sketchengine.eu/eurlex-corpus/) - multilingual corpus in all the official languages of the European Union.
- [Europarl](https://www.statmt.org/europarl/) - European Parliament Proceedings Parallel Corpus 1996-2011.
- [Europarl-ST](https://www.mllp.upv.es/europarl-st/) - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
- [mc4](https://huggingface.co/datasets/mc4/viewer/pt/train) - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
- [mfaq](https://huggingface.co/datasets/clips/mfaq) - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
- [MKQA](https://huggingface.co/datasets/mkqa) - Multilingual Knowledge Questions & Answers ([github](https://github.com/apple/ml-mkqa)).
- [MQA](https://huggingface.co/datasets/clips/mqa) - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
- [MMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) - Multilingual version of the MS MARCO passage ranking dataset.
- [mRobust](https://huggingface.co/datasets/unicamp-dl/mrobust) - Multilingual version of the TREC 2004 Robust passage ranking dataset
- [MultiCoNER](https://huggingface.co/datasets/MultiCoNER/multiconer_v2) - a large multilingual dataset for Named Entity Recognition.
- [MuST-C](https://ict.fbk.eu/must-c/) - multilingual speech translation corpus.
- [OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles.php) - collection of translated movie subtitles.
- [OSCAR](https://oscar-corpus.com/) - Open Super-large Crawled Aggregated coRpus.
- [Tatoeba](https://tatoeba.org/en/downloads) - a large database of sentences and translations.
- [TED2020](https://opus.nlpl.eu/TED2020.php) - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
- [TSAR-2022-Shared-Task](https://github.com/LaSTUS-TALN-UPF/TSAR-2022-Shared-Task) - TSAR2022 Shared Task on Lexical Simplification.
- [WikiANN](https://huggingface.co/datasets/wikiann) - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
- [WikiLingua](https://github.com/esdurmus/Wikilingua) - Multilingual abstractive summarization dataset extracted from WikiHow.
- [WikiMatrix](https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix) - Parallel Sentences in 1620 Language Pairs from Wikipedia.
- [Wikiner](https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) - Learning multilingual named entity recognition from Wikipedia.
- [WikiNEuRal](https://github.com/Babelscape/wikineural) - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
- [Wikipedia](https://huggingface.co/datasets/wikipedia) - Wikipedia dataset containing cleaned articles of all languages.
- [XFORMAL](https://github.com/Elbria/xformal-FoST) - A Benchmark for Multilingual Formality Style Transfer.
- [XLSUM](https://huggingface.co/datasets/csebuetnlp/xlsum) - 1.35 million professionally annotated article-summary pairs from BBC.

## Lexicon

- [BATS-PT](https://github.com/NLP-CISUC/PT-LexicalSemantics/tree/master/BATS-PT) - manual translation of the lexicographic portion of the Bigger Analogy Test Set (BATS) to Portuguese
- [br.ispell](https://www.ime.usp.br/~ueda/br.ispell/summary.html) - Ispell dictionary for brazilian portuguese ([github](https://github.com/fititnt/br.ispell-dicionario-portugues-brasileiro)).
- [Conceptnet](https://conceptnet.io/) - an open, multilingual knowledge graph.
- [DicSin](https://github.com/fititnt/DicSin-dicionario-sinonimos-portugues-brasileiro) - Dictionary of synonyms and antonyms.
- [lexiconPT](https://github.com/sillasgonzaga/lexiconPT) - R package that provides lexicons for Portuguese Text Analysis.
- [lexicons](https://github.com/davidsbatista/lexicons) - Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc.
- [LIWC](http://nilc.icmc.usp.br/portlex/index.php/en/liwc) - Linguistic Inquiry and Word Count ([dictionary](https://sites.icmc.usp.br/sandra/LIWC/LIWC2007_Portugues_win.dic))
- [Onto.PT](http://ontopt.dei.uc.pt/) - Ontologia Lexical para o Português.
- [OpenWordnet-PT](https://github.com/own-pt/openWordnet-PT) - an open access wordnet for Portuguese ([site](http://wn.mybluemix.net/)).
- [OpLexicon](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/oplexicon/) - a sentiment lexicon for the Portuguese language.
- [palavras](https://github.com/pythonprobr/palavras) - Word list of Brazillian Portuguese.
- [PAPEL](https://www.linguateca.pt/PAPEL/).
- [pt-br](https://github.com/fserb/pt-br) - Wordlist, verbs, conjugations, term frequencies.
- [PT-LKB](https://github.com/NLP-CISUC/PT-LexicalSemantics/) - Large Portuguese Lexical-Semantic Knowledge Base
- [PULO](http://wordnet.pt/) - Portuguese Unified Lexical Ontology.
- [SentiLex-PT](http://b2find.eudat.eu/dataset/b6bd16c2-a8ab-598f-be41-1e7aeecd60d3) - a sentiment lexicon for Portuguese.
- [Stopwords](https://github.com/stopwords-iso/stopwords-pt) - Portuguese stopwords collection.
- [Tep2](http://www.nilc.icmc.usp.br/tep2/).
- [Unitex-PB](http://www.nilc.icmc.usp.br/nilc/projects/unitex-pb/web/dicionarios.html) - lexical resources.
- [VaLexPB](https://github.com/jessemourao/VaLexPB) - a lexicon of Brazilian Portuguese verb valences.
- [VerbNet.Br 1.0](http://143.107.183.175:21380/portlex/index.php/en/projects/verbnetbringl) - verbal lexicon of Brazilian Portuguese.
- [wikidict-dsl-pt](https://github.com/open-dsl-dict/wikidict-dsl-pt) - Wikidata Bilingual DSL Dictionaries.
- [Wordnetaffectbr](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/wordnetaffectbr/) - vocabulary of emotions words.
- [Wordnet.Br](http://www.nilc.icmc.usp.br/wordnetbr/) - Portuguese WordNet.

## Models
- [Albertina PT-BR](https://huggingface.co/PORTULAN/albertina-ptbr) - It is an encoder of the BERT family for the Portuguese language - the American variant from Brazil.
- [Albertina PT-PT](https://huggingface.co/PORTULAN/albertina-ptpt) - It is an encoder of the BERT family for the Portuguese language - the European variant from Portugal.
- [Alpaca-LoRA-PTBR](https://huggingface.co/dominguesm/alpaca-lora-ptbr-7b) - Low-Rank LLaMA Instruct-Tuning.
- [BART](https://huggingface.co/adalbertojunior/bart-base-portuguese) - BART pre-treinado em português.
- [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment ([Github](https://github.com/neuralmind-ai/portuguese-bert)).
- [BioBERTpt](https://huggingface.co/pucpr/biobertpt-all) - fine-tuned BERT models trained on the clinical domain for Portuguese language ([Github](https://github.com/HAILab-PUCPR/BioBERTpt)).
- [Cabrita](https://huggingface.co/22h/cabrita-lora-v0-1) - A portuguese finetuned instruction LLaMA ([Github](https://github.com/22-hours/cabrita)).
- [DeBERTinha](https://huggingface.co/sagui-nlp/debertinha-ptbr-xsmall) - A DeBERTa V3 XSmall adapted to the Brazilian Portuguese language ([Github](https://github.com/sagui-nlp/DeBERTinha)).
- [Electra](https://huggingface.co/dlb/electra-base-portuguese-uncased-brwac) - Electra model trained on BRWAC.
- [Gervasio-PT-BR](https://huggingface.co/PORTULAN/gervasio-ptbr-base) - It is a decoder of the GPT family for the Portuguese language - the American variant from Brazil.
- [Gervasio-PT-PT](https://huggingface.co/PORTULAN/gervasio-ptpt-base) - It is a decoder of the GPT family for the Portuguese language - the European variant from Portugal.
- [GlórIA 1.3B](https://github.com/rvlopes/GlorIA) - A Portuguese European-focused Large Language Model ([HuggingFace](https://huggingface.co/NOVA-vision-language/GlorIA-1.3B))
- [GPT2 small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
- [GPT-Neo small](https://huggingface.co/HeyLucasLeao/gpt-neo-small-portuguese) - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
- [GPT2-Bio-PT](https://huggingface.co/pucpr/gpt2-bio-pt) - a biomedical finetuned version from GPorTuguese-2 ([Github](https://github.com/HAILab-PUCPR/gpt2-bio-pt)).
- [NERDE-base](https://huggingface.co/Gpaiva/NERDE-base) - BERTimbau finetuned to NER on Judicial Documents.
- [roberta-pt-br](https://huggingface.co/josu/roberta-pt-br)
- [RoBERTaCrawlPT-base](https://huggingface.co/eduagarcia/RoBERTaCrawlPT-base) - RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora
- [RoBERTaLexPT-base](https://huggingface.co/eduagarcia/RoBERTaLexPT-base) - Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora
- [Sabiá](https://huggingface.co/maritaca-ai/sabia-7b) - Sabiá-7B is Portuguese language model developed by Maritaca AI.
- [Sabiá 2](https://www.maritaca.ai/en/sabia-2) - Language model trained on Portuguese text, especially in the Brazilian domain.
- [T5](https://github.com/unicamp-dl/PTT5) - T5 model on Brazilian Portuguese data.
- [tgf-xlm-roberta-base-pt-br](https://huggingface.co/thegoodfellas/tgf-xlm-roberta-base-pt-br) ([Github](https://github.com/the-good-fellas/xlm-roberta-pt-br))
- [Wav2vec](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) - Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1.

### Multilingual Models

- [Bloom](https://huggingface.co/bigscience/bloom) - BigScience Large Open-science Open-access Multilingual Language Model.
- [mBert](https://huggingface.co/bert-base-multilingual-cased) - Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
- [mDeBERTa](https://huggingface.co/microsoft/mdeberta-v3-base)
- [mGPT](https://huggingface.co/sberbank-ai/mGPT) - Multilingual GPT model. An autoregressive GPT-like model.
- [mMiniLM](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-pt-v2) - mMiniLM-L6-v2 Reranker finetuned on mMARCO
- [mT5](https://huggingface.co/google/mt5-base) - Multilingual T5. A massively multilingual pre-trained text-to-text transformer.
- [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) - XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
- [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) - Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

## Word Embeddings

- [fastText](https://fasttext.cc/docs/en/crawl-vectors.html) - Multi-lingual word vectors.
- [LASER](https://github.com/facebookresearch/LASER) - Language-Agnostic SEntence Representations.
- [NILC-Embeddings](http://www.nilc.icmc.usp.br/embeddings) - Word embeddings trained in Portuguese by USP.
- [MUSE](https://github.com/facebookresearch/MUSE) - Multilingual Unsupervised and Supervised Embeddings.
- [word vectors](https://github.com/Kyubyong/wordvectors) - Pre-trained word vectors of 30+ languages.

## Metrics

- [Coh-Metrix-Port](https://github.com/nilc-nlp/coh-metrix-port) - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
- [NILC-Metrix](https://github.com/sidleal/nilcmetrix) - it gathers the metrics developed over more than a decade in NILC Lab.

## Leaderboards
- [Open PT LLM Leaderboard](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard) - Open PT LLM Leaderboard aims to provide a benchmark for the evaluation of Large Language Models (LLMs) in the Portuguese language across a variety of tasks and datasets.

## Frameworks
- [nlpnet](http://nilc.icmc.usp.br/nlpnet/)
- [NLTK](https://www.nltk.org/howto/portuguese_en.html)
- [polyglot](https://github.com/aboSamoor/polyglot)
- [spaCy](https://spacy.io/models/pt)
- [Stanza NLP](https://stanfordnlp.github.io/stanza/available_models.html)
- [udpipe](https://github.com/bnosac/udpipe)

## Institutions

- [Brasileiras em PLN](https://brasileiraspln.com/).
- [HAILab-PUCPR](https://github.com/HAILab-PUCPR) - A pioneering research group aiming to develop solutions for health care using Natural Language Processing and Machine Learning.
- [Linguateca](https://www.linguateca.pt/).
- [NILC](http://www.nilc.icmc.usp.br/nilc/index.php).
- [NLPortuguês](https://nlportugues.ime.usp.br/) - Devoted to creating NLP courses in brazilian portuguese.
- [NLX-Group](http://nlx.di.fc.ul.pt/).
- [PLN PUCRS](https://www.inf.pucrs.br/linatural/wordpress/).

## Tools

- [Apertium-por](https://github.com/apertium/apertium-por) - Apertium linguistic data for Portuguese.
- [Autocorrect](https://github.com/filyp/autocorrect) - Spelling corrector in python.
- [BrGram](https://github.com/LR-POR/BrGram) - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
- [Dicio API](https://github.com/ThiagoNelsi/dicio-api) - Portuguese dictionary API.
- [dict-pt-br](https://github.com/VisualText/dict-pt-br) - dictionary for Brazilian Portuguese.
- [Languagetool](https://github.com/languagetool-org/languagetool) - Style and Grammar Checker for 25+ Languages.
- [LegalNLP](https://github.com/felipemaiapolo/legalnlp) - Natural Language Processing Methods for the Brazilian Legal Language.
- [LexML Parser](https://github.com/lexml/lexml-parser-projeto-lei) - parser for legal documents.
- [LX parser](http://lxcenter.di.fc.ul.pt/tools/en/LXParserEN.html) - statistical constituency parser for Portuguese.
- [metaphone-ptbr](https://github.com/carlosjordao/metaphone-ptbr) - Metaphone algorithm for the Portuguese language.
- [mlconjug3](https://github.com/SekouDiaoNlp/mlconjug3) - a Python library to conjugate verbs in Portuguese and other languages.
- [MorphoBr](https://github.com/LR-POR/MorphoBr) - Resources for morphological analysis of Portuguese.
- [OpCluster](https://github.com/franciellevargas/Opcluster) - Automatic extraction and clustering of fine-grained opinions.
- [Phonemizer](https://github.com/bootphon/phonemizer) - Simple text to phones converter for multiple languages.
- [PorGram](https://github.com/LR-POR/PorGram) - Open source computational grammar for Portuguese in the HPSG formalism.
- [pymetaphone-br](https://github.com/Escavador/pymetaphone-br) - Metaphone algorithm package for the Portuguese language.
- [pysentimiento](https://github.com/pysentimiento/pysentimiento) - Multilingual toolkit for Sentiment Analysis and Social NLP tasks.
- [pyspellchecker](https://github.com/barrust/pyspellchecker) - Multilingual Spell Checking.
- [RBAMR](https://github.com/rafaelanchieta/rbamr) - A Rule-Based AMR Parser for Portuguese.
- [Verbecc](https://github.com/bretttolbert/verbecc) - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

## Other lists

- [Annotated Semantic Relationships Datasets](https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets)
- [Linguistic datasets](https://github.com/EticaAI/linguistic-datasets-portuguese) - Linguistic Datasets for Portuguese.
- [NER-datasets for Portuguese](https://github.com/davidsbatista/NER-datasets/tree/master/Portuguese)
- [NILC](http://www.nilc.icmc.usp.br/nilc/index.php/tools-and-resources)
- [NILC 2](https://sites.google.com/view/nilc-usp/resources-and-tools)
- [NILC 3](https://sites.icmc.usp.br/taspardo/Projects.htm)
- [Opinando](https://sites.google.com/icmc.usp.br/opinando/p%C3%A1gina-inicial) - Opinion Mining for Portuguese.
- [Portuguese dataset List](https://forum.ailab.unb.br/t/datasets-em-portugues/251)

## Other links

- [OPUS](https://opus.nlpl.eu/) - OPUS is a growing collection of translated texts from the web.
- [Statistical and Neural Machine Translation](https://statmt.org/).

![Visitor Badge](https://visitor-badge.laobi.icu/badge?page_id=ajdavidl.Portuguese-NLP)