An open API service indexing awesome lists of open source software.

https://github.com/andythefactory/romanian-nlp-datasets

A list of Romanian NLP Datasets
https://github.com/andythefactory/romanian-nlp-datasets

nlp nlp-data nlp-dataset nlp-datasets nlp-resources romanian romanian-language

Last synced: 5 months ago
JSON representation

A list of Romanian NLP Datasets

Awesome Lists containing this project

README

          

[![Awesome](https://awesome.re/badge-flat2.svg)](https://awesome.re)

# A list of Romanian NLP Datasets
A curated list of open source and open access Romanian Language NLP Datasets.
For the moment we don't add parallel copora to the list.

For additions or any other changes please submit a pull request.

Table of contents
=================

* [Unlabeled text Corpora](#unlabeled-text-corpora)
* [Semantic Textual Similarity / Paraphrasing](#semantic-textual-similarity--paraphrasing)
* [Natural Language Inference](#natural-language-inference)
* [Summarization](#summarization)
* [Dialect and regional speech identification](#dialect-and-regional-speech-identification)
* [Named Entity Recognition (NER)](#named-entity-recognition-ner)
* [Autorship Attribution](#autorship-attribution)
* [Sentiment Analysis](#sentiment-analysis)
* [Dependency Parsing](#dependency-parsing)
* [Diacritics Restoration / Grammar Correction](#diacritics-restoration--grammar-correction)
* [Fake News / Clickbait / Satirical News](#fake-news--clickbait--satirical-news)
* [Offensive Language](#offensive-language)
* [Questions and Answering](#questions-and-answers)
* [Spelling, Dictionaries and Gramatical Errors](#spelling-and-gramatical-errors)
* [Automatic Speech Recognition (ASR)](#automatic-speech-recognition)

## Unlabeled text Corpora

* [โ„๏ธFuLG dataset โ„๏ธ](https://huggingface.co/datasets/faur-ai/fulg)
> The FuLG dataset is a comprehensive Romanian language corpus comprising
> 150 billion tokens, carefully extracted from Common Crawl.
[![arXiv](https://img.shields.io/badge/arXiv-2004.06165-f9f107.svg)](https://arxiv.org/abs/2407.13657)

* [๐ŸŒ Oscar Common Crawl dataset ๐ŸŒ](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201)
> Part of a large multilanguage corpus originated from Common Crawl.
> It's a raw, unannotated corpus. It has roughly 50 GB of Romanian text
> in 4.5 million documnets. For details check its homepage
> and the paper

[![arXiv](https://img.shields.io/badge/arXiv-2004.06165-f9f107.svg)](https://arxiv.org/abs/2004.06165)
[![Homepage](https://img.shields.io/badge/oscar%20homepage-6ca1f0)](https://oscar-project.org/)

* [๐Ÿ“š CC-100 ๐Ÿ“š](https://data.statmt.org/cc-100/)
> Similar to Oscar, part of a multilanguage corpus also based on Common Crawl
> from 2018. Romanian text is 16GB large

[![arXiv](https://img.shields.io/badge/arXiv-1911.00359-f9f107.svg)](https://arxiv.org/abs/1911.00359)
[![Homepage](https://img.shields.io/badge/cc-100%20homepage-6ca1f0)](https://data.statmt.org/cc-100/)

* [๐ŸŒ Wikipedia Corpus ๐ŸŒ](https://dumps.wikimedia.org/rowiki/)
> Romanian language wikipedia dump.

* [๐Ÿ“ฐโš–๏ธ RoTex Collection ๐Ÿ“ฐโš–๏ธ](https://github.com/aleris/ReadME-RoTex-Corpus-Builder)
> A collection of varoius unannotated corpora collected around 2018-2019.
> Includes books, scraped newspapers and juridical documents
>
* [๐Ÿ“– Romanian Language Repository ๐Ÿ“–](https://github.com/lmidriganciochina/romaniancorpus)
> A collection of written and spoken text from various
> sources: Articles, Fairy tales, Fiction, History, Theatre, News

* [๐Ÿ›๏ธ MARCELL Legislative Corpus ๐Ÿ›๏ธ](https://elrc-share.eu/repository/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/)
> Romanian national legilation from 1881 to 2021. The corpus
> includes mainly: governmental decisions, ministerial orders,
> decisions, decrees and laws.
> Automatically annotated for Named Entities

[![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.464.pdf)
[![Homepage](https://img.shields.io/badge/marcell%20homepage-6ca1f0)](https://marcell-project.eu/)

* [๐Ÿฆ  COVID-19 Tweets ๐Ÿฆ](https://github.com/UBC-NLP/megacov)
> Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. It is
> available in over 100+ languages, Romanian being one of them. Tweets need
> to be rehydrated

[![arXiv](https://img.shields.io/badge/arXiv-2005.06012-f9f107.svg)](https://arxiv.org/abs/2005.06012)
[![Medium](https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white)](https://mumageed.medium.com/billion-scale-investigation-of-covid-19-impact-on-human-communication-in-104-languages-874b5a37beac)

* [COVIDSentiRO](https://github.com/Alegzandra/KES-2023/tree/main/datasets/COVIDSentiRO)
> A corpus of Romanian tweets related to COVID and vaccination against COVID, created and
> collected between January 2021 and February 2022. It contains 19319 tweets.

* [๐Ÿ“œ Minutes of the Sittings of the Chamber of Deputies of Romania ๐Ÿ“œ](https://elrc-share.eu/repository/browse/monolingual-corpus-from-minutes-of-the-sittings-of-the-chamber-of-deputies-of-romania-2016-2018-processed/759806e22e1311e9a4d400155d02670657928f90efa64c9dab1b177c2186bf6c/)
> Minutes of the Sittings of the Chamber of Deputies of Romania (2016-2018)
> Unannotated corpus

* [๐Ÿ”Š Minutes of the Sittings of the Romanian Parliament ๐Ÿ”Š](https://elrc-share.eu/repository/browse/romanian-parliament-transcripts-1996-2018-processed/779b85aee4de11e9913100155d026706ac0a5e38c6824010a537363b37b6bd0f/)
> contains 500k+ instances of speech from the parliament podium from
> 1996 to 2018. Sentence splitting and deduplication onm sentence level
> have been applied as processing steps
> Unannotated corpus

* [๐Ÿ—ฃ๏ธ Romanian Presidential Discourses ๐Ÿ—ฃ๏ธ](https://github.com/grrrrah/RomanianPresidentialDiscourses)
> Romanian presidential discouses (1990-2020) split in 4 files
> one for each president. Unannotated corpus


* [๐ŸŽญ Culture Domain Corpus ๐ŸŽญ](https://elrc-share.eu/repository/browse/monolingual-romanian-corpus-in-the-culture-domain-processed/a1d6c98e1d5911e9b7d400155d026706fd71a90bf1df4bacb3f174edccb6e9b9/)
> Monolingual Romanian corpus, including content from public websites related to culture

* [โš–๏ธ Law Domain Corpus โš–๏ธ](https://elrc-share.eu/repository/browse/monolingual-romanian-corpus-in-the-law-domain/ee9f6b0289f611e6bfe700155d0205029e7b188412ab4e56bf6fd1d7d9e8b033/)
> Monolingual (ron) corpus, containing 38063991 tokens and 854096 lexical types in the law domain.

* [๐Ÿข Public Administration Domain Corpus ๐Ÿข](https://elrc-share.eu/repository/browse/monolingual-romanian-corpus-in-the-public-administration-domain-processed/32b0c234327311e8b7d400155d026706afa147904c554dd1bad4764bd4a7aaed/)
> Monolingual Romanian corpus, containing 360833 sentences (9064764 words) in the public administration domain.

* [๐Ÿ“‹ New Civil Procedure Code ๐Ÿ“‹](https://elrc-share.eu/repository/browse/romanian-new-civil-procedure-code-processed/e4d8e13046ff11e8b7d400155d026706010657f248274e6286aeb6488d8a2ee6/)
> The New Civil Procedure Code in Romanian (monolingual) comprising 297888 words.

* [โš–๏ธ New Criminal Code โš–๏ธ](https://elrc-share.eu/repository/browse/noul-cod-penal/100b7e38d25111ea913100155d0267066e48d760d0d24b39a4900b2b09864a02/)
> The Romanian updated criminal code: text with law content.

* [๐Ÿ“ฐ Romanian News Articles Dataset ๐Ÿ“ฐ](https://github.com/mhakan20/RomanianNewsArticlesDataset)
> news articles dataset from romanian newssites
> title, summary and article

* [๐Ÿ“ฐ Old Newspapers ๐Ÿ“ฐ](https://www.kaggle.com/datasets/alvations/old-newspapers)
> multi-language corpus from online available news sources.
> It contains also 43mil words in Romanian language from Twitter, Blogs and Newspapers

[![Homepage](https://img.shields.io/badge/hc%20corpora%20homepage-6ca1f0)](http://corpora.epizy.com/corpora.html?i=1)

* [๐Ÿ“š ELTeC-Rom ๐Ÿ“š](https://github.com/COST-ELTeC/ELTeC-rom)
> The Romanian novel collection for ELTeC, the European Literary Text Collection
> Sources: Biblioteca Metropolitana din Bucuresti, Biblioteca Universitara "Mihai Eminescu" din Iasi,
> Biblioteca Judeteana din Botosani, personal micro-collections uploaded on Zenodo
> under the following labels: "Hajduks Library"; "RomanianNovel Library"; "CityMysteries Library"; "BibliotecaDHL_Iasi"

* [๐Ÿ“ง RO Business Emails ๐Ÿ“ง](https://huggingface.co/datasets/readerbench/ro-business-emails)
> Public dataset of 1447 manually annotated Romanian business-oriented emails.
> The corpus is annotated with 5 token-related labels, as well as 5 sequence-related classes

[![MDPI](https://img.shields.io/badge/MDPI-Information-00813E.svg)](https://www.mdpi.com/2078-2489/14/6/321)

* [๐Ÿ“–RO-Stories๐Ÿ“–](https://huggingface.co/datasets/readerbench/ro-stories)
> The corpus consists of texts written by Romanian authors between 19th century and present, representing stories, short-stories, fairy tales and sketches.
> The current version contains 19 authors, 1263 full texts and 12516 paragraphs of around 200 words each, preserving paragraphs integrity.

* [๐Ÿ“•ROST๐Ÿ“•](https://www.kaggle.com/datasets/sandamariaavram/rost-romanian-stories-and-other-texts)
> A dataset containing 400 Romanian texts written by 10 authors
> The dataset contains stories, short stories, fairy tales, novels, articles, and sketches written by Ion Creangฤƒ,
> Barbu ลžtefฤƒnescu Delavrancea, Mihai Eminescu, Nicolae Filimon, Emil Gรขrleanu, Petre Ispirescu, Mihai Oltean, Emilia Plugaru, Liviu Rebreanu, Ioan Slavici.

[![MDPI](https://img.shields.io/badge/MDPI-Information-00813E.svg)](https://www.mdpi.com/2227-7390/10/23/4589)

* [๐ŸณRomanian Cooking Recipes๐Ÿณ](https://huggingface.co/datasets/BlackKakapo/recipes-ro)
> 891 Cooking Recipes in Romanian Language

## Semantic Textual Similarity / Paraphrasing

* [๐Ÿ”— RO-STS ๐Ÿ”—](https://huggingface.co/datasets/ro_sts)
> Semantic Textual Similarity dataset for the Romanian language
> RO-STS contains 8,628 sentence pairs with their similarity scores

[![NeurIPS](https://img.shields.io/badge/NeurIPS-ed1c24.svg)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/5f93f983524def3dca464469d2cf9f3e-Abstract-round1.html)

* [๐Ÿ“– Romanian Bible Paraphrase Corpus ๐Ÿ“–](https://huggingface.co/datasets/andyP/ro-paraphrase-bible)
> A paraphrase corpus created from 10 different Romanian language Bible versions.
> The final dataset contains 904,815 similar records and 218,977 non matching records, totaling 1,123,927

* [๐Ÿ”„ Romanian paraphrase dataset ๐Ÿ”„](https://huggingface.co/datasets/BlackKakapo/paraphrase-ro)
> Around ~100k examples of paraphrases. No clear explanation on how the dataset was built

* [๐ŸŒ TaPaCo ๐ŸŒ](https://huggingface.co/datasets/tapaco/viewer/ro/train)
> A multi-language paraphrase corpus for 73 languages extracted from the Tatoeba database.
> It has ~ 2000 romanian phrases totaling 941 paraphrase groups.

[![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2020.lrec-1.848/)
[![Homepage](https://img.shields.io/badge/cc-100%20homepage-6ca1f0)](https://zenodo.org/record/3707949)

## Natural Language Inference

* [๐Ÿง  RONLI ๐Ÿง ](https://github.com/Eduard6421/RONLI)
> We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs, which are obtained via distant supervision, and 6K validation and test sentence pairs, which are manually annotated with the correct labels.
[![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2024.acl-long.15/)

* [~RO-NLI~](https://github.com/dumitrescustefan/RO-NLI)
> The repository seems to be just an attempt at starting to build the dataset

## Summarization
* [๐Ÿ“ RO Text Summarization ๐Ÿ“](https://huggingface.co/datasets/readerbench/ro-text-summarization)
> Around ~72k Full texts and their summary. Source seems to be news websites.
> No description or explanation available

## Dialect and regional speech identification
* [๐Ÿ—ฃ๏ธ RoDia ๐Ÿ—ฃ๏ธ](https://github.com/codrut2/RoDia)
> varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments.
> Around 2800 records labeled with age, gender and type of dialect

[![arXiv](https://img.shields.io/badge/arXiv-2309.03378-f9f107.svg)](https://arxiv.org/abs/2309.03378)

* [๐ŸŒ MOROCO ๐ŸŒ](https://github.com/butnaruandrei/MOROCO)
> MOROCO: The Moldavian and Romanian Dialectal Corpus
> The MOROCO data set contains Moldavian and Romanian samples of text collected from the news domain.
> The samples belong to one of the following six topics: culture, finance, politics, science, sports, tech
> totaling over 32.000 labeled records

[![arXiv](https://img.shields.io/badge/arXiv-1901.06543-f9f107.svg)](https://arxiv.org/abs/1901.06543)


## Named Entity Recognition (NER)

* [โš–๏ธ LegalNERo โš–๏ธ](https://huggingface.co/datasets/joelito/legalnero)
* [๐Ÿท๏ธ RONEC ๐Ÿท๏ธ](https://huggingface.co/datasets/ronec)
* [๐ŸŒ WikiAnn ๐ŸŒ](https://huggingface.co/datasets/wikiann)
* [๐Ÿ›๏ธ SiMoNERo ๐Ÿ›๏ธ](https://github.com/UniversalDependencies/UD_Romanian-SiMoNERo)
* [๐Ÿ“œ HistNERo ๐Ÿ“œ]()
> The dataset contains 323k tokens of text, covering more than half of the 19th century
> (i.e., 1817) until the late part of the 20th century (i.e., 1990).
> The samples belong to one of the following four historical regions of Romania,
> namely Bessarabia, Moldavia, Transylvania, and Wallachia.

[![arXiv](https://img.shields.io/badge/arXiv-2405.00155-f9f107.svg)](https://arxiv.org/abs/2405.00155v1)

## Autorship Attribution

* [๐Ÿ“š ROST ๐Ÿ“š](https://www.kaggle.com/datasets/sandamariaavram/rost-romanian-stories-and-other-texts)

## Sentiment Analysis

* [๐Ÿ˜Š RO_Sent ๐Ÿ˜Š](https://huggingface.co/datasets/ro_sent)
* [๐Ÿ“– Senti_Lex ๐Ÿ“–](https://huggingface.co/datasets/senti_lex)
* [๐ŸŽญ LaROSeDa ๐ŸŽญ](https://huggingface.co/datasets/laroseda)
* [๐ŸŽฌ SART ๐ŸŽฌ](https://github.com/Alegzandra/KES-2023/tree/main/datasets/SART)
* [โค๏ธ RED โค๏ธ](https://github.com/Alegzandra/RED-Romanian-Emotions-Dataset)
* [๐ŸŒ Romanian Categorized Web Dataset ๐ŸŒ](https://github.com/bogsio/RomanianCategorizedWebDataset)
* [๐ŸŽฌ Romanian Sentiment Movie Reviews ๐ŸŽฌ](https://www.kaggle.com/datasets/gringoandy/romanian-sentiment-movie-reviews)

## Dependency Parsing

* [๐ŸŒณ CoNLL 2017 & 2018 ๐ŸŒณ](https://www.conll.org/previous-tasks)
* [๐Ÿ”— Deep Universal Dependencies ๐Ÿ”—](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3720)
* [๐Ÿ“š Curlicat Romanian Corpus ๐Ÿ“š](https://elrc-share.eu/repository/browse/curlicat-romanian-corpus/8b6c8dca58ea11ed9c1a00155d026706fb03ef8b4c1847cfbe9cea869a82731e/)
* [๐ŸŒฒ HamleDT ๐ŸŒฒ](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1508)
* [๐Ÿ“– RoWordNet ๐Ÿ“–](https://github.com/dumitrescustefan/RoWordNet)
* [๐ŸŒณ RoRefTrees ๐ŸŒณ](https://github.com/UniversalDependencies/UD_Romanian-RRT)

## Diacritics Restoration / Grammar Correction

* [โœ๏ธ Corpus for training and evaluating diacritics restoration systems โœ๏ธ](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2607)
* [โœ๏ธ RONACC โœ๏ธ](https://nextcloud.readerbench.com/index.php/s/9pwymesT5sycxoM)

## Fake News / Clickbait / Satirical News

* [โŒ Fakerom โŒ](https://www.tagtog.com/fakerom/fakerom)
* [๐ŸŽฃ Clickbait dataset on Romanian SciTech News ๐ŸŽฃ](https://github.com/ralucaginga/ClickbaitSciTechRO)
* [๐Ÿ˜‚ SaRoCo ๐Ÿ˜‚](https://github.com/MihaelaGaman/SaRoCo)

## Offensive Language

* [๐Ÿšซ RO-Offense ๐Ÿšซ](https://huggingface.co/datasets/readerbench/ro-offense)

* [๐Ÿ“ฐ News RO-Offense ๐Ÿ“ฐ](https://huggingface.co/datasets/readerbench/news-ro-offense)
> manually annotated 4,052 comments on a Romanian local news website
> into one of the following classes: non-offensive, targeted insults,
> racist, homophobic, and sexist.

[![arXiv](https://img.shields.io/badge/UTCluj-RoCHI-ac2820.svg)](http://rochi.utcluj.ro/articole/10/RoCHI2022-Cojocaru-A.pdf)

* [๐Ÿ“ฑ FB RO-Offense ๐Ÿ“ฑ](https://huggingface.co/datasets/readerbench/ro-fb-offense)
> 4455 organic generated comments from Facebook live broadcasts
> annotated not binary offensive language detection tasks and for fine-grained offensive language detection

[![IEEE](https://img.shields.io/badge/IEEE-xplore-14303e.svg)](https://ieeexplore.ieee.org/document/10130824)

* [๐Ÿ” RO-Offense-Sequences ๐Ÿ”](https://huggingface.co/datasets/readerbench/ro-offense-sequences)
> 4800 Romanian comments annotated with offensive text spans
> Offensive span detection

[![MDPI](https://img.shields.io/badge/MDPI-Information-00813e.svg)](https://www.mdpi.com/2078-2489/15/1/8)

* [๐Ÿ’ข Hate Speech RO ๐Ÿ’ข](https://github.com/andra-pumnea/hate-speech-ro)
> 3860 labeled hate speech records

* [๐Ÿฆ ROFF ๐Ÿฆ](https://github.com/guzimanis/ROFF)
> Dataset consists of 5000 tweets, from which 924 were labeled as offensive (18.48 %)
> and 4076 tweets as non-offensive.

[![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2021.ranlp-1.102.pdf)

* [โš ๏ธ CoRoSeOf โš ๏ธ](https://github.com/DianaHoefels/CoRoSeOf)
> The corpus contains 39 245 tweets, annotated by multiple annotators, following the sexist label set of a recent study.

[![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2022.lrec-1.243.pdf)

## Questions and Answers

* [๐Ÿงฎ GSM8K RO ๐Ÿงฎ](https://huggingface.co/datasets/BlackKakapo/gsm8k-ro)
> This dataset is just the translation of the [gsm8k](https://huggingface.co/datasets/gsm8k) dataset.
> GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems.
> There is no information on the quality of the translation

* [๐Ÿ’ป ROCODE ๐Ÿ’ป](https://huggingface.co/datasets/cosmadrian/rocode)
> RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian,
> 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode
> is to provide a benchmark for evaluating the code intelligence of language models trained on
> Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models.

[![arXiv](https://img.shields.io/badge/arXiv-2005.06012-f9f107.svg)](https://arxiv.org/abs/2402.13222)

* [๐Ÿ’ป RoITD ๐Ÿ’ป](https://huggingface.co/datasets/dragosnicolae555/RoITD)
> Romanian IT Dataset (RoITD) resembling SQuAD 1.1.
> RoITD consists of 9575 Romanian QA pairs formulated by crowd workers. QA pairs are based on 5043 articles from Romanian Wikipedia articles describing IT and household products.
> Of the total number of questions, 5103 are possible (i.e. the correct answer can be found within the paragraph) and 4472 are not possible (i.e. the given answer is a "plausible answer" and not correct)

* [๐Ÿฉบ RoMedQA ๐Ÿฉบ](https://github.com/ana-rogoz/MedQARo)
> The dataset comprises 102,646 high-quality QA pairs from real-world clinical records of 1,011 oncology patients (796 patients with breast cancer and 215 patients with lung cancer).
> The QA pairs are the results of a manual annotation process carried out by physicians specialized in oncology and radiotherapy
> RoMedQA includes 76,416 QA pairs about breast cancer patients and 26,230 about lung cancer patients, with questions grounded in medical case summaries (epicrises).

[![arXiv](https://img.shields.io/badge/arXiv-2508.16390-f9f107.svg)](https://arxiv.org/abs/2508.16390v1)

* [โš–๏ธ JurRO โš–๏ธ](https://github.com/craciuncg/GRAF)
> Romanian legal MCQA dataset, comprising 10,836 questions from three examinations.
> Each entry essentially consists of a body in which a theoretical question is posed regarding
> a legal aspect, along with three possible answer choices labeled A, B, and C,
> out of which at mosttwo answers are correct.

[![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2025.findings-acl.659/)

## Spelling, Dictionaries and Gramatical Errors

* [โœ๏ธ Grammar-RO โœ๏ธ](https://huggingface.co/datasets/BlackKakapo/grammar-ro)
> Synthetic dataset with ~1.9M records. Altered and correct statement as columns

* [๐Ÿ“– RoAcReL ๐Ÿ“–](https://huggingface.co/datasets/fmi-unibuc/RoAcReL)
> Romanian Archaisms Regionalisms Lexicon containing ~ 1940 Word definitions

* [๐Ÿ—บ๏ธ RoRuDi ๐Ÿ—บ๏ธ](https://huggingface.co/datasets/fmi-unibuc/RoRuDi)
> Romanian Rules for Dialects - 1940 regionalisms, meanings and the region of provenience

* [๐Ÿ“š RoLEX ๐Ÿ“š](https://github.com/adrianastan/rolex)
> The dataset was developed mainly for speech processing applications, yet its applicability
> extends beyond this domain. RoLEX includes over 330,000 curated entries with information
> regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription.

[![Cambridge](https://img.shields.io/badge/Cambridge-Core-ef4135.svg)](https://adrianastan.com/papers/2022_NLE.pdf)

## Automatic Speech Recognition

* [๐Ÿ—ฃ๏ธ USPDATRO ๐Ÿ—ฃ๏ธ](https://zenodo.org/records/7898233)
> Underrepresented Speech Dataset from Open Data. Duration of the dataset is 4h 18m 55s.
> Distribution according to platforms: 83% of the content comes from YouTube, 12% from SoundCloud, and 5% from Vimeo
> Dataset covers primarily under-represented speech groups (outside the 19โ€“29 male category)

[![MDPI](https://img.shields.io/badge/MDPI-Information-00813E.svg)](https://www.mdpi.com/2076-3417/14/19/9043)