Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/radi-cho/noisy-sentences-dataset
550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.
https://github.com/radi-cho/noisy-sentences-dataset
dataset natural-language-processing nlp
Last synced: 29 days ago
JSON representation
550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.
- Host: GitHub
- URL: https://github.com/radi-cho/noisy-sentences-dataset
- Owner: radi-cho
- License: mit
- Created: 2023-02-03T09:18:55.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-02-03T10:11:20.000Z (almost 2 years ago)
- Last Synced: 2024-05-01T17:52:41.289Z (8 months ago)
- Topics: dataset, natural-language-processing, nlp
- Homepage:
- Size: 7.81 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Noisy Sentences Dataset
[![DOI](https://zenodo.org/badge/596943135.svg)](https://zenodo.org/badge/latestdoi/596943135)
550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models. We have constructed our dataset to cover representatives from the language families used across Europe.
- Germanic - English, German;
- Romance - French;
- Slavic - Bulgarian;
- Turkic - Turkish;**Use case example:** Apply language models or other techniques to compare the sentence pairs and reconstruct the original sentences from the augmented ones. You can use a single multilingual solution to solve the challenge or employ multiple models/techniques for the separate languages. Per-word dictionary lookup is also an option.
## Files
* **train.csv** - the training set
* **test.csv** - the test setBecause of an ongoing [Kaggle competition](https://www.kaggle.com/competitions/ml-olympiad-multilingual-spell-correction/), the testing set labels will be released in March 2023.
## Columns
* `Id` - unique id for each sentence pair
* `Language` - one of `en` (English), `bg` (Bulgarian), `tr` (Turkish), `fr` (French), `de` (German)
* `Text` - noisy text
* `Expected` - original sentence before augmentation## Cite
```bibtex
@software{Cholakov_Noisy_Sentences_Dataset_2023,
author = {Cholakov, Radostin},
doi = {10.5281/zenodo.7602333},
month = {2},
title = {{Noisy Sentences Dataset}},
url = {https://github.com/radi-cho/noisy-sentences-dataset},
version = {0.0.1},
year = {2023}
}
```## Credits
The sentences in our dataset are sourced from [Wikipedia](https://en.wikipedia.org/wiki/Main_Page). The contents of Wikipedia are licensed under the [Creative Commons 3.0 License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License).