Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/radi-cho/noisy-sentences-dataset

550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.
https://github.com/radi-cho/noisy-sentences-dataset

dataset natural-language-processing nlp

Last synced: 29 days ago
JSON representation

550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.

Host: GitHub
URL: https://github.com/radi-cho/noisy-sentences-dataset
Owner: radi-cho
License: mit
Created: 2023-02-03T09:18:55.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-02-03T10:11:20.000Z (almost 2 years ago)
Last Synced: 2024-05-01T17:52:41.289Z (8 months ago)
Topics: dataset, natural-language-processing, nlp
Homepage:
Size: 7.81 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

        # Noisy Sentences Dataset

[![DOI](https://zenodo.org/badge/596943135.svg)](https://zenodo.org/badge/latestdoi/596943135)

550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models. We have constructed our dataset to cover representatives from the language families used across Europe.

- Germanic - English, German;

- Romance - French;

- Slavic - Bulgarian;

- Turkic - Turkish;

**Use case example:** Apply language models or other techniques to compare the sentence pairs and reconstruct the original sentences from the augmented ones. You can use a single multilingual solution to solve the challenge or employ multiple models/techniques for the separate languages. Per-word dictionary lookup is also an option.

## Files

*   **train.csv** - the training set

*   **test.csv** - the test set

Because of an ongoing [Kaggle competition](https://www.kaggle.com/competitions/ml-olympiad-multilingual-spell-correction/), the testing set labels will be released in March 2023.

## Columns

*   `Id` - unique id for each sentence pair

*   `Language` - one of `en` (English), `bg` (Bulgarian), `tr` (Turkish), `fr` (French), `de` (German)

*   `Text` - noisy text

*   `Expected` - original sentence before augmentation

## Cite

```bibtex

@software{Cholakov_Noisy_Sentences_Dataset_2023,

  author = {Cholakov, Radostin},

  doi = {10.5281/zenodo.7602333},

  month = {2},

  title = {{Noisy Sentences Dataset}},

  url = {https://github.com/radi-cho/noisy-sentences-dataset},

  version = {0.0.1},

  year = {2023}

}

```

## Credits

The sentences in our dataset are sourced from [Wikipedia](https://en.wikipedia.org/wiki/Main_Page). The contents of Wikipedia are licensed under the [Creative Commons 3.0 License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License).