Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/oussamaahmia/TED-dataset


https://github.com/oussamaahmia/TED-dataset

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

        

# TED-dataset
The two sub-datasets, fd-TED and par-TED, will be updated in a regular basis to keep tracks of the new calls for
tender published by the EU states.

- The [par-TED](https://drive.google.com/drive/folders/1U2W-dKc7jJBtpt1iuLqDZgNQeM8wA7ds) is a multilingual (24 languages) aligned corpus in the form of a set of parallel unique sentences translated to at least 23 languages.

- The [fd-TED](https://drive.google.com/drive/folders/1G-21p8vxvbXtb6hoQPjbvMnokThyk8HI) corpus is built from the full content of the documents extracted from the [TED − Tenders Electronic Daily platform](https://ted.europa.eu). This dataset can be used as a benchmark for supervised classification or for training machine learning models applied to business intelligence application.
We also propose a filtered version of fd-ted created by ignoring administrative information.

[comment]: <> (***NB: The currently published dataset, contains only filtered documents. The raw version will be soon available***)

For further information please refer to this [article](http://www.lrec-conf.org/proceedings/lrec2018/pdf/832.pdf).

**Citation:**
\
``@inproceedings{ahmia-etal-2018-two,
title = "Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications.",
author = "Ahmia, Oussama and
B{\'e}chet, Nicolas and
Marteau, Pierre-Fran{\c{c}}ois",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
month = may,
year = "2018",
address = "Miyazaki, Japan",
publisher = "European Language Resources Association (ELRA)",
url = "https://www.aclweb.org/anthology/L18-1583",
}``