Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/oussamaahmia/TED-dataset

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/oussamaahmia/TED-dataset
Owner: oussamaahmia
License: mit
Created: 2017-09-28T22:45:23.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2024-02-20T14:02:06.000Z (10 months ago)
Last Synced: 2024-07-15T13:54:47.509Z (5 months ago)
Size: 10.6 MB
Stars: 6
Watchers: 6
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # TED-dataset

The two sub-datasets, fd-TED and par-TED, will be updated in a regular basis to keep tracks of the new calls for

tender published by the EU states.

- The [par-TED](https://drive.google.com/drive/folders/1U2W-dKc7jJBtpt1iuLqDZgNQeM8wA7ds) is a multilingual (24 languages) aligned corpus in the form of a set of parallel unique sentences translated to at least 23 languages.

- The [fd-TED](https://drive.google.com/drive/folders/1G-21p8vxvbXtb6hoQPjbvMnokThyk8HI) corpus is built from the full content of the documents extracted from the  [TED − Tenders Electronic Daily platform](https://ted.europa.eu). This dataset can be used as a benchmark for supervised classification or for training machine learning models applied to business intelligence application.

We also propose a filtered version of fd-ted created by ignoring administrative information.

 [comment]: <> (***NB: The currently published dataset, contains only filtered documents. The raw version will be soon available***)

 

 For further information please refer to this [article](http://www.lrec-conf.org/proceedings/lrec2018/pdf/832.pdf).

 

 **Citation:**

 \

 ``@inproceedings{ahmia-etal-2018-two,

    title = "Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications.",

    author = "Ahmia, Oussama  and

      B{\'e}chet, Nicolas  and

      Marteau, Pierre-Fran{\c{c}}ois",

    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",

    month = may,

    year = "2018",

    address = "Miyazaki, Japan",

    publisher = "European Language Resources Association (ELRA)",

    url = "https://www.aclweb.org/anthology/L18-1583",

}``