Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/esdurmus/Wikilingua
Multilingual abstractive summarization dataset extracted from WikiHow.
https://github.com/esdurmus/Wikilingua
Last synced: about 1 month ago
JSON representation
Multilingual abstractive summarization dataset extracted from WikiHow.
- Host: GitHub
- URL: https://github.com/esdurmus/Wikilingua
- Owner: esdurmus
- License: cc0-1.0
- Created: 2020-09-25T18:37:00.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2021-07-07T06:26:44.000Z (over 3 years ago)
- Last Synced: 2024-08-02T00:22:37.093Z (4 months ago)
- Size: 38.1 KB
- Stars: 79
- Watchers: 2
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Text-Summarization-Repo - **WikiLingua**: A Multilingual Abstractive Summarization Dataset (2020) - [paper](https://arxiv.org/abs/2010.03093), [Collab notebook](https://colab.research.google.com/drive/1HxonmcM7EOQVal2I6oTi9QWEP257BgDP?usp=sharing)</a> | - How-to docs<br />- 391w→ 39w | 12,189<br />(전체 770,087 중 kor) | 2020,<br />[CC BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) | (Resources / Datasets)
- StarryDivineSky - esdurmus/Wikilingua
README
# WikiLingua: A Multilingual Abstractive Summarization Dataset #
**UPDATE:\
We have created new Train/Test splits for all 17 languages that can be downloaded [here](https://drive.google.com/file/d/1PM7GFCy2gJL1WHqQz1dzqIDIEN6kfRoi/view?usp=sharing). These splits were created to ensure that there is no (document, summary) pair overlap across any of the 17 languages so that they can be safely used for multilingual evaluations.**This repo contains dataset introduced in the following paper:
[WikiLingua: A New Benchmark Dataset for Multilingual Abstractive
Summarization](https://arxiv.org/abs/2010.03093)Download the dataset using [this link](https://drive.google.com/drive/folders/1PFvXUOsW_KSEzFm5ixB8J8BDB8zRRfHW?usp=sharing).
Please refer to this [Collab notebook](https://colab.research.google.com/drive/1HxonmcM7EOQVal2I6oTi9QWEP257BgDP?usp=sharing) to see how to align articles in other languages with the parallel English articles.
## Reference ##
Please cite the following paper:```
@inproceedings{ladhak-wiki-2020,
title={WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization},
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
```## Description ##
The dataset includes ~770k article and summary pairs in 18 languages from WikiHow. We extracted gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
The table below shows number of article-summary pairs with a parallel article-summary pair in English.
______________________________
| Language | Num. parallel |
| ----------- | --------------|
| English | 141,457 |
| Spanish | 113,215 |
| Portuguese | 81,695 |
| French | 63,692 |
| German | 58,375 |
| Russian | 52,928 |
| Italian | 50,968 |
| Indonesian | 47,511 |
| Dutch | 31,270 |
| Arabic | 29,229 |
| Vietnamese | 19,600 |
| Chinese | 18,887 |
| Thai | 14,770 |
| Japanese | 12,669 |
| Korean | 12,189 |
| Hindi | 9,929 |
| Czech | 7,200 |
| Turkish | 4,503 |## License ##
- Article provided by wikiHow , a wiki building the world's largest, highest quality how-to manual. Please edit this article and find author credits at wikiHow.com. Content on wikiHow can be shared under a [Creative Commons license](http://creativecommons.org/licenses/by-nc-sa/3.0/).
- Refer to [this webpage](https://www.wikihow.com/wikiHow:Attribution) for the specific attribution guidelines.