https://github.com/esdurmus/Wikilingua
Multilingual abstractive summarization dataset extracted from WikiHow.
https://github.com/esdurmus/Wikilingua
Last synced: 8 months ago
JSON representation
Multilingual abstractive summarization dataset extracted from WikiHow.
- Host: GitHub
- URL: https://github.com/esdurmus/Wikilingua
- Owner: esdurmus
- License: cc0-1.0
- Created: 2020-09-25T18:37:00.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2025-03-14T18:37:21.000Z (9 months ago)
- Last Synced: 2025-03-14T19:31:23.397Z (9 months ago)
- Size: 43 KB
- Stars: 87
- Watchers: 2
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Text-Summarization-Repo - **WikiLingua**: A Multilingual Abstractive Summarization Dataset (2020) - [paper](https://arxiv.org/abs/2010.03093), [Collab notebook](https://colab.research.google.com/drive/1HxonmcM7EOQVal2I6oTi9QWEP257BgDP?usp=sharing)</a> | - How-to docs<br />- 391w→ 39w | 12,189<br />(전체 770,087 중 kor) | 2020,<br />[CC BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) | (Resources / Datasets)
- StarryDivineSky - esdurmus/Wikilingua
- indicnlp_catalog - WikiLingua - lingual summarization dataset created from WikiHow. Contains 9k English-Hindi article-summary pairs. [[paper](https://arxiv.org/abs/2010.03093)] (<a name='TextCorpora'></a>Text Corpora / <a name='Summarization'></a>Summarization)
README
# WikiLingua: A Multilingual Abstractive Summarization Dataset #
**UPDATE:\
We have created new Train/Test splits for all 17 languages that can be downloaded [here](https://drive.google.com/file/d/1sTCB5NDPq6vUOlxR29DbvSssErvXLD1d/view?usp=sharing). These splits were created to ensure that there is no (document, summary) pair overlap across any of the 18 languages so that they can be safely used for multilingual evaluations.**
This repo contains dataset introduced in the following paper:
[WikiLingua: A New Benchmark Dataset for Multilingual Abstractive
Summarization](https://arxiv.org/abs/2010.03093)
Download the dataset using [this link](https://drive.google.com/file/d/1sTCB5NDPq6vUOlxR29DbvSssErvXLD1d/view?usp=sharing).
## Reference ##
Please cite the following paper:
```
@inproceedings{ladhak-wiki-2020,
title={WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization},
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
```
## Description ##
The dataset includes ~770k article and summary pairs in 18 languages from WikiHow. We extracted gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
The table below shows number of article-summary pairs with a parallel article-summary pair in English.
______________________________
| Language | Num. parallel |
| ----------- | --------------|
| English | 141,457 |
| Spanish | 113,215 |
| Portuguese | 81,695 |
| French | 63,692 |
| German | 58,375 |
| Russian | 52,928 |
| Italian | 50,968 |
| Indonesian | 47,511 |
| Dutch | 31,270 |
| Arabic | 29,229 |
| Vietnamese | 19,600 |
| Chinese | 18,887 |
| Thai | 14,770 |
| Japanese | 12,669 |
| Korean | 12,189 |
| Hindi | 9,929 |
| Czech | 7,200 |
| Turkish | 4,503 |
## License ##
- Article provided by wikiHow , a wiki building the world's largest, highest quality how-to manual. Please edit this article and find author credits at wikiHow.com. Content on wikiHow can be shared under a [Creative Commons license](http://creativecommons.org/licenses/by-nc-sa/3.0/).
- Refer to [this webpage](https://www.wikihow.com/wikiHow:Attribution) for the specific attribution guidelines.