https://github.com/esdurmus/Wikilingua

Multilingual abstractive summarization dataset extracted from WikiHow.
https://github.com/esdurmus/Wikilingua

Last synced: about 1 year ago
JSON representation

Multilingual abstractive summarization dataset extracted from WikiHow.

Host: GitHub
URL: https://github.com/esdurmus/Wikilingua
Owner: esdurmus
License: cc0-1.0
Created: 2020-09-25T18:37:00.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2025-03-14T18:37:21.000Z (over 1 year ago)
Last Synced: 2025-03-14T19:31:23.397Z (over 1 year ago)
Size: 43 KB
Stars: 87
Watchers: 2
Forks: 7
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Text-Summarization-Repo - **WikiLingua**: A Multilingual Abstractive Summarization Dataset (2020) - [paper](https://arxiv.org/abs/2010.03093), [Collab notebook](https://colab.research.google.com/drive/1HxonmcM7EOQVal2I6oTi9QWEP257BgDP?usp=sharing)</a> | - How-to docs<br />- 391w→ 39w | 12,189<br />(전체 770,087 중 kor) | 2020,<br />[CC BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) | (Resources / Datasets)
StarryDivineSky - esdurmus/Wikilingua
indicnlp_catalog - WikiLingua - lingual summarization dataset created from WikiHow. Contains 9k English-Hindi article-summary pairs. [[paper](https://arxiv.org/abs/2010.03093)] (<a name='TextCorpora'></a>Text Corpora / <a name='Summarization'></a>Summarization)

README

          # WikiLingua: A Multilingual Abstractive Summarization Dataset #

**UPDATE:\

We have created new Train/Test splits for all 17 languages that can be downloaded [here](https://drive.google.com/file/d/1sTCB5NDPq6vUOlxR29DbvSssErvXLD1d/view?usp=sharing). These splits were created to ensure that there is no (document, summary) pair overlap across any of the 18 languages so that they can be safely used for multilingual evaluations.**

This repo contains dataset introduced in the following paper: 

[WikiLingua: A New Benchmark Dataset for Multilingual Abstractive

Summarization](https://arxiv.org/abs/2010.03093) 

Download the dataset using [this link](https://drive.google.com/file/d/1sTCB5NDPq6vUOlxR29DbvSssErvXLD1d/view?usp=sharing).

## Reference ##

Please cite the following paper: 

```

@inproceedings{ladhak-wiki-2020,

    title={WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization},

    author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},

    booktitle={Findings of EMNLP, 2020},

    year={2020}

}

```

## Description ##

The dataset includes ~770k article and summary pairs in 18 languages from WikiHow. We extracted gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.

The table below shows number of article-summary pairs with a parallel article-summary pair in English. 

______________________________

| Language    | Num. parallel |

| ----------- | --------------|

| English     |   141,457     |

| Spanish     |   113,215     |

| Portuguese  |    81,695     |

| French      |    63,692     |

| German      |    58,375     |

| Russian     |    52,928     |

| Italian     |    50,968     |

| Indonesian  |    47,511     |

| Dutch       |    31,270     |

| Arabic      |    29,229     |

| Vietnamese  |    19,600     |

| Chinese     |    18,887     |

| Thai        |    14,770     |

| Japanese    |    12,669     |

| Korean      |    12,189     |

| Hindi       |     9,929     |

| Czech       |     7,200     |

| Turkish     |     4,503     |

## License ##

- Article provided by wikiHow , a wiki building the world's largest, highest quality how-to manual. Please edit this article and find author credits at wikiHow.com. Content on wikiHow can be shared under a [Creative Commons license](http://creativecommons.org/licenses/by-nc-sa/3.0/).

- Refer to [this webpage](https://www.wikihow.com/wikiHow:Attribution) for the specific attribution guidelines.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/esdurmus/Wikilingua

Awesome Lists containing this project

README