{"id":13635666,"url":"https://github.com/esdurmus/Wikilingua","last_synced_at":"2025-04-19T04:31:21.845Z","repository":{"id":48884734,"uuid":"298652396","full_name":"esdurmus/Wikilingua","owner":"esdurmus","description":"Multilingual abstractive summarization dataset extracted from WikiHow. ","archived":false,"fork":false,"pushed_at":"2025-03-14T18:37:21.000Z","size":44,"stargazers_count":87,"open_issues_count":1,"forks_count":7,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-14T19:31:23.397Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/esdurmus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-25T18:37:00.000Z","updated_at":"2025-03-14T18:37:25.000Z","dependencies_parsed_at":"2022-09-23T04:44:07.373Z","dependency_job_id":null,"html_url":"https://github.com/esdurmus/Wikilingua","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esdurmus%2FWikilingua","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esdurmus%2FWikilingua/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esdurmus%2FWikilingua/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esdurmus%2FWikilingua/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/esdurmus","download_url":"https://codeload.github.com/esdurmus/Wikilingua/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249606341,"owners_count":21298851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T00:00:49.253Z","updated_at":"2025-04-19T04:31:21.840Z","avatar_url":"https://github.com/esdurmus.png","language":null,"funding_links":[],"categories":["Resources","NLP语料和数据集","\u003ca name='TextCorpora'\u003e\u003c/a\u003eText Corpora"],"sub_categories":["Datasets","其他_文本生成、文本对话","\u003ca name='Summarization'\u003e\u003c/a\u003eSummarization"],"readme":"# WikiLingua: A Multilingual Abstractive Summarization Dataset #\n\n**UPDATE:\\\nWe have created new Train/Test splits for all 17 languages that can be downloaded [here](https://drive.google.com/file/d/1sTCB5NDPq6vUOlxR29DbvSssErvXLD1d/view?usp=sharing). These splits were created to ensure that there is no (document, summary) pair overlap across any of the 18 languages so that they can be safely used for multilingual evaluations.**\n\nThis repo contains dataset introduced in the following paper: \n\n[WikiLingua: A New Benchmark Dataset for Multilingual Abstractive\nSummarization](https://arxiv.org/abs/2010.03093) \n\nDownload the dataset using [this link](https://drive.google.com/file/d/1sTCB5NDPq6vUOlxR29DbvSssErvXLD1d/view?usp=sharing).\n\n## Reference ##\nPlease cite the following paper: \n\n```\n@inproceedings{ladhak-wiki-2020,\n    title={WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization},\n    author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},\n    booktitle={Findings of EMNLP, 2020},\n    year={2020}\n}\n```\n\n## Description ##\n\nThe dataset includes ~770k article and summary pairs in 18 languages from WikiHow. We extracted gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.\n\nThe table below shows number of article-summary pairs with a parallel article-summary pair in English. \n______________________________\n| Language    | Num. parallel |\n| ----------- | --------------|\n| English     |   141,457     |\n| Spanish     |   113,215     |\n| Portuguese  |    81,695     |\n| French      |    63,692     |\n| German      |    58,375     |\n| Russian     |    52,928     |\n| Italian     |    50,968     |\n| Indonesian  |    47,511     |\n| Dutch       |    31,270     |\n| Arabic      |    29,229     |\n| Vietnamese  |    19,600     |\n| Chinese     |    18,887     |\n| Thai        |    14,770     |\n| Japanese    |    12,669     |\n| Korean      |    12,189     |\n| Hindi       |     9,929     |\n| Czech       |     7,200     |\n| Turkish     |     4,503     |\n\n## License ##\n\n- Article provided by wikiHow \u003chttps://www.wikihow.com/Main-Page\u003e, a wiki building the world's largest, highest quality how-to manual. Please edit this article and find author credits at wikiHow.com. Content on wikiHow can be shared under a [Creative Commons license](http://creativecommons.org/licenses/by-nc-sa/3.0/).\n\n- Refer to [this webpage](https://www.wikihow.com/wikiHow:Attribution) for the specific attribution guidelines. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fesdurmus%2FWikilingua","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fesdurmus%2FWikilingua","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fesdurmus%2FWikilingua/lists"}