{"id":13635678,"url":"https://github.com/ThomasScialom/MLSUM","last_synced_at":"2025-04-19T04:31:24.486Z","repository":{"id":91201204,"uuid":"260230504","full_name":"ThomasScialom/MLSUM","owner":"ThomasScialom","description":"The large-scale MultiLingual SUMmarization corpus","archived":false,"fork":false,"pushed_at":"2022-05-26T07:17:23.000Z","size":38,"stargazers_count":26,"open_issues_count":0,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-09T05:34:45.208Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ThomasScialom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-30T14:14:13.000Z","updated_at":"2024-09-04T07:25:54.000Z","dependencies_parsed_at":"2023-07-08T14:32:05.143Z","dependency_job_id":null,"html_url":"https://github.com/ThomasScialom/MLSUM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThomasScialom%2FMLSUM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThomasScialom%2FMLSUM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThomasScialom%2FMLSUM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThomasScialom%2FMLSUM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ThomasScialom","download_url":"https://codeload.github.com/ThomasScialom/MLSUM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249606341,"owners_count":21298851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T00:00:49.540Z","updated_at":"2025-04-19T04:31:24.229Z","avatar_url":"https://github.com/ThomasScialom.png","language":"Python","funding_links":[],"categories":["Resources","Anthropomorphic-Taxonomy"],"sub_categories":["Datasets","Typical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks"],"readme":"# MLSUM\n\nThe original dataset as used in the paper is available on HuggingFace datasets (https://github.com/huggingface/datasets/tree/master/datasets/mlsum)\n\nUsage of dataset is restricted to non-commercial research purposes only.\nCopyright belongs to the original copyright holders.\n\nIt is also available [here](https://drive.google.com/file/d/1Z4oswHQ8yYPHqxendaC3jAzAfYEP9CbJ/view?usp=sharing).\n\n## Outputs used in the paper \n\nWe make available the [outputs](https://drive.google.com/file/d/1YFlBhEO-yLv28xtAEX28p1JxWfjBKNzn/view?usp=sharing) from our models so anyone can fairly compare. \nNote that if you obtain a different ROUGE, it might be due to the library: \n- for BERT-Gen, we used the [UNILM Rouge](https://github.com/microsoft/unilm/blob/master/unilm-v1/src/cnndm/eval.py)\n- for the other moels we used the [ROUGE](https://pypi.org/project/rouge/) pypi library, version 0.3.1\n\n## Instructions and code to rebuild the dataset from the archived web pages:\n\n#### Setup the environment  \n```shell\ncd MLSUM\nconda create --name mlsum\nconda activate mlsum\nconda install pip\npip install requirements -r\n ```\n\n#### Download the URLs \n\nThe list of URLs is available here:\n\n```shell\nhttps://drive.google.com/file/d/1qViYZwl82yyyTE5mKryhVWbFao4Ck-7g/view?usp=sharing\n```\n\nIn the main folder MLSUM, create a folder data and unzip the URL folder there. Create also an empty folder processed, in which the data will be stored. \n\n```shell\nmkdir data\ncd data\nunzip urls.zip\nmkdir processed\n ```\n    \n#### Scrap all the MLSUM data on web.archive\n\nNot that it is possible that some URLs fail to be processed for various reasons. All those failed URLs are listed in the 'data/processed/*.errors.txt' files. \n\n```shell\npython run_all.py\n```\n\n## Reproducing the results\n\n### Training BERT-gen \n\nWe used the [UniLM code](https://github.com/microsoft/unilm/tree/master/unilm-v1#abstractive-summarization---cnn--daily-mail) for abstractive summarization with the default parameters except:\n- num_train_epochs set to 5 (instead of 30)\n- model_recover_path is simply the [multilingal BERT checkpoint](https://huggingface.co/bert-base-multilingual-uncased/tree/main) (instead of unilmv1-large-cased.bin)\n\n\n### Russian Score:\n\nIt seems that for Russian, the results are very different given the implementation of ROUGE metric.\nTo reproduce the one used in the paper, install the following ROUGE package:\n\n```shell\npip install rouge==0.3.1\n```  \n\nThen, the bellow script should give you results corresponding results: \n\n```python\nfrom rouge import Rouge\n\ndef get_rouge(hypothesis, references):\n    rouge = Rouge()\n    preprocess_exs = lambda exs : [ex.strip().lower() for ex in exs]\n    rouge_scores =  rouge.get_scores(preprocess_exs(hypothesis), preprocess_exs(references), avg=True)\n    return {k: v['f'] for k, v in rouge_scores.items()}\n    \nrefs = ['Старший преподаватель института коммунального хозяйства и строительства был задержан на днях в Москве за растление школьника',\n    'Манежная площадь Москвы стала местом последнего в 2009 году убийства',\n    'Президент РФ Дмитрий Медведев с семьей проводит новогодние праздники на горнолыжном курорте “Красная Поляна”, а в воскресенье к нему в гости приехал и премьер Владимир Путин']\n\ngens = ['Миллениалы , которые не знают , уходит электричество из розетки или нет , если выключить свет , крайне обрадовались , когда недавно Илон Маск вывел на орбиту первые 60 спутников для интернет-сети Starlink . Основной посыл — началось ! Скоро у нас везде будет бесплатный спутниковый Интернет , до которого не дотянутся руки Роскомнадзора .', \n    'Если верить южнокорейскому изданию , ссылающемуся на анонимные источники , спецпредставитель Ким Хёк Чхоль и четверо неназванных сотрудников Министерства иностранных дел КНДР были казнены в марте в Пхеньяне на военном аэродроме Мирим . Напомним , что встреча на высшем уровне между лидерами Соединенных Штатов и Северной Кореи во вьетнамской столлице , на которую Трамп возлагал , судя по всему , немалые надежды , была закончена раньше намеченного срока . Сторонам не удалось ни о чем договориться , и никаких соглашений по ядерному разоружению Пхеньяна подписано не было .',\n    'ЦИТАТА ДНЯ Андрей ВОРОБЬЕВ : « Наша ключевая задача — сделать так , чтобы люди , вызвавшие « скорую » , могли точно знать , когда к ним приедет бригада . Такой сервис есть в Европе . Должен быть и у нас » .']\n\nprint(get_rouge(gens, refs))\n```\n\n*Output:*\n```\n{'rouge-1': 0.05170222555648688, 'rouge-2': 0.0, 'rouge-l': 0.04330277388057737}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FThomasScialom%2FMLSUM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FThomasScialom%2FMLSUM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FThomasScialom%2FMLSUM/lists"}