{"id":13564215,"url":"https://github.com/IlyaGusev/summarus","last_synced_at":"2025-04-03T21:30:33.317Z","repository":{"id":41488086,"uuid":"159055060","full_name":"IlyaGusev/summarus","owner":"IlyaGusev","description":"Models for automatic abstractive summarization","archived":false,"fork":false,"pushed_at":"2022-07-03T19:02:08.000Z","size":496,"stargazers_count":171,"open_issues_count":0,"forks_count":20,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-11-04T17:47:16.205Z","etag":null,"topics":["deep-learning","machine-learning","nlp","pytorch","summarization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IlyaGusev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-11-25T17:07:26.000Z","updated_at":"2024-10-31T19:23:09.000Z","dependencies_parsed_at":"2022-07-19T12:54:06.391Z","dependency_job_id":null,"html_url":"https://github.com/IlyaGusev/summarus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Fsummarus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Fsummarus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Fsummarus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IlyaGusev%2Fsummarus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IlyaGusev","download_url":"https://codeload.github.com/IlyaGusev/summarus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247082851,"owners_count":20880730,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","nlp","pytorch","summarization"],"created_at":"2024-08-01T13:01:28.124Z","updated_at":"2025-04-03T21:30:32.639Z","avatar_url":"https://github.com/IlyaGusev.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# summarus\n\n[![Tests Status](https://github.com/IlyaGusev/summarus/actions/workflows/python-package.yml/badge.svg)](https://github.com/IlyaGusev/summarus/actions/workflows/python-package.yml)\n[![Code Climate](https://codeclimate.com/github/IlyaGusev/summarus/badges/gpa.svg)](https://codeclimate.com/github/IlyaGusev/summarus)\n\nAbstractive and extractive summarization models, mostly for Russian language. Building on top of [AllenNLP](https://allennlp.org/)\n\nYou can also checkout the MBART-based Russian summarization model on Huggingface: [mbart_ru_sum_gazeta](https://huggingface.co/IlyaGusev/mbart_ru_sum_gazeta)\n\nBased on the following papers:\n* [SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents](https://arxiv.org/abs/1611.04230)\n* [Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/abs/1704.04368)\n* [Self-Attentive Model for Headline Generation](https://arxiv.org/abs/1901.07786)\n* [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345)\n* [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210)\n\n## Contacts\n\n* Telegram: [@YallenGusev](https://t.me/YallenGusev)\n\n## Prerequisites\n```\npip install -r requirements.txt\n```\n\n## Commands\n\n#### train.sh\n\nScript for training a model based on AllenNLP 'train' command.\n\n| Argument | Required | Description                                      |\n|:---------|:---------|--------------------------------------------------|\n| -c       | true     | path to file with configuration                  |\n| -s       | true     | path to directory where model will be saved      |\n| -t       | true     | path to train dataset                            |\n| -v       | true     | path to val dataset                              |\n| -r       | false    | recover from checkpoint                          |\n\n#### predict.sh\n\nScript for model evaluation. The test dataset should have the same format as the train dataset.\n\n| Argument | Required | Default | Description                                                      |\n|:---------|:---------|:--------|:-----------------------------------------------------------------|\n| -t       | true     |         | path to test dataset                                             |\n| -m       | true     |         | path to tar.gz archive with model                                |\n| -p       | true     |         | name of Predictor                                                |\n| -c       | false    | 0       | CUDA device                                                      |\n| -L       | true     |         | Language (\"ru\" or \"en\")                                          |\n| -b       | false    | 32      | size of a batch with test examples to run simultaneously         |\n| -M       | false    |         | path to meteor.jar for Meteor metric                             |\n| -T       | false    |         | tokenize gold and predicted summaries before metrics calculation |\n| -D       | false    |         | save temporary files with gold and predicted summaries           |\n\n#### summarus.util.train_subword_model\n\nScript for subword model training.\n\n| Argument          | Default | Description                                                        |\n|:------------------|:--------|:-------------------------------------------------------------------|\n| --train-path      |         | path to train dataset                                              |\n| --model-path      |         | path to directory where generated subword model will be saved      |\n| --model-type      | bpe     | type of subword model, see sentencepiece                           |\n| --vocab-size      | 50000   | size of the resulting subword model vocabulary                     |\n| --config-path     |         | path to file with configuration for DatasetReader (with parse_set) |\n\n\n## Headline generation\n\n* First paper: [Importance of Copying Mechanism for News Headline Generation](http://www.dialog-21.ru/media/4599/gusevio-152.pdf)\n* Slides: [Importance of Copying Mechanism for News Headline Generation](https://www.dropbox.com/s/agtvl3umlc6vci5/ICMNHG-Presentation.pdf)\n* Second paper: [Advances of Transformer-Based Models for News Headline Generation](https://arxiv.org/abs/2007.05044)\n\n#### Dataset splits:\n* RIA original dataset: https://github.com/RossiyaSegodnya/ria_news_dataset\n* RIA train/val/test: https://www.dropbox.com/s/rermx1r8lx9u7nl/ria.tar.gz\n* RIA dataset preprocessed for mBART: https://www.dropbox.com/s/iq2ih8sztygvz0m/ria_data_mbart_512_200.tar.gz\n* Lenta original dataset: https://github.com/yutkin/Lenta.Ru-News-Dataset\n* Lenta train/val/test: https://www.dropbox.com/s/v9i2nh12a4deuqj/lenta.tar.gz\n* Lenta dataset preprocessed for mBART: https://www.dropbox.com/s/4oo8jazmw3izqvr/lenta_mbart_data_512_200.tar.gz\n* Telegram train dataset with split: https://www.dropbox.com/s/ykqk49a8avlmnaf/ru_all_split.tar.gz\n* Telegram test dataset with multiple references: https://github.com/dialogue-evaluation/Russian-News-Clustering-and-Headline-Generation/blob/main/data/headline_generation/headline_generation_answers.jsonl\n\n#### Models:\n* [ria_copynet_10kk](https://www.dropbox.com/s/78ni5gnbcjz59ss/ria_copynet_10kk.tar.gz)\n* [ria_pgn_24kk](https://www.dropbox.com/s/6wa1a2qzvqx5tti/ria_pgn_24kk.tar.gz)\n* [ria_mbart](https://www.dropbox.com/s/bhrfd5o5etz8hso/ria_mbart_checkpoint_4.tar.gz)\n* [rubert_telegram_headlines](https://huggingface.co/IlyaGusev/rubert_telegram_headlines)\n\nPrediction script:\n```\n./predict.sh -t \u003cpath_to_test_dataset\u003e -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru \n```\n\n#### Results\n##### Train dataset: RIA, test dataset: RIA\n\n| Model                     | R-1-f | R-2-f | R-L-f | BLEU  |\n|:--------------------------|:------|:------|:------|:------|\n| ria_copynet_10kk          | 40.0  | 23.3  | 37.5  | -     |\n| ria_pgn_24kk              | 42.3  | 25.1  | 39.6  | -     |\n| ria_mbart                 | 42.8  | 25.5  | 39.9  | -     |\n| First Sentence            | 24.1  | 10.6  | 16.7  | -     |\n\n#### Train dataset: RIA, eval dataset: Lenta\n\n| Model                     | R-1-f | R-2-f | R-L-f | BLEU  |\n|:--------------------------|:------|:------|:------|:------|\n| ria_copynet_10kk          | 25.6  | 12.3  | 23.0  | -     |\n| ria_pgn_24kk              | 26.4  | 12.3  | 24.0  | -     |\n| ria_mbart                 | 30.3  | 14.5  | 27.1  | -     |\n| First Sentence            | 25.5  | 11.2  | 19.2  | -     |\n\n## Summarization - CNN/DailyMail\n\n#### Dataset splits:\n* CNN/DailyMail jsonl dataset: https://www.dropbox.com/s/35ezpg78rtukkgh/cnn_dm_jsonl.tar.gz\n\n#### Models:\n* [cnndm_pgn_25kk](https://www.dropbox.com/s/kctjduh84gam2pl/cnndm_pgn_25kk.tar.gz)\n\nPrediction script:\n```\n./predict.sh -t \u003cpath_to_test_dataset\u003e -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R\n```\n\n#### Results:\n\n| Model                     | R-1-f | R-2-f | R-L-f | METEOR | BLEU |\n|:--------------------------|:------|:------|:------|:-------|:-----|\n| cnndm_pgn_25kk            | 38.5  | 16.5  | 33.4  | 17.6   | -    |\n\n\n## Summarization - Gazeta, russian news dataset\n* Paper: [Dataset for Automatic Summarization of Russian News](https://arxiv.org/abs/2006.11063)\n* Gazeta dataset: https://github.com/IlyaGusev/gazeta\n* Usage examples: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1B26oDFEKSNCcI0BPkGXgxi13pbadriyN)\n\n#### Models:\n* [gazeta_pgn_7kk](https://www.dropbox.com/s/aold2691f5amad8/gazeta_pgn_7kk.tar.gz)\n* [gazeta_pgn_7kk_cov.tar.gz](https://www.dropbox.com/s/2yk25xaizevtqw3/gazeta_pgn_7kk_cov.tar.gz)\n* [gazeta_pgn_25kk](https://www.dropbox.com/s/jmg7vk4ed9ph2ov/gazeta_pgn_25kk.tar.gz)\n* [gazeta_pgn_words_13kk.tar.gz](https://www.dropbox.com/s/acexr5xecc8xizx/gazeta_pgn_words_13kk.tar.gz)\n* [gazeta_summarunner_3kk](https://www.dropbox.com/s/mlo7ioxodqib1xl/gazeta_summarunner_3kk.tar.gz)\n\nPrediction scripts:\n```\n./predict.sh -t \u003cpath_to_test_dataset\u003e -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T\n./predict.sh -t \u003cpath_to_test_dataset\u003e -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T\n```\n\n#### External models:\n* [gazeta_mbart (fairseq)](https://www.dropbox.com/s/b2auu9dhrm2wj0p/gazeta_mbart_checkpoint_600_160.tar.gz)\n* [gazeta_mbart (transformers)](https://huggingface.co/IlyaGusev/mbart_ru_sum_gazeta)\n* [gazeta_mbart_lowercase (fairseq)](https://www.dropbox.com/s/k3gsgokq69468jw/gazeta_mbart_lower.tar.gz)\n\n#### Results:\n\n| Model                     | R-1-f | R-2-f | R-L-f | METEOR | BLEU |\n|:--------------------------|:------|:------|:------|:-------|:-----|\n| gazeta_pgn_7kk            | 29.4  | 12.7  | 24.6  | 21.2   | 9.0  |\n| gazeta_pgn_7kk_cov        | 29.8  | 12.8  | 25.4  | 22.1   | 10.1 |\n| gazeta_pgn_25kk           | 29.6  | 12.8  | 24.6  | 21.5   | 9.3  |\n| gazeta_pgn_words_13kk     | 29.4  | 12.6  | 24.4  | 20.9   | 8.9  |\n| gazeta_summarunner_3kk    | 31.6  | 13.7  | 27.1  | 26.0   | 11.5 |\n| gazeta_mbart              | 32.6  | 14.6  | 28.2  | 25.7   | 12.4 |\n| gazeta_mbart_lower        | 32.7  | 14.7  | 28.3  | 25.8   | 12.5 |\n\n\n## Demo\n```\npython demo/server.py --include-package summarus --model-dir \u003cmodel_dir\u003e --host \u003chost\u003e --port \u003cport\u003e\n```\n\n## Citations\nHeadline generation (PGN):\n```bibtex\n@article{Gusev2019headlines,\n    author={Gusev, I.O.},\n    title={Importance of copying mechanism for news headline generation},\n    journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},\n    year={2019},\n    volume={2019-May},\n    number={18},\n    pages={229--236}\n}\n```\n\nHeadline generation (transformers):\n```bibtex\n@InProceedings{Bukhtiyarov2020headlines,\n    author={Bukhtiyarov, Alexey and Gusev, Ilya},\n    title=\"Advances of Transformer-Based Models for News Headline Generation\",\n    booktitle=\"Artificial Intelligence and Natural Language\",\n    year=\"2020\",\n    publisher=\"Springer International Publishing\",\n    address=\"Cham\",\n    pages={54--61},\n    isbn=\"978-3-030-59082-6\",\n    doi={10.1007/978-3-030-59082-6_4}\n}\n```\n\nSummarization:\n```bibtex\n@InProceedings{Gusev2020gazeta,\n    author=\"Gusev, Ilya\",\n    title=\"Dataset for Automatic Summarization of Russian News\",\n    booktitle=\"Artificial Intelligence and Natural Language\",\n    year=\"2020\",\n    publisher=\"Springer International Publishing\",\n    address=\"Cham\",\n    pages=\"{122--134}\",\n    isbn=\"978-3-030-59082-6\",\n    doi={10.1007/978-3-030-59082-6_9}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIlyaGusev%2Fsummarus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FIlyaGusev%2Fsummarus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIlyaGusev%2Fsummarus/lists"}