{"id":13753340,"url":"https://github.com/nlpyang/PreSumm","last_synced_at":"2025-05-09T20:35:41.986Z","repository":{"id":37444243,"uuid":"202547555","full_name":"nlpyang/PreSumm","owner":"nlpyang","description":"code for EMNLP 2019 paper Text Summarization with Pretrained Encoders ","archived":false,"fork":false,"pushed_at":"2024-07-25T10:16:21.000Z","size":13312,"stargazers_count":1292,"open_issues_count":149,"forks_count":463,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-04-08T01:38:36.667Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nlpyang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-15T13:37:21.000Z","updated_at":"2025-04-01T10:55:33.000Z","dependencies_parsed_at":"2022-07-12T13:34:04.365Z","dependency_job_id":"d35cba34-90d5-480a-a027-35ace4fe5f0b","html_url":"https://github.com/nlpyang/PreSumm","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FPreSumm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FPreSumm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FPreSumm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FPreSumm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nlpyang","download_url":"https://codeload.github.com/nlpyang/PreSumm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253321889,"owners_count":21890488,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:20.533Z","updated_at":"2025-05-09T20:35:36.940Z","avatar_url":"https://github.com/nlpyang.png","language":"Python","funding_links":[],"categories":["文本摘要"],"sub_categories":[],"readme":"# PreSumm\n\n**This code is for EMNLP 2019 paper [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345)**\n\n**Updates Jan 22 2020**: Now you can **Summarize Raw Text Input!**. Swith to the dev branch, and use `-mode test_text` and use `-text_src $RAW_SRC.TXT` to input your text file. Please still use master branch for normal training and evaluation, dev branch should be only used for test_text mode.\n* abstractive use -task abs, extractive use -task ext\n* use `-test_from $PT_FILE$` to use your model checkpoint file.\n* Format of the source text file:\n  * For **abstractive summarization**, each line is a document.\n  * If you want to do **extractive summarization**, please insert ` [CLS] [SEP] ` as your sentence boundaries.\n* There are example input files in the [raw_data directory](https://github.com/nlpyang/PreSumm/tree/dev/raw_data)\n* If you also have reference summaries aligned with your source input, please use `-text_tgt $RAW_TGT.TXT` to keep the order for evaluation.\n\n\nResults on CNN/DailyMail (20/8/2019):\n\n\n\u003ctable class=\"tg\"\u003e\n  \u003ctr\u003e\n    \u003cth class=\"tg-0pky\"\u003eModels\u003c/th\u003e\n    \u003cth class=\"tg-0pky\"\u003eROUGE-1\u003c/th\u003e\n    \u003cth class=\"tg-0pky\"\u003eROUGE-2\u003c/th\u003e\n    \u003cth class=\"tg-0pky\"\u003eROUGE-L\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\" colspan=\"4\"\u003eExtractive\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0pky\"\u003eTransformerExt\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e40.90\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e18.02\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e37.17\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0pky\"\u003eBertSumExt\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e43.23\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e20.24\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e39.63\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0pky\"\u003eBertSumExt (large)\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e43.85\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e20.34\u003c/td\u003e\n    \u003ctd class=\"tg-0pky\"\u003e39.90\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-baqh\" colspan=\"4\"\u003eAbstractive\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0lax\"\u003eTransformerAbs\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e40.21\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e17.76\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e37.09\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0lax\"\u003eBertSumAbs\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e41.72\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e19.39\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e38.76\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0lax\"\u003eBertSumExtAbs\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e42.13\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e19.60\u003c/td\u003e\n    \u003ctd class=\"tg-0lax\"\u003e39.18\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n**Python version**: This code is in Python3.6\n\n**Package Requirements**: torch==1.1.0 pytorch_transformers tensorboardX multiprocess pyrouge\n\n\n\n**Updates**: For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.\n\n\nSome codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)\n\n## Trained Models\n[CNN/DM BertExt](https://drive.google.com/open?id=1kKWoV0QCbeIuFt85beQgJ4v0lujaXobJ)\n\n[CNN/DM BertExtAbs](https://drive.google.com/open?id=1-IKVCtc4Q-BdZpjXc4s70_fRsWnjtYLr)\n\n[CNN/DM TransformerAbs](https://drive.google.com/open?id=1yLCqT__ilQ3mf5YUUCw9-UToesX5Roxy)\n\n[XSum BertExtAbs](https://drive.google.com/open?id=1H50fClyTkNprWJNh10HWdGEdDdQIkzsI)\n\n## System Outputs\n\n[CNN/DM and XSum](https://drive.google.com/file/d/1kYA384UEAQkvmZ-yWZAfxw7htCbCwFzC) \n\n## Data Preparation For XSum\n[Pre-processed data](https://drive.google.com/open?id=1BWBN1coTWGBqrWoOfRc5dhojPHhatbYs)\n\n\n## Data Preparation For CNN/Dailymail\n### Option 1: download the processed data\n\n[Pre-processed data](https://drive.google.com/open?id=1DN7ClZCCXsk2KegmC6t4ClBwtAf5galI)\n\nunzip the zipfile and put all `.pt` files into `bert_data`\n\n### Option 2: process the data yourself\n\n#### Step 1 Download Stories\nDownload and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail. Put all  `.story` files in one directory (e.g. `../raw_stories`)\n\n####  Step 2. Download Stanford CoreNLP\nWe will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:\n```\nexport CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar\n```\nreplacing `/path/to/` with the path to where you saved the `stanford-corenlp-full-2017-06-09` directory. \n\n####  Step 3. Sentence Splitting and Tokenization\n\n```\npython preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH\n```\n\n* `RAW_PATH` is the directory containing story files (`../raw_stories`), `JSON_PATH` is the target directory to save the generated json files (`../merged_stories_tokenized`)\n\n\n####  Step 4. Format to Simpler Json Files\n \n```\npython preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH\n```\n\n* `RAW_PATH` is the directory containing tokenized files (`../merged_stories_tokenized`), `JSON_PATH` is the target directory to save the generated json files (`../json_data/cnndm`), `MAP_PATH` is the  directory containing the urls files (`../urls`)\n\n####  Step 5. Format to PyTorch Files\n```\npython preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH  -lower -n_cpus 1 -log_file ../logs/preprocess.log\n```\n\n* `JSON_PATH` is the directory containing json files (`../json_data`), `BERT_DATA_PATH` is the target directory to save the generated binary files (`../bert_data`)\n\n## Model Training\n\n**First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use ``-visible_gpus -1``, after downloading, you could kill the process and rerun the code with multi-GPUs.**\n\n### Extractive Setting\n\n```\npython train.py -task ext -mode train -bert_data_path BERT_DATA_PATH -ext_dropout 0.1 -model_path MODEL_PATH -lr 2e-3 -visible_gpus 0,1,2 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 50000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 10000 -max_pos 512\n```\n\n### Abstractive Setting\n\n#### TransformerAbs (baseline)\n```\npython train.py -mode train -accum_count 5 -batch_size 300 -bert_data_path BERT_DATA_PATH -dec_dropout 0.1 -log_file ../../logs/cnndm_baseline -lr 0.05 -model_path MODEL_PATH -save_checkpoint_steps 2000 -seed 777 -sep_optim false -train_steps 200000 -use_bert_emb true -use_interval true -warmup_steps 8000  -visible_gpus 0,1,2,3 -max_pos 512 -report_every 50 -enc_hidden_size 512  -enc_layers 6 -enc_ff_size 2048 -enc_dropout 0.1 -dec_layers 6 -dec_hidden_size 512 -dec_ff_size 2048 -encoder baseline -task abs\n```\n#### BertAbs\n```\npython train.py  -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2  -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3  -log_file ../logs/abs_bert_cnndm\n```\n#### BertExtAbs\n```\npython train.py  -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2  -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3 -log_file ../logs/abs_bert_cnndm  -load_from_extractive EXT_CKPT   \n```\n* `EXT_CKPT` is the saved `.pt` checkpoint of the extractive model.\n\n\n\n\n## Model Evaluation\n### CNN/DM\n```\n python train.py -task abs -mode validate -batch_size 3000 -test_batch_size 500 -bert_data_path BERT_DATA_PATH -log_file ../logs/val_abs_bert_cnndm -model_path MODEL_PATH -sep_optim true -use_interval true -visible_gpus 1 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path ../logs/abs_bert_cnndm \n```\n### XSum\n```\n python train.py -task abs -mode validate -batch_size 3000 -test_batch_size 500 -bert_data_path BERT_DATA_PATH -log_file ../logs/val_abs_bert_cnndm -model_path MODEL_PATH -sep_optim true -use_interval true -visible_gpus 1 -max_pos 512 -min_length 20 -max_length 100 -alpha 0.9 -result_path ../logs/abs_bert_cnndm \n```\n* `-mode` can be {`validate, test`}, where `validate` will inspect the model directory and evaluate the model for each newly saved checkpoint, `test` need to be used with `-test_from`, indicating the checkpoint you want to use\n* `MODEL_PATH` is the directory of saved checkpoints\n* use `-mode valiadte` with `-test_all`, the system will load all saved checkpoints and select the top ones to generate summaries (this will take a while)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlpyang%2FPreSumm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnlpyang%2FPreSumm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlpyang%2FPreSumm/lists"}