{"id":13535220,"url":"https://github.com/nlpyang/BertSum","last_synced_at":"2025-04-02T00:33:02.194Z","repository":{"id":43774540,"uuid":"177497186","full_name":"nlpyang/BertSum","owner":"nlpyang","description":"Code for paper Fine-tune BERT for Extractive Summarization","archived":false,"fork":false,"pushed_at":"2022-01-11T07:58:23.000Z","size":15397,"stargazers_count":1484,"open_issues_count":49,"forks_count":423,"subscribers_count":31,"default_branch":"master","last_synced_at":"2025-04-01T14:14:26.159Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nlpyang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-25T02:05:03.000Z","updated_at":"2025-03-30T13:19:58.000Z","dependencies_parsed_at":"2022-08-12T10:42:19.910Z","dependency_job_id":null,"html_url":"https://github.com/nlpyang/BertSum","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FBertSum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FBertSum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FBertSum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlpyang%2FBertSum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nlpyang","download_url":"https://codeload.github.com/nlpyang/BertSum/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246735358,"owners_count":20825222,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T08:00:51.460Z","updated_at":"2025-04-02T00:33:02.152Z","avatar_url":"https://github.com/nlpyang.png","language":"Python","readme":"# BertSum\n\n**This code is for paper `Fine-tune BERT for Extractive Summarization`**(https://arxiv.org/pdf/1903.10318.pdf)\n\n**!New: Please see our [full paper](https://arxiv.org/abs/1908.08345) with trained models**\n\n\n\nResults on CNN/Dailymail (25/3/2019):\n\n|  Models| ROUGE-1 | ROUGE-2 |ROUGE-L\n| :---         |     :---      |         :--- |          :--- |\n| Transformer Baseline   | 40.9     | 18.02    |37.17    |\n| BERTSUM+Classifier     | 43.23       | 20.22    |39.60      |\n| BERTSUM+Transformer     | 43.25      | 20.24    |39.63     |\n| BERTSUM+LSTM     | 43.22       |  20.17    |39.59      |\n\n**Python version**: This code is in Python3.6\n\n**Package Requirements**: pytorch pytorch_pretrained_bert tensorboardX multiprocess pyrouge\n\nSome codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)\n\n## Data Preparation For CNN/Dailymail\n### Option 1: download the processed data\n\ndownload https://drive.google.com/open?id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6\n\nunzip the zipfile and put all `.pt` files into `bert_data`\n\n### Option 2: process the data yourself\n\n#### Step 1 Download Stories\nDownload and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail. Put all  `.story` files in one directory (e.g. `../raw_stories`)\n\n####  Step 2. Download Stanford CoreNLP\nWe will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:\n```\nexport CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar\n```\nreplacing `/path/to/` with the path to where you saved the `stanford-corenlp-full-2017-06-09` directory. \n\n####  Step 3. Sentence Splitting and Tokenization\n\n```\npython preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH\n```\n\n* `RAW_PATH` is the directory containing story files (`../raw_stories`), `JSON_PATH` is the target directory to save the generated json files (`../merged_stories_tokenized`)\n\n\n####  Step 4. Format to Simpler Json Files\n \n```\npython preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -map_path MAP_PATH -lower \n```\n\n* `RAW_PATH` is the directory containing tokenized files (`../merged_stories_tokenized`), `JSON_PATH` is the target directory to save the generated json files (`../json_data/cnndm`), `MAP_PATH` is the  directory containing the urls files (`../urls`)\n\n####  Step 5. Format to PyTorch Files\n```\npython preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log\n```\n\n* `JSON_PATH` is the directory containing json files (`../json_data`), `BERT_DATA_PATH` is the target directory to save the generated binary files (`../bert_data`)\n\n* `-oracle_mode` can be `greedy` or `combination`, where `combination` is more accurate but takes much longer time to process \n\n## Model Training\n\n**First run**: For the first time, you should use single-GPU, so the code can download the BERT model. Change ``-visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3`` to ``-visible_gpus 0  -gpu_ranks 0 -world_size 1``, after downloading, you could kill the process and rerun the code with multi-GPUs.\n\n\nTo train the BERT+Classifier model, run:\n```\npython train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000\n```\n\nTo train the BERT+Transformer model, run:\n```\npython train.py -mode train -encoder transformer -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_transformer -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_transformer -use_interval true -warmup_steps 10000 -ff_size 2048 -inter_layers 2 -heads 8\n```\n\nTo train the BERT+RNN model, run:\n```\npython train.py -mode train -encoder rnn -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_rnn -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_rnn -use_interval true -warmup_steps 10000 -rnn_size 768 -dropout 0.1\n```\n\n\n* `-mode` can be {`train, validate, test`}, where `validate` will inspect the model directory and evaluate the model for each newly saved checkpoint, `test` need to be used with `-test_from`, indicating the checkpoint you want to use\n\n## Model Evaluation\nAfter the training finished, run\n```\npython train.py -mode validate -bert_data_path ../bert_data/cnndm -model_path MODEL_PATH  -visible_gpus 0  -gpu_ranks 0 -batch_size 30000  -log_file LOG_FILE  -result_path RESULT_PATH -test_all -block_trigram true\n```\n* `MODEL_PATH` is the directory of saved checkpoints\n* `RESULT_PATH` is where you want to put decoded summaries (default `../results/cnndm`)\n\n\n","funding_links":[],"categories":["BERT Text Summarization Task:","Papers","文本摘要"],"sub_categories":["Single-Document-Summarization (as references)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlpyang%2FBertSum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnlpyang%2FBertSum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlpyang%2FBertSum/lists"}