Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nlpyang/hiersumm
Code for paper Hierarchical Transformers for Multi-Document Summarization in ACL2019
https://github.com/nlpyang/hiersumm
Last synced: about 1 month ago
JSON representation
Code for paper Hierarchical Transformers for Multi-Document Summarization in ACL2019
- Host: GitHub
- URL: https://github.com/nlpyang/hiersumm
- Owner: nlpyang
- License: apache-2.0
- Created: 2019-05-30T16:38:49.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-08-20T14:17:33.000Z (over 5 years ago)
- Last Synced: 2024-08-02T10:27:18.033Z (4 months ago)
- Language: Python
- Homepage:
- Size: 74.2 KB
- Stars: 228
- Watchers: 6
- Forks: 42
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-Multi-Document-Summarization - nlpyang/hiersumm
README
**This code is for paper `Hierarchical Transformers for Multi-Document Summarization` in ACL2019**
**Python version**: This code is in Python3.6
**Package Requirements**: pytorch tensorboardX pyrouge
Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)
**Trained Model**: Please use this url for a trained model https://drive.google.com/open?id=1suMaf7yBZPSBtaJKLnPyQ4uZNcm0VgDz
## Data Preparation:
The raw version of the WikiSum dataset can be built following steps in https://github.com/tensorflow/tensor2tensor/tree/5acf4a44cc2cbe91cd788734075376af0f8dd3f4/tensor2tensor/data_generators/wikisum
Since the raw version of the WikiSum dataset is huge (~300GB), we only provide the **ranked version** of the dataset:
https://drive.google.com/open?id=1iiPkkZHE9mLhbF5b39y3ua5MJhgGM-5u* The dataset is ranked by the ranker as described in the paper.
* The top-40 paragraphs for each instance is extracted.
* Extract the `.pt` files into one directory.The **sentencepiece vocab file** is at:
https://drive.google.com/open?id=1zbUILFWrgIeLEzSP93Mz5JkdVwVPY9Bf## Model Training:
To train with the default setting as in the paper:
```
python train_abstractive.py -data_path DATA_PATH/WIKI -mode train -batch_size 10500 -seed 666 -train_steps 1000000 -save_checkpoint_steps 5000 -report_every 100 -trunc_tgt_ntoken 400 -trunc_src_nblock 24 -visible_gpus 0,1,2,3 -gpu_ranks 0,1,2,3 -world_size 4 -accum_count 4 -dec_dropout 0.1 -enc_dropout 0.1 -label_smoothing 0.1 -vocab_path VOCAB_PATH -model_path MODEL_PATH -accum_count 4 -log_file LOG_PATH -inter_layers 6,7 -inter_heads 8 -hier
```
* DATA_PATH is where you put the `.pt` files
* VOCAB_PATH is where you put the sentencepiece model file
* MODEL_PATH is the directory you want to store the checkpoints
* LOG_PATH is the file you want to write training logs in## Evaluation
To evaluated the trained model:
```
python train_abstractive.py -data_path DATA_PATH/WIKI -mode validate -batch_size 30000 -valid_batch_size 7500 -seed 666 -trunc_tgt_ntoken 400 -trunc_src_nblock 40 -visible_gpus 0 -gpu_ranks 0 -vocab_path VOCAB_PATH -model_path MODEL_PATH -log_file LOG_PATH -inter_layers 6,7 -inter_heads 8 -hier -report_rouge -max_wiki 100000 -dataset test -alpha 0.4 -max_length 400
```
This will load all saved checkpoints in the model_path and calculate validation losses (this could take a long time). Then it will select the top checkpoints and generate real summaries. Rouge scores will be calculated.