https://github.com/HHousen/TransformerSum
Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
https://github.com/HHousen/TransformerSum
albert automatic-summarization bert distilbert extractive-summarization machine-learning pytorch-lightning roberta summarization summarization-dataset text-summarization transformer-models
Last synced: 3 months ago
JSON representation
Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
- Host: GitHub
- URL: https://github.com/HHousen/TransformerSum
- Owner: HHousen
- License: gpl-3.0
- Created: 2020-04-07T04:41:40.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-05-03T01:34:49.000Z (almost 2 years ago)
- Last Synced: 2024-06-17T22:49:33.370Z (8 months ago)
- Topics: albert, automatic-summarization, bert, distilbert, extractive-summarization, machine-learning, pytorch-lightning, roberta, summarization, summarization-dataset, text-summarization, transformer-models
- Language: Python
- Homepage: https://transformersum.rtfd.io
- Size: 11.7 MB
- Stars: 424
- Watchers: 6
- Forks: 59
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

# TransformerSum
> Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.[](https://github.com/HHousen/TransformerSum/blob/master/LICENSE) [](https://github.com/HHousen/TransformerSum/commits/master) [](https://www.python.org/) [](https://transformersum.readthedocs.io/en/latest/?badge=latest) [](https://GitHub.com/HHousen/TransformerSum/issues/) [](https://GitHub.com/HHousen/TransformerSum/pull/) [](https://deepsource.io/gh/HHousen/TransformerSum/?ref=repository-badge)
`TransformerSum` is a library that aims to make it easy to *train*, *evaluate*, and *use* machine learning **transformer models** that perform **automatic summarization**. It features tight integration with [huggingface/transformers](https://github.com/huggingface/transformers) which enables the easy usage of a **wide variety of architectures** and **pre-trained models**. There is a heavy emphasis on code **readability** and **interpretability** so that both beginners and experts can build new components. Both the extractive and abstractive model classes are written using [pytorch_lightning](https://github.com/PyTorchLightning/pytorch-lightning), which handles the PyTorch training loop logic, enabling **easy usage of advanced features** such as 16-bit precision, multi-GPU training, and [much more](https://pytorch-lightning.readthedocs.io/). `TransformerSum` supports both the extractive and abstractive summarization of **long sequences** (4,096 to 16,384 tokens) using the [longformer](https://huggingface.co/transformers/model_doc/longformer.html) (extractive) and [LongformerEncoderDecoder](https://github.com/allenai/longformer/tree/encoderdecoder) (abstractive), which is a combination of [BART](https://huggingface.co/transformers/model_doc/bart.html) ([paper](https://arxiv.org/abs/1910.13461)) and the longformer. `TransformerSum` also contains models that can run on resource-limited devices while still maintaining high levels of accuracy. Models are automatically evaluated with the **ROUGE metric** but human tests can be conducted by the user.
**Check out [the documentation](https://transformersum.readthedocs.io/en/latest) for usage details.**
## Features
* For extractive summarization, compatible with every [huggingface/transformers](https://github.com/huggingface/transformers) transformer encoder model.
* For abstractive summarization, compatible with every [huggingface/transformers](https://github.com/huggingface/transformers) EncoderDecoder and Seq2Seq model.
* Currently, 10+ pre-trained extractive models available to summarize text trained on 3 datasets (CNN-DM, WikiHow, and ArXiv-PebMed).* Contains pre-trained models that excel at summarization on resource-limited devices: On CNN-DM, ``mobilebert-uncased-ext-sum`` achieves about 97% of the performance of [BertSum](https://arxiv.org/abs/1903.10318) while containing 4.45 times fewer parameters. It achieves about 94% of the performance of [MatchSum (Zhong et al., 2020)](https://arxiv.org/abs/2004.08795), the current extractive state-of-the-art.
* Contains code to train models that excel at summarizing long sequences: The [longformer](https://huggingface.co/transformers/model_doc/longformer.html) (extractive) and [LongformerEncoderDecoder](https://github.com/allenai/longformer/tree/encoderdecoder) (abstractive) can summarize sequences of lengths up to 4,096 tokens by default, but can be trained to summarize sequences of more than 16k tokens.* Integration with [huggingface/nlp](https://github.com/huggingface/nlp) means any summarization dataset in the `nlp` library can be used for both abstractive and extractive training.
* "Smart batching" (extractive) and trimming (abstractive) support to not perform unnecessary calculations (speeds up training).
* Use of `pytorch_lightning` for code readability.
* Extensive documentation.
* Three pooling modes (convert word vectors to sentence embeddings): mean or max of word embeddings in addition to the CLS token.## Pre-trained Models
All pre-trained models (including larger models and other architectures) are located in [the documentation](https://transformersum.readthedocs.io/en/latest). The below is a fraction of the available models.
### Extractive
| Name | Dataset | Comments | R1/R2/RL/RL-Sum | Model Download | Data Download |
|-|-|-|-|-|-|
| mobilebert-uncased-ext-sum | CNN/DM | None | 42.01/19.31/26.89/38.53 | [Model](https://drive.google.com/uc?id=1R3tRH07z_9nYW8sC8eFceBmxC7u0kP_W) | [CNN/DM Bert Uncased](https://drive.google.com/uc?id=1PWvo8jkBcfJfo7iNifqw47NtfxJSG4Hj) |
| distilroberta-base-ext-sum | CNN/DM | None | 42.87/20.02/27.46/39.31 | [Model](https://drive.google.com/uc?id=1VNoFhqfwlvgwKuJwjlHnlGcGg38cGM--) | [CNN/DM Roberta](https://drive.google.com/uc?id=1bXw0sm5G5kjVbFGQ0jb7RPC8nebVdi_T) |
| roberta-base-ext-sum | CNN/DM | None | 43.24/20.36/27.64/39.65 | [Model](https://drive.google.com/uc?id=1xlBJTO1LF5gIfDNvG33q8wVmvUB4jXYx) | [CNN/DM Roberta](https://drive.google.com/uc?id=1bXw0sm5G5kjVbFGQ0jb7RPC8nebVdi_T) |
| mobilebert-uncased-ext-sum | WikiHow | None | 30.72/8.78/19.18/28.59 | [Model](https://drive.google.com/uc?id=1EtBNClC-HkeolJFn8JmCK5c3DDDkZO7O) | [WikiHow Bert Uncased](https://drive.google.com/uc?id=1uj9LcOrtWds8knfVNFXi7o6732he2Bjn) |
| distilroberta-base-ext-sum | WikiHow | None | 31.07/8.96/19.34/28.95 | [Model](https://drive.google.com/uc?id=1RdFcoeuHd_JCj5gBQRFXFpieb-3EXkiN) | [WikiHow Roberta](https://drive.google.com/uc?id=1dNCLAAuI0JrmWk2Dox-pdmE-mp2KqSff) |
| roberta-base-ext-sum | WikiHow | None | 31.26/09.09/19.47/29.14 | [Model](https://drive.google.com/uc?id=1aCtrwms5GzsF7nY-Y3k-_N1OmLivlDQQ) | [WikiHow Roberta](https://drive.google.com/uc?id=1dNCLAAuI0JrmWk2Dox-pdmE-mp2KqSff) |
| mobilebert-uncased-ext-sum | arXiv-PubMed | None | 33.97/11.74/19.63/30.19 | [Model](https://drive.google.com/uc?id=1K3GHtdQS52Dzg9ENy6AtA5jHqqVLj5Lh) | [arXiv-PubMed Bert Uncased](https://drive.google.com/uc?id=1zBVpoFkm29DWu3L9lAO6QJDvYl3gOFnx) |
| distilroberta-base-ext-sum | arXiv-PubMed | None | 34.70/12.16/19.52/30.82 | [Model](https://drive.google.com/uc?id=1zzazmT0hpfLoH8PqF94dMhY53nHes6kR) | [arXiv-PubMed Roberta](https://drive.google.com/uc?id=16GiKBOo5zmgTzEczPatem_6kEZudAiIE) |
| roberta-base-ext-sum | arXiv-PubMed | None | 34.81/12.26/19.65/30.91 | [Model](https://drive.google.com/uc?id=1mMUeyVVZDmZFE7l4GhUfm8z6CYO-xNZi) | [arXiv-PubMed Roberta](https://drive.google.com/uc?id=16GiKBOo5zmgTzEczPatem_6kEZudAiIE) |### Abstractive
| Name | Dataset | Comments | Model Download |
|-|-|-|-|
| longformer-encdec-8192-bart-large-abs-sum | arXiv-PubMed | None | Not yet... |## Install
Installation is made easy due to conda environments. Simply run this command from the root project directory: `conda env create --file environment.yml` and conda will create and environment called `transformersum` with all the required packages from [environment.yml](environment.yml). The spacy `en_core_web_sm` model is required for the [convert_to_extractive.py](convert_to_extractive.py) script to detect sentence boundaries.
### Step-by-Step Instructions
1. Clone this repository: `git clone https://github.com/HHousen/transformersum.git`.
2. Change to project directory: `cd transformersum`.
3. Run installation command: `conda env create --file environment.yml`.
4. **(Optional)** If using the [convert_to_extractive.py](convert_to_extractive.py) script then download the `en_core_web_sm` spacy model: `python -m spacy download en_core_web_sm`.## Meta
[](https://GitHub.com/HHousen/)
Hayden Housen – [haydenhousen.com](https://haydenhousen.com)
Distributed under the GNU General Public License v3.0. See the [LICENSE](LICENSE) for more information.
### Attributions
* Code heavily inspired by the following projects:
* Adapting BERT for Extractive Summariation: [BertSum](https://github.com/nlpyang/BertSum)
* Text Summarization with Pretrained Encoders: [PreSumm](https://github.com/nlpyang/PreSumm)
* Word/Sentence Embeddings: [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
* CNN/CM Dataset: [cnn-dailymail](https://github.com/artmatsak/cnn-dailymail)
* PyTorch Lightning Classifier: [lightning-text-classification](https://github.com/ricardorei/lightning-text-classification)
* Important projects utilized:
* PyTorch: [pytorch](https://github.com/pytorch/pytorch/)
* Training code: [pytorch_lightning](https://github.com/PyTorchLightning/pytorch-lightning/)
* Transformer Models: [huggingface/transformers](https://github.com/huggingface/transformers)## Contributing
All Pull Requests are greatly welcomed.
Questions? Commends? Issues? Don't hesitate to open an [issue](https://github.com/HHousen/TransformerSum/issues/new) and briefly describe what you are experiencing (with any error logs if necessary). Thanks.
1. Fork it ()
2. Create your feature branch (`git checkout -b feature/fooBar`)
3. Commit your changes (`git commit -am 'Add some fooBar'`)
4. Push to the branch (`git push origin feature/fooBar`)
5. Create a new Pull Request