Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bloomberg/entsum

Open Source / ENTSUM: A Data Set for Entity-Centric Extractive Summarization
https://github.com/bloomberg/entsum

nlp

Last synced: 7 days ago
JSON representation

Open Source / ENTSUM: A Data Set for Entity-Centric Extractive Summarization

Awesome Lists containing this project

README

        

# EntSUM: A dataset for entity centric summarization
Repository for pre-processing code related to generating the training datasets used in the [paper](https://aclanthology.org/2022.acl-long.237/).

## Using this repository
The repository contains 4 notebooks:
- [preprocessing_ner_tagging.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_ner_tagging.ipynb): This notebook will allow you to tag all the entities using [FLAIR](https://github.com/flairNLP/flair) in the source and summary for both CNN/DailyMail and NYT datasets
- [preprocessing_coref_resolution.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_coref_resolution.ipynb): This notebook takes the entities from the previous entity tagging and performs [Coreference Resolution using SpanBERT](https://github.com/mandarjoshi90/coref) so we don't have duplicate data points for a given entity
- [preprocessing_bertsum.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_bertsum.ipynb): This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to [train a BERTSum model](https://github.com/nlpyang/BertSum)
- [preprocessing_gsum.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_gsum.ipynb): This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to [train a GSum model](https://github.com/neulab/guided_summarization)

## Datasets
CNN/DailyMail and NYT are datasets that can be used for training models by setting up entity-centric summarization datasets with methods described in the paper and by leveraging the notebooks mentioned above.

- [CNN/DailyMail](https://cs.nyu.edu/~kcho/DMQA/)
- [NYT](https://catalog.ldc.upenn.edu/LDC2008T19)

The EntSUM dataset is used to evaluate the effectiveness of these trained entity-centric summarization models.
- EntSUM [Zenodo](https://zenodo.org/record/6359875) | [HuggingFace](https://huggingface.co/datasets/bloomberg/entsum)

# License
The EntSUM code is distributed under the Apache License (version 2.0); see the LICENSE file at the top of the source tree for more information.

**Note:** To run the code and download the datasets, please obtain the respective licenses for each respectively.

# Citation
```
@inproceedings{maddela-etal-2022-entsum,
title = "{E}nt{SUM}: A Data Set for Entity-Centric Extractive Summarization",
author = "Maddela, Mounica and
Kulkarni, Mayank and
Preotiuc-Pietro, Daniel",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.237",
pages = "3355--3366",
abstract = "Controllable summarization aims to provide summaries that take into account user-specified aspects and preferences to better assist them with their information need, as opposed to the standard summarization setup which build a single generic summary of a document.We introduce a human-annotated data set EntSUM for controllable summarization with a focus on named entities as the aspects to control.We conduct an extensive quantitative analysis to motivate the task of entity-centric summarization and show that existing methods for controllable summarization fail to generate entity-centric summaries. We propose extensions to state-of-the-art summarization approaches that achieve substantially better results on our data set. Our analysis and results show the challenging nature of this task and of the proposed data set.",
}
```