Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bloomberg/entsum
Open Source / ENTSUM: A Data Set for Entity-Centric Extractive Summarization
https://github.com/bloomberg/entsum
nlp
Last synced: 7 days ago
JSON representation
Open Source / ENTSUM: A Data Set for Entity-Centric Extractive Summarization
- Host: GitHub
- URL: https://github.com/bloomberg/entsum
- Owner: bloomberg
- License: apache-2.0
- Created: 2022-03-17T18:49:16.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-05-23T21:56:15.000Z (over 2 years ago)
- Last Synced: 2024-10-19T03:14:04.405Z (28 days ago)
- Topics: nlp
- Language: Jupyter Notebook
- Homepage:
- Size: 27.3 KB
- Stars: 28
- Watchers: 6
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# EntSUM: A dataset for entity centric summarization
Repository for pre-processing code related to generating the training datasets used in the [paper](https://aclanthology.org/2022.acl-long.237/).## Using this repository
The repository contains 4 notebooks:
- [preprocessing_ner_tagging.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_ner_tagging.ipynb): This notebook will allow you to tag all the entities using [FLAIR](https://github.com/flairNLP/flair) in the source and summary for both CNN/DailyMail and NYT datasets
- [preprocessing_coref_resolution.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_coref_resolution.ipynb): This notebook takes the entities from the previous entity tagging and performs [Coreference Resolution using SpanBERT](https://github.com/mandarjoshi90/coref) so we don't have duplicate data points for a given entity
- [preprocessing_bertsum.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_bertsum.ipynb): This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to [train a BERTSum model](https://github.com/nlpyang/BertSum)
- [preprocessing_gsum.ipynb](https://bbgithub.dev.bloomberg.com/mkulkarni24/entsum/blob/master/notebooks/preprocessing_gsum.ipynb): This notebook uses the files generated by the NER tagging and Coreference Resolution to generate the training dataset to be used to [train a GSum model](https://github.com/neulab/guided_summarization)## Datasets
CNN/DailyMail and NYT are datasets that can be used for training models by setting up entity-centric summarization datasets with methods described in the paper and by leveraging the notebooks mentioned above.- [CNN/DailyMail](https://cs.nyu.edu/~kcho/DMQA/)
- [NYT](https://catalog.ldc.upenn.edu/LDC2008T19)The EntSUM dataset is used to evaluate the effectiveness of these trained entity-centric summarization models.
- EntSUM [Zenodo](https://zenodo.org/record/6359875) | [HuggingFace](https://huggingface.co/datasets/bloomberg/entsum)# License
The EntSUM code is distributed under the Apache License (version 2.0); see the LICENSE file at the top of the source tree for more information.**Note:** To run the code and download the datasets, please obtain the respective licenses for each respectively.
# Citation
```
@inproceedings{maddela-etal-2022-entsum,
title = "{E}nt{SUM}: A Data Set for Entity-Centric Extractive Summarization",
author = "Maddela, Mounica and
Kulkarni, Mayank and
Preotiuc-Pietro, Daniel",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.237",
pages = "3355--3366",
abstract = "Controllable summarization aims to provide summaries that take into account user-specified aspects and preferences to better assist them with their information need, as opposed to the standard summarization setup which build a single generic summary of a document.We introduce a human-annotated data set EntSUM for controllable summarization with a focus on named entities as the aspects to control.We conduct an extensive quantitative analysis to motivate the task of entity-centric summarization and show that existing methods for controllable summarization fail to generate entity-centric summaries. We propose extensions to state-of-the-art summarization approaches that achieve substantially better results on our data set. Our analysis and results show the challenging nature of this task and of the proposed data set.",
}
```