https://github.com/pszemraj/summcomparer

compiles and parses the summarization gauntlet and results from various models into a dataset-like format
https://github.com/pszemraj/summcomparer

encoder-decoder long-document long-document-summarization summarization text-generation transformers

Last synced: 10 months ago
JSON representation

compiles and parses the summarization gauntlet and results from various models into a dataset-like format

Host: GitHub
URL: https://github.com/pszemraj/summcomparer
Owner: pszemraj
License: mit
Created: 2023-05-22T00:55:14.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-05-25T00:59:13.000Z (about 3 years ago)
Last Synced: 2025-02-10T23:36:41.017Z (over 1 year ago)
Topics: encoder-decoder, long-document, long-document-summarization, summarization, text-generation, transformers
Language: Python
Homepage:
Size: 13.1 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # SummComparer

> **Comparative analysis of summarization models**

⚠️ This project is currently under active development and will continue to evolve over time. ⚠️

SummComparer is an initiative aimed at compiling, scrutinizing, and analyzing a [Summarization Gauntlet](https://www.dropbox.com/sh/axu1xlscrrexy55/AADAm01-4Zs3POyHQrgbDAsda?dl=0) with the goal of understanding/improving _what makes a summarization model do well_ in practical everyday use cases.

The latest version of the dataset can also be found [on huggingface here](https://huggingface.co/datasets/pszemraj/summcomparer-gauntlet-v0.1) and loaded with `datasets`.

---

- [SummComparer](#summcomparer)

  - [About](#about)

    - [A Case Study](#a-case-study)

  - [EDA links](#eda-links)

  - [Installation](#installation)

  - [Usage](#usage)

    - [Compiling the Gauntlet](#compiling-the-gauntlet)

    - [Working with the Dataset](#working-with-the-dataset)

      - [Input Documents](#input-documents)

      - [Exploring the Dataset](#exploring-the-dataset)

---

## About

SummComparer's main aim is to test how well various summarization models work on long documents from a wide range of topics, **none of which** are part of standard training data[^1]. This "gauntlet" of topics helps us see how well the models can summarize both familiar and unfamiliar content. By doing this, we can understand how these models might perform in real-world situations where the content is unpredictable[^2]. This also helps us identify their limitations and ideally, understand what makes them work well.

[^1]: As it turns out, the practical application of summarization models **is not** the ritual of summarizing documents _you already know the summary of_ and benchmarking their ability to regurgitate these back to you via ROUGE scores as a testament of their performance. Who knew?

[^2]: i.e. you are not trying to hit a high score on the test set of [arXiv summarization](https://paperswithcode.com/dataset/arxiv-summarization-dataset) as a measure of a "good model", but rather actually read and use the summaries in real life.

### A Case Study

Put another way, SummComparer can be thought of as a case study for the following scenario:

- You have a collection of documents that you need to summarize/understand for ``

- You don't know what domain(s) these documents belong to **because you haven't read them**, and you don't have the time or inclination to read them fully.

  - You're hoping to get a general understanding of these documents from summaries, and then plan to decide which ones to do more in-depth reading on.

- You're not sure what the ideal summaries of these documents are **because if you knew that, you wouldn't need to summarize them with a language model**.

- So: Which model(s) should you use? How can you determine if the outputs are faithful without reading the source documents? How can you determine whether the model is performing well or not?

The idea for this project was born out of necessity: to test whether a summarization model was "good" or not, I would run it on a consistent set of documents and compare the generated summaries with the outputs of other models and my growing understanding of the documents themselves.

If `` claiming to be amazing is unable to summarize the [navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta), OCR'd powerpoint slides, or a [short story](https://en.wikipedia.org/wiki/The_Most_Dangerous_Game), then it's probably not going to be very useful in the real world.

## EDA links

From `pandas-profiling`:

- [summary outputs](https://gauntlet-compiled-eda-v0p1.netlify.app/)

- [input docs](https://gauntlet-inputs-eda-v0p1.netlify.app/)

## Installation

To install the necessary packages, run the following command:

```bash

pip install -r requirements.txt

```

To install the package requirements for using the scripts in `bin/`, navigate to that directory and run:

```bash

pip install -r bin/requirements.txt

```

## Usage

As the dataset is already compiled, you can skip to the [Working with the Dataset](#working-with-the-dataset) section for most use cases.

### Compiling the Gauntlet

The current version supports Command Line Interface (CLI) usage. The recommended sequence of operations is as follows:

```bash

export_gauntlet.py

map_gauntlet_files.py

build_src_df.py

```

All CLI scripts utilize the `fire` package for CLI generation. For more information on how to use the CLI, run:

```bash

python .py --help

```

### Working with the Dataset

> **Note:** The current version of the dataset is in a "raw" format. It has not been cleaned or pruned of unnecessary columns. This will be addressed in a future release.

The dataset files are located in `as-dataset/` and are saved as `.parquet` files. The dataset comprises two files, which can be conceptualized as two tables in a relational database:

- `as-dataset/gauntlet_input_documents.parquet`: This file contains the input documents for the gauntlet along with metadata/`id` fields as defined in `gauntlet_master_data.json`.

- `as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet`: This file contains the output summaries for the gauntlet with hyperparameters/models as columns. All summaries (rows) are mapped to their source documents (columns) by columns prefixed with `source_doc`.

You can load the data using `pandas`:

```python

import pandas as pd

df = pd.read_parquet('as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet')

df.info()

```

#### Input Documents

The `gauntlet_input_documents.parquet` file is required only if you need to examine the source documents themselves or perform any analysis using their text. Most of the necessary information is available in the `summary_gauntlet_dataset_mapped_src_docs.parquet` file.

The `gauntlet_input_documents.parquet` file contains the following columns:

```python

>>> import pandas as pd

>>> df = pd.read_parquet("as-dataset/gauntlet_input_documents.parquet").convert_dtypes()

>>> df.info()

RangeIndex: 19 entries, 0 to 18

Data columns (total 4 columns):

 #   Column               Non-Null Count  Dtype

---  ------               --------------  -----

0   source_doc_filename  19 non-null     string

1   source_doc_id        19 non-null     string

2   source_doc_domain    19 non-null     string

3   document_text        19 non-null     string

dtypes: string(4)

memory usage: 736.0 bytes

```

The `source_doc_id` column, present in both files, can be used to join them together. A script that does this for you can be found in `bin/`:

```bash

python bin/create_merged_df.py

```

#### Exploring the Dataset

There are numerous Exploratory Data Analysis (EDA) tools available. For initial exploration and testing, `dtale` is recommended due to its flexibility and user-friendly interface. Install it with:

```bash

pip install dtale

```

You can then launch a UI instance from the command line with:

```bash

dtale --parquet-path as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet

```

Please note that this project is a work in progress. Future updates will include data cleaning, removal of unnecessary columns, and additional features to enhance the usability and functionality of the project.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pszemraj/summcomparer

Awesome Lists containing this project

README