{"id":15987292,"url":"https://github.com/pszemraj/summcomparer","last_synced_at":"2025-10-06T18:29:39.742Z","repository":{"id":168098170,"uuid":"643698243","full_name":"pszemraj/SummComparer","owner":"pszemraj","description":"compiles and parses the summarization gauntlet and results from various models into a dataset-like format","archived":false,"fork":false,"pushed_at":"2023-05-25T00:59:13.000Z","size":13780,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-10T23:36:41.017Z","etag":null,"topics":["encoder-decoder","long-document","long-document-summarization","summarization","text-generation","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pszemraj.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-22T00:55:14.000Z","updated_at":"2023-05-22T11:28:13.000Z","dependencies_parsed_at":"2023-05-30T12:00:20.654Z","dependency_job_id":null,"html_url":"https://github.com/pszemraj/SummComparer","commit_stats":{"total_commits":42,"total_committers":2,"mean_commits":21.0,"dds":"0.11904761904761907","last_synced_commit":"f0e2190ec5cff28503cea23af936c455bf95e090"},"previous_names":["pszemraj/summcomparer"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2FSummComparer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2FSummComparer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2FSummComparer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2FSummComparer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pszemraj","download_url":"https://codeload.github.com/pszemraj/SummComparer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247334060,"owners_count":20922134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["encoder-decoder","long-document","long-document-summarization","summarization","text-generation","transformers"],"created_at":"2024-10-08T03:22:41.804Z","updated_at":"2025-10-06T18:29:39.636Z","avatar_url":"https://github.com/pszemraj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SummComparer\n\n\u003e **Comparative analysis of summarization models**\n\n⚠️ This project is currently under active development and will continue to evolve over time. ⚠️\n\nSummComparer is an initiative aimed at compiling, scrutinizing, and analyzing a [Summarization Gauntlet](https://www.dropbox.com/sh/axu1xlscrrexy55/AADAm01-4Zs3POyHQrgbDAsda?dl=0) with the goal of understanding/improving _what makes a summarization model do well_ in practical everyday use cases.\n\nThe latest version of the dataset can also be found [on huggingface here](https://huggingface.co/datasets/pszemraj/summcomparer-gauntlet-v0.1) and loaded with `datasets`.\n\n---\n\n- [SummComparer](#summcomparer)\n  - [About](#about)\n    - [A Case Study](#a-case-study)\n  - [EDA links](#eda-links)\n  - [Installation](#installation)\n  - [Usage](#usage)\n    - [Compiling the Gauntlet](#compiling-the-gauntlet)\n    - [Working with the Dataset](#working-with-the-dataset)\n      - [Input Documents](#input-documents)\n      - [Exploring the Dataset](#exploring-the-dataset)\n\n---\n\n## About\n\nSummComparer's main aim is to test how well various summarization models work on long documents from a wide range of topics, **none of which** are part of standard training data[^1]. This \"gauntlet\" of topics helps us see how well the models can summarize both familiar and unfamiliar content. By doing this, we can understand how these models might perform in real-world situations where the content is unpredictable[^2]. This also helps us identify their limitations and ideally, understand what makes them work well.\n\n[^1]: As it turns out, the practical application of summarization models **is not** the ritual of summarizing documents _you already know the summary of_ and benchmarking their ability to regurgitate these back to you via ROUGE scores as a testament of their performance. Who knew?\n[^2]: i.e. you are not trying to hit a high score on the test set of [arXiv summarization](https://paperswithcode.com/dataset/arxiv-summarization-dataset) as a measure of a \"good model\", but rather actually read and use the summaries in real life.\n\n### A Case Study\n\nPut another way, SummComparer can be thought of as a case study for the following scenario:\n\n- You have a collection of documents that you need to summarize/understand for `\u003creason\u003e`\n- You don't know what domain(s) these documents belong to **because you haven't read them**, and you don't have the time or inclination to read them fully.\n  - You're hoping to get a general understanding of these documents from summaries, and then plan to decide which ones to do more in-depth reading on.\n- You're not sure what the ideal summaries of these documents are **because if you knew that, you wouldn't need to summarize them with a language model**.\n- So: Which model(s) should you use? How can you determine if the outputs are faithful without reading the source documents? How can you determine whether the model is performing well or not?\n\nThe idea for this project was born out of necessity: to test whether a summarization model was \"good\" or not, I would run it on a consistent set of documents and compare the generated summaries with the outputs of other models and my growing understanding of the documents themselves.\n\nIf `\u003cnew summarization model or technique\u003e` claiming to be amazing is unable to summarize the [navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta), OCR'd powerpoint slides, or a [short story](https://en.wikipedia.org/wiki/The_Most_Dangerous_Game), then it's probably not going to be very useful in the real world.\n\n## EDA links\n\nFrom `pandas-profiling`:\n\n- [summary outputs](https://gauntlet-compiled-eda-v0p1.netlify.app/)\n- [input docs](https://gauntlet-inputs-eda-v0p1.netlify.app/)\n\n## Installation\n\nTo install the necessary packages, run the following command:\n\n```bash\npip install -r requirements.txt\n```\n\nTo install the package requirements for using the scripts in `bin/`, navigate to that directory and run:\n\n```bash\npip install -r bin/requirements.txt\n```\n\n## Usage\n\nAs the dataset is already compiled, you can skip to the [Working with the Dataset](#working-with-the-dataset) section for most use cases.\n\n### Compiling the Gauntlet\n\nThe current version supports Command Line Interface (CLI) usage. The recommended sequence of operations is as follows:\n\n```bash\nexport_gauntlet.py\nmap_gauntlet_files.py\nbuild_src_df.py\n```\n\nAll CLI scripts utilize the `fire` package for CLI generation. For more information on how to use the CLI, run:\n\n```bash\npython \u003cscript_name\u003e.py --help\n```\n\n### Working with the Dataset\n\n\u003e **Note:** The current version of the dataset is in a \"raw\" format. It has not been cleaned or pruned of unnecessary columns. This will be addressed in a future release.\n\nThe dataset files are located in `as-dataset/` and are saved as `.parquet` files. The dataset comprises two files, which can be conceptualized as two tables in a relational database:\n\n- `as-dataset/gauntlet_input_documents.parquet`: This file contains the input documents for the gauntlet along with metadata/`id` fields as defined in `gauntlet_master_data.json`.\n- `as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet`: This file contains the output summaries for the gauntlet with hyperparameters/models as columns. All summaries (rows) are mapped to their source documents (columns) by columns prefixed with `source_doc`.\n\nYou can load the data using `pandas`:\n\n```python\nimport pandas as pd\ndf = pd.read_parquet('as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet')\ndf.info()\n```\n\n#### Input Documents\n\nThe `gauntlet_input_documents.parquet` file is required only if you need to examine the source documents themselves or perform any analysis using their text. Most of the necessary information is available in the `summary_gauntlet_dataset_mapped_src_docs.parquet` file.\n\nThe `gauntlet_input_documents.parquet` file contains the following columns:\n\n```python\n\u003e\u003e\u003e import pandas as pd\n\u003e\u003e\u003e df = pd.read_parquet(\"as-dataset/gauntlet_input_documents.parquet\").convert_dtypes()\n\u003e\u003e\u003e df.info()\n\u003cclass 'pandas.core.frame.DataFrame'\u003e\nRangeIndex: 19 entries, 0 to 18\nData columns (total 4 columns):\n #   Column               Non-Null Count  Dtype\n---  ------               --------------  -----\n0   source_doc_filename  19 non-null     string\n\n1   source_doc_id        19 non-null     string\n2   source_doc_domain    19 non-null     string\n3   document_text        19 non-null     string\ndtypes: string(4)\nmemory usage: 736.0 bytes\n```\n\nThe `source_doc_id` column, present in both files, can be used to join them together. A script that does this for you can be found in `bin/`:\n\n```bash\npython bin/create_merged_df.py\n```\n\n#### Exploring the Dataset\n\nThere are numerous Exploratory Data Analysis (EDA) tools available. For initial exploration and testing, `dtale` is recommended due to its flexibility and user-friendly interface. Install it with:\n\n```bash\npip install dtale\n```\n\nYou can then launch a UI instance from the command line with:\n\n```bash\ndtale --parquet-path as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet\n```\n\nPlease note that this project is a work in progress. Future updates will include data cleaning, removal of unnecessary columns, and additional features to enhance the usability and functionality of the project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpszemraj%2Fsummcomparer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpszemraj%2Fsummcomparer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpszemraj%2Fsummcomparer/lists"}