https://github.com/aborroy/summary-comparison-tool
Comparing the quality of two summaries against a source Markdown document
https://github.com/aborroy/summary-comparison-tool
nlp python
Last synced: 9 months ago
JSON representation
Comparing the quality of two summaries against a source Markdown document
- Host: GitHub
- URL: https://github.com/aborroy/summary-comparison-tool
- Owner: aborroy
- License: mit
- Created: 2025-07-25T14:18:21.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-11T06:38:17.000Z (10 months ago)
- Last Synced: 2025-08-11T08:34:59.401Z (10 months ago)
- Topics: nlp, python
- Language: Python
- Homepage:
- Size: 19.5 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Summary Comparison Tool
A comprehensive evaluation tool for comparing the quality of two summaries against a source Markdown document using multiple AI-powered metrics.
## Features
- **Multi-metric evaluation**: BARTScore, semantic similarity, coverage, conciseness, and factual consistency
- **Markdown support**: Direct processing of Markdown documents with proper text extraction
- **Flexible scoring**: Weighted combination of multiple evaluation criteria
- **GPU acceleration**: CUDA support for faster processing
- **Detailed analysis**: Optional breakdown of individual metrics
## Installation
### Prerequisites
- Python 3.7+
- PyTorch (CPU or GPU version)
or
- Docker 4.40+
### Setup with local deployment
1. **Clone the repository**
```bash
git clone https://github.com/aborroy/summary-comparison-tool.git
cd summary-comparison-tool
```
2. **Create a virtual environment**
```bash
python3 -m venv venv
source venv/bin/activate
```
3. **Install Python dependencies**
```bash
pip install torch transformers sentence-transformers markdown beautifulsoup4 tqdm numpy
```
4. **Clone BARTScore dependency**
```bash
git clone https://github.com/neulab/BARTScore.git
```
### GPU Support (Optional)
For CUDA acceleration, install PyTorch with GPU support:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu118
```
## Usage
### Basic Comparison
```bash
python summary_comparison.py document.md "First summary text" "Second summary text"
```
### With GPU Acceleration
```bash
python summary_comparison.py document.md "First summary" "Second summary" --device cuda
```
### Detailed Analysis
```bash
python summary_comparison.py document.md "First summary" "Second summary" --detailed
```
## Running with Docker
## How to use (CPU)
```bash
docker build -t summary-compare .
docker run --rm -v "$PWD":/work -w /work summary-compare \
examples/sample_document.md "First summary" "Second summary"
```
## GPU (optional)
If you want CUDA, build with the CUDA wheel index and run with NVIDIA:
```bash
docker build --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu121 \
-t summary-compare-gpu .
docker run --rm --gpus all -v "$PWD":/work -w /work summary-compare-gpu \
examples/sample_document.md "A" "B" --device cuda
```
> Note. Docker Desktop 4.44 switched the default builder to a containerized Buildx driver. With that driver, docker build doesn’t load the image into your local image store unless you say so. When using this Docker version, following command needs to be used (build with `--load`):
```bash
docker build --load -t summary-compare .
```
## Example Output
### Standard Output
```
============================================================
SUMMARY EVALUATION RESULTS
============================================================
Summary 1 Overall Score: -0.167
Summary 2 Overall Score: -0.389
* Better Summary: Summary 1
Margin: +0.222
```
### Detailed Output (with `--detailed` flag)
```
============================================================
SUMMARY EVALUATION RESULTS
============================================================
SUMMARY 1 BREAKDOWN:
BARTScore: -1.727
Semantic Similarity: 0.527
Coverage: 0.187
Conciseness: 0.800
Factual Consistency: 0.923
* Overall Score: -0.167
SUMMARY 2 BREAKDOWN:
BARTScore: -2.447
Semantic Similarity: 0.476
Coverage: 0.160
Conciseness: 1.000
Factual Consistency: 0.857
* Overall Score: -0.389
* Better Summary: Summary 1
Margin: +0.222
```
## Evaluation Metrics
The tool evaluates summaries across five key dimensions:
### 1. BARTScore (Weight: 30%)
- Semantic similarity using BART model
- Measures how well the summary captures document meaning
- Higher scores indicate better semantic alignment
### 2. Semantic Similarity (Weight: 25%)
- Sentence embedding-based similarity
- Uses sentence-transformers for deep semantic understanding
- Fallback to word overlap if sentence-transformers unavailable
### 3. Coverage Score (Weight: 25%)
- Measures how well the summary covers key document content
- Based on important keyword overlap
- Filters out common stop words for better accuracy
### 4. Conciseness Score (Weight: 10%)
- Evaluates appropriate compression ratio
- Optimal range: 10-30% of original document length
- Penalizes both over-compression and verbosity
### 5. Factual Consistency (Weight: 10%)
- Checks if summary facts appear in source document
- Focuses on numbers and proper nouns
- Helps identify potential hallucinations
## Command Line Options
| Option | Description | Default |
|--------|-------------|---------|
| `md_file` | Path to Markdown source document | Required |
| `summary1` | First candidate summary text | Required |
| `summary2` | Second candidate summary text | Required |
| `--device` | Computation device (`cpu`, `cuda`, `cuda:0`) | `cpu` |
| `--detailed` | Show detailed metric breakdown | `false` |
## File Structure
```
summary-comparison-tool/
├── summary_comparison.py # Main evaluation script
├── README.md # This file
├── requirements.txt # Python dependencies
├── examples/ # Example documents and summaries
│ ├── sample_document.md
│ ├── good_summary.txt
│ └── best_summary.txt
└── BARTScore/ # Git submodule (clone separately)
```
## Requirements
See `requirements.txt` for complete dependency list:
```txt
torch>=1.9.0
transformers>=4.20.0
sentence-transformers>=2.2.0
markdown>=3.4.0
beautifulsoup4>=4.11.0
tqdm>=4.64.0
numpy>=1.21.0
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- [BARTScore](https://github.com/neulab/BARTScore) for semantic evaluation
- [Sentence Transformers](https://www.sbert.net/) for embedding-based similarity
- [Hugging Face Transformers](https://huggingface.co/transformers/) for model infrastructure
## Related Work
- **ROUGE**: Traditional n-gram based evaluation metrics
- **BERTScore**: BERT-based semantic similarity
- **BLEURT**: Learned evaluation metric for text generation
- **Factual Consistency**: Various approaches for hallucination detection