https://github.com/blmoistawinde/fense
Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.
https://github.com/blmoistawinde/fense
audio-captioning audiocaption benchmark evaluation-metrics
Last synced: 15 days ago
JSON representation
Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.
- Host: GitHub
- URL: https://github.com/blmoistawinde/fense
- Owner: blmoistawinde
- Created: 2021-10-02T06:21:48.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2023-02-01T09:57:47.000Z (over 2 years ago)
- Last Synced: 2025-04-14T09:51:05.236Z (6 months ago)
- Topics: audio-captioning, audiocaption, benchmark, evaluation-metrics
- Language: Python
- Homepage: https://share.streamlit.io/blmoistawinde/fense/main/streamlit_demo/app.py
- Size: 103 MB
- Stars: 21
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# FENSE
The metric, **F**luency **EN**hanced **S**entence-bert **E**valuation (FENSE), for audio caption evaluation, proposed in the paper ["Can Audio Captions Be Evaluated with Image Caption Metrics?"](https://arxiv.org/abs/2110.04684)
The `main` branch contains an easy-to-use interface for fast evaluation of an audio captioning system.
Online demo avaliable at https://share.streamlit.io/blmoistawinde/fense/main/streamlit_demo/app.py .
To get the dataset (AudioCaps-Eval and Clotho-Eval) and the code to reproduce, please refer to the [experiment-code](https://github.com/blmoistawinde/fense/tree/experiment-code) branch.
## Installation
Clone the repository and pip install it.
```bash
git clone https://github.com/blmoistawinde/fense.git
cd fense
pip install -e .
```## Usage
### Single Sentence
To get the detailed scores of each component for a single sentence.```python
from fense.evaluator import Evaluatorprint("----Using tiny models----")
evaluator = Evaluator(device='cpu', sbert_model='paraphrase-MiniLM-L6-v2', echecker_model='echecker_clotho_audiocaps_tiny')eval_cap = "An engine in idling and a man is speaking and then"
ref_cap = "A machine makes stitching sounds while people are talking in the background"score, error_prob, penalized_score = evaluator.sentence_score(eval_cap, [ref_cap], return_error_prob=True)
print("Cand:", eval_cap)
print("Ref:", ref_cap)
print(f"SBERT sim: {score:.4f}, Error Prob: {error_prob:.4f}, Penalized score: {penalized_score:.4f}")
```### System Score
To get a system's overall score on a dataset by averaging sentence-level FENSE, you can use `eval_system.py`, with your system outputs prepared in the format like `test_data/audiocaps_cands.csv` or `test_data/clotho_cands.csv` .
For AudioCaps test set:
```bash
python eval_system.py --device cuda --dataset audiocaps --cands_dir ./test_data/audiocaps_cands.csv
```For Clotho Eval set:
```bash
python eval_system.py --device cuda --dataset clotho --cands_dir ./test_data/clotho_cands.csv
```## Performance Benchmark
We benchmark the performance of FENSE with different choices of SBERT model and Error Detector on the two benchmark dataset AudioCaps-Eval and Clotho-Eval. (*) is the combination reported in paper.
AudioCaps-Eval
| SBERT | echecker | HC | HI | HM | MM | total |
|-------|-------|------|------|------|------|--------|
| paraphrase-MiniLM-L6-v2 | none | 62.1 | 98.8 | 93.7 | 75.4 | 80.4 |
| paraphrase-MiniLM-L6-v2 | tiny | 57.6 | 94.7 | 89.5 | 82.6 | 82.3 |
| paraphrase-MiniLM-L6-v2 | base | 62.6 | 98 | 82.5 | 85.4 | 85.5 |
| paraphrase-TinyBERT-L6-v2 | none | 64 | 99.2 | 92.5 | 73.6 | 79.6 |
| paraphrase-TinyBERT-L6-v2 | tiny | 58.6 | 95.1 | 88.3 | 82.2 | 82.1 |
| paraphrase-TinyBERT-L6-v2 | base | 64.5 | 98.4 | 91.6 | 84.6 | 85.3(*) |
| paraphrase-mpnet-base-v2 | none | 63.1 | 98.8 | 94.1 | 74.1 | 80.1 |
| paraphrase-mpnet-base-v2 | tiny | 58.1 | 94.3 | 90 | 83.2 | 82.7 |
| paraphrase-mpnet-base-v2 | base | 63.5 | 98 | 92.5 | 85.9 | 85.9 |Clotho-Eval
| SBERT | echecker | HC | HI | HM | MM | total |
|-------|-------|------|------|------|------|--------|
| paraphrase-MiniLM-L6-v2 | none | 59.5 | 95.1 | 76.3 | 66.2 | 71.3 |
| paraphrase-MiniLM-L6-v2 | tiny | 56.7 | 90.6 | 79.3 | 70.9 | 73.3 |
| paraphrase-MiniLM-L6-v2 | base | 60 | 94.3 | 80.6 | 72.3 | 75.3 |
| paraphrase-TinyBERT-L6-v2 | none | 60 | 95.5 | 75.9 | 66.9 | 71.8 |
| paraphrase-TinyBERT-L6-v2 | tiny | 59 | 93 | 79.7 | 71.5 | 74.4 |
| paraphrase-TinyBERT-L6-v2 | base | 60.5 | 94.7 | 80.2 | 72.8 | 75.7(*) |
| paraphrase-mpnet-base-v2 | none | 56.2 | 96.3 | 77.6 | 65.2 | 70.7 |
| paraphrase-mpnet-base-v2 | tiny | 54.8 | 91.8 | 80.6 | 70.1 | 73 |
| paraphrase-mpnet-base-v2 | base | 57.1 | 95.5 | 81.9 | 71.6 | 74.9 |## Reference
If you use FENSE in your research, please cite:
```
@misc{zhou2021audio,
title={Can Audio Captions Be Evaluated with Image Caption Metrics?},
author={Zelin Zhou and Zhiling Zhang and Xuenan Xu and Zeyu Xie and Mengyue Wu and Kenny Q. Zhu},
year={2021},
eprint={2110.04684},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
```