Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aldenhovel/bleu-rouge-meteor-cider-spice-eval4imagecaption

Evaluation tools for image captioning. Including BLEU, ROUGE-L, CIDEr, METEOR, SPICE scores.
https://github.com/aldenhovel/bleu-rouge-meteor-cider-spice-eval4imagecaption

bleu cider evaluation-metrics image-captioning meteor rouge-l spice

Last synced: about 1 month ago
JSON representation

Evaluation tools for image captioning. Including BLEU, ROUGE-L, CIDEr, METEOR, SPICE scores.

Awesome Lists containing this project

README

        

# About Image Captioning Metrics

1. **BLEU**

> Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In *Proceedings of meeting of the association for computational linguistics* (pp. 311–318).

BLEU has frequently been reported as correlating well with human judgement, and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries. It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.

2. **ROUGE-L**

> Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In *Proceedings of meeting of the association for computational linguistics* (pp. 74–81).

ROUGE-L: Longest Common Subsequence (LCS) based statistics. [Longest common subsequence problem](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

3. **METEOR**

> Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In *Proceedings of meeting of the association for computational linguistics* (pp. 65–72).

The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

4. **CIDEr**

> Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation. In *Proceedings of IEEE conference on computer vision and pattern recognition* (pp. 4566–4575).

5. **SPICE**

> Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic propositional image caption evaluation. In *Proceedings European conference on computer vision* (pp. 382–398).

# Installation

Please check [salaniz](https://github.com/salaniz)/**[pycocoevalcap](https://github.com/salaniz/pycocoevalcap)** for installation of `pycocotools` and `pycocoevalcap` .

># Microsoft COCO Caption Evaluation
>
>Evaluation codes for MS COCO caption generation.
>
>## Description
>
>This repository provides Python 3 support for the caption evaluation metrics used for the MS COCO dataset.
>
>The code is derived from the original repository that supports Python 2.7: https://github.com/tylin/coco-caption.
>Caption evaluation depends on the COCO API that natively supports Python 3.
>
>## Requirements
>
>- Java 1.8.0
>- Python 3.6
>
>## Installation
>
>To install `pycocoevalcap` and the `pycocotools` dependency (https://github.com/cocodataset/cocoapi), run:
>
>```bash
>pip install pycocoevalcap
>```
>
>## Setup
>
>- SPICE requires the download of [Stanford CoreNLP 3.6.0](http://stanfordnlp.github.io/CoreNLP/index.html) code and models. This will be done automatically the first time the SPICE evaluation is performed.
>- Note: SPICE will try to create a cache of parsed sentences in ./spice/cache/. This dramatically speeds up repeated evaluations. The cache directory can be moved by setting 'CACHE_DIR' in ./spice. In the same file, caching can be turned off by removing the '-cache' argument to 'spice_cmd'.

# How To Use

This repo is mainly based on the code from `pycocotools` and `pycocoevalcap` , which is designed for evaluation of MS COCO caption generation. Here the API was simplified, we can transfer the use of this evaluation tool to other caption datasets, such as Flickr8k, Flickr30k or any other else.

There are 2 `json` file saving the **references and candidate captions** were required in `example/`. And `example/main.py` would read these 2 `json` files and evaluate the scores automatically, then print them.

The `references.json` and `captions.json` (candidate captions) were shown in `examples/` . In order to generate these files, please check the demo below:

```python
# Collect all references from dataset as references: dict
# Collect all captions generated by model as captions: dict

references = {
"1": ["this is a tree", "this is an apple", ...],
"2": ["a man is sitting", "a man in the street", ...],
//......
}

captions = {
"1": ["this is a big tree"],
"2": ["a man is sitting"],
......
}
```

```python
# Save them as correct json files
import json

new_cap = []
for k, v in captions.items():
new_cap.append({'image_id': k, 'caption': v[0]})

new_ref = {'images': [], 'annotations': []}
for k, refs in references.items():
new_ref['images'].append({'id': k})
for ref in refs:
new_ref['annotations'].append({'image_id': k, 'id': k, 'caption': ref})

with open('references.json', 'w') as fgts:
json.dump(new_gts, fgts)
with open('captions.json', 'w') as fres:
json.dump(new_res, fres)
```

Then we can check the saved `references.json` and `captions.json` if it is the same format as demo `references_example.json` and `captions_example.json` :

- `references.json`

```
{
"images": [
{"id": "0"},
{"id": "1"},
......
],
"annotations": [
{
"image_id": "0",
"id": "0",
"caption": "A man with a red helmet on a small moped on a dirt road. "
},
{
"image_id": "0",
"id": "0",
"caption": "Man riding a motor bike on a dirt road on the countryside."
},
{
"image_id": "0",
"id": "0",
"caption": "A man riding on the back of a motorcycle."
},
{
"image_id": "0",
"id": "0",
"caption": "A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. "
},
{
"image_id": "0",
"id": "0",
"caption": "A man in a red shirt and a red hat is on a motorcycle on a hill side."
},
{
"image_id": "1",
"id": "1",
"caption": "A woman wearing a net on her head cutting a cake. "
},
{
"image_id": "1",
"id": "1",
"caption": "A woman cutting a large white sheet cake."
},
{
"image_id": "1",
"id": "1",
"caption": "A woman wearing a hair net cutting a large sheet cake."
},
{
"image_id": "1",
"id": "1",
"caption": "there is a woman that is cutting a white cake"
},
{
"image_id": "1",
"id": "1",
"caption": "A woman marking a cake with the back of a chef's knife. "
},
......
]
}
```

- `captions.json`
```
[
{
"image_id": "0",
"caption": "a man standing on the side of a road ."
},
{
"image_id": "1",
"caption": "a person standing in front of a mirror ."
},
......
]
```

Then use command in `example/` to run `main.py` :

```bash
python main.py
```

Terminal output:

```
>>
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 72388 tokens at 846674.96 tokens per second.
PTBTokenizer tokenized 12514 tokens at 290819.68 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 10476, 'reflen': 10274, 'guess': [10476, 9476, 8476, 7476], 'correct': [7043, 3379, 1518, 669]}
ratio: 1.0196612809031516
Bleu_1: 0.672
Bleu_2: 0.490
Bleu_3: 0.350
Bleu_4: 0.249
computing METEOR score...
METEOR: 0.201
computing Rouge score...
ROUGE_L: 0.472
computing CIDEr score...
CIDEr: 0.457
computing SPICE score...
Parsing reference captions
Initiating Stanford parsing pipeline
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...
done [0.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [0.8 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.3 sec].
Threads( StanfordCoreNLP ) [01:03.436 minutes]
Parsing test captions
Threads( StanfordCoreNLP ) [3.322 seconds]
SPICE evaluation took: 1.182 min
SPICE: 0.137
Bleu_1: 0.672
Bleu_2: 0.490
Bleu_3: 0.350
Bleu_4: 0.249
METEOR: 0.201
ROUGE_L: 0.472
CIDEr: 0.457
SPICE: 0.137
```

# Reference

- [tylin](https://github.com/tylin)/**[coco-caption](https://github.com/tylin/coco-caption)**
- [cocodataset](https://github.com/cocodataset)/**[cocoapi](https://github.com/cocodataset/cocoapi)**
- [salaniz](https://github.com/salaniz)/**[pycocoevalcap](https://github.com/salaniz/pycocoevalcap)**