https://github.com/google-research-datasets/conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.
https://github.com/google-research-datasets/conceptual-captions

Last synced: 3 months ago
JSON representation

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

Host: GitHub
URL: https://github.com/google-research-datasets/conceptual-captions
Owner: google-research-datasets
License: other
Created: 2018-05-10T22:24:40.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2021-08-21T16:18:14.000Z (almost 4 years ago)
Last Synced: 2025-03-29T11:07:22.785Z (4 months ago)
Language: Shell
Size: 1.24 MB
Stars: 532
Watchers: 17
Forks: 26
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - google-research-datasets/conceptual-captions

README

# Conceptual Captions Dataset

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed
for the training and evaluation of machine learned image captioning systems.

## Downloads
See for details.

## Motivation

Automatic image captioning is the task of producing a natural-language
utterance (usually a sentence) that correctly reflects the visual content of an
image. Up to this point, the resource most used for this task was the
[MS-COCO dataset](http://cocodataset.org), containing around 120,000
images and 5-way image-caption annotations (produced by paid annotators).

Google's Conceptual Captions dataset has more than 3 million images, paired
with natural-language captions. In contrast with the curated style of the
MS-COCO images, Conceptual Captions images and their raw descriptions are
harvested from the web, and therefore represent a wider variety of styles. The
raw descriptions are harvested from the Alt-text HTML attribute associated with
web images. We developed an automatic pipeline that extracts, filters, and
transforms candidate image/caption pairs, with the goal of achieving a balance
of cleanliness, informativeness, fluency, and learnability of the resulting
captions.

More details are available in this paper (please cite the paper if you use or discuss this dataset in your work):


@inproceedings{sharma2018conceptual,

  title = {Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning},

  author = {Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu},

  booktitle = {Proceedings of ACL},

  year = {2018},

}

Additionally, we provide machine-generated labels for a subset of 2.0M images from the Conceptual Captions training set.
Please cite this paper if you use the image labels in your work.


@article{ng2020understanding,

  title={Understanding Guided Image Captioning Performance across Domains},

  author={Edwin G. Ng and Bo Pang and Piyush Sharma and Radu Soricut},

  journal={arXiv preprint arXiv:2012.02339},

  year={2020}

}

## Dataset Description

Conceptual Captions dataset release contains two splits: train (~3.3M examples) and validation (~16K examples).
See Table 1 below for more details.

Table 1: Dataset stats.

Tokens per Caption

Split
Examples
Uniqe Tokens
Mean
StdDev
Median

Train
3,318,333
51,201
10.3
4.5
9.0

Valid
15,840
10,900
10.4
4.7
9.0

Test (Hidden)
12,559
9,645
10.2
4.6
9.0

## Hidden Test set

We are not releasing the official test split (~12.5K examples).
Instead, we are hosting a competition (see ) dedicated to supporting submissions and evaluations of model outputs on this blind test set.

We strongly believe that this setup has several advantages: a) it allows the evaluation to be done using an unbiased, large number of images b) it keeps the test completely blind and eliminate suspicions of fitting to the test, cheating, etc. c) it overall provides a clean setup for advancing the SoTA on this task, including reporting reproducible results for paper publications, etc.

## Image Labels

The image labels are obtained using the Google Cloud Vision API (). Each image label has a machine-generated identifier (MID) corresponding to the label's Google Knowledge Graph entry and a confidence score for its presence in the image.
These labels have been obtained running the same model and are presented in a similar fashion with the image labels made available for the T2 Guiding dataset available at https://github.com/google-research-datasets/T2-Guiding.

## Data Format for Conceptual Captions

The Conceptual Captions training and validation sets are provided as TSV (tab-separated values) text files with the following columns:

Table 2: Columns in Train/Validation TSV files.

| Column | Description |
| -------- | -------------------------------------------------------------------------------- |
| 1 | Caption. The text has been tokenized and lowercased. |
| 2 | Image URL |

## Data Format for Image Labels

The image labels for a 2.0M subset of the training set are provided as TSV (tab-separated values) text files with the following columns:

Table 3: Columns in Image Labels TSV files.

| Column | Description |
| -------- | -------------------------------------------------------------------------------- |
| 1 | Caption. The text has been tokenized and lowercased. |
| 2 | Image URL |
| 3 | Image labels. Comma separated list in descending order of confidence. |
| 4 | MIDs. Comma separated list corresponding to the image labels list. |
| 5 | Confidence scores. Comma separated list corresponding to the image labels list. |

## Contact us

If you have a technical question regarding the dataset, code or publication, please create an issue in this repository.
This is the fastest way to reach us.

If you would like to share feedback or report concerns, please email us at [email protected]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/google-research-datasets/conceptual-captions

Awesome Lists containing this project

README