An open API service indexing awesome lists of open source software.

https://github.com/toloka/wsdmcup2023

Toloka Visual Question Answering Challenge at WSDM Cup 2023
https://github.com/toloka/wsdmcup2023

challenge competition shared-task toloka visual-question-answering wsdmcup wsdmcup2023

Last synced: 4 months ago
JSON representation

Toloka Visual Question Answering Challenge at WSDM Cup 2023

Awesome Lists containing this project

README

          

# Toloka Visual Question Answering Challenge at WSDM Cup 2023

We challenge you with a visual question answering task! **Given an image and a textual question, draw the bounding box around the object correctly responding to that question.**

| Question | Image and Answer |
| --- | --- |
| What do you use to hit the ball? | What do you use to hit the ball? |
| What do people use for cutting? | What do people use for cutting? |
| What do we use to support the immune system and get vitamin C? | What do we use to support the immune system and get vitamin C? |

## Links

- **Competition:**
- **CodaLab:**
- **Dataset:**

## Citation

Please cite the challenge results or dataset description as follows.

- Ustalov D., Pavlichenko N., Koshelev S., Likhobaba D., and Smirnova A. [Toloka Visual Question Answering Benchmark](https://arxiv.org/abs/2309.16511). 2023. arXiv: [2309.16511 [cs.CV]](https://arxiv.org/abs/2309.16511).

```bibtex
@inproceedings{TolokaWSDMCup2023,
author = {Ustalov, Dmitry and Pavlichenko, Nikita and Koshelev, Sergey and Likhobaba, Daniil and Smirnova, Alisa},
title = {{Toloka Visual Question Answering Benchmark}},
year = {2023},
eprint = {2309.16511},
eprinttype = {arxiv},
eprintclass = {cs.CV},
language = {english},
}
```

## Dataset

Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, [MS COCO](https://cocodataset.org/). All data labeling was performed on the Toloka crowdsourcing platform, . We release the entire dataset under the CC BY license:

- Zenodo:
- Hugging Face Hub:
- Kaggle:
- GitHub Packages:

Licensed under the Creative Commons Attribution 4.0 License. See LICENSE-CC-BY.txt file for more details.

## Zero-Shot Baselines

We provide zero-shot baselines in `zeroshot_baselines` folder. All notebooks are made to run in Colab

#### YOLOR + CLIP

This baseline was provided to participants of WSDM Cup 2023 Challenge. First, it uses a detection model, YOLOR, to generate candidate rectangles. Then, it applies CLIP to measure the similarity between the question and a part of the image bounded by each candidate rectangle. To make a prediction, it uses the candidate with the highest similarity. This baseline method achieves **IoU = 0.21** on private test subset.

Licensed under the Apache License, Version 2.0. See LICENSE-APACHE.txt file for more details.

#### OVSeg + SAM

Another zero-shot baseline, called OVSeg, utilizes SAM as a proposal generator instead of MaskFormer in the original setup. This approach achieves **IoU = 0.35** on the private test subset.

Licensed under the Creative Commons Attribution 4.0 License.

#### OFA + SAM

Last one is primarily based on OFA, combined with bounding box correction using SAM. To solve the task, we followed a two-step zero-shot setup.

First, we address the Visual Question Answering, where the model is given a prompt `{question} Name an object in the picture` along with an image. The model provides the name of a clue object to the question.

In the second step, an object corresponding to the answer from the previous step is annotated using the prompt `which region does the text "{answer}" describe?`, resulting in IoU = 0.42.

Subsequently, with the obtained bounding boxes, SAM generates the corresponding masks for the annotated object, which are then transformed into bounding boxes. This enabled us to achieve **IoU = 0.45** with this baseline.

Licensed under the Apache License, Version 2.0.

## Crowdsourcing Baseline

We evaluated how well non-expert human annotators can solve our task by running a dedicated round of crowdsourcing annotations on the [Toloka](https://toloka.ai/) crowdsourcing platform. We found them to tackle this task successfully without knowing the ground truth. On all three subsets of our data, the average IoU value was 0.87 ± 0.01, which we consider as a *strong human baseline* for our task. Krippendorff's α coefficients for the public test was 0.68 and for the private test was 0.66, showing the decent agreement between the responses; we used 1 − IoU as the distance metric when calculating the α coefficient. We selected the bounding boxes which were the most similar to the ground truth data to indicate the upper bound of non-expert annotation quality; `*_crowd_baseline.csv` files contain these responses.

Licensed under the Creative Commons Attribution 4.0 License. See LICENSE-CC-BY.txt file for more details.

## Reproduction

The final score will be evaluated on the private test dataset during Reproduction phase. We kindly ask you to create a docker image and share it with us by December 19th 23:59 AoE in [this form](https://docs.google.com/forms/d/e/1FAIpQLSfWt-c2OvfXPcOQ-J7EmIh1AOAjiojH7RT33bRgchI4evtvLw/viewform?usp=sf_link). We put an instruction how to create a docker image in `reproduction` directory.

We will run your solution on a machine with one Nvidia A100 80 GB GPU, 16 CPU cores, and 200 GB of RAM. Your Docker image must perform the inference in at most 3 hours on this machine. In other words, the docker run command must finish in 3 hours.

Don't hesitate to contact us at research@toloka.ai if you have any questions or suggestions.