https://github.com/toloka/wsdmcup2023
Toloka Visual Question Answering Challenge at WSDM Cup 2023
https://github.com/toloka/wsdmcup2023
challenge competition shared-task toloka visual-question-answering wsdmcup wsdmcup2023
Last synced: 4 months ago
JSON representation
Toloka Visual Question Answering Challenge at WSDM Cup 2023
- Host: GitHub
- URL: https://github.com/toloka/wsdmcup2023
- Owner: Toloka
- License: apache-2.0
- Created: 2022-09-07T11:44:10.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-05-01T14:57:20.000Z (about 2 years ago)
- Last Synced: 2025-09-05T03:26:37.758Z (9 months ago)
- Topics: challenge, competition, shared-task, toloka, visual-question-answering, wsdmcup, wsdmcup2023
- Language: Jupyter Notebook
- Homepage: https://toloka.ai/challenges/wsdm2023/
- Size: 5.24 MB
- Stars: 31
- Watchers: 3
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE.txt
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Toloka Visual Question Answering Challenge at WSDM Cup 2023
We challenge you with a visual question answering task! **Given an image and a textual question, draw the bounding box around the object correctly responding to that question.**
| Question | Image and Answer |
| --- | --- |
| What do you use to hit the ball? |
|
| What do people use for cutting? |
|
| What do we use to support the immune system and get vitamin C? |
|
## Links
- **Competition:**
- **CodaLab:**
- **Dataset:**
## Citation
Please cite the challenge results or dataset description as follows.
- Ustalov D., Pavlichenko N., Koshelev S., Likhobaba D., and Smirnova A. [Toloka Visual Question Answering Benchmark](https://arxiv.org/abs/2309.16511). 2023. arXiv: [2309.16511 [cs.CV]](https://arxiv.org/abs/2309.16511).
```bibtex
@inproceedings{TolokaWSDMCup2023,
author = {Ustalov, Dmitry and Pavlichenko, Nikita and Koshelev, Sergey and Likhobaba, Daniil and Smirnova, Alisa},
title = {{Toloka Visual Question Answering Benchmark}},
year = {2023},
eprint = {2309.16511},
eprinttype = {arxiv},
eprintclass = {cs.CV},
language = {english},
}
```
## Dataset
Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, [MS COCO](https://cocodataset.org/). All data labeling was performed on the Toloka crowdsourcing platform, . We release the entire dataset under the CC BY license:
- Zenodo:
- Hugging Face Hub:
- Kaggle:
- GitHub Packages:
Licensed under the Creative Commons Attribution 4.0 License. See LICENSE-CC-BY.txt file for more details.
## Zero-Shot Baselines
We provide zero-shot baselines in `zeroshot_baselines` folder. All notebooks are made to run in Colab
#### YOLOR + CLIP
This baseline was provided to participants of WSDM Cup 2023 Challenge. First, it uses a detection model, YOLOR, to generate candidate rectangles. Then, it applies CLIP to measure the similarity between the question and a part of the image bounded by each candidate rectangle. To make a prediction, it uses the candidate with the highest similarity. This baseline method achieves **IoU = 0.21** on private test subset.
Licensed under the Apache License, Version 2.0. See LICENSE-APACHE.txt file for more details.
#### OVSeg + SAM
Another zero-shot baseline, called OVSeg, utilizes SAM as a proposal generator instead of MaskFormer in the original setup. This approach achieves **IoU = 0.35** on the private test subset.
Licensed under the Creative Commons Attribution 4.0 License.
#### OFA + SAM
Last one is primarily based on OFA, combined with bounding box correction using SAM. To solve the task, we followed a two-step zero-shot setup.
First, we address the Visual Question Answering, where the model is given a prompt `{question} Name an object in the picture` along with an image. The model provides the name of a clue object to the question.
In the second step, an object corresponding to the answer from the previous step is annotated using the prompt `which region does the text "{answer}" describe?`, resulting in IoU = 0.42.
Subsequently, with the obtained bounding boxes, SAM generates the corresponding masks for the annotated object, which are then transformed into bounding boxes. This enabled us to achieve **IoU = 0.45** with this baseline.
Licensed under the Apache License, Version 2.0.
## Crowdsourcing Baseline
We evaluated how well non-expert human annotators can solve our task by running a dedicated round of crowdsourcing annotations on the [Toloka](https://toloka.ai/) crowdsourcing platform. We found them to tackle this task successfully without knowing the ground truth. On all three subsets of our data, the average IoU value was 0.87 ± 0.01, which we consider as a *strong human baseline* for our task. Krippendorff's α coefficients for the public test was 0.68 and for the private test was 0.66, showing the decent agreement between the responses; we used 1 − IoU as the distance metric when calculating the α coefficient. We selected the bounding boxes which were the most similar to the ground truth data to indicate the upper bound of non-expert annotation quality; `*_crowd_baseline.csv` files contain these responses.
Licensed under the Creative Commons Attribution 4.0 License. See LICENSE-CC-BY.txt file for more details.
## Reproduction
The final score will be evaluated on the private test dataset during Reproduction phase. We kindly ask you to create a docker image and share it with us by December 19th 23:59 AoE in [this form](https://docs.google.com/forms/d/e/1FAIpQLSfWt-c2OvfXPcOQ-J7EmIh1AOAjiojH7RT33bRgchI4evtvLw/viewform?usp=sf_link). We put an instruction how to create a docker image in `reproduction` directory.
We will run your solution on a machine with one Nvidia A100 80 GB GPU, 16 CPU cores, and 200 GB of RAM. Your Docker image must perform the inference in at most 3 hours on this machine. In other words, the docker run command must finish in 3 hours.
Don't hesitate to contact us at research@toloka.ai if you have any questions or suggestions.