https://github.com/declare-lab/sealing
[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
https://github.com/declare-lab/sealing
multimodality naacl2024 video-question-answering video-understanding visual-language-models
Last synced: 6 months ago
JSON representation
[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
- Host: GitHub
- URL: https://github.com/declare-lab/sealing
- Owner: declare-lab
- License: mit
- Created: 2023-05-19T09:20:45.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-25T10:59:48.000Z (about 1 year ago)
- Last Synced: 2025-03-27T18:21:29.379Z (6 months ago)
- Topics: multimodality, naacl2024, video-question-answering, video-understanding, visual-language-models
- Language: Python
- Homepage: https://arxiv.org/pdf/2307.04192.pdf
- Size: 8.92 MB
- Stars: 11
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# [NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models
🔥 [14/03/2024] This paper has been accepeted to NAACL 2024 (Findings)!
## Introduction
This repository contains the official implementation code of the paper "[Self-adaptive Sampling for Efficient Video Question Answering on Image--Text Models](https://arxiv.org/pdf/2307.04192.pdf)".
In this work we introduce and study two simple sampling strategies (__MIF__ and __MDF__) for the tuning of Video Question Answering tasks on pretrained Visual Language Models (VLMs).We first systematically test the performance of __MIF__ (**M**_ost_ **I**_mplied_ **F**_rames_) with varied backbone models as captioner and scorer. They collaborate to perform a "question-and-vision-aware" sampling.
Then we draw inspiration from the results and analysis to further propose the more lightweight __MDF__ (**M**_ost_ **D**_ominant_ **F**_rames_), which takes one more step to discard the correlation of question and executs a "question-agnostic, vision-aware" sampling. This routine significantly boosts the efficiency and gains competative or higher performance on the tested datasets.
Once running completes, sampled frames will be saved in a hdf5 (.h5) file as a "dataset" for fast loading during training and test time.
We test our methods on three models (__CLIP__, __GIT__ and __All-in-one__) and 4 datasets (**MSVD-QA**, **MSRVTT-QA**, **TGIF-Frame**, **NeXT-QA**).
The implementation on CLIP (including our refined structure **CLIP-Dec** which significantly enhances the performance on **raw-CLIP**) and GIT are in the folder `clip_and_git`, while the implementation on All-in-one are under the folder `all_in_one`.## Usage
### 1. Downloading Datasets
Please visit the corresponding repository and follow the instruction there to download the datasets.
- [MSVD and MSRVTT](https://github.com/xudejing/video-question-answering)
- [TGIF](https://github.com/YunseokJANG/tgif-qa)
- [NExT-QA](https://github.com/doc-doc/NExT-QA)The suggested path to store these datasets is "model/dataset/"
### 2. Preprocessing
The code to do sampling for all three models is same, under the folder "clip_and_git/src/preprocessing".* To sample via MDF method, run the python script as follows:
```
python extract_features.py --dataset= --dataset_root= --sampling_strategy='repr' --model_name= ... (other hps)
```
If your code prompts an out-of-memory exception, please using a smaller chunksize (default=512) to shrink the input size per computation.* To sample via MIF method, first run a uniform sampling with large K (e.g., 16 or 32) to obtain a sparse frame sequence
```
python extract_features.py --sampling_strategy='uni' --K 16 ...
```
Then run the python script to capture and start sampling
```
python gen_sample.py --dataset= --dataset_root= --sampling_strategy='repr' --vlm_model= --sim_model= --task='gen_cap'python gen_sample.py --dataset= --dataset_root= --sampling_strategy='repr' --vlm_model= --sim_model= --task='gen_inds'
```### 3. Training and Inference
For experiments on CLIP and GIT, please modify our provided reference scripts (in `src/scripts`). For all-in-one, please check its attached README file for more details.## Results (Partial)
The following results are prediction accuracy, which has been defined and customized for each dataset/model in our paper.### CLIP-Dec (3 Frame)
|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|
|---|---|---|---|
|noDec|27.7|30.3|42.8|
|Uniform|33.8|33.7|47.2|
|MDF|__35.0__|35.2|__63.2__|
|MIF|__35.0__|__35.4__|61.8|### GIT-Base (6 Frame)
|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|
|---|---|---|---|
|Report|51.2|41.0|__69.1__|
|Uniform|52.2|41.1|67.5|
|MDF|__55.3__|42.0|__69.9__|
|MIF|54.5|__42.3__|69.6|### AIO-Base (3 Frame)
|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|
|---|---|---|---|
|Report|46.5|42.9|64.2|
|Reprd.|46.1|42.7|64.0|
|MDF|__46.9__|43.8|__66.2__|
|MIF|46.7|__44.0__|65.9|### AIO-Base+ on Next-QA (3 Frame)
|Method|Val|Test|
|---|---|---|
|Base|48.4|48.1|
|MIF|49.7|49.5|
|MDF|50.2|49.8|### BLIP2-T5XXL on Next-QA (3 Frame)
|Method|Val|Test|
|---|---|---|
|Base|60.1|59.7|
|MIF|61.5|__61.2__|
|MDF|__61.8__|61.1|## Citation
Please cite our paper if you find this project is related to your work
```bibtex
@inproceedings{han2024self,
title={Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models},
author={Han, Wei and Chen, Hui and Kan, Min-Yen and Poria, Soujanya},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},
pages={2522--2534},
year={2024}
}
```
## Acknowledgement
Code for AIO is adapted from [AIO official implementation](https://github.com/showlab/all-in-one)## Contact
If you have any enquiries about our code and paper, feel free to contact us at henryhan88888@gmail.com or chchenhui1996@gmail.com.