https://github.com/declare-lab/sealing

[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
https://github.com/declare-lab/sealing

multimodality naacl2024 video-question-answering video-understanding visual-language-models

Last synced: 6 months ago
JSON representation

[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"

Host: GitHub
URL: https://github.com/declare-lab/sealing
Owner: declare-lab
License: mit
Created: 2023-05-19T09:20:45.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-07-25T10:59:48.000Z (about 1 year ago)
Last Synced: 2025-03-27T18:21:29.379Z (6 months ago)
Topics: multimodality, naacl2024, video-question-answering, video-understanding, visual-language-models
Language: Python
Homepage: https://arxiv.org/pdf/2307.04192.pdf
Size: 8.92 MB
Stars: 11
Watchers: 3
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # [NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

🔥 [14/03/2024] This paper has been accepeted to NAACL 2024 (Findings)!

## Introduction

This repository contains the official implementation code of the paper "[Self-adaptive Sampling for Efficient Video Question Answering on Image--Text Models](https://arxiv.org/pdf/2307.04192.pdf)". 

In this work we introduce and study two simple sampling strategies (__MIF__ and __MDF__) for the tuning of Video Question Answering tasks on pretrained Visual Language Models (VLMs).

We first systematically test the performance of __MIF__ (**M**_ost_ **I**_mplied_ **F**_rames_) with varied backbone models as captioner and scorer. They collaborate to perform a "question-and-vision-aware" sampling.

Then we draw inspiration from the results and analysis to further propose the more lightweight __MDF__ (**M**_ost_ **D**_ominant_ **F**_rames_), which takes one more step to discard the correlation of question and executs a "question-agnostic, vision-aware" sampling. This routine significantly boosts the efficiency and gains competative or higher performance on the tested datasets.



     

    



Once running completes, sampled frames will be saved in a hdf5 (.h5) file as a "dataset" for fast loading during training and test time.

We test our methods on three models (__CLIP__, __GIT__ and __All-in-one__) and 4 datasets (**MSVD-QA**, **MSRVTT-QA**, **TGIF-Frame**, **NeXT-QA**).

The implementation on CLIP (including our refined structure **CLIP-Dec** which significantly enhances the performance on **raw-CLIP**) and GIT are in the folder `clip_and_git`, while the implementation on All-in-one are under the folder `all_in_one`.

## Usage

### 1. Downloading Datasets

Please visit the corresponding repository and follow the instruction there to download the datasets.

- [MSVD and MSRVTT](https://github.com/xudejing/video-question-answering)

- [TGIF](https://github.com/YunseokJANG/tgif-qa)

- [NExT-QA](https://github.com/doc-doc/NExT-QA)

The suggested path to store these datasets is "model/dataset/" 

### 2. Preprocessing

The code to do sampling for all three models is same, under the folder "clip_and_git/src/preprocessing". 

* To sample via MDF method, run the python script as follows:

    ```

    python extract_features.py --dataset= --dataset_root= --sampling_strategy='repr' --model_name= ... (other hps)

    ```

    If your code prompts an out-of-memory exception, please using a smaller chunksize (default=512) to shrink the input size per computation.

* To sample via MIF method, first run a uniform sampling with large K (e.g., 16 or 32) to obtain a sparse frame sequence

    ```

    python extract_features.py --sampling_strategy='uni' --K 16 ...

    ```

    Then run the python script to capture and start sampling

    ```

    python gen_sample.py --dataset= --dataset_root= --sampling_strategy='repr' --vlm_model= --sim_model= --task='gen_cap'

    python gen_sample.py --dataset= --dataset_root= --sampling_strategy='repr' --vlm_model= --sim_model= --task='gen_inds'

    ```

### 3. Training and Inference

For experiments on CLIP and GIT, please modify our provided reference scripts (in `src/scripts`). For all-in-one, please check its attached README file for more details.

## Results (Partial)

The following results are prediction accuracy, which has been defined and customized for each dataset/model in our paper.

### CLIP-Dec (3 Frame)

|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|

|---|---|---|---|

|noDec|27.7|30.3|42.8|

|Uniform|33.8|33.7|47.2|

|MDF|__35.0__|35.2|__63.2__|

|MIF|__35.0__|__35.4__|61.8|

### GIT-Base (6 Frame)

|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|

|---|---|---|---|

|Report|51.2|41.0|__69.1__|

|Uniform|52.2|41.1|67.5|

|MDF|__55.3__|42.0|__69.9__|

|MIF|54.5|__42.3__|69.6|

### AIO-Base (3 Frame)

|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|

|---|---|---|---|

|Report|46.5|42.9|64.2|

|Reprd.|46.1|42.7|64.0|

|MDF|__46.9__|43.8|__66.2__|

|MIF|46.7|__44.0__|65.9|

### AIO-Base+ on Next-QA (3 Frame)

|Method|Val|Test|

|---|---|---|

|Base|48.4|48.1|

|MIF|49.7|49.5|

|MDF|50.2|49.8|

### BLIP2-T5XXL on Next-QA (3 Frame)

|Method|Val|Test|

|---|---|---|

|Base|60.1|59.7|

|MIF|61.5|__61.2__|

|MDF|__61.8__|61.1|

## Citation

Please cite our paper if you find this project is related to your work

```bibtex

@inproceedings{han2024self,

  title={Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models},

  author={Han, Wei and Chen, Hui and Kan, Min-Yen and Poria, Soujanya},

  booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},

  pages={2522--2534},

  year={2024}

}

```

## Acknowledgement

Code for AIO is adapted from [AIO official implementation](https://github.com/showlab/all-in-one)

## Contact

If you have any enquiries about our code and paper, feel free to contact us at henryhan88888@gmail.com or chchenhui1996@gmail.com.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/declare-lab/sealing

Awesome Lists containing this project

README