{"id":18832925,"url":"https://github.com/declare-lab/sealing","last_synced_at":"2025-04-14T04:31:28.165Z","repository":{"id":180656723,"uuid":"642755740","full_name":"declare-lab/Sealing","owner":"declare-lab","description":"[NAACL 2024] Official Implementation of paper \"Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models\"","archived":false,"fork":false,"pushed_at":"2024-07-25T10:59:48.000Z","size":9356,"stargazers_count":11,"open_issues_count":0,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-27T18:21:29.379Z","etag":null,"topics":["multimodality","naacl2024","video-question-answering","video-understanding","visual-language-models"],"latest_commit_sha":null,"homepage":"https://arxiv.org/pdf/2307.04192.pdf","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/declare-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-19T09:20:45.000Z","updated_at":"2025-02-12T06:30:53.000Z","dependencies_parsed_at":"2024-06-27T08:06:52.089Z","dependency_job_id":null,"html_url":"https://github.com/declare-lab/Sealing","commit_stats":null,"previous_names":["declare-lab/sas-vqa","declare-lab/vqa-sampling"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FSealing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FSealing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FSealing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FSealing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/declare-lab","download_url":"https://codeload.github.com/declare-lab/Sealing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248821711,"owners_count":21166940,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multimodality","naacl2024","video-question-answering","video-understanding","visual-language-models"],"created_at":"2024-11-08T01:59:33.279Z","updated_at":"2025-04-14T04:31:28.146Z","avatar_url":"https://github.com/declare-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# [NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models\n\n🔥 [14/03/2024] This paper has been accepeted to NAACL 2024 (Findings)!\n\n## Introduction\nThis repository contains the official implementation code of the paper \"[Self-adaptive Sampling for Efficient Video Question Answering on Image--Text Models](https://arxiv.org/pdf/2307.04192.pdf)\". \nIn this work we introduce and study two simple sampling strategies (__MIF__ and __MDF__) for the tuning of Video Question Answering tasks on pretrained Visual Language Models (VLMs).\n\nWe first systematically test the performance of __MIF__ (**M**_ost_ **I**_mplied_ **F**_rames_) with varied backbone models as captioner and scorer. They collaborate to perform a \"question-and-vision-aware\" sampling.\nThen we draw inspiration from the results and analysis to further propose the more lightweight __MDF__ (**M**_ost_ **D**_ominant_ **F**_rames_), which takes one more step to discard the correlation of question and executs a \"question-agnostic, vision-aware\" sampling. This routine significantly boosts the efficiency and gains competative or higher performance on the tested datasets.\n\n\u003cp align=\"center\"\u003e\n    \u003cimage src=\"assets/MDF.png\" width=\"324\"\u003e \n    \u003cimage src=\"assets/MIF.png\" width=\"432\"\u003e\n\u003c/p\u003e\n\nOnce running completes, sampled frames will be saved in a hdf5 (.h5) file as a \"dataset\" for fast loading during training and test time.\nWe test our methods on three models (__CLIP__, __GIT__ and __All-in-one__) and 4 datasets (**MSVD-QA**, **MSRVTT-QA**, **TGIF-Frame**, **NeXT-QA**).\nThe implementation on CLIP (including our refined structure **CLIP-Dec** which significantly enhances the performance on **raw-CLIP**) and GIT are in the folder `clip_and_git`, while the implementation on All-in-one are under the folder `all_in_one`.\n\n## Usage\n### 1. Downloading Datasets\nPlease visit the corresponding repository and follow the instruction there to download the datasets.\n- [MSVD and MSRVTT](https://github.com/xudejing/video-question-answering)\n- [TGIF](https://github.com/YunseokJANG/tgif-qa)\n- [NExT-QA](https://github.com/doc-doc/NExT-QA)\n\nThe suggested path to store these datasets is \"model/dataset/\u003cdataset_name\u003e\" \n\n### 2. Preprocessing\nThe code to do sampling for all three models is same, under the folder \"clip_and_git/src/preprocessing\". \n\n* To sample via MDF method, run the python script as follows:\n    ```\n    python extract_features.py --dataset=\u003cdataset_name\u003e --dataset_root=\u003croot_path\u003e --sampling_strategy='repr' --model_name=\u003cvlm_model_name\u003e ... (other hps)\n    ```\n    If your code prompts an out-of-memory exception, please using a smaller chunksize (default=512) to shrink the input size per computation.\n\n* To sample via MIF method, first run a uniform sampling with large K (e.g., 16 or 32) to obtain a sparse frame sequence\n\n    ```\n    python extract_features.py --sampling_strategy='uni' --K 16 ...\n    ```\n    Then run the python script to capture and start sampling\n    ```\n    python gen_sample.py --dataset=\u003cdataset_name\u003e --dataset_root=\u003croot_path\u003e --sampling_strategy='repr' --vlm_model=\u003cvlm_model_name\u003e --sim_model=\u003csim_model_name\u003e --task='gen_cap'\n\n    python gen_sample.py --dataset=\u003cdataset_name\u003e --dataset_root=\u003croot_path\u003e --sampling_strategy='repr' --vlm_model=\u003cvlm_model_name\u003e --sim_model=\u003csim_model_name\u003e --task='gen_inds'\n    ```\n\n### 3. Training and Inference\nFor experiments on CLIP and GIT, please modify our provided reference scripts (in `src/scripts`). For all-in-one, please check its attached README file for more details.\n\n## Results (Partial)\nThe following results are prediction accuracy, which has been defined and customized for each dataset/model in our paper.\n\n### CLIP-Dec (3 Frame)\n|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|\n|---|---|---|---|\n|noDec|27.7|30.3|42.8|\n|Uniform|33.8|33.7|47.2|\n|MDF|__35.0__|35.2|__63.2__|\n|MIF|__35.0__|__35.4__|61.8|\n\n### GIT-Base (6 Frame)\n|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|\n|---|---|---|---|\n|Report|51.2|41.0|__69.1__|\n|Uniform|52.2|41.1|67.5|\n|MDF|__55.3__|42.0|__69.9__|\n|MIF|54.5|__42.3__|69.6|\n\n### AIO-Base (3 Frame)\n|Sampling|MSVD-QA|MSRVTT-QA|TGIF-Frame|\n|---|---|---|---|\n|Report|46.5|42.9|64.2|\n|Reprd.|46.1|42.7|64.0|\n|MDF|__46.9__|43.8|__66.2__|\n|MIF|46.7|__44.0__|65.9|\n\n### AIO-Base+ on Next-QA (3 Frame)\n|Method|Val|Test|\n|---|---|---|\n|Base|48.4|48.1|\n|MIF|49.7|49.5|\n|MDF|50.2|49.8|\n\n### BLIP2-T5XXL on Next-QA (3 Frame)\n|Method|Val|Test|\n|---|---|---|\n|Base|60.1|59.7|\n|MIF|61.5|__61.2__|\n|MDF|__61.8__|61.1|\n\n## Citation\nPlease cite our paper if you find this project is related to your work\n```bibtex\n@inproceedings{han2024self,\n  title={Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models},\n  author={Han, Wei and Chen, Hui and Kan, Min-Yen and Poria, Soujanya},\n  booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},\n  pages={2522--2534},\n  year={2024}\n}\n```\n## Acknowledgement\nCode for AIO is adapted from [AIO official implementation](https://github.com/showlab/all-in-one)\n\n## Contact\nIf you have any enquiries about our code and paper, feel free to contact us at henryhan88888@gmail.com or chchenhui1996@gmail.com.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeclare-lab%2Fsealing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeclare-lab%2Fsealing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeclare-lab%2Fsealing/lists"}