{"id":21305855,"url":"https://github.com/lyuchenyang/efficient-videoqa","last_synced_at":"2025-03-15T19:41:03.789Z","repository":{"id":177219155,"uuid":"649936141","full_name":"lyuchenyang/Efficient-VideoQA","owner":"lyuchenyang","description":"Code for ACL SustaiNLP 2023 paper \"Is a Video worth n × n Images? A Highly Efficient Approach to Transformer-based Video Question Answering\"","archived":false,"fork":false,"pushed_at":"2023-07-04T11:11:06.000Z","size":29,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-22T08:45:05.723Z","etag":null,"topics":["artificial-intelligence","deep-learning","machine-learning","multi-modal-learning","natural-language-processing","video-question-answering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lyuchenyang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-06T01:09:59.000Z","updated_at":"2024-11-18T19:41:04.000Z","dependencies_parsed_at":"2024-11-21T16:19:49.496Z","dependency_job_id":"241f61a7-f853-48fb-8c9e-975fb24d5477","html_url":"https://github.com/lyuchenyang/Efficient-VideoQA","commit_stats":null,"previous_names":["lyuchenyang/efficient-videoqa"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyuchenyang%2FEfficient-VideoQA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyuchenyang%2FEfficient-VideoQA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyuchenyang%2FEfficient-VideoQA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyuchenyang%2FEfficient-VideoQA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lyuchenyang","download_url":"https://codeload.github.com/lyuchenyang/Efficient-VideoQA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243784099,"owners_count":20347409,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","machine-learning","multi-modal-learning","natural-language-processing","video-question-answering"],"created_at":"2024-11-21T16:19:42.823Z","updated_at":"2025-03-15T19:41:03.751Z","avatar_url":"https://github.com/lyuchenyang.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Is a Video worth $n\\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering 💡\n\n**[Chenyang Lyu](https://lyuchenyang.github.io), [Tianbo Ji](mailto:jitianbo@ntu.edu.cn), [Yvette Graham](mailto:ygraham@tcd.ie), [Jennifer Foster](mailto:jennifer.foster@dcu.ie)**\n\nSchool of Computing, Dublin City University, Dublin, Ireland 🏫\n\n\u003c/div\u003e\n\nThis repository contains the code for the Efficient-VideoQA system, which is a highly efficient approach for Transformer-based Video Question Answering. The system utilizes existing vision-language pre-trained models and converts video frames into a $n\\times n$ matrix, reducing the computational requirements while maintaining the temporal structure of the original video.\n\n## Table of Contents\n\n- [1. Introduction](#1-introduction-📚)\n- [2. Dataset](#2-dataset-📊)\n- [3. Pre-processing](#3-pre-processing-🔧)\n- [4. Training](#4-training-🎓)\n- [5. Usage](#5-usage-🚀)\n- [6. Dependencies](#6-dependencies-⚙️)\n\n## 1. Introduction 📚\n\nConventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and questions. However, such approach incurs significant memory usage and inevitably slows down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models. We concatenate video frames into a $n\\times n$ matrix and then convert it into one image. By doing so, we reduce the use of the image encoder from $n^{2}$ to $1$ while maintaining the temporal structure of the original video.\n\n## 2. Dataset 📊\n\nPlease download the dataset from this link: [https://www.mediafire.com/folder/h14iarbs62e7p/shared](https://www.mediafire.com/folder/h14iarbs62e7p/shared) including videos and corresponding annotations. Move them under the `data/` directory.\n\nPlease download the TrafficQA dataset from this link: [https://sutdcv.github.io/SUTD-TrafficQA/#/download](https://sutdcv.github.io/SUTD-TrafficQA/#/download) including videos and corresponding annotations. Move them under the `data/` directory.\n\n## 3. Pre-processing 🔧\n\nTo pre-process the data, use `data_preprocess.py` to extract and combine frames from videos in the MSR-VTT and TrafficQA dataset. Then tokenize the annotation data to tensor dataset.\n\n## 4. Training 🎓\n\nTo train the model, use the following scripts:\n\n- For TrafficQA dataset: `python run_trafficqa_concat_image.py --do_train --do_eval --num_train_epochs 2 --learning_rate 5e-6 --train_batch_size 8 --eval_batch_size 16 --attention_heads 8 --eval_steps 50`\n- For MSR-VTT dataset: `python run_msrvtt_concat_image.py --do_train --do_eval --num_train_epochs 3 --learning_rate 5e-6 --train_batch_size 16 --eval_batch_size 16 --attention_heads 8 --eval_steps 5000`\n\n## 5. Usage 🚀\n\nOnce the model is trained, you can use it for VideoQA tasks. Provide a video, and the system will give the most probable answer based on the video. 🔎\n\n## 6. Dependencies ⚙️\n\nMake sure to install the following dependencies before running the code:\n\n- Python (\u003e=3.8) 🐍\n- PyTorch (\u003e=2.0) 🔥\n- MoviePy 🧮\n- ffmpeg 🐼\n\n## Citation 📄\n\nIf you find our paper useful, please cite it using the bibtex below:\n\n```bibtex\n@article{lyu2023video,\n  title={Is a Video worth $ n$\\backslash$times n $ Images? A Highly Efficient Approach to Transformer-based Video Question Answering},\n  author={Lyu, Chenyang and Ji, Tianbo and Graham, Yvette and Foster, Jennifer},\n  journal={arXiv preprint arXiv:2305.09107},\n  year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyuchenyang%2Fefficient-videoqa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyuchenyang%2Fefficient-videoqa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyuchenyang%2Fefficient-videoqa/lists"}