Projects in Awesome Lists tagged with vqa
A curated list of projects in awesome lists tagged with vqa .
https://github.com/facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
captioning deep-learning dialog hateful-memes multi-tasking multimodal pretrained-models pytorch textvqa vqa
Last synced: 14 May 2025
https://github.com/OpenGVLab/InternGPT
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa
Last synced: 27 Mar 2025
https://github.com/opengvlab/interngpt
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa
Last synced: 14 May 2025
https://github.com/roboflow/maestro
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision qwen2-vl transformers vision-and-language vqa
Last synced: 14 May 2025
https://github.com/open-compass/vlmevalkit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa
Last synced: 13 May 2025
https://github.com/bdbc-kg-nlp/qa-survey-cn
北京航空航天大学大数据高精尖中心自然语言处理研究团队开展了智能问答的研究与应用总结。包括基于知识图谱的问答(KBQA),基于文本的问答系统(TextQA),基于表格的问答系统(TableQA)、基于视觉的问答系统(VisualQA)和机器阅读理解(MRC)等,每类任务分别对学术界和工业界进行了相关总结。
cqa kbqa nlp qa qa-survey question-answering survey tqa vqa
Last synced: 04 Feb 2026
https://github.com/BDBC-KG-NLP/QA-Survey-CN
北京航空航天大学大数据高精尖中心自然语言处理研究团队开展了智能问答的研究与应用总结。包括基于知识图谱的问答(KBQA),基于文本的问答系统(TextQA),基于表格的问答系统(TableQA)、基于视觉的问答系统(VisualQA)和机器阅读理解(MRC)等,每类任务分别对学术界和工业界进行了相关总结。
cqa kbqa nlp qa qa-survey question-answering survey tqa vqa
Last synced: 27 Apr 2025
https://github.com/peteanderson80/bottom-up-attention
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome
caffe captioning-images faster-rcnn image-captioning mscoco mscoco-dataset visual-question-answering vqa
Last synced: 08 Apr 2025
https://github.com/nvlabs/prismer
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
image-captioning language-model multi-modal-learning multi-task-learning vision-and-language vision-language-model vqa
Last synced: 16 May 2025
https://github.com/open-compass/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa
Last synced: 20 Jul 2025
https://github.com/microsoft/Oscar
Oscar and VinVL
image-captioning image-text-search oscar pre-training vinvl vision-and-language vqa
Last synced: 21 Jul 2025
https://github.com/microsoft/oscar
Oscar and VinVL
image-captioning image-text-search oscar pre-training vinvl vision-and-language vqa
Last synced: 28 Sep 2025
https://github.com/hila-chefer/transformer-mm-explainability
[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
clip detr explainability explainable-ai interpretability lxmert transformer transformers visualbert visualization vqa
Last synced: 12 Apr 2025
https://github.com/hila-chefer/Transformer-MM-Explainability
[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
clip detr explainability explainable-ai interpretability lxmert transformer transformers visualbert visualization vqa
Last synced: 03 Apr 2025
https://github.com/hengyuan-hu/bottom-up-attention-vqa
An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.
bottom-up-attention pytorch vqa
Last synced: 13 Apr 2025
https://github.com/cadene/vqa.pytorch
Visual Question Answering in Pytorch
clevr coco deep-learning pytorch resnet skipthoughts torch vgenome vqa
Last synced: 04 Apr 2025
https://github.com/jayleicn/clipbert
[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
cvpr2021 pytorch video-question-answering video-retrieval vision-and-language vqa
Last synced: 04 Apr 2025
https://github.com/Cadene/vqa.pytorch
Visual Question Answering in Pytorch
clevr coco deep-learning pytorch resnet skipthoughts torch vgenome vqa
Last synced: 01 Apr 2025
https://github.com/jayleicn/ClipBERT
[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
cvpr2021 pytorch video-question-answering video-retrieval vision-and-language vqa
Last synced: 12 May 2025
https://github.com/opengvlab/multi-modality-arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
chat chatbot chatgpt gradio large-language-models llms multi-modality vision-language-model vqa
Last synced: 20 Apr 2025
https://github.com/stanfordnlp/mac-network
Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)
attention clevr compositional-attention-networks machine-reasoning question-answering tensorflow vqa
Last synced: 13 May 2025
https://github.com/OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
chat chatbot chatgpt gradio large-language-models llms multi-modality vision-language-model vqa
Last synced: 03 Apr 2025
https://github.com/davidmascharka/tbd-nets
PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"
deep-learning machine-learning neural-networks pytorch visual-question-answering visualization vqa
Last synced: 06 Apr 2025
https://github.com/FuxiaoLiu/LRV-Instruction?tab=readme-ov-file
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
chatgpt evaluation evaluation-metrics foundation-models gpt gpt-4 hallucination iclr iclr2024 llama llava multimodal object-detection prompt-engineering vicuna vision vision-and-language vqa
Last synced: 29 Mar 2025
https://github.com/cyanogenoid/pytorch-vqa
Strong baseline for visual question answering
baseline pytorch visual-question-answering vqa
Last synced: 07 Apr 2025
https://github.com/x-plug/mplug-2
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)
foundation-models image-retrieval mllm mplug multimodal multimodal-pretraining video video-question-answering video-retrieval vqa
Last synced: 09 Sep 2025
https://github.com/JackYFL/awesome-VLLMs
This repository collects papers on VLLM applications. We will update new papers irregularly.
application embodied llm mllm reasoning-agent survey vllm vlm vqa
Last synced: 06 Nov 2025
https://github.com/wangleihitcs/papers
读过的CV方向的一些论文,图像生成文字、弱监督分割等
captions computer-vision cvpr eccv iccv image2text miccai natural-language-processing scene-text-detection-recognition vqa weakly-supervised-segmentation
Last synced: 02 Mar 2026
https://github.com/yuanze-lin/revive
[NeurIPS 2022] Official code for REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
computer-vision deep-learning gpt-3 knowledge-based multimodal-deep-learning neurips2022 ok-vqa pytorch question-answering vision-and-languge vqa
Last synced: 09 Apr 2025
https://github.com/x-plug/mplug
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)
image-captioning image-text image-text-retrieval multimodal pretraining pytorch transformer visual-language vqa
Last synced: 26 Jun 2025
https://github.com/j-min/dsg
Davidsonian Scene Graph (DSG) for Text-to-Image Evaluation (ICLR 2024)
dsg llm text-to-image text-to-image-evaluation text-to-image-generation vqa
Last synced: 07 Apr 2025
https://github.com/kdexd/probnmn-clevr
Code for ICML 2019 paper "Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering" [long-oral]
clevr icml icml-2019 neural-module-networks probabilistic-models vqa
Last synced: 07 May 2025
https://github.com/cloud-cv/vqa
CloudCV Visual Question Answering Demo
artificial-intelligence machine-learning vqa vqa-dataset
Last synced: 14 Jun 2025
https://github.com/China-UK-ZSL/ZS-F-VQA
[Paper][ISWC 2021] Zero-shot Visual Question Answering using Knowledge Graph
commonsense commonsense-reasoning fvqa knowledge-graph visual-question-answering vqa zero-shot zs-f-vqa zsl
Last synced: 21 Jul 2025
https://github.com/ap229997/Conditional-Batch-Norm
Pytorch implementation of NIPS 2017 paper "Modulating early visual processing by language"
cbn modulated-resnet pytorch vqa
Last synced: 11 May 2025
https://github.com/cdancette/rubi.bootstrap.pytorch
NeurIPS 2019 Paper: RUBi : Reducing Unimodal Biases for Visual Question Answering
bias bias-reduction deep-learning pytorch vqa
Last synced: 03 Jul 2025
https://github.com/lupantech/IconQA
Data and code for NeurIPS 2021 Paper "IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning".
commensense dataset mathai pytorch reasoning vqa
Last synced: 02 May 2025
https://github.com/lupantech/iconqa
Data and code for NeurIPS 2021 Paper "IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning".
commensense dataset mathai pytorch reasoning vqa
Last synced: 09 Mar 2026
https://github.com/mapluisch/llava-cli-with-multiple-images
LLaVA inference with multiple images at once for cross-image analysis.
image-concatenation image-processing inference llama2 llama2-13b llava lmm lmms pillow python python3 pytorch visual-question-answering vqa
Last synced: 20 Jun 2025
https://github.com/sutdcv/SUTD-TrafficQA
[CVPR2021] SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
annotations cvpr cvpr2021 dataset multimodal multimodal-deep-learning paper traffic-events video-qa video-reasoning vqa vqa-dataset
Last synced: 18 Mar 2025
https://github.com/sidgan/whats_in_a_question
CVPR'17 Spotlight: What’s in a Question: Using Visual Questions as a Form of Supervision
computer-vision deep-learning deep-neural-networks vqa
Last synced: 26 Aug 2025
https://github.com/lucidrains/aoa-pytorch
A Pytorch implementation of Attention on Attention module (both self and guided variants), for Visual Question Answering
attention attention-mechanism captioning visual-question-answering vqa
Last synced: 13 Dec 2025
https://github.com/lupantech/dual-mfa-vqa
Co-attending Regions and Detections for VQA.
aaai attention-mechanism caffe faster-rcnn multi-gpu multi-modal object-detection torch visual-question-answering vqa
Last synced: 19 Feb 2026
https://github.com/aimagelab/reflectiva
[CVPR 2025] Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
knowledge-base mllm multimodal vlm vqa
Last synced: 26 Aug 2025
https://github.com/mbzuai-oryx/kitab-bench
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
arabic benchmark layout-detection ocr pdf-to-text table-detection vlms vqa
Last synced: 19 Jun 2025
https://github.com/adrianbzg/llama-multimodal-vqa
Multimodal Instruction Tuning for Llama 3
chatbot chatgpt gpt-4 huggingface instruction-tuning language-models llama llama2 llama3 multimodal multimodal-instruction-tuning visual-language-learning visual-question-answering vqa
Last synced: 25 Oct 2025
https://github.com/vzhou842/easy-vqa
The Easy Visual Question Answering dataset.
dataset easy-vqa visual-question-answering vqa vqa-dataset
Last synced: 05 Aug 2025
https://github.com/kushalkafle/DVQA_dataset
DVQA Dataset: A Bar chart question answering dataset presented at CVPR 2018
bar-chart cvpr2018 dataset deep-learning question-answering vqa
Last synced: 02 May 2025
https://github.com/vzhou842/easy-vqa-keras
A Keras implementation of VQA using the easy-VQA dataset.
easy-vqa keras keras-tensorflow vqa
Last synced: 27 Mar 2025
https://github.com/vzhou842/easy-VQA-keras
A Keras implementation of VQA using the easy-VQA dataset.
easy-vqa keras keras-tensorflow vqa
Last synced: 11 Apr 2025
https://github.com/yashkant/concat-vqa
Official code for the paper "Contrast and Classify: Training Robust VQA Models" published at ICCV, 2021
Last synced: 23 Apr 2025
https://github.com/wangzheallen/stl-vqa
The good practice in the VQA system such as pos-tag attention, structed triplet learning and triplet attention is very general and can be inserted into almost any visual and language task
deep-learning practice tensorflow vision-and-language vqa
Last synced: 08 Mar 2026
https://nextplusplus.github.io/TAT-DQA/
TAT-DQA: Towards Complex Document Understanding By Discrete Reasoning
document-understanding question-answering vqa
Last synced: 27 Oct 2025
https://github.com/raeidsaqur/mgn
Multimodal Graph Network (MGN): Code repo, examples from the paper
compositionality gnn program-synthesis vqa
Last synced: 03 Sep 2025
https://github.com/abachaa/VQA-Med-2021
VQA-Med 2021
medical-imaging radiology visual-question-answering visual-question-generation vqa vqa-dataset vqa-med
Last synced: 03 Apr 2025
https://github.com/google-research-datasets/maverics
MAVERICS (Manually-vAlidated Vq^2a Examples fRom Image-Caption datasetS) is a suite of test-only benchmarks for visual question answering (VQA).
data-creation evaluation maverics multimodal vq2a vqa vqa-dataset
Last synced: 16 Apr 2025
https://github.com/mbzuai-oryx/camel-bench
CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
arabic benchmark large-multimodal-models mbzuai multimodal-learning visual-question-answering vqa
Last synced: 01 May 2025
https://badripatro.github.io/Question-Paraphrases/
adversarial-machine-learning adversarial-networks answers coling2018 deep-neural-networks paraphrase-generation paraphrase-identification question-answering questions-generation sentiment-analysis sentiment-classification sentiment-scores vqa vqg
Last synced: 12 May 2025
https://github.com/vzhou842/easy-vqa-demo
A Web-based Javascript Demo of an easy-VQA model.
demo-app easy-vqa keras react tensorflowjs vqa
Last synced: 27 Mar 2025
https://github.com/seujung/relational-network-gluon
Gluon implement of "A simple neural network module of relational reasoning"
deep-learning gluon mxnet relational-networks vqa
Last synced: 17 Apr 2025
https://github.com/eurus-holmes/pythia-vqa
Baseline for Visual Question Answering.
Last synced: 02 May 2025
https://github.com/abdur75648/medicalgpt
Medical Report Generation And VQA (Adapting XrayGPT to Any Modality)
chatgpt chatgpt4o llama llm llms medical-dataset medical-imaging medical-report-generation medicalgpt minigpt4 multimodal-llm vicuna vqa vqa-dataset xraygpt
Last synced: 01 May 2026
https://github.com/yang-yifan/vqa-gan
Generative Visual Question Answering Pytorch
Last synced: 31 Oct 2025
https://github.com/pavansomisetty21/visual-question-answering-using-gemini-llm
In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension
artificial-intelligence blip blip2 gemini gemini-flash generative-ai generative-model git question-answering vision-language-model vision-transformer visual-models visual-question-answering vlm vqa
Last synced: 30 Apr 2025
https://github.com/ailln/vqa-roadmap
🍌Visual Question Answering Roadmap.
roadmap visual-question-answering vqa
Last synced: 19 Mar 2026
https://github.com/amirshnll/persian-visual-question-answering
Visual Question Answering in Persian Based on deep learning techniques (paper code)
deep-learning persian persian-vqa resnext resnext-101 visual-question-answering vqa
Last synced: 16 Mar 2025
https://github.com/ekinakyurek/mac-network
VQA: Memory, Attention and Composition (MAC) Network for CLEVR implemented via KnetLayers
attention clevr deep-learning knet machine-learning vqa
Last synced: 25 Mar 2025
https://github.com/chen0040/mxnet-vqa
Yet Another Visual Question Answering in MXNet
image-encoding mxnet text-encoding visual-question-answering vqa
Last synced: 03 Apr 2025
https://github.com/sumedhpendurkar/amf-vqa
attention-mechanism deep-learning multimodal-deep-learning neural-networks vqa
Last synced: 23 May 2026
https://github.com/cansik/vqa-service
VQA application that allows users to ask questions about images and receive answers.
Last synced: 08 Jul 2025
https://github.com/esborisova/scivqa
SciVQA: Scientific Visual Question Answering shared task
chart-understanding shared-task vqa
Last synced: 06 Jul 2025
https://github.com/nikhilroxtomar/visual-question-answer
An easy and simple implementation of Visual Question Answer (VQA) in TensorFlow and PyTorch (coming soon).
pytorch tensorflow visual-question-answering vqa
Last synced: 30 Apr 2026
https://github.com/msmrexe/neurosymbolic-vqa-program-generator
A comprehensive implementation of a Neurosymbolic framework for Visual Question Answering (VQA) on the CLEVR dataset. This project translates natural language questions into symbolic programs using three different learning strategies: Supervised (LSTM & Transformer), Reinforcement Learning (REINFORCE), and In-Context Learning (LLM).
clevr course-project in-context-learning large-language-models lstm neurosymbolic neurosymbolic-ai policy-gradient program-generator pytorch reinforce reinforcement-learning seq2seq supervised-learning system-2 transformer university-project visual-question-answering visual-reasoning vqa
Last synced: 07 May 2026
https://github.com/reshalfahsi/vqa-clip-lstm
Visual Question Answering Using CLIP + LSTM
clip lstm nlp pytorch pytorch-lightning visual-question-answering vizwiz-vqa vqa
Last synced: 11 May 2026
https://github.com/arulkumarann/vqa_implementation
vanilla vqa_v1 PyTorch implementation
Last synced: 03 May 2026
https://github.com/0xnu/tiny_llm_trainer
The experiment implements a tiny language model trainer using PyTorch.
large-language-model large-language-models llm llm-training pytorch text-generation text-to-speech tts visual-question-answering vqa wiki wikipedia
Last synced: 03 Apr 2025
https://github.com/cserajdeep/visual-question-answering-vqa
Visual Question Answering (VQA)
computer-vision flask keras python tensorflow vqa vqa-dataset
Last synced: 28 Apr 2026
https://github.com/orshkuri/vqa-qformer-comparison
A benchmark and analysis of QFormer, Cross Attention, and Concat models for binary Visual Question Answering (VQA) using CLIP and BERT+ViT-CLIP encoders.
bert clip deep-learning-multimodal pytorch pytorch-lightning transformers vqa
Last synced: 30 Apr 2026
https://github.com/rakshath66/ask-your-image
Ask questions about any image using AI. A smart Streamlit app powered by BLIP that answers visual questions, generates captions, and lets you download a PDF report.
ai-app blip caption-generator computer-vision deep-learning generative-ai huggingface image-captioning image-processing image-question-answering interactive-ui multimodal-ai openai pdf-generator pytorch streamlit transformers vision-language visual-question-answering vqa
Last synced: 06 May 2026
https://github.com/mahmood-anaam/violet
Violet is a Python-based library designed for generating Arabic image captions. The pipeline leverages state-of-the-art transformer models, providing an easy-to-use interface for researchers and developers working on tasks such as image captioning and visual question answering (VQA).
image-captioning okvqa python3 pytorch transformers vqa vqav2
Last synced: 07 May 2026