Projects in Awesome Lists tagged with vqa

https://github.com/facebookresearch/mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

captioning deep-learning dialog hateful-memes multi-tasking multimodal pretrained-models pytorch textvqa vqa

Last synced: 14 May 2025

https://github.com/OpenGVLab/InternGPT

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa

Last synced: 27 Mar 2025

https://github.com/opengvlab/interngpt

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa

Last synced: 14 May 2025

https://github.com/roboflow/maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision qwen2-vl transformers vision-and-language vqa

Last synced: 14 May 2025

https://github.com/open-compass/vlmevalkit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 13 May 2025

https://github.com/bdbc-kg-nlp/qa-survey-cn

北京航空航天大学大数据高精尖中心自然语言处理研究团队开展了智能问答的研究与应用总结。包括基于知识图谱的问答（KBQA），基于文本的问答系统（TextQA），基于表格的问答系统（TableQA）、基于视觉的问答系统（VisualQA）和机器阅读理解（MRC）等，每类任务分别对学术界和工业界进行了相关总结。

cqa kbqa nlp qa qa-survey question-answering survey tqa vqa

Last synced: 04 Feb 2026

https://github.com/BDBC-KG-NLP/QA-Survey-CN

北京航空航天大学大数据高精尖中心自然语言处理研究团队开展了智能问答的研究与应用总结。包括基于知识图谱的问答（KBQA），基于文本的问答系统（TextQA），基于表格的问答系统（TableQA）、基于视觉的问答系统（VisualQA）和机器阅读理解（MRC）等，每类任务分别对学术界和工业界进行了相关总结。

cqa kbqa nlp qa qa-survey question-answering survey tqa vqa

Last synced: 27 Apr 2025

https://github.com/peteanderson80/bottom-up-attention

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

caffe captioning-images faster-rcnn image-captioning mscoco mscoco-dataset visual-question-answering vqa

Last synced: 08 Apr 2025

https://github.com/nvlabs/prismer

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

image-captioning language-model multi-modal-learning multi-task-learning vision-and-language vision-language-model vqa

Last synced: 16 May 2025

https://github.com/open-compass/VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 20 Jul 2025

https://github.com/microsoft/Oscar

Oscar and VinVL

image-captioning image-text-search oscar pre-training vinvl vision-and-language vqa

Last synced: 21 Jul 2025

https://github.com/microsoft/oscar

Oscar and VinVL

image-captioning image-text-search oscar pre-training vinvl vision-and-language vqa

Last synced: 28 Sep 2025

https://github.com/hila-chefer/transformer-mm-explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

clip detr explainability explainable-ai interpretability lxmert transformer transformers visualbert visualization vqa

Last synced: 12 Apr 2025

https://github.com/hila-chefer/Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

clip detr explainability explainable-ai interpretability lxmert transformer transformers visualbert visualization vqa

Last synced: 03 Apr 2025

https://github.com/hengyuan-hu/bottom-up-attention-vqa

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

bottom-up-attention pytorch vqa

Last synced: 13 Apr 2025

https://github.com/cadene/vqa.pytorch

Visual Question Answering in Pytorch

clevr coco deep-learning pytorch resnet skipthoughts torch vgenome vqa

Last synced: 04 Apr 2025

https://github.com/jayleicn/clipbert

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

cvpr2021 pytorch video-question-answering video-retrieval vision-and-language vqa

Last synced: 04 Apr 2025

https://github.com/Cadene/vqa.pytorch

Visual Question Answering in Pytorch

clevr coco deep-learning pytorch resnet skipthoughts torch vgenome vqa

Last synced: 01 Apr 2025

https://github.com/jayleicn/ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

cvpr2021 pytorch video-question-answering video-retrieval vision-and-language vqa

Last synced: 12 May 2025

https://github.com/opengvlab/multi-modality-arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

chat chatbot chatgpt gradio large-language-models llms multi-modality vision-language-model vqa

Last synced: 20 Apr 2025

https://github.com/stanfordnlp/mac-network

Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)

attention clevr compositional-attention-networks machine-reasoning question-answering tensorflow vqa

Last synced: 13 May 2025

https://github.com/OpenGVLab/Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

chat chatbot chatgpt gradio large-language-models llms multi-modality vision-language-model vqa

Last synced: 03 Apr 2025

https://github.com/davidmascharka/tbd-nets

PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"

deep-learning machine-learning neural-networks pytorch visual-question-answering visualization vqa

Last synced: 06 Apr 2025

https://github.com/FuxiaoLiu/LRV-Instruction?tab=readme-ov-file

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

chatgpt evaluation evaluation-metrics foundation-models gpt gpt-4 hallucination iclr iclr2024 llama llava multimodal object-detection prompt-engineering vicuna vision vision-and-language vqa

Last synced: 29 Mar 2025

https://github.com/cyanogenoid/pytorch-vqa

Strong baseline for visual question answering

baseline pytorch visual-question-answering vqa

Last synced: 07 Apr 2025

https://github.com/x-plug/mplug-2

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)

foundation-models image-retrieval mllm mplug multimodal multimodal-pretraining video video-question-answering video-retrieval vqa

Last synced: 09 Sep 2025

https://github.com/JackYFL/awesome-VLLMs

This repository collects papers on VLLM applications. We will update new papers irregularly.

application embodied llm mllm reasoning-agent survey vllm vlm vqa

Last synced: 06 Nov 2025

https://github.com/wangleihitcs/papers

读过的CV方向的一些论文，图像生成文字、弱监督分割等

captions computer-vision cvpr eccv iccv image2text miccai natural-language-processing scene-text-detection-recognition vqa weakly-supervised-segmentation

Last synced: 02 Mar 2026

https://github.com/yuanze-lin/revive

[NeurIPS 2022] Official code for REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

computer-vision deep-learning gpt-3 knowledge-based multimodal-deep-learning neurips2022 ok-vqa pytorch question-answering vision-and-languge vqa

Last synced: 09 Apr 2025

https://github.com/x-plug/mplug

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

image-captioning image-text image-text-retrieval multimodal pretraining pytorch transformer visual-language vqa

Last synced: 26 Jun 2025

https://github.com/j-min/dsg

Davidsonian Scene Graph (DSG) for Text-to-Image Evaluation (ICLR 2024)

dsg llm text-to-image text-to-image-evaluation text-to-image-generation vqa

Last synced: 07 Apr 2025

https://github.com/kdexd/probnmn-clevr

Code for ICML 2019 paper "Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering" [long-oral]

clevr icml icml-2019 neural-module-networks probabilistic-models vqa

Last synced: 07 May 2025

https://github.com/cloud-cv/vqa

CloudCV Visual Question Answering Demo

artificial-intelligence machine-learning vqa vqa-dataset

Last synced: 14 Jun 2025

https://github.com/China-UK-ZSL/ZS-F-VQA

[Paper][ISWC 2021] Zero-shot Visual Question Answering using Knowledge Graph

commonsense commonsense-reasoning fvqa knowledge-graph visual-question-answering vqa zero-shot zs-f-vqa zsl

Last synced: 21 Jul 2025

https://github.com/ap229997/Conditional-Batch-Norm

Pytorch implementation of NIPS 2017 paper "Modulating early visual processing by language"

cbn modulated-resnet pytorch vqa

Last synced: 11 May 2025

https://github.com/cdancette/rubi.bootstrap.pytorch

NeurIPS 2019 Paper: RUBi : Reducing Unimodal Biases for Visual Question Answering

bias bias-reduction deep-learning pytorch vqa

Last synced: 03 Jul 2025

https://github.com/lupantech/IconQA

Data and code for NeurIPS 2021 Paper "IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning".

commensense dataset mathai pytorch reasoning vqa

Last synced: 02 May 2025

https://github.com/lupantech/iconqa

Data and code for NeurIPS 2021 Paper "IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning".

commensense dataset mathai pytorch reasoning vqa

Last synced: 09 Mar 2026

https://github.com/mapluisch/llava-cli-with-multiple-images

LLaVA inference with multiple images at once for cross-image analysis.

image-concatenation image-processing inference llama2 llama2-13b llava lmm lmms pillow python python3 pytorch visual-question-answering vqa

Last synced: 20 Jun 2025

https://github.com/sutdcv/SUTD-TrafficQA

[CVPR2021] SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

annotations cvpr cvpr2021 dataset multimodal multimodal-deep-learning paper traffic-events video-qa video-reasoning vqa vqa-dataset

Last synced: 18 Mar 2025

https://github.com/sidgan/whats_in_a_question

CVPR'17 Spotlight: What’s in a Question: Using Visual Questions as a Form of Supervision

computer-vision deep-learning deep-neural-networks vqa

Last synced: 26 Aug 2025

https://github.com/lucidrains/aoa-pytorch

A Pytorch implementation of Attention on Attention module (both self and guided variants), for Visual Question Answering

attention attention-mechanism captioning visual-question-answering vqa

Last synced: 13 Dec 2025

https://github.com/lupantech/dual-mfa-vqa

Co-attending Regions and Detections for VQA.

aaai attention-mechanism caffe faster-rcnn multi-gpu multi-modal object-detection torch visual-question-answering vqa

Last synced: 19 Feb 2026

https://github.com/aimagelab/reflectiva

[CVPR 2025] Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

knowledge-base mllm multimodal vlm vqa

Last synced: 26 Aug 2025

https://github.com/mbzuai-oryx/kitab-bench

[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

arabic benchmark layout-detection ocr pdf-to-text table-detection vlms vqa

Last synced: 19 Jun 2025

https://github.com/adrianbzg/llama-multimodal-vqa

Multimodal Instruction Tuning for Llama 3

chatbot chatgpt gpt-4 huggingface instruction-tuning language-models llama llama2 llama3 multimodal multimodal-instruction-tuning visual-language-learning visual-question-answering vqa

Last synced: 25 Oct 2025

https://github.com/vzhou842/easy-vqa

The Easy Visual Question Answering dataset.

dataset easy-vqa visual-question-answering vqa vqa-dataset

Last synced: 05 Aug 2025

https://github.com/kushalkafle/DVQA_dataset

DVQA Dataset: A Bar chart question answering dataset presented at CVPR 2018

bar-chart cvpr2018 dataset deep-learning question-answering vqa

Last synced: 02 May 2025

https://github.com/vzhou842/easy-vqa-keras

A Keras implementation of VQA using the easy-VQA dataset.

easy-vqa keras keras-tensorflow vqa

Last synced: 27 Mar 2025

https://github.com/vzhou842/easy-VQA-keras

A Keras implementation of VQA using the easy-VQA dataset.

easy-vqa keras keras-tensorflow vqa

Last synced: 11 Apr 2025

https://github.com/yashkant/concat-vqa

Official code for the paper "Contrast and Classify: Training Robust VQA Models" published at ICCV, 2021

concat robust vqa

Last synced: 23 Apr 2025

https://github.com/wangzheallen/stl-vqa

The good practice in the VQA system such as pos-tag attention, structed triplet learning and triplet attention is very general and can be inserted into almost any visual and language task

deep-learning practice tensorflow vision-and-language vqa

Last synced: 08 Mar 2026

https://nextplusplus.github.io/TAT-DQA/

TAT-DQA: Towards Complex Document Understanding By Discrete Reasoning

document-understanding question-answering vqa

Last synced: 27 Oct 2025

https://github.com/raeidsaqur/mgn

Multimodal Graph Network (MGN): Code repo, examples from the paper

compositionality gnn program-synthesis vqa

Last synced: 03 Sep 2025

https://github.com/abachaa/VQA-Med-2021

VQA-Med 2021

medical-imaging radiology visual-question-answering visual-question-generation vqa vqa-dataset vqa-med

Last synced: 03 Apr 2025

https://github.com/google-research-datasets/maverics

MAVERICS (Manually-vAlidated Vq^2a Examples fRom Image-Caption datasetS) is a suite of test-only benchmarks for visual question answering (VQA).

data-creation evaluation maverics multimodal vq2a vqa vqa-dataset

Last synced: 16 Apr 2025

https://github.com/mbzuai-oryx/camel-bench

CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.

arabic benchmark large-multimodal-models mbzuai multimodal-learning visual-question-answering vqa

Last synced: 01 May 2025

https://badripatro.github.io/Question-Paraphrases/

adversarial-machine-learning adversarial-networks answers coling2018 deep-neural-networks paraphrase-generation paraphrase-identification question-answering questions-generation sentiment-analysis sentiment-classification sentiment-scores vqa vqg

Last synced: 12 May 2025

https://github.com/vzhou842/easy-vqa-demo

A Web-based Javascript Demo of an easy-VQA model.

demo-app easy-vqa keras react tensorflowjs vqa

Last synced: 27 Mar 2025

https://github.com/seujung/relational-network-gluon

Gluon implement of "A simple neural network module of relational reasoning"

deep-learning gluon mxnet relational-networks vqa

Last synced: 17 Apr 2025

https://github.com/eurus-holmes/pythia-vqa

Baseline for Visual Question Answering.

baseline multimodal vqa

Last synced: 02 May 2025

https://github.com/abdur75648/medicalgpt

Medical Report Generation And VQA (Adapting XrayGPT to Any Modality)

chatgpt chatgpt4o llama llm llms medical-dataset medical-imaging medical-report-generation medicalgpt minigpt4 multimodal-llm vicuna vqa vqa-dataset xraygpt

Last synced: 01 May 2026

https://github.com/yang-yifan/vqa-gan

Generative Visual Question Answering Pytorch

image-synthesis pytorch vqa

Last synced: 31 Oct 2025

https://github.com/pavansomisetty21/visual-question-answering-using-gemini-llm

In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension

artificial-intelligence blip blip2 gemini gemini-flash generative-ai generative-model git question-answering vision-language-model vision-transformer visual-models visual-question-answering vlm vqa

Last synced: 30 Apr 2025

https://github.com/ailln/vqa-roadmap

🍌Visual Question Answering Roadmap.

roadmap visual-question-answering vqa

Last synced: 19 Mar 2026

https://github.com/amirshnll/persian-visual-question-answering

Visual Question Answering in Persian Based on deep learning techniques (paper code)

deep-learning persian persian-vqa resnext resnext-101 visual-question-answering vqa

Last synced: 16 Mar 2025

https://github.com/ekinakyurek/mac-network

VQA: Memory, Attention and Composition (MAC) Network for CLEVR implemented via KnetLayers

attention clevr deep-learning knet machine-learning vqa

Last synced: 25 Mar 2025

https://github.com/chen0040/mxnet-vqa

Yet Another Visual Question Answering in MXNet

image-encoding mxnet text-encoding visual-question-answering vqa

Last synced: 03 Apr 2025

https://github.com/sumedhpendurkar/amf-vqa

attention-mechanism deep-learning multimodal-deep-learning neural-networks vqa

Last synced: 23 May 2026

https://github.com/cansik/vqa-service

VQA application that allows users to ask questions about images and receive answers.

gradio python service vqa

Last synced: 08 Jul 2025

https://github.com/esborisova/scivqa

SciVQA: Scientific Visual Question Answering shared task

chart-understanding shared-task vqa

Last synced: 06 Jul 2025

https://github.com/nikhilroxtomar/visual-question-answer

An easy and simple implementation of Visual Question Answer (VQA) in TensorFlow and PyTorch (coming soon).

pytorch tensorflow visual-question-answering vqa

Last synced: 30 Apr 2026

https://github.com/msmrexe/neurosymbolic-vqa-program-generator

A comprehensive implementation of a Neurosymbolic framework for Visual Question Answering (VQA) on the CLEVR dataset. This project translates natural language questions into symbolic programs using three different learning strategies: Supervised (LSTM & Transformer), Reinforcement Learning (REINFORCE), and In-Context Learning (LLM).

clevr course-project in-context-learning large-language-models lstm neurosymbolic neurosymbolic-ai policy-gradient program-generator pytorch reinforce reinforcement-learning seq2seq supervised-learning system-2 transformer university-project visual-question-answering visual-reasoning vqa

Last synced: 07 May 2026

https://github.com/reshalfahsi/vqa-clip-lstm

Visual Question Answering Using CLIP + LSTM

clip lstm nlp pytorch pytorch-lightning visual-question-answering vizwiz-vqa vqa

Last synced: 11 May 2026

https://github.com/arulkumarann/vqa_implementation

vanilla vqa_v1 PyTorch implementation

pytorch vggnet vqa

Last synced: 03 May 2026

https://github.com/0xnu/tiny_llm_trainer

The experiment implements a tiny language model trainer using PyTorch.

large-language-model large-language-models llm llm-training pytorch text-generation text-to-speech tts visual-question-answering vqa wiki wikipedia

Last synced: 03 Apr 2025

https://github.com/cserajdeep/visual-question-answering-vqa

Visual Question Answering (VQA)

computer-vision flask keras python tensorflow vqa vqa-dataset

Last synced: 28 Apr 2026

https://github.com/orshkuri/vqa-qformer-comparison

A benchmark and analysis of QFormer, Cross Attention, and Concat models for binary Visual Question Answering (VQA) using CLIP and BERT+ViT-CLIP encoders.

bert clip deep-learning-multimodal pytorch pytorch-lightning transformers vqa

Last synced: 30 Apr 2026

https://github.com/rakshath66/ask-your-image

Ask questions about any image using AI. A smart Streamlit app powered by BLIP that answers visual questions, generates captions, and lets you download a PDF report.

ai-app blip caption-generator computer-vision deep-learning generative-ai huggingface image-captioning image-processing image-question-answering interactive-ui multimodal-ai openai pdf-generator pytorch streamlit transformers vision-language visual-question-answering vqa

Last synced: 06 May 2026

https://github.com/mahmood-anaam/violet

Violet is a Python-based library designed for generating Arabic image captions. The pipeline leverages state-of-the-art transformer models, providing an easy-to-use interface for researchers and developers working on tasks such as image captioning and visual question answering (VQA).

image-captioning okvqa python3 pytorch transformers vqa vqav2

Last synced: 07 May 2026