https://github.com/fork123aniket/graph-neural-network-based-visual-question-answering
Implementation of GNNs for Visual Question Answering task in PyTorch
https://github.com/fork123aniket/graph-neural-network-based-visual-question-answering
computer-vision encoder-decoder-architecture encoder-decoder-model graph-neural-networks natural-language-processing pytorch pytorch-geometric pytorch-implementation seq2seq-model visual-question-answering
Last synced: 4 months ago
JSON representation
Implementation of GNNs for Visual Question Answering task in PyTorch
- Host: GitHub
- URL: https://github.com/fork123aniket/graph-neural-network-based-visual-question-answering
- Owner: fork123aniket
- License: mit
- Created: 2022-11-04T07:44:49.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-04-30T17:57:15.000Z (about 2 years ago)
- Last Synced: 2024-11-15T18:09:59.843Z (6 months ago)
- Topics: computer-vision, encoder-decoder-architecture, encoder-decoder-model, graph-neural-networks, natural-language-processing, pytorch, pytorch-geometric, pytorch-implementation, seq2seq-model, visual-question-answering
- Language: Python
- Homepage:
- Size: 865 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Generating Multimodal Representations for VQA using Graph Neural Networks
This repository aims at providing a GNN-based implementation to reason over both input modalities and improve performance on a VQA dataset. Here, the model receives an image `im` and a text-based question `t` and outputs the answer to the question `t`. The following approach is being followed by the GNN model for the Visual Question Answering task:-
- Processing our input question into a graph Gt and image into a graph Gim using the Graph Parser.
- Passing the text graph Gt and image graph Gim into a graph neural network (GNN) to get the text and image node embeddings.
- Combining the embeddings using the Graph Matcher, which projects the text embeddings into the image embedding space and returns the combined multimodal representation of the input.
- Passing the joint representation through a sequence to sequence model to output the answer to the question.## Requirements
- `PyTorch`
- `PyTorch Geometric`
- `Numpy`
- `rsmlkit`## Usage
- This implementations trains the GNN model on the ***CLEVR*** dataset (a diagnostic dataset of `3D` shapes that tests visual and linguistic reasoning), which can be downloaded from [***here***](https://cs.stanford.edu/people/jcjohns/clevr/).
- To pre-process and prepare the dataset for training, run `Dataset.py`
- To see the GNN-based model implementation, check `Model.py`
- `Match.py` is responsible for matching nodes locally via a graph neural network and then updating correspondence scores iteratively
- To see RNN-based Encoder-Decoder implementation and how it interacts with GNN model, check `Encoder_Decoder.py`
- `Parser.py` is responsible for instantiating the models as well as taking care of loading and saving of model checkpoints
- To train the whole model pipeline, run `Train.py`## Results
The predicted answer to each question alongside its corresponsing image can be seen in the following attached output images:-
| Image | Question | GNN-Generated Answer|
| ----- |:--------:|:-------------------:|
|  | There is a tiny matte thing that is left of the big gray cylinder and in front of the cyan metal cylinder; what color is it? | Purple |
|  | There is a small blue block; are there any spheres to the left of it? | Yes |