https://github.com/reshalfahsi/vqa-clip-lstm
Visual Question Answering Using CLIP + LSTM
https://github.com/reshalfahsi/vqa-clip-lstm
clip lstm nlp pytorch pytorch-lightning visual-question-answering vizwiz-vqa vqa
Last synced: 7 months ago
JSON representation
Visual Question Answering Using CLIP + LSTM
- Host: GitHub
- URL: https://github.com/reshalfahsi/vqa-clip-lstm
- Owner: reshalfahsi
- Created: 2024-04-21T21:48:39.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-05-05T14:08:48.000Z (over 1 year ago)
- Last Synced: 2025-01-15T07:14:38.235Z (9 months ago)
- Topics: clip, lstm, nlp, pytorch, pytorch-lightning, visual-question-answering, vizwiz-vqa, vqa
- Language: Jupyter Notebook
- Homepage:
- Size: 4.67 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Visual Question Answering Using CLIP + LSTM
![]()
CLIP + LSTM architecture.
The visual question-answering problem can be described as "asking our computer to reply to the assigned questions about a particular image." In this project, a CLIP + LSTM architecture comes to lend a helping hand, trying to solve the problem. The image and text encoders of CLIP cultivate the given image and question, respectively. The concatenated image-text representation from CLIP is then applied to the vectorized answer text via the Hadamard product before feeding it to LSTM. By a fashion of autoregressive, the answer to the question is finally served to us. Here, the VizWiz-VQA dataset is utilized to train, validate, and test the model. The training set of the dataset is used in the training and validation phases. It is divided by a ratio of 99:1. The validation set of the dataset is employed for testing. The SQuAD and BLEU metrics are utilized to gauge the performance of the model quantitatively. In inference time, the test set of VizWiz-VQA is leveraged.
## Experiment
Give yourself a delightful excursion by passing through the line of codes regarding the experiment provided in this [notebook](https://github.com/reshalfahsi/vqa-clip-lstm/blob/master/Visual_Question_Answering_Using_CLIP_LSTM.ipynb).
## Result
## Quantitative Result
Here are the evaluation metric results of the model.
SQuAD Metric | Score
------------------------------ | -------------
BLEU 1-gram | 44.67%
Exact Match | 44.43%
F1-score | 44.83%## Loss Curve
![]()
Loss curves of the CLIP + LSTM model on the train and validation sets.## Qualitative Result
The following image exhibits the collated results of the VQA model.
A collection of qualitative results containing question-answer-image triads.## Credit
- [An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf)
- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)
- [Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model](https://arxiv.org/pdf/2206.05281.pdf)
- [Long Short-Term Memory](https://www.bioinf.jku.at/publications/older/2604.pdf)
- [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/pdf/1606.05250.pdf)
- [A Call for Clarity in Reporting BLEU Scores](https://arxiv.org/pdf/1804.08771)
- [CLIP](https://github.com/openai/CLIP)
- [LLaMA 2 from scratch 🦙](https://github.com/viai957/llama-inference)
- [yousefkotp's Visual Question Answering](https://github.com/yousefkotp/Visual-Question-Answering)
- [aladdinpersson's Image Captioning](https://github.com/aladdinpersson/Machine-Learning-Collection/tree/master/ML/Pytorch/more_advanced/image_captioning)
- [VizWiz-VQA](https://vizwiz.org/tasks-and-datasets/vqa/)
- [PyTorch Lightning](https://lightning.ai/docs/pytorch/latest/)