https://github.com/emrys-hong/multimodal_peft
https://github.com/emrys-hong/multimodal_peft
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/emrys-hong/multimodal_peft
- Owner: Emrys-Hong
- Created: 2023-05-30T09:12:56.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-06-25T21:25:29.000Z (almost 2 years ago)
- Last Synced: 2025-01-15T08:19:15.896Z (4 months ago)
- Language: Python
- Size: 36.1 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# MultiModal Peft datasets
Here I list down datasets used by those papers:
- InstructBlip
- ICode-V2[√] means already downloaded and have available dataloader in `data_loading.py`
[x] means have not downloaded or unavailable
This readme only contains the datasets mentioned in InstructBlip paper or ICode-V2 paper. For full List of implemented datasets, please check `test_model` function in `data_loading.py`
## Speech Language Pretraining Tasks
### ASR
[x] i-code: An integrative and composable multi-modal learning framework (from IcodeV2)
```
Transcribe the speech utterance to text
```
### Sentiment Analysis
[√] MOSEI (from IcodeV2)[x] Spoken Language Understanding Evaluation (SLUE) (from IcodeV2)
```
Predict the sentiment of this segment:
```
### Emotion Recognition
[√] CMU-MOSEI (from IcodeV2)
```
Predict the emotion of this segment:
```
### Speech Augmented Text Reconstruction
```
Reconstruct the following text based on the speech:
```## Vision Language Pretraining Tasks
### Vision Captioning for Image
[x] Florence image-text pair dataset (from IcodeV2)[√] COCO Caption (from InstructBlip)
[x] Web Cap Filt (from InstructBlip, used in BLIP and BLIP2)
[x] NoCaps (from InstructBlip held out dataset) NoCaps contains 15,100 images with 166,100 human-written captions for novel object image captioning. (Used Validation portion)
[√] Flickr30K (from InstructBlip heldout dataset) Used test portion 1K
[√] TextCaps (from InstructBlip) image captioning dataset that requires the model to comprehend and reason the text in images.
```
Generate the caption for this image:
```### Vision Captioning for Video
[√] Web-Vid10M (from IcodeV2)
```
Generate the caption for this video
```### VQA For Image
[√] VQA V2 (from Icodev2 and instructblip) is dataset for open-ended image question answering.[x] Vizwiz (from instructblip heldout set) A dataset contains visual questions asked by people who are blind. 8K images are used for the held-out evaluation.
[x] GQA (from instructblip heldoutset) contains image questions for scene understanding and reasoning. We use the bal- anced test-dev set as held-out.)
[x] Visual Spatial Reasoning (from instructblip heldout set) VSR is a collection of image-text pairs, in which the text describes the spatial relation of two objects in the image. Models are required to classify true/false for the description
[x] IconQA (from instructblip heldout set) IconQA measures the abstract diagram understanding and comprehensive cognitive rea- soning abilities of models.
[√] OKVQA (from instructblip) OKVQA contains visual questions that require outside knowledge to answer
[√] A-OKVQA (from instructblip) A-OKVQA is a successor of OKVQA with more challenging and diverse questions.
[x] ScienceQA (from instructblip heldout set) ScienceQA covers diverse science topics with corresponding lectures and explanations. In out settings, we only use the part with image context (IMG).
[√] VisualDialog (from instructblip heldout set) Visual dialog is a conversational question answering dataset.
```
Answer the following question based on the image:
```### VQA for Video
[√] MSVD-QA (from instructblip heldout set) We use the test set (13K video QA pairs) of MSVD-QA for held-out testing.[√] MSRVTT-QA (from instructblip heldout set) MSRVTT-QA has more complex scenes than MSVD, with 72K video QA pairs as the test set.
[x] iVQA (from instructblip heldout set) iVQA is a video QA dataset with mitigated language biases. It has 6K/2K/2K samples for train/val/test.
```
Answer the following question based on the Video:
```### Vision Augmented Text Reconstruction
Same as image Captioning
```
Reconstruct the following text based on the image: [with masked text]
```### Others
[√] OCR-VQA (from instructblip) contains visual questions that require models to read text in the image[x] TextVQA (from instructblip heldout set) TextVQA requires models to comprehend visual text to answer questions.
[x] HatefulMemes (from instructblip heldout set) A binary classification dataset to justify whether a meme contains hateful content.
[√] LLaVA-Instruct-150K (from instructblip) An instruction tuning dataset which has three parts: detailed caption (23K), reasoning (77K), conversation (58K).
## Language - Only Tasks
### Text Recontrusction
```
Reconstruct masked spans in the following text:
```## Prompt Design for specific datasets
MOSEI:
```
1. Video + audio + text + instruction: "predict the sentiment"
2. Video + audio instruction: "predict the sentiment"
3. Video + instruction: "predict the sentiment"
4. Audio + instruction: "predict the sentiment"
```Librispeech
```
1. ASR
2. audio + masked text to text reconstruction
```COCO
```
1. Image Captioning
2. Image + masked text to text reconstruction
```8. Text generation (Wikipedia + Bookcorpus)
### Other possible datasets
- [Youtube 8M](https://research.google.com/youtube8m/) Video classfication dataset
- [Multimodal C4](https://github.com/allenai/mmc4) billion scale corpus of images interleaved with text
- [VALOR](https://github.com/TXH-mercury/VALOR) VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset