https://github.com/muennighoff/vilio
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle
https://github.com/muennighoff/vilio
ernie-vil hateful-memes lxmert oscar transformers uniter vision-and-language vision-transformer visualbert
Last synced: 9 months ago
JSON representation
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle
- Host: GitHub
- URL: https://github.com/muennighoff/vilio
- Owner: Muennighoff
- License: mit
- Created: 2020-10-28T08:17:32.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-06-08T08:35:51.000Z (almost 3 years ago)
- Last Synced: 2025-07-12T17:43:10.086Z (11 months ago)
- Topics: ernie-vil, hateful-memes, lxmert, oscar, transformers, uniter, vision-and-language, vision-transformer, visualbert
- Language: Python
- Homepage: https://arxiv.org/abs/2012.07788
- Size: 10.4 MB
- Stars: 90
- Watchers: 2
- Forks: 28
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
🥶VILIO🥶
State-of-the-art Visio-Linguistic Models 🥶
## Updates
### 06/2021 - Hateful Memes CSV Files
- The CSV files that were used for the scores in the vilio paper are now available here
### 06/2021 - Inference on any meme
- Thanks to the initiative by katrinc, here are two notebooks for using Vilio to perform pure inference on any meme you want :)
- Just adapt the example input dataset / input model to use a different meme / pretrained model🥶
- GPU: https://www.kaggle.com/muennighoff/vilioexample-nb
- CPU: https://www.kaggle.com/muennighoff/vilioexample-nb-cpu
## Ordering
Vilio aims to replicate the organization of huggingface's transformer repo at:
https://github.com/huggingface/transformers
- /bash
Shell files to reproduce hateful memes results
- /data
By default, directory for loading in data & saving checkpoints
- /ernie-vil
Ernie-vil sub-repository written in PaddlePaddle
- /fts_lmdb
Scripts for handling .lmdb extracted features
- /fts_tsv
Scripts for handling .tsv extracted features
- /notebooks
Jupyter Notebooks for demonstration & reproducibility
- /py-bottm-up-attention
Sub-repository for tsv feature extraction forked & adapted from [here](https://github.com/airsplay/py-bottom-up-attention)
- src/vilio
All implemented models (also see below for a quick overview of models)
- /utils
Pandas & ensembling scripts for data handling
- entry.py files
Scripts used to access the models and apply model-specific data preparation
- pretrain.py files
Same purpose as entry files, but for pre-training; Point of entry for pre-training
- hm.py
Training code for the hateful memes challenge; Main point of entry
- param.py
Args for running hm.py
## Usage
Follow SCORE_REPRO.md for reproducing performance on the Hateful Memes Task.
Follow GETTING_STARTED.md for using the framework for your own task.
See the paper at: https://arxiv.org/abs/2012.07788
## Architectures
🥶 Vilio currently provides the following architectures with the outlined language transformers:
1. **[E - ERNIE-VIL](https://arxiv.org/abs/2006.16934)** [ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph](https://arxiv.org/abs/2006.16934)
- [ERNIE: Enhanced Language Representation with Informative Entities](https://arxiv.org/abs/1905.07129)
1. **[D - DeVLBERT](https://arxiv.org/abs/2008.06884)** [DeVLBert: Learning Deconfounded Visio-Linguistic Representations](https://arxiv.org/abs/2008.06884)
- [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)
1. **[O - OSCAR](https://arxiv.org/abs/2004.06165)** [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/abs/2004.06165)
- [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)
1. **[U - UNITER](https://arxiv.org/abs/1909.11740)** [UNITER: UNiversal Image-TExt Representation Learning](https://arxiv.org/abs/1909.11740)
- [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)
- [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)
1. **[V - VisualBERT](https://arxiv.org/abs/1908.03557)** [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/abs/1908.03557)
- [ALBERT: A Lite BERT](https://arxiv.org/abs/1909.11942)
- [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)
- [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)
1. **[X - LXMERT](https://arxiv.org/abs/1908.07490)** [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490)
- [ALBERT: A Lite BERT](https://arxiv.org/abs/1909.11942)
- [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)
- [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)
## To-do's
- [ ] Clean-up import statements, python paths & find a better way to integrate transformers (Right now all import statements only work if in main folder)
- [ ] Enable loading and running models just via import statements (and not having to clone the repo)
- [ ] Find a way to better include ERNIE-VIL in this repo (PaddlePaddle to Torch?)
- [ ] Move tokenization in entry files to model-specific tokenization similar to transformers
## Attributions
The code heavily borrows from the following repositories, thanks for their great work:
- https://github.com/huggingface/transformers
- https://github.com/facebookresearch/mmf
- https://github.com/airsplay/lxmert
## Citation
```bibtex
@article{muennighoff2020vilio,
title={Vilio: State-of-the-art visio-linguistic models applied to hateful memes},
author={Muennighoff, Niklas},
journal={arXiv preprint arXiv:2012.07788},
year={2020}
}
```