https://github.com/muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle
https://github.com/muennighoff/vilio

ernie-vil hateful-memes lxmert oscar transformers uniter vision-and-language vision-transformer visualbert

Last synced: 11 months ago
JSON representation

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

Host: GitHub
URL: https://github.com/muennighoff/vilio
Owner: Muennighoff
License: mit
Created: 2020-10-28T08:17:32.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-06-08T08:35:51.000Z (about 3 years ago)
Last Synced: 2025-07-12T17:43:10.086Z (about 1 year ago)
Topics: ernie-vil, hateful-memes, lxmert, oscar, transformers, uniter, vision-and-language, vision-transformer, visualbert
Language: Python
Homepage: https://arxiv.org/abs/2012.07788
Size: 10.4 MB
Stars: 90
Watchers: 2
Forks: 28
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


    


    
 🥶VILIO🥶  

    






    

        

    

    

        

    

    

        

    

    

        

    





 State-of-the-art Visio-Linguistic Models 🥶

## Updates

### 06/2021 - Hateful Memes CSV Files

- The CSV files that were used for the scores in the vilio paper are now available here

### 06/2021 - Inference on any meme

- Thanks to the initiative by katrinc, here are two notebooks for using Vilio to perform pure inference on any meme you want :)

- Just adapt the example input dataset / input model to use a different meme / pretrained model🥶

- GPU: https://www.kaggle.com/muennighoff/vilioexample-nb

- CPU: https://www.kaggle.com/muennighoff/vilioexample-nb-cpu

## Ordering

Vilio aims to replicate the organization of huggingface's transformer repo at:

https://github.com/huggingface/transformers

- /bash

Shell files to reproduce hateful memes results

- /data

By default, directory for loading in data & saving checkpoints

- /ernie-vil

Ernie-vil sub-repository written in PaddlePaddle

- /fts_lmdb

Scripts for handling .lmdb extracted features

- /fts_tsv

Scripts for handling .tsv extracted features

- /notebooks

Jupyter Notebooks for demonstration & reproducibility

- /py-bottm-up-attention

Sub-repository for tsv feature extraction forked & adapted from [here](https://github.com/airsplay/py-bottom-up-attention)

- src/vilio

All implemented models (also see below for a quick overview of models)

- /utils

Pandas & ensembling scripts for data handling

- entry.py files

Scripts used to access the models and apply model-specific data preparation

- pretrain.py files

Same purpose as entry files, but for pre-training; Point of entry for pre-training

- hm.py

Training code for the hateful memes challenge; Main point of entry

- param.py

Args for running hm.py

## Usage

Follow SCORE_REPRO.md for reproducing performance on the Hateful Memes Task. 


Follow GETTING_STARTED.md for using the framework for your own task. 


See the paper at: https://arxiv.org/abs/2012.07788

## Architectures

🥶 Vilio currently provides the following architectures with the outlined language transformers:

1. **[E - ERNIE-VIL](https://arxiv.org/abs/2006.16934)** [ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph](https://arxiv.org/abs/2006.16934)

    - [ERNIE: Enhanced Language Representation with Informative Entities](https://arxiv.org/abs/1905.07129)

1. **[D - DeVLBERT](https://arxiv.org/abs/2008.06884)** [DeVLBert: Learning Deconfounded Visio-Linguistic Representations](https://arxiv.org/abs/2008.06884)

    - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)

1. **[O - OSCAR](https://arxiv.org/abs/2004.06165)** [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/abs/2004.06165)

    - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)

1. **[U - UNITER](https://arxiv.org/abs/1909.11740)** [UNITER: UNiversal Image-TExt Representation Learning](https://arxiv.org/abs/1909.11740)

    - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)

    - [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)

1. **[V - VisualBERT](https://arxiv.org/abs/1908.03557)** [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/abs/1908.03557)

    - [ALBERT: A Lite BERT](https://arxiv.org/abs/1909.11942)

    - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)

    - [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)

1. **[X - LXMERT](https://arxiv.org/abs/1908.07490)** [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490)

    - [ALBERT: A Lite BERT](https://arxiv.org/abs/1909.11942)

    - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805)

    - [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)

## To-do's

- [ ] Clean-up import statements, python paths & find a better way to integrate transformers (Right now all import statements only work if in main folder)

- [ ] Enable loading and running models just via import statements (and not having to clone the repo)

- [ ] Find a way to better include ERNIE-VIL in this repo (PaddlePaddle to Torch?)

- [ ] Move tokenization in entry files to model-specific tokenization similar to transformers

## Attributions

The code heavily borrows from the following repositories, thanks for their great work:

- https://github.com/huggingface/transformers

- https://github.com/facebookresearch/mmf

- https://github.com/airsplay/lxmert

## Citation

```bibtex

@article{muennighoff2020vilio,

  title={Vilio: State-of-the-art visio-linguistic models applied to hateful memes},

  author={Muennighoff, Niklas},

  journal={arXiv preprint arXiv:2012.07788},

  year={2020}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/muennighoff/vilio

Awesome Lists containing this project

README

🥶VILIO🥶