An open API service indexing awesome lists of open source software.

https://github.com/skalskip/vlms-zero-to-hero

This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
https://github.com/skalskip/vlms-zero-to-hero

bert-model clip computer-vision embeddings gpt gpt-2 lora natural-language-processing seq2seq vision-language-model word2vec

Last synced: 22 days ago
JSON representation

This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

Awesome Lists containing this project

README

        

VLMs zero-to-hero

coming: january 2025...

# hello

Welcome to VLMs Zero to Hero! This series will take you on a journey from the
fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

# tutorials

| **notebook** | **open in colab** | **video** | **paper** |
|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------:|:---------------------------------------:|
| 01.01. [Word2Veq: Distributed Representations of Words and Phrases and their Compositionality](https://github.com/SkalskiP/vlms-zero-to-hero/blob/master/01_natural_language_processing_fundamentals/01_01_word2vec_with_sub_sampling_and_negative_sampling_in_pytorch.ipynb) | [link](https://colab.research.google.com/github/SkalskiP/vlms-zero-to-hero/blob/master/01_natural_language_processing_fundamentals/01_01_word2vec_with_sub_sampling_and_negative_sampling_in_pytorch.ipynb) | soon | [link](https://arxiv.org/abs/1310.4546) |

# roadmap

## natural language processing (NLP) fundamentals

- Word2Veq: [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781) (2013) and [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546) (2013)
- Seq2Seq: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215) (2014)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762) (2017)
- BERT: [Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805) (2018)
- GPT: [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) (2018)

## computer vision (CV) fundamentals

- AlexNet: [ImageNet Classification with Deep Convolutional Neural Networks](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) (2012)
- VGG: [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556) (2014)
- ResNet: [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) (2015)

## early vision-language models

- [Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555) (2014) and [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/pdf/1502.03044) (2015)
- [A Picture is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929) (2020)
- CLIP: [Learning Transferable Visual Models from Natural Language Supervision](https://arxiv.org/pdf/2103.00020) (2021)

## scale and efficiency

- [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361) (2020)
- LoRA: [Low-Rank Adaptation of Large Language Models](https://arxiv.org/pdf/2106.09685) (2021)
- QLoRA: [Efficient Fine-tuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314) (2023)

## modern vision-language models

- Flamingo: [A Visual Language Model for Few-Shot Learning](https://arxiv.org/pdf/2204.14198) (2022)
- LLaVA: [Visual Instruction Tuning](https://arxiv.org/pdf/2304.08485) (2023)
- BLIP-2: [Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/pdf/2301.12597) (2023)
- PaliGemma: [A versatile 3B VLM for transfer](https://arxiv.org/pdf/2407.07726) (2024)

## extra

- BLEU: [a Method for Automatic Evaluation of Machine Translation](https://aclanthology.org/P02-1040.pdf) (2002)

# contribute and suggest more papers

Are there important papers, models, or techniques we missed? Do you have a favorite
breakthrough in vision-language research that isn't listed here? We’d love to hear
your suggestions!