https://github.com/skalskip/vlms-zero-to-hero

This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
https://github.com/skalskip/vlms-zero-to-hero

bert-model clip computer-vision embeddings gpt gpt-2 lora natural-language-processing seq2seq vision-language-model word2vec

Last synced: 22 days ago
JSON representation

This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

Host: GitHub
URL: https://github.com/skalskip/vlms-zero-to-hero
Owner: SkalskiP
License: apache-2.0
Created: 2024-12-20T08:02:26.000Z (6 months ago)
Default Branch: master
Last Pushed: 2025-01-23T11:23:09.000Z (4 months ago)
Last Synced: 2025-05-08T22:44:02.116Z (28 days ago)
Topics: bert-model, clip, computer-vision, embeddings, gpt, gpt-2, lora, natural-language-processing, seq2seq, vision-language-model, word2vec
Language: Jupyter Notebook
Homepage: https://www.youtube.com/@SkalskiP
Size: 338 KB
Stars: 1,063
Watchers: 46
Forks: 97
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


  
VLMs zero-to-hero


  
coming: january 2025...




# hello

Welcome to VLMs Zero to Hero! This series will take you on a journey from the 

fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

# tutorials

|                                                                                                                                 **notebook**                                                                                                                                  |                                                                                              **open in colab**                                                                                              | **video** |                **paper**                |

|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------:|:---------------------------------------:|

| 01.01. [Word2Veq: Distributed Representations of Words and Phrases and their Compositionality](https://github.com/SkalskiP/vlms-zero-to-hero/blob/master/01_natural_language_processing_fundamentals/01_01_word2vec_with_sub_sampling_and_negative_sampling_in_pytorch.ipynb) | [link](https://colab.research.google.com/github/SkalskiP/vlms-zero-to-hero/blob/master/01_natural_language_processing_fundamentals/01_01_word2vec_with_sub_sampling_and_negative_sampling_in_pytorch.ipynb) |   soon    | [link](https://arxiv.org/abs/1310.4546) |

# roadmap

## natural language processing (NLP) fundamentals

- Word2Veq: [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781) (2013) and [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546) (2013) 

- Seq2Seq: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215) (2014)

- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762) (2017)

- BERT: [Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805) (2018)

- GPT: [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) (2018)

## computer vision (CV) fundamentals

- AlexNet: [ImageNet Classification with Deep Convolutional Neural Networks](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) (2012)

- VGG: [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556) (2014)

- ResNet: [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) (2015)

## early vision-language models

- [Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555) (2014) and [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/pdf/1502.03044) (2015)

- [A Picture is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929) (2020)

- CLIP: [Learning Transferable Visual Models from Natural Language Supervision](https://arxiv.org/pdf/2103.00020) (2021)

## scale and efficiency

- [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361) (2020)

- LoRA: [Low-Rank Adaptation of Large Language Models](https://arxiv.org/pdf/2106.09685) (2021)

- QLoRA: [Efficient Fine-tuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314) (2023)

## modern vision-language models

- Flamingo: [A Visual Language Model for Few-Shot Learning](https://arxiv.org/pdf/2204.14198) (2022)

- LLaVA: [Visual Instruction Tuning](https://arxiv.org/pdf/2304.08485) (2023)

- BLIP-2: [Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/pdf/2301.12597) (2023)

- PaliGemma: [A versatile 3B VLM for transfer](https://arxiv.org/pdf/2407.07726) (2024)

## extra

- BLEU: [a Method for Automatic Evaluation of Machine Translation](https://aclanthology.org/P02-1040.pdf) (2002)

# contribute and suggest more papers

Are there important papers, models, or techniques we missed? Do you have a favorite 

breakthrough in vision-language research that isn't listed here? We’d love to hear 

your suggestions!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/skalskip/vlms-zero-to-hero

Awesome Lists containing this project

README

VLMs zero-to-hero