awesome-foundation-and-multimodal-models

👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code + Examples + Tutorials]
https://github.com/SkalskiP/awesome-foundation-and-multimodal-models

Last synced: 1 day ago
JSON representation

🤖 models
- AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/haoheliu/AudioLDM_48K_Text-to-HiFiAudio_Generation) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/merve/BLIP2-with-transformers) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/Salesforce/blip2-opt-6.7b) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb)
- OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers
  - ![arXiv - research/scenic/tree/main/scenic/projects/owl_vit) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/adirik/OWL-ViT) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/google/owlvit-base-patch32)
- ImageBind: One Embedding Space To Bind Them All
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/JustinLin610/ImageBind_zeroshot_demo)
- LLaVA: Large Language and Vision Assistant
  - ![arXiv - liu/LLaVA) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/badayvedat/LLaVA) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/liuhaotian/llava-v1.6-34b)
- CogVLM: Visual Expert for Pretrained Language Models
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/lykeven/CogVLM) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/THUDM/CogVLM)
- Fuyu-8B: A Multimodal Architecture for AI Agents
  - ![Hugging Face Space - 8b-demo) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/adept/fuyu-8b)
- Ferret: Refer and Ground Anything Anywhere at Any Granularity
  - ![arXiv - ferret)
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/openai/whisper) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/openai/whisper-large-v3) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb)
- YOLO-World: Real-Time Open-Vocabulary Object Detection
  - ![arXiv - CVC/YOLO-World) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/X7gKBGVz4vs) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/SkalskiP/YOLO-World) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-yolo-world.ipynb)
- MetaCLIP: Demystifying CLIP Data
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/SkalskiP/SAM_and_MetaCLIP) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/facebook/metaclip-b32-400m) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1V0Rv1QQJkcolTjiwJuRsqWycROvYjOwg?usp=sharing)
- Nougat: Neural Optical Understanding for Academic Documents
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/hf-vision/nougat-transformers) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/facebook/nougat-small) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Nougat/Inference_with_Nougat_to_read_scientific_PDFs.ipynb)
- Kosmos-2: Grounding Multimodal Large Language Models to the World
  - ![arXiv - 2) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ydshieh/Kosmos-2) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/microsoft/kosmos-2-patch14-224)
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
  - ![arXiv - Research/GroundingDINO) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/merve/Grounding_DINO_demo) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  - ![arXiv - VL) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Qwen/Qwen-VL-Max) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/Qwen/Qwen-VL)
- Depth Anything
  - ![arXiv - Anything) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/LiheYoung/Depth-Anything) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/LiheYoung/depth_anything_vitl14) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Depth%20Anything/Predicting_depth_in_an_image_with_Depth_Anything.ipynb)
- EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/SkalskiP/EfficientSAM) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/merve/EfficientSAM)
- SigLIP: Sigmoid Loss for Language Image Pre-Training
  - ![arXiv - research/big_vision) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/merve/compare_clip_siglip) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/openai/clip-vit-base-patch16) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb)
- OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/openflamingo/OpenFlamingo) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b)
- OWLv2: Scaling Open-Vocabulary Object Detection
  - ![arXiv - research/scenic/tree/main/scenic/projects/owl_vit) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/merve/owlv2) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/google/owlv2-base-patch16-ensemble) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/OWLv2/Zero_and_one_shot_object_detection_with_OWLv2.ipynb)
- Segment Anything
  - ![arXiv - anything) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/D-D6ZmadzPE) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/radames/candle-segment-anything-wasm) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/facebook/sam-vit-base) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-segment-anything-with-sam.ipynb)
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
  - ![arXiv - Models-blue)](https://huggingface.co/openai/clip-vit-large-patch14) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-use-openai-clip-classification.ipynb)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-foundation-and-multimodal-models

🤖 models

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers

ImageBind: One Embedding Space To Bind Them All

LLaVA: Large Language and Vision Assistant

CogVLM: Visual Expert for Pretrained Language Models

Fuyu-8B: A Multimodal Architecture for AI Agents

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

YOLO-World: Real-Time Open-Vocabulary Object Detection

MetaCLIP: Demystifying CLIP Data

Nougat: Neural Optical Understanding for Academic Documents

Kosmos-2: Grounding Multimodal Large Language Models to the World

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Depth Anything

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

SigLIP: Sigmoid Loss for Language Image Pre-Training

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

OWLv2: Scaling Open-Vocabulary Object Detection

Segment Anything

CLIP: Learning Transferable Visual Models From Natural Language Supervision