Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-foundation-and-multimodal-models
👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code + Examples + Tutorials]
https://github.com/SkalskiP/awesome-foundation-and-multimodal-models
Last synced: 4 days ago
JSON representation
-
🤖 models
-
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/haoheliu/AudioLDM_48K_Text-to-HiFiAudio_Generation) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/merve/BLIP2-with-transformers) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/Salesforce/blip2-opt-6.7b) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb)
-
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers
- ![arXiv - research/scenic/tree/main/scenic/projects/owl_vit) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/adirik/OWL-ViT) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/google/owlvit-base-patch32)
-
Depth Anything
- ![arXiv - Anything) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/LiheYoung/Depth-Anything) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/LiheYoung/depth_anything_vitl14) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Depth%20Anything/Predicting_depth_in_an_image_with_Depth_Anything.ipynb)
-
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/SkalskiP/EfficientSAM) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/merve/EfficientSAM)
-
ImageBind: One Embedding Space To Bind Them All
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/JustinLin610/ImageBind_zeroshot_demo)
-
LLaVA: Large Language and Vision Assistant
- ![arXiv - liu/LLaVA) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/badayvedat/LLaVA) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/liuhaotian/llava-v1.6-34b)
-
CLIP: Learning Transferable Visual Models From Natural Language Supervision
- ![arXiv - Models-blue)](https://huggingface.co/openai/clip-vit-large-patch14) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-use-openai-clip-classification.ipynb)
-
CogVLM: Visual Expert for Pretrained Language Models
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/lykeven/CogVLM) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/THUDM/CogVLM)
-
Fuyu-8B: A Multimodal Architecture for AI Agents
- ![Hugging Face Space - 8b-demo) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/adept/fuyu-8b)
-
Ferret: Refer and Ground Anything Anywhere at Any Granularity
- ![arXiv - ferret)
-
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/openflamingo/OpenFlamingo) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b)
-
Segment Anything
- ![arXiv - anything) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/D-D6ZmadzPE) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/radames/candle-segment-anything-wasm) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/facebook/sam-vit-base) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-segment-anything-with-sam.ipynb)
-
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/openai/whisper) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/openai/whisper-large-v3) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb)
-
SigLIP: Sigmoid Loss for Language Image Pre-Training
- ![arXiv - research/big_vision) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/merve/compare_clip_siglip) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/openai/clip-vit-base-patch16) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb)
-
YOLO-World: Real-Time Open-Vocabulary Object Detection
- ![arXiv - CVC/YOLO-World) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/X7gKBGVz4vs) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/SkalskiP/YOLO-World) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-yolo-world.ipynb)
-
MetaCLIP: Demystifying CLIP Data
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/SkalskiP/SAM_and_MetaCLIP) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/facebook/metaclip-b32-400m) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1V0Rv1QQJkcolTjiwJuRsqWycROvYjOwg?usp=sharing)
-
Nougat: Neural Optical Understanding for Academic Documents
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/hf-vision/nougat-transformers) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/facebook/nougat-small) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Nougat/Inference_with_Nougat_to_read_scientific_PDFs.ipynb)
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
- ![arXiv - 2) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ydshieh/Kosmos-2) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/microsoft/kosmos-2-patch14-224)
-
OWLv2: Scaling Open-Vocabulary Object Detection
- ![arXiv - research/scenic/tree/main/scenic/projects/owl_vit) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/merve/owlv2) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/google/owlv2-base-patch16-ensemble) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/OWLv2/Zero_and_one_shot_object_detection_with_OWLv2.ipynb)
-
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- ![arXiv - Research/GroundingDINO) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/merve/Grounding_DINO_demo) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)
-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- ![arXiv - VL) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Qwen/Qwen-VL-Max) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/Qwen/Qwen-VL)
-
Categories
Sub Categories
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
1
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
1
SigLIP: Sigmoid Loss for Language Image Pre-Training
1
Kosmos-2: Grounding Multimodal Large Language Models to the World
1
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
1
CLIP: Learning Transferable Visual Models From Natural Language Supervision
1
LLaVA: Large Language and Vision Assistant
1
Ferret: Refer and Ground Anything Anywhere at Any Granularity
1
MetaCLIP: Demystifying CLIP Data
1
Depth Anything
1
OWLv2: Scaling Open-Vocabulary Object Detection
1
ImageBind: One Embedding Space To Bind Them All
1
YOLO-World: Real-Time Open-Vocabulary Object Detection
1
Nougat: Neural Optical Understanding for Academic Documents
1
Fuyu-8B: A Multimodal Architecture for AI Agents
1
CogVLM: Visual Expert for Pretrained Language Models
1
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers
1
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
1
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
1
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
1
Segment Anything
1
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
1