Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Foundation-Models
A curated list of foundation models for vision and language tasks
https://github.com/uncbiag/Awesome-Foundation-Models
Last synced: 5 days ago
JSON representation
-
Survey
-
Before 2024
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Large Multimodal Models: Notes on CVPR 2023 Tutorial
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
- Foundational Models Defining a New Era in Vision: A Survey and Outlook
- A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
- A Survey on Multimodal Large Language Models
- A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
- Large Multimodal Models: Notes on CVPR 2023 Tutorial
- A Survey on Multimodal Large Language Models
- Vision-Language Models for Vision Tasks: A Survey
- A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
- Foundation Models for Generalist Medical Artificial Intelligence
- A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants
- Towards Generalist Foundation Model for Radiology
- Foundational Models Defining a New Era in Vision: A Survey and Outlook
- Towards Generalist Biomedical AI
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
- Vision-language pre-training: Basics, recent advances, and future trends
- On the Opportunities and Risks of Foundation Models
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Towards Generalist Foundation Model for Radiology
- Towards Generalist Biomedical AI
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Vision-Language Models for Vision Tasks: A Survey
- Foundation Models for Generalist Medical Artificial Intelligence
- On the Opportunities and Risks of Foundation Models
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
-
2024
- Towards Vision-Language Geo-Foundation Model: A Survey
- Foundation Models for Video Understanding: A Survey
- Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
- Prospective Role of Foundation Models in Advancing Autonomous Vehicles
- Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
- Real-World Robot Applications of Foundation Models: A Review
- Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey
- From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
- Real-World Robot Applications of Foundation Models: A Review
- A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
- Image Segmentation in Foundation Model Era: A Survey
- Large Multimodal Agents: A Survey
- The Uncanny Valley: A Comprehensive Analysis of Diffusion Models
- Efficient Multimodal Large Language Models: A Survey
- An Introduction to Vision-Language Modeling
- The Evolution of Multimodal Model Architectures
- Language Agents
- A Systematic Survey on Large Language Models for Algorithm Design
-
2023
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
- Foundation Models for Generalist Medical Artificial Intelligence
-
-
Papers by Date
-
2024
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 06/06 - AI/vision-lstm.svg?style=social&label=Star)](https://github.com/NX-AI/vision-lstm)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 04/14
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 01/19 - Anything.svg?style=social&label=Star)](https://github.com/LiheYoung/Depth-Anything)
- 05/09 - VLLM/Lumina-T2X.svg?style=social&label=Star)](https://github.com/Alpha-VLLM/Lumina-T2X)
- 05/08
- 05/07
- 05/03 - ai/reka-vibe-eval.svg?style=social&label=Star)](https://github.com/reka-ai/reka-vibe-eval)
- 04/30
- 04/26
- 02/28 - ai-lab/Consistency_LLM.svg?style=social&label=Star)](https://github.com/hao-ai-lab/Consistency_LLM)
- 05/17
- 07/24
- 07/17
- 07/12
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 07/31
- 07/29 - anything-2.svg?style=social&label=Star)](https://github.com/facebookresearch/segment-anything-2)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 03/14
- 01/30 - CVC/YOLO-World.svg?style=social&label=Star)](https://github.com/AILab-CVC/YOLO-World)
- 08/14
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 07/29 - anything-2.svg?style=social&label=Star)](https://github.com/facebookresearch/segment-anything-2)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 01/22 - AIMI/CheXagent.svg?style=social&label=Star)](https://github.com/Stanford-AIMI/CheXagent)
- 01/19 - Anything.svg?style=social&label=Star)](https://github.com/LiheYoung/Depth-Anything)
- 02/27
- 09/25
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/25
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 02/06 - AutoML/MobileVLM.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/MobileVLM)
- 01/15
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 01/23
- 01/30 - CVC/YOLO-World.svg?style=social&label=Star)](https://github.com/AILab-CVC/YOLO-World)
- 03/18
- 03/14
- 03/09 - Chapel Hill) [![Star](https://img.shields.io/github/stars/uncbiag/uniGradICON.svg?style=social&label=Star)](https://github.com/uncbiag/uniGradICON)
- 08/22
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 02/21
- 02/20 - HPC-AI-Lab/Neural-Network-Diffusion.svg?style=social&label=Star)](https://github.com/NUS-HPC-AI-Lab/Neural-Network-Diffusion)
- 01/16
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 04/02
- 04/02
- 04/10
- 02/20
- 02/19
- 03/05
- 03/01 - AutoML/VisionLLaMA.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/VisionLLaMA)
- 02/22
- 03/22
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 09/30
- 09/27
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 03/01
- 05/21 - 024-02499-w)) [![Star](https://img.shields.io/github/stars/microsoft/BiomedParse.svg?style=social&label=Star)](https://github.com/microsoft/BiomedParse)
- 05/14
- 05/06
- 10/01
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 09/18 - VL.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen2-VL)
- 09/18 - labs/moshi.svg?style=social&label=Star)](https://github.com/kyutai-labs/moshi)
- 08/27
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 10/30
- 10/21
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 11/14
- 11/13
- 11/07
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/04 - DiT.svg?style=social&label=Star)](https://github.com/YuchuanTian/U-DiT)
- 10/30 - W/TokenFormer.svg?style=social&label=Star)](https://github.com/Haiyang-W/TokenFormer)
- 06/10
- 05/31
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/20 - models/octo.svg?style=social&label=Star)](https://github.com/octo-models/octo)
- 10/04
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 10/02
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 10/10 - CVC/UniRepLKNet.svg?style=social&label=Star)](https://github.com/AILab-CVC/UniRepLKNet)
- 06/24 - mllm/cambrian.svg?style=social&label=Star)](https://github.com/cambrian-mllm/cambrian)
- 06/13 - 4m.svg?style=social&label=Star)](https://github.com/apple/ml-4m)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 10/31
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 12/04
- 12/03
- 11/21
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 04/02 - learning-for-vlm.svg?style=social&label=Star)](https://github.com/hellomuffin/iterated-learning-for-vlm)
- 12/19 - Embodied-AI/Genesis.svg?style=social&label=Star)](https://github.com/Genesis-Embodied-AI/Genesis)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
- 05/22 - gigapath/prov-gigapath.svg?style=social&label=Star)](https://github.com/prov-gigapath/prov-gigapath)
-
2022
- Towards artificial general intelligence via a multimodal foundation model
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Foundation Transformers
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Image Segmentation Using Text and Image Prompts
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- GLIPv2: Unifying Localization and VL Understanding
- A Unified Sequence Interface for Vision Tasks
- Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
- CRIS: CLIP-Driven Referring Image Segmentation
- Masked Autoencoders As Spatiotemporal Learners
- Unified Vision and Language Prompt Learning - ->
- BEVT: BERT Pretraining of Video Transformers
- Foundation Transformers
- A Generalist Agent - modal, multi-task, multi-embodiment generalist agent; from DeepMind)
- FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
- Flamingo: a Visual Language Model for Few-Shot Learning
- MetaLM: Language Models are General-Purpose Interfaces
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts - to-image diffusion model; from OpenAI)
- Image Segmentation Using Text and Image Prompts
- Unifying Flow, Stereo and Depth Estimation
- CoCa: Contrastive Captioners are Image-Text Foundation Models
- PaLI: A Jointly-Scaled Multilingual Language-Image Model
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
- PaLM: Scaling Language Modeling with Pathways
- Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
- Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models
- CRIS: CLIP-Driven Referring Image Segmentation
- Masked Autoencoders As Spatiotemporal Learners
- A Unified Sequence Interface for Vision Tasks
- DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents
- Robust and Efficient Medical Imaging with Self-Supervision
- Video Swin Transformer
- Masked Autoencoders Are Scalable Vision Learners
- InstructGPT: Training language models to follow instructions with human feedback
- A Unified Sequence Interface for Vision Tasks
- DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
- Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation
- FLAVA: A Foundational Language And Vision Alignment Model
- Towards artificial general intelligence via a multimodal foundation model
- FILIP: Fine-Grained Interactive Language-Image Pre-Training
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
- Unified Vision and Language Prompt Learning - ->
- BEVT: BERT Pretraining of Video Transformers
- Image Segmentation Using Text and Image Prompts
- Unifying Flow, Stereo and Depth Estimation
- PaLI: A Jointly-Scaled Multilingual Language-Image Model
- CRIS: CLIP-Driven Referring Image Segmentation
- A Unified Sequence Interface for Vision Tasks
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- GLIPv2: Unifying Localization and VL Understanding
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts - to-image diffusion model; from OpenAI)
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- SLIP: Self-supervision meets Language-Image Pre-training
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- InstructGPT: Training language models to follow instructions with human feedback
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Flamingo: a Visual Language Model for Few-Shot Learning
- Towards artificial general intelligence via a multimodal foundation model
- GLIP: Grounded Language-Image Pre-training
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- MetaLM: Language Models are General-Purpose Interfaces
- PaLM: Scaling Language Modeling with Pathways
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
- Towards artificial general intelligence via a multimodal foundation model
-
2023
- Meta-Transformer: A Unified Framework for Multimodal Learning
- Visual Instruction Tuning - Madison and Microsoft) [![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)](https://github.com/haotian-liu/LLaVA)
- Visual Instruction Tuning - Madison and Microsoft) [![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)](https://github.com/haotian-liu/LLaVA)
- Visual Prompt Multi-Modal Tracking
- Tracking Everything Everywhere All at Once
- Foundation Models for Generalist Geospatial Artificial Intelligence
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
- Meta-Transformer: A Unified Framework for Multimodal Learning
- Retentive Network: A Successor to Transformer for Large Language Models
- Neural World Models for Computer Vision
- Recognize Anything: A Strong Image Tagging Model
- Towards Visual Foundation Models of Physical Scenes - purpose visual representations of physical scenes
- LIMA: Less Is More for Alignment
- PaLM 2 Technical Report
- IMAGEBIND: One Embedding Space To Bind Them All
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning
- UniDector: Detecting Everything in the Open World: Towards Universal Object Detection
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models
- Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
- EVA-02: A Visual Representation for Neon Genesis
- EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale
- LLaMA: Open and Efficient Foundation Language Models
- The effectiveness of MAE pre-pretraining for billion-scale pretraining
- BloombergGPT: A Large Language Model for Finance
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- UNINEXT: Universal Instance Perception as Object Discovery and Retrieval
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
- BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
- Foundation Models for Generalist Geospatial Artificial Intelligence
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
- Retentive Network: A Successor to Transformer for Large Language Models
- Recognize Anything: A Strong Image Tagging Model
- IMAGEBIND: One Embedding Space To Bind Them All
- SegGPT: Segmenting Everything In Context
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning
- UniDector: Detecting Everything in the Open World: Towards Universal Object Detection
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models
- Visual Prompt Multi-Modal Tracking
- Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
- EVA-02: A Visual Representation for Neon Genesis
- LLaMA: Open and Efficient Foundation Language Models
- The effectiveness of MAE pre-pretraining for billion-scale pretraining
- BloombergGPT: A Large Language Model for Finance
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- UNINEXT: Universal Instance Perception as Object Discovery and Retrieval
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
- BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
- SAM: Segment Anything - anything.svg?style=social&label=Star)](https://github.com/facebookresearch/segment-anything)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Neural World Models for Computer Vision
- Towards Visual Foundation Models of Physical Scenes - purpose visual representations of physical scenes
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces - sized Transformers while scaling linearly with sequence length. from CMU)
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
- FLIP: Scaling Language-Image Pre-training via Masking
- GPT-4 Technical Report
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
- SEEM: Segment Everything Everywhere All at Once - Madison, HKUST, and Microsoft) [![Star](https://img.shields.io/github/stars/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.svg?style=social&label=Star)](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
- BioCLIP: A Vision Foundation Model for the Tree of Life
- LIMA: Less Is More for Alignment
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
2021
- Unifying Vision-and-Language Tasks via Text Generation - Chapel Hill)
- Unifying Vision-and-Language Tasks via Text Generation - Chapel Hill)
- UniT: Multimodal Multitask Learning with a Unified Transformer
- WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training - scale Chinese multimodal pre-training model called BriVL; from Renmin University of China)
- Codex: Evaluating Large Language Models Trained on Code
- Florence: A New Foundation Model for Computer Vision
- DALL-E: Zero-Shot Text-to-Image Generation
- Multimodal Few-Shot Learning with Frozen Language Models
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - attention blocks; ICLR, from Google)
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - attention blocks; ICLR, from Google)
- WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training - scale Chinese multimodal pre-training model called BriVL; from Renmin University of China)
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- DALL-E: Zero-Shot Text-to-Image Generation
- Codex: Evaluating Large Language Models Trained on Code
-
Before 2021
- UNITER: UNiversal Image-TExt Representation Learning
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Attention Is All You Need
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers - Chapel Hill)
- GPT-3: Language Models are Few-Shot Learners - context learning compared with GPT-2; from OpenAI)
-
2025
- 01/06 - ov-file.svg?style=social&label=Star)](https://github.com/NVIDIA/Cosmos?tab=readme-ov-file)
-
-
Topics
-
Large Language Models (LLM)
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models
- GPT-3: Language Models are Few-Shot Learners - context learning compared with GPT-2; from OpenAI)
-
Training Efficiency
-
Towards Artificial General Intelligence (AGI)
-
Large Language Models
-
Perception Tasks: Detection, Segmentation, and Pose Estimation
- SEEM: Segment Everything Everywhere All at Once - Madison, HKUST, and Microsoft)
- SegGPT: Segmenting Everything In Context
-
Vision-Language Pretraining
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - the-shelf frozen vision and language models. from Salesforce Research)
-
-
Papers by Topic
-
Large Benchmarks
-
Large Language/Multimodal Models
- GPT-3: Language Models are Few-Shot Learners - context learning compared with GPT-2; from OpenAI)
- GPT-2: Language Models are Unsupervised Multitask Learners
- LLaVA: Visual Instruction Tuning - Madison) [![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)](https://github.com/haotian-liu/LLaVA)
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - CAIR/MiniGPT-4.svg?style=social&label=Star)](https://github.com/Vision-CAIR/MiniGPT-4)
- GPT: Improving Language Understanding by Generative Pre-Training
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer - research/text-to-text-transfer-transformer.svg?style=social&label=Star)](https://github.com/google-research/text-to-text-transfer-transformer)
-
Vision-Language Pretraining
- RegionCLIP: Region-Based Language-Image Pretraining
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
-
AI Safety and Responsibility
-
Linear Attention
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning - AILab/flash-attention.svg?style=social&label=Star)](https://github.com/Dao-AILab/flash-attention)
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness - AILab/flash-attention.svg?style=social&label=Star)](https://github.com/Dao-AILab/flash-attention)
-
-
Related Awesome Repositories
-
AI Safety and Responsibility
- Awesome-Diffusion-Models - usion/Awesome-Diffusion-Models.svg?style=social&label=Star)](https://github.com/diff-usion/Awesome-Diffusion-Models)
- Awesome-Video-Diffusion-Models - Video-Diffusion-Models.svg?style=social&label=Star)](https://github.com/ChenHsing/Awesome-Video-Diffusion-Models)
- Awesome-Diffusion-Model-Based-Image-Editing-Methods - Diffusion-Model-Based-Image-Editing-Methods.svg?style=social&label=Star)](https://github.com/SiatMMLab/Awesome-Diffusion-Model-Based-Image-Editing-Methods)
- Awesome-CV-Foundational-Models - CV-Foundational-Models.svg?style=social&label=Star)](https://github.com/awaisrauf/Awesome-CV-Foundational-Models)
- Awesome-Healthcare-Foundation-Models - Qiu/Awesome-Healthcare-Foundation-Models.svg?style=social&label=Star)](https://github.com/Jianing-Qiu/Awesome-Healthcare-Foundation-Models)
- awesome-large-multimodal-agents - large-multimodal-agents.svg?style=social&label=Star)](https://github.com/jun0wanan/awesome-large-multimodal-agents)
- Computer Vision in the Wild (CVinW) - Vision-in-the-Wild/CVinW_Readings.svg?style=social&label=Star)](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings)
-
Programming Languages
Sub Categories
2024
132
2022
130
Before 2024
75
2023
73
2021
15
AI Safety and Responsibility
9
Large Language/Multimodal Models
6
Vision-Language Pretraining
5
Before 2021
5
Large Benchmarks
4
Large Language Models (LLM)
2
Linear Attention
2
Perception Tasks: Detection, Segmentation, and Pose Estimation
2
Towards Artificial General Intelligence (AGI)
1
2025
1
Training Efficiency
1
Large Language Models
1