Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-CV-Foundational-Models
https://github.com/awaisrauf/Awesome-CV-Foundational-Models
Last synced: 4 days ago
JSON representation
-
2023
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link - AI/scaling-laws-openclip) <br> <details><summary>Read Abstract</summary>Scaling up neural networks has led to remarkable performance across a wide
- [Paper Link - trained models have substantially improved the
- [Paper Link
- [Paper Link - chatgpt) <br> <details><summary>Read Abstract</summary>ChatGPT is attracting a cross-field interest as it provides a language
- [Paper Link - 02, a next-generation Transformer-based visual representation
- [Paper Link
- [Paper Link - level robot task plans from high-level natural language
- [Paper Link - image pre-training, CLIP for short, has gained
- [Paper Link
- [Paper Link - view inputs has been proven to alleviate
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link - zheng/GENIUS) <br> <details><summary>Read Abstract</summary>We investigate the potential of GPT-4~\cite{gpt4} to perform Neural
- [Paper Link
- [Paper Link - SAM-Adapter) <br> <details><summary>Read Abstract</summary>The Segment Anything Model (SAM) has recently gained popularity in the field
- [Paper Link - Adapter) <br> <details><summary>Read Abstract</summary>How to efficiently transform large language models (LLMs) into instruction
- [Paper Link - language LLM (VL-LLM) by pre-training on
- [Paper Link
- [Paper Link - scale pre-training and instruction tuning have been successful at
- [Paper Link - VLAA/CLIPA) <br> <details><summary>Read Abstract</summary>CLIP, the first foundation model that connects images and text, has enabled
- [Paper Link
- [Paper Link
- [Paper Link - modal models have been developed for joint image and
- [Paper Link - text pairs from the web is one of the most
- [Paper Link - oryx/XrayGPT) <br> <details><summary>Read Abstract</summary>The latest breakthroughs in large vision-language models, such as Bard and
- [Paper Link - MAE) <br> <details><summary>Read Abstract</summary>Artificial intelligence (AI) has seen a tremendous surge in capabilities
- [Paper Link - mercury/COSA) <br> <details><summary>Read Abstract</summary>Due to the limited scale and quality of video-text training corpus, most
- [Paper Link - Med) <br> <details><summary>Read Abstract</summary>Obtaining large pre-trained models that can be fine-tuned to new tasks with
- [Paper Link - IVA-Lab/FastSAM) <br> <details><summary>Read Abstract</summary>The recently proposed segment anything model (SAM) has made a significant
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link - VLAA/CLIPA) <br> <details><summary>Read Abstract</summary>The recent work CLIPA presents an inverse scaling law for CLIP training --
- [Paper Link
- [Paper Link - air/Endo-FM) <br> <details><summary>Read Abstract</summary>Foundation models have exhibited remarkable success in various applications,
- [Paper Link
- [Paper Link - DA) <br> <details><summary>Read Abstract</summary>Domain adaptation (DA) has demonstrated significant promise for real-time
- [Paper Link - Med) <br> <details><summary>Read Abstract</summary>Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained deep networks on ImageNet and vision-language foundation models trained on web-scale data are prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed via a combinatorial graph-matching objective; and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.</details>
- [Paper Link - 1B), the foundation Segment Anything Model
- [Paper Link
- [Paper Link - FM) <br> <details><summary>Read Abstract</summary>Pretraining with large-scale 3D volumes has a potential for improving the
- [Paper Link - powered embodied lifelong learning agent
- [Paper Link - oryx/Video-ChatGPT) <br> <details><summary>Read Abstract</summary>Conversation agents fueled by Large Language Models (LLMs) are providing a
- [Paper Link
- [Paper Link
- [Paper Link - based learning has emerged as a successful paradigm in natural
- [Paper Link - level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.</details>
- [Paper Link
- [Paper Link - REACT, a system paradigm that integrates ChatGPT with a pool of
- [Paper Link - anything.com) <br> <details><summary>Read Abstract</summary>We introduce the Segment Anything (SA) project: a new task, model, and
- [Paper Link - Robot-Manipulation-Prompts) <br> <details><summary>Read Abstract</summary>This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot
- [Paper Link - CAIR/ChatCaptioner) <br> <details><summary>Read Abstract</summary>Video captioning aims to convey dynamic scenes from videos using natural
- [Paper Link
- [Paper Link
- [Paper Link
- [Paper Link - based learning has emerged as a successful paradigm in natural
- [Paper Link
- [Paper Link - oryx/Video-ChatGPT) <br> <details><summary>Read Abstract</summary>Conversation agents fueled by Large Language Models (LLMs) are providing a
- [Paper Link - language models have shown impressive multi-modal generation
- [Paper Link - x-yang/Segment-and-Track-Anything) <br> <details><summary>Read Abstract</summary>This report presents a framework called Segment And Track Anything (SAMTrack)
- [Paper Link - context learning, as a new paradigm in NLP, allows the model to rapidly
- [Paper Link - only language models
- [Paper Link
- [Paper Link - Robot-Manipulation-Prompts) <br> <details><summary>Read Abstract</summary>This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot
- [Paper Link - generated
- [Paper Link - 4 has demonstrated extraordinary multi-modal abilities, such
- [Paper Link
- [Paper Link - 2.) <br> <details><summary>Read Abstract</summary>We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
- [Paper Link - purpose pre-trained models ("foundation models") have enabled
- [Paper Link - pt) <br> <details><summary>Read Abstract</summary>The Segment Anything Model (SAM) has established itself as a powerful
- [Paper Link - and-language pre-training has become increasingly
- [Paper Link - grop) <br> <details><summary>Read Abstract</summary>Multi-object rearrangement is a crucial skill for service robots, and
- [Paper Link - 4, a large-scale, multimodal model which can
- [Paper Link - Anything) <br> <details><summary>Read Abstract</summary>Recently, the Segment Anything Model (SAM) gains lots of attention rapidly
- [Paper Link
- [Paper Link - Anything) <br> <details><summary>Read Abstract</summary>Controllable image captioning is an emerging multimodal topic that aims to
- [Paper Link - Modality-Arena) <br> <details><summary>Read Abstract</summary>Large Vision-Language Models (LVLMs) have recently played a dominant role in
- [Paper Link - ai/cream) <br> <details><summary>Read Abstract</summary>Advances in Large Language Models (LLMs) have inspired a surge of research
- [Paper Link - purpose foundation models have become increasingly important in the
-
Citation
- **A of Large Language Models**
- **Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond**
- **Multimodal Learning with Transformers: A Survey**
- **Self-Supervised Multimodal Learning: A Survey**
- **A Survey of Vision-Language Pre-Trained Models**
- **Vision-Language Models for Vision Tasks: A Survey**
- **A Comprehensive Survey on Segment Anything Model for Vision and Beyond**
- **Vision-language pre-training: Basics, recent advances, and future trends**
- **Towards Open Vocabulary Learning: A Survey**
- **Transformer-Based Visual Segmentation: A Survey**
-
2021
- [Paper Link - trained representations are becoming crucial for many NLP and perception
- [Paper Link - of-the-art computer vision systems are trained to predict a fixed set
- [Paper Link - modal pre-training models have been intensively explored to bridge
- [Paper Link - language pre-training
- [Paper Link
- [Paper Link
- [Paper Link - modal language-vision models trained on hundreds of millions of
- [Paper Link - scale vision-language pre-training has shown promising
- [Paper Link - vocabulary object detection, which detects objects
- [Paper Link - regressive language models exhibit the
- [Paper Link
-
2022
- [Paper Link - of-the-art vision and vision-and-language models rely on large-scale
- [Paper Link
- [Paper Link - scale pre-training has led to
- [Paper Link
- [Paper Link
- [Paper Link - Language (VL) models with the Two-Tower architecture have dominated
- [Paper Link - conditioned policies for robotic navigation can be trained on large,
- [Paper Link - vocabulary image segmentation model to organize an image
- [Paper Link - language (VL) pre-training has recently received considerable
- [Paper Link
- [Paper Link - to-text Transformer,
- [Paper Link - Image Pre-training (CLIP) has made a remarkable
- [Paper Link - vocabulary image segmentation model to organize an image
- [Paper Link
- [Paper Link - scale pretrained foundation models is of significant interest
Categories
Sub Categories