Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Awesome-CV-Foundational-Models


https://github.com/awaisrauf/Awesome-CV-Foundational-Models

Last synced: 4 days ago
JSON representation

  • 2023

    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link - AI/scaling-laws-openclip) <br> <details><summary>Read Abstract</summary>Scaling up neural networks has led to remarkable performance across a wide
    • [Paper Link - trained models have substantially improved the
    • [Paper Link
    • [Paper Link - chatgpt) <br> <details><summary>Read Abstract</summary>ChatGPT is attracting a cross-field interest as it provides a language
    • [Paper Link - 02, a next-generation Transformer-based visual representation
    • [Paper Link
    • [Paper Link - level robot task plans from high-level natural language
    • [Paper Link - image pre-training, CLIP for short, has gained
    • [Paper Link
    • [Paper Link - view inputs has been proven to alleviate
    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link - zheng/GENIUS) <br> <details><summary>Read Abstract</summary>We investigate the potential of GPT-4~\cite{gpt4} to perform Neural
    • [Paper Link
    • [Paper Link - SAM-Adapter) <br> <details><summary>Read Abstract</summary>The Segment Anything Model (SAM) has recently gained popularity in the field
    • [Paper Link - Adapter) <br> <details><summary>Read Abstract</summary>How to efficiently transform large language models (LLMs) into instruction
    • [Paper Link - language LLM (VL-LLM) by pre-training on
    • [Paper Link
    • [Paper Link - scale pre-training and instruction tuning have been successful at
    • [Paper Link - VLAA/CLIPA) <br> <details><summary>Read Abstract</summary>CLIP, the first foundation model that connects images and text, has enabled
    • [Paper Link
    • [Paper Link
    • [Paper Link - modal models have been developed for joint image and
    • [Paper Link - text pairs from the web is one of the most
    • [Paper Link - oryx/XrayGPT) <br> <details><summary>Read Abstract</summary>The latest breakthroughs in large vision-language models, such as Bard and
    • [Paper Link - MAE) <br> <details><summary>Read Abstract</summary>Artificial intelligence (AI) has seen a tremendous surge in capabilities
    • [Paper Link - mercury/COSA) <br> <details><summary>Read Abstract</summary>Due to the limited scale and quality of video-text training corpus, most
    • [Paper Link - Med) <br> <details><summary>Read Abstract</summary>Obtaining large pre-trained models that can be fine-tuned to new tasks with
    • [Paper Link - IVA-Lab/FastSAM) <br> <details><summary>Read Abstract</summary>The recently proposed segment anything model (SAM) has made a significant
    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link - VLAA/CLIPA) <br> <details><summary>Read Abstract</summary>The recent work CLIPA presents an inverse scaling law for CLIP training --
    • [Paper Link
    • [Paper Link - air/Endo-FM) <br> <details><summary>Read Abstract</summary>Foundation models have exhibited remarkable success in various applications,
    • [Paper Link
    • [Paper Link - DA) <br> <details><summary>Read Abstract</summary>Domain adaptation (DA) has demonstrated significant promise for real-time
    • [Paper Link - Med) <br> <details><summary>Read Abstract</summary>Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained deep networks on ImageNet and vision-language foundation models trained on web-scale data are prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed via a combinatorial graph-matching objective; and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.</details>
    • [Paper Link - 1B), the foundation Segment Anything Model
    • [Paper Link
    • [Paper Link - FM) <br> <details><summary>Read Abstract</summary>Pretraining with large-scale 3D volumes has a potential for improving the
    • [Paper Link - powered embodied lifelong learning agent
    • [Paper Link - oryx/Video-ChatGPT) <br> <details><summary>Read Abstract</summary>Conversation agents fueled by Large Language Models (LLMs) are providing a
    • [Paper Link
    • [Paper Link
    • [Paper Link - based learning has emerged as a successful paradigm in natural
    • [Paper Link - level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.</details>
    • [Paper Link
    • [Paper Link - REACT, a system paradigm that integrates ChatGPT with a pool of
    • [Paper Link - anything.com) <br> <details><summary>Read Abstract</summary>We introduce the Segment Anything (SA) project: a new task, model, and
    • [Paper Link - Robot-Manipulation-Prompts) <br> <details><summary>Read Abstract</summary>This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot
    • [Paper Link - CAIR/ChatCaptioner) <br> <details><summary>Read Abstract</summary>Video captioning aims to convey dynamic scenes from videos using natural
    • [Paper Link
    • [Paper Link
    • [Paper Link
    • [Paper Link - based learning has emerged as a successful paradigm in natural
    • [Paper Link
    • [Paper Link - oryx/Video-ChatGPT) <br> <details><summary>Read Abstract</summary>Conversation agents fueled by Large Language Models (LLMs) are providing a
    • [Paper Link - language models have shown impressive multi-modal generation
    • [Paper Link - x-yang/Segment-and-Track-Anything) <br> <details><summary>Read Abstract</summary>This report presents a framework called Segment And Track Anything (SAMTrack)
    • [Paper Link - context learning, as a new paradigm in NLP, allows the model to rapidly
    • [Paper Link - only language models
    • [Paper Link
    • [Paper Link - Robot-Manipulation-Prompts) <br> <details><summary>Read Abstract</summary>This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot
    • [Paper Link - generated
    • [Paper Link - 4 has demonstrated extraordinary multi-modal abilities, such
    • [Paper Link
    • [Paper Link - 2.) <br> <details><summary>Read Abstract</summary>We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
    • [Paper Link - purpose pre-trained models ("foundation models") have enabled
    • [Paper Link - pt) <br> <details><summary>Read Abstract</summary>The Segment Anything Model (SAM) has established itself as a powerful
    • [Paper Link - and-language pre-training has become increasingly
    • [Paper Link - grop) <br> <details><summary>Read Abstract</summary>Multi-object rearrangement is a crucial skill for service robots, and
    • [Paper Link - 4, a large-scale, multimodal model which can
    • [Paper Link - Anything) <br> <details><summary>Read Abstract</summary>Recently, the Segment Anything Model (SAM) gains lots of attention rapidly
    • [Paper Link
    • [Paper Link - Anything) <br> <details><summary>Read Abstract</summary>Controllable image captioning is an emerging multimodal topic that aims to
    • [Paper Link - Modality-Arena) <br> <details><summary>Read Abstract</summary>Large Vision-Language Models (LVLMs) have recently played a dominant role in
    • [Paper Link - ai/cream) <br> <details><summary>Read Abstract</summary>Advances in Large Language Models (LLMs) have inspired a surge of research
    • [Paper Link - purpose foundation models have become increasingly important in the
  • Citation

  • 2021

  • 2022