An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with multimodal-learning

A curated list of projects in awesome lists tagged with multimodal-learning .

https://github.com/mlfoundations/open_flamingo

An open-source framework for training large multimodal models.

computer-vision deep-learning flamingo in-context-learning language-model multimodal-learning pytorch

Last synced: 09 Apr 2025

https://github.com/kaiyangzhou/coop

Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)

foundation-models multimodal-learning prompt-learning

Last synced: 15 May 2025

https://github.com/KaiyangZhou/CoOp

Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)

foundation-models multimodal-learning prompt-learning

Last synced: 16 Mar 2025

https://github.com/ailab-cvc/unireplknet

[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

architecture artificial-intelligence convolutional-neural-networks deep-learning multimodal-learning

Last synced: 15 May 2025

https://github.com/dmitryryumin/iccv-2023-papers

ICCV 2023 Papers: Discover cutting-edge research from ICCV 2023, the leading computer vision conference. Stay updated on the latest in computer vision and deep learning, with code included. ⭐ support visual intelligence development!

3d-graphics 3d-reconstruction biometrics computer-vision datasets deep-learning explainable-ai face-recognition gesture-recognition iccv iccv2023 image-processing image-synthesis multimodal-learning pattern-recognition photogrammetry pose-estimation robotics transfer-learning video-synthesis

Last synced: 16 May 2025

https://github.com/AILab-CVC/UniRepLKNet

[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

architecture artificial-intelligence convolutional-neural-networks deep-learning multimodal-learning

Last synced: 20 Mar 2025

https://github.com/ArrowLuo/CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

activitynet clip didemo lsmdc msrvtt msvd multimodal multimodal-learning multimodality ranking retrieval retrieval-model search video-clip-retrieval video-text-retrieval

Last synced: 03 Apr 2025

https://github.com/huaizhengzhang/awsome-deep-learning-for-video-analysis

Papers, code and datasets about deep learning and multi-modal learning for video analysis

deep-learning machine-learning multimodal-learning paper video-analysis video-classification video-dataset

Last synced: 28 Jan 2026

https://github.com/declare-lab/multimodal-deep-learning

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

multimodal-deep-learning multimodal-interactions multimodal-learning multimodal-sentiment-analysis

Last synced: 25 Jan 2026

https://github.com/HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis

Papers, code and datasets about deep learning and multi-modal learning for video analysis

deep-learning machine-learning multimodal-learning paper video-analysis video-classification video-dataset

Last synced: 20 Mar 2025

https://github.com/georgian-io/multimodal-toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data

huggingface-transformers multimodal-learning natural-language-processing tabular-data transformer

Last synced: 04 Apr 2025

https://github.com/georgian-io/Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data

huggingface-transformers multimodal-learning natural-language-processing tabular-data transformer

Last synced: 04 Apr 2025

https://github.com/subho406/OmniNet

Official Pytorch implementation of "OmniNet: A unified architecture for multi-modal multi-task learning" | Authors: Subhojeet Pramanik, Priyanka Agrawal, Aman Hussain

artificial-intelligence deep-learning image-captioning machine-learning multimodal-learning multitask-learning neural-network nlp transformer video-recognition

Last synced: 19 Jul 2025

https://github.com/pykale/pykale

Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the 🔥PyTorch ecosystem. ⭐ Star to support our work!

computer-vision data-science deep-learning domain-adaptation graph-analysis knowledge-aware-learning machine-learning medical-image-analysis meta-learning multimodal multimodal-learning python pytorch transfer-learning

Last synced: 10 May 2026

https://github.com/DmitryRyumin/ICASSP-2023-24-Papers

ICASSP 2023-2024 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023-24 conferences. Explore the latest advancements in acoustics, speech and signal processing. Code included. Star the repository to support the advancement of audio and signal processing!

asr denoising domain-adaptation face-recognition generative-models icassp icassp2023 icassp2024 image-generation keyword-spotting language-modeling multimodal-learning music-generation self-supervised-learning semantic-segmentation signal-processing signal-restoration speech-recognition spoken-language-understanding vad

Last synced: 14 Jul 2025

https://github.com/dmitryryumin/icassp-2023-24-papers

ICASSP 2023-2024 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023-24 conferences. Explore the latest advancements in acoustics, speech and signal processing. Code included. Star the repository to support the advancement of audio and signal processing!

asr denoising domain-adaptation face-recognition generative-models icassp icassp2023 icassp2024 image-generation keyword-spotting language-modeling multimodal-learning music-generation self-supervised-learning semantic-segmentation signal-processing signal-restoration speech-recognition spoken-language-understanding vad

Last synced: 08 Apr 2025

https://github.com/pointcept/gpt4point

[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.

3d-generation llm multimodal-learning

Last synced: 05 Apr 2025

https://github.com/kyegomez/cm3leon

An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images

attention attention-is-all-you-need dalle imagegeneration multimodal multimodal-learning multimodality

Last synced: 06 Apr 2025

https://github.com/HUANGLIZI/LViT

[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

medical-image-analysis multimodal-learning pytorch segmentation vision-language

Last synced: 21 Jul 2025

https://github.com/huanglizi/lvit

[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

medical-image-analysis multimodal-learning pytorch segmentation vision-language

Last synced: 16 May 2025

https://github.com/mmaaz60/mvits_for_class_agnostic_od

[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".

class-agnostic-detection multimodal-learning object-detection open-world-detection psuedo-labels pytorch

Last synced: 19 Oct 2025

https://github.com/Pointcept/GPT4Point

[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.

3d-generation llm multimodal-learning

Last synced: 20 Mar 2025

https://github.com/kyegomez/navit

My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

attention-mechanism clip gpt4 multimodal multimodal-deep-learning multimodal-learning multimodality vit

Last synced: 16 May 2025

https://github.com/merveenoyan/siglip

Projects based on SigLIP (Zhai et. al, 2023) and Hugging Face transformers integration 🤗

computer-vision machine-learning multimodal-learning siglip

Last synced: 04 Apr 2025

https://github.com/snap-research/mmvid

[CVPR 2022] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

bert deep-learning multimodal-learning multimodal-video-generation text-to-video transformer video-generation

Last synced: 02 Sep 2025

https://github.com/tencentarc/vit-lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations

multimodal-learning

Last synced: 04 Apr 2025

https://github.com/miccunifi/SEARLE

[ICCV 2023] - Zero-shot Composed Image Retrieval with Textual Inversion

circo cirr clip composed-image-retrieval fashion-iq knowledge-distillation multimodal-learning pytorch textual-inversion

Last synced: 03 Apr 2025

https://github.com/mhw32/multimodal-vae-public

A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335)

generative-models machine-learning multimodal-learning variational-autoencoder

Last synced: 14 Apr 2025

https://github.com/ofa-sys/ofasys

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

audio computer-vision deep-learning motion multimodal-learning multitask-learning nlp pretrained-models pytorch transformers vision-and-language

Last synced: 24 Oct 2025

https://github.com/kyegomez/pali3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

artificial-intelligence autogpt gpt4 machine-learning multimodal multimodal-deep-learning multimodal-learning multimodality

Last synced: 13 Apr 2025

https://github.com/zjunlp/hvpnet

[NAACL 2022 Findings] Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

bert dataset entity-extraction hvpnet information-extraction kg multimodal multimodal-knowledge-graph multimodal-learning naacl ner prefix pytorch re relation-extraction

Last synced: 12 Sep 2025

https://github.com/pliang279/mfn

[AAAI 2018] Memory Fusion Network for Multi-view Sequential Learning

machine-learning multimodal-learning

Last synced: 05 May 2025

https://github.com/pliang279/multiviz

[ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models

computer-vision machine-learning multimodal-learning natural-language-processing

Last synced: 12 Apr 2025

https://github.com/declare-lab/llm-puzzletest

This repository is maintained to release dataset and models for multimodal puzzle reasoning.

gemini gemini-pro gpt-4 language-model large-language-models llm llms multimodal-deep-learning multimodal-learning

Last synced: 22 Sep 2025

https://github.com/johnarevalo/gmu-mmimdb

Source code for training Gated Multimodal Units on MM-IMDb dataset

multimodal-learning representation-learning

Last synced: 04 Sep 2025

https://github.com/pliang279/factorized

[ICLR 2019] Learning Factorized Multimodal Representations

machine-learning multimodal-learning representation-learning

Last synced: 05 May 2025

https://github.com/tiger-ai-lab/quickvideo

Quick Long Video Understanding

llm multimodal multimodal-learning video

Last synced: 13 Jun 2025

https://github.com/praveena2j/jointcrossattentional-av-fusion

ABAW3 (CVPRW): A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

affective-computing attention-model audio-visual-learning emotion emotion-recognition multimodal-learning

Last synced: 30 Jul 2025

https://github.com/kyegomez/autort

Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents"

ai artificial-intelligence attention-is-all-you-need attention-mechanism gpt4 machine-learning ml multi-modal multimodal-learning robotics robots ros swarm swarms

Last synced: 09 Apr 2025

https://github.com/praveena2j/joint-cross-attention-for-audio-visual-fusion

IEEE T-BIOM : "Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention"

affective-computing attention attention-model audio-visual-learning emotion-recognition multimodal-learning

Last synced: 12 Apr 2025

https://github.com/georgepar/slp

Utils and modules for Speech Language and Multimodal processing using pytorch and pytorch lightning

multimodal multimodal-deep-learning multimodal-learning natural-language-processing pytorch pytorch-lightning wandb

Last synced: 21 Sep 2025

https://github.com/cocoxili/cmpc

[IJCAI2022] Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

biometric-matching crossmodal-retrieval deep-learning multimodal-learning representation-learning voice-face-association voxceleb

Last synced: 14 Jan 2026

https://github.com/praveena2j/rjcma

ABAW6 (CVPR-W) We achieved second place in the valence arousal challenge of ABAW6

affective-computing arousal-valence attention-model emotion emotion-recognition multimodal-learning

Last synced: 12 Apr 2025

https://github.com/3dlg-hcvc/tricolo

[WACV 2024] TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval

3d computer-vision multimodal-learning natual-language-processing pytorch pytorch-lightning

Last synced: 05 Apr 2025

https://github.com/breezedeus/coin-clip

Coin-CLIP: fine-tuned with a vast collection of coin images from CLIP using contrastive learning. It enhances feature extraction for coins, boosting image search accuracy. This model merges Visual Transformer (ViT) with CLIP's multimodal learning, optimized for numismatic applications.

clip coin coin-identification coin-recognition coin-retrieval multimodal-learning numismatics

Last synced: 05 Mar 2026

https://github.com/aws-samples/sample-multimodal-agent-tutorial

Build production-ready AI agents with the Strands Agents SDK and AWS services. This repository demonstrates how you can create multi-modal systems with persistent memory in minimal code. Progress from your first agent to production-ready systems through hands-on chapters.

aws bedrock-agentcore generative-ai multimodal-learning s3-storage strands-agents

Last synced: 18 Jan 2026

https://github.com/kyegomez/kosmosg

My implementation of the model KosmosG from "KOSMOS-G: Generating Images in Context with Multimodal Large Language Models"

attention-is-all-you-need attention-mechanism attention-mechanisms computer-vision multimodal multimodal-learning

Last synced: 07 May 2025

https://github.com/aehrc/cxrmate

CXRMate: Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

chest-x-ray-report-generation chest-xray-images image-captioning medical-imaging multimodal-learning radiology-reports

Last synced: 12 Apr 2025

https://github.com/mbzuai-oryx/camel-bench

CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.

arabic benchmark large-multimodal-models mbzuai multimodal-learning visual-question-answering vqa

Last synced: 01 May 2025

https://github.com/alipay/pc2-noiseofweb

Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark containing 100K image-text pairs for robust image-text matching/retrieval models.

acmmm acmmm2024 benchmark captioning-images cross-modal-retrieval dataset image-text-matching image-text-retrieval multimodal-learning noisy-correspondence

Last synced: 25 Apr 2025

https://github.com/praveena2j/recurrentjointattentionwithlstms

ICASSP 2023: "Recursive Joint Attention for Audio-Visual Fusion in Regression Based Emotion Recognition"

affective-computing attention-model audio-visual-learning emotion-recognition multimodal-learning

Last synced: 12 Apr 2025

https://github.com/buaadreamer/spn4cir

[ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

acmmm2024 blip blip2 clip composed-image-retrieval cross-modal-retrieval data-generation image-retrieval llama llava memory-bank multi-modal-retrieval multimodal-learning transformer

Last synced: 14 Mar 2026

https://github.com/juselara1/mlsa

Tensorflow 2.0 implementation of the M-LSA method.

deep-learning histopathology-images kernel-methods multimodal-learning

Last synced: 04 Sep 2025

https://github.com/kyegomez/convnet

Implementation of the NFNets from the paper: "ConvNets Match Vision Transformers at Scale" by Google Research

ai convolutional-layers convolutional-neural-networks deeplearning machine-learning ml multimodal-learning multimodality

Last synced: 09 Oct 2025

https://github.com/ihp-lab/mm_analysis_empathy

[ICMI 23] Multimodal Analysis and Assessment of Therapist Empathy in Motivational Interviews

motivational-interviewing multimodal-learning therapist-empathy

Last synced: 13 Apr 2025

https://github.com/ichalkiad/cryptogpcausality

This repository contains the code for the paper "Sentiment-driven statistical causality in multimodal systems", by Ioannis Chalkiadakis, Anna Zaremba, Gareth W. Peters and Michael J. Chantler.

causal-inference causality causality-analysis cryptocurrency cryptocurrency-exchanges gaussian-processes multimodal-learning multimodal-sentiment-analysis natural-language-procressing public-news sentiment-analysis statistical-models text-mining

Last synced: 27 Aug 2025

https://github.com/waybarrios/guidance-based-video-grounding

The official PyTorch implementation of the paper: "Localizing Moments in Long Video Via Multimodal Guidance"

moment-retrieval multimodal-learning pytorch video-language

Last synced: 23 Jan 2026

https://github.com/praveena2j/rjcaforspeakerverification

[FG 2024] "Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention"

attention attention-model audio-visual-learning multimodal-learning speaker-verification

Last synced: 24 Sep 2025

https://github.com/wxjiao/multimodal-feature-extraction

A detailed description on how to extract and align text, audio, and video features at word-level.

multimodal-learning multimodal-representation

Last synced: 22 Jul 2025

https://github.com/hanksoong/charisma-predictor

Multimodal AI pipeline to predict Big Five personality traits and assess charismatic leadership using audio, text, and video inputs.

audio-processing big-five charismatic-leadership computer-vision deep-learning facial-landmarks fusion-models mediapipe multimodal-learning nlp personality-prediction pytorch transformer

Last synced: 30 Jun 2025

https://github.com/zihao-jing/mumo

Official repo of the paper "Multimodal Molecular Representation Learning via Structural Fusion and Progressive Injection"

ai deep-learning foundation-models molecular-modeling molecular-property-prediction multimodal-learning representation-learning

Last synced: 04 Mar 2026

https://github.com/aehrc/imageclefmedical_caption_23

MedICap: Code for the participation of team CSIRO at the ImageCLEFmedical Caption task of 2023.

image-captioning medical-image-captioning medical-imaging multimodal multimodal-learning report-generation

Last synced: 31 Jan 2026

https://github.com/bryanbocao/vitag

Repository of the paper ViTag in SECON 2022 and demo (Best Demo Award).

deep-learning multimodal multimodal-association multimodal-learning

Last synced: 05 Apr 2025

https://github.com/lucaswychan/quant-lvlm

Easy-to-use large vision language model pipeline for quantitative analysis

large-vision-language-model multimodal-learning pytorch quantitative-finance

Last synced: 24 Feb 2026

https://github.com/rishikesh-jadhav/cmsc828i-advanced-techniques-in-visual-learning-recognition

This repository, contains my academic work for the Fall 2023 CMSC828I course. It includes assignments, projects, and relevant documentation covering various aspects of computer vision and recognition.

computer-vision deep-learning generative-model image-segmentation implicit-neural-representation machine-learning multimodal-learning object-detection self-supervised-learning superpixel-segmentation

Last synced: 18 Jun 2025

https://github.com/weiyx16/learning_ml

HW for CS229 Machine Learning

cs229 machine-learning multimodal-learning

Last synced: 15 Mar 2025

https://liyuantsao.github.io/HoliSDiP/

The official repository of "HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior"

diffusion-models diffusion-prior generative-ai multimodal-learning real-world-super-resolution super-resolution text-to-image

Last synced: 13 Oct 2025

https://github.com/praveena2j/dynamic-crossattention

IEEE ICME : "Cross-Attention is not always needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition"

affective-computing ai attention attention-mechanism attention-model audio-visual-learning computer-vision cross-attention emotion-recognition multimodal-learning

Last synced: 16 Oct 2025

https://github.com/qkrwoghd04/binary_classification_using_bert-vit

This project aims to develop multimodal deep learning model for fall detecting(Sleep or Fall)

bert image-classification late-fusion multimodal-learning text-classification vision-transformer

Last synced: 26 Jul 2025

https://github.com/harisbinzia/hatefulmemes

Racist or Sexist Meme? Classifying Memes beyond Hateful

hateful-memes multimodal-learning

Last synced: 19 Mar 2026

https://github.com/iboudhaine/deep-fake-audio-detection

A deepfake audio detection system using the ASVspoof 2019 dataset, combining acoustic and text features with a custom DistilBERT model to classify audio as real or fake.

asvspoof audio-processing deepfake-detection distilbert machine-learning multimodal-learning python pytorch signal-processing whisper

Last synced: 07 May 2026