Projects in Awesome Lists tagged with multimodal-learning
A curated list of projects in awesome lists tagged with multimodal-learning .
https://github.com/mlfoundations/open_flamingo
An open-source framework for training large multimodal models.
computer-vision deep-learning flamingo in-context-learning language-model multimodal-learning pytorch
Last synced: 09 Apr 2025
https://github.com/kaiyangzhou/coop
Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
foundation-models multimodal-learning prompt-learning
Last synced: 15 May 2025
https://github.com/KaiyangZhou/CoOp
Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
foundation-models multimodal-learning prompt-learning
Last synced: 16 Mar 2025
https://github.com/ailab-cvc/unireplknet
[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
architecture artificial-intelligence convolutional-neural-networks deep-learning multimodal-learning
Last synced: 15 May 2025
https://github.com/dmitryryumin/iccv-2023-papers
ICCV 2023 Papers: Discover cutting-edge research from ICCV 2023, the leading computer vision conference. Stay updated on the latest in computer vision and deep learning, with code included. ⭐ support visual intelligence development!
3d-graphics 3d-reconstruction biometrics computer-vision datasets deep-learning explainable-ai face-recognition gesture-recognition iccv iccv2023 image-processing image-synthesis multimodal-learning pattern-recognition photogrammetry pose-estimation robotics transfer-learning video-synthesis
Last synced: 16 May 2025
https://github.com/AILab-CVC/UniRepLKNet
[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
architecture artificial-intelligence convolutional-neural-networks deep-learning multimodal-learning
Last synced: 20 Mar 2025
https://github.com/PreferredAI/cornac
A Comparative Framework for Multimodal Recommender Systems
collaborative-filtering matrix-factorization multimodal-learning multimodality recommendation-algorithms recommendation-engine recommendation-system recommender-system
Last synced: 18 Jul 2025
https://github.com/ArrowLuo/CLIP4Clip
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
activitynet clip didemo lsmdc msrvtt msvd multimodal multimodal-learning multimodality ranking retrieval retrieval-model search video-clip-retrieval video-text-retrieval
Last synced: 03 Apr 2025
https://github.com/huaizhengzhang/awsome-deep-learning-for-video-analysis
Papers, code and datasets about deep learning and multi-modal learning for video analysis
deep-learning machine-learning multimodal-learning paper video-analysis video-classification video-dataset
Last synced: 28 Jan 2026
https://github.com/declare-lab/multimodal-deep-learning
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.
multimodal-deep-learning multimodal-interactions multimodal-learning multimodal-sentiment-analysis
Last synced: 25 Jan 2026
https://github.com/HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis
Papers, code and datasets about deep learning and multi-modal learning for video analysis
deep-learning machine-learning multimodal-learning paper video-analysis video-classification video-dataset
Last synced: 20 Mar 2025
https://github.com/henghuiding/rela
[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation
cvpr2023 multimodal-learning referring-expression-comprehension referring-expression-segmentation referring-image-segmentation vision-language-transformer
Last synced: 04 Apr 2025
https://github.com/georgian-io/multimodal-toolkit
Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
huggingface-transformers multimodal-learning natural-language-processing tabular-data transformer
Last synced: 04 Apr 2025
https://github.com/georgian-io/Multimodal-Toolkit
Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
huggingface-transformers multimodal-learning natural-language-processing tabular-data transformer
Last synced: 04 Apr 2025
https://github.com/henghuiding/mevis
[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
mevis-dataset mose-dataset multimodal-learning referring-expression-comprehension referring-expression-segmentation referring-video-object-segmentation video-understanding
Last synced: 05 Apr 2025
https://github.com/pliang279/MultiBench
[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning
computer-vision deep-learning healthcare machine-learning multimodal-learning natural-language-processing representation-learning robotics speech-processing
Last synced: 08 May 2025
https://github.com/subho406/OmniNet
Official Pytorch implementation of "OmniNet: A unified architecture for multi-modal multi-task learning" | Authors: Subhojeet Pramanik, Priyanka Agrawal, Aman Hussain
artificial-intelligence deep-learning image-captioning machine-learning multimodal-learning multitask-learning neural-network nlp transformer video-recognition
Last synced: 19 Jul 2025
https://github.com/pliang279/multibench
[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning
computer-vision deep-learning healthcare machine-learning multimodal-learning natural-language-processing representation-learning robotics speech-processing
Last synced: 02 Mar 2025
https://github.com/microsoft/xpretrain
Multi-modality pre-training
computer-vision multimedia multimodal-learning nlp pre-training
Last synced: 04 Apr 2025
https://github.com/microsoft/XPretrain
Multi-modality pre-training
computer-vision multimedia multimodal-learning nlp pre-training
Last synced: 03 Apr 2025
https://henghuiding.com/MeViS/
[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
mevis-dataset mose-dataset multimodal-learning referring-expression-comprehension referring-expression-segmentation referring-video-object-segmentation video-understanding
Last synced: 18 Jul 2025
https://github.com/pykale/pykale
Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the 🔥PyTorch ecosystem. ⭐ Star to support our work!
computer-vision data-science deep-learning domain-adaptation graph-analysis knowledge-aware-learning machine-learning medical-image-analysis meta-learning multimodal multimodal-learning python pytorch transfer-learning
Last synced: 10 May 2026
https://github.com/MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
computer-vision deep-learning deep-neural-networks evaluation foundation-models large-language-models large-multimodal-models llm llms machine-learning multimodal multimodal-deep-learning multimodal-learning multimodality natural-language-processing question-answering stem visual-question-answering
Last synced: 17 Apr 2025
https://github.com/DmitryRyumin/ICASSP-2023-24-Papers
ICASSP 2023-2024 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023-24 conferences. Explore the latest advancements in acoustics, speech and signal processing. Code included. Star the repository to support the advancement of audio and signal processing!
asr denoising domain-adaptation face-recognition generative-models icassp icassp2023 icassp2024 image-generation keyword-spotting language-modeling multimodal-learning music-generation self-supervised-learning semantic-segmentation signal-processing signal-restoration speech-recognition spoken-language-understanding vad
Last synced: 14 Jul 2025
https://github.com/dmitryryumin/icassp-2023-24-papers
ICASSP 2023-2024 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023-24 conferences. Explore the latest advancements in acoustics, speech and signal processing. Code included. Star the repository to support the advancement of audio and signal processing!
asr denoising domain-adaptation face-recognition generative-models icassp icassp2023 icassp2024 image-generation keyword-spotting language-modeling multimodal-learning music-generation self-supervised-learning semantic-segmentation signal-processing signal-restoration speech-recognition spoken-language-understanding vad
Last synced: 08 Apr 2025
https://github.com/pointcept/gpt4point
[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.
3d-generation llm multimodal-learning
Last synced: 05 Apr 2025
https://github.com/kyegomez/cm3leon
An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images
attention attention-is-all-you-need dalle imagegeneration multimodal multimodal-learning multimodality
Last synced: 06 Apr 2025
https://github.com/HUANGLIZI/LViT
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
medical-image-analysis multimodal-learning pytorch segmentation vision-language
Last synced: 21 Jul 2025
https://github.com/huanglizi/lvit
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
medical-image-analysis multimodal-learning pytorch segmentation vision-language
Last synced: 16 May 2025
https://github.com/mmaaz60/mvits_for_class_agnostic_od
[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".
class-agnostic-detection multimodal-learning object-detection open-world-detection psuedo-labels pytorch
Last synced: 19 Oct 2025
https://github.com/Pointcept/GPT4Point
[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.
3d-generation llm multimodal-learning
Last synced: 20 Mar 2025
https://github.com/kyegomez/navit
My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"
attention-mechanism clip gpt4 multimodal multimodal-deep-learning multimodal-learning multimodality vit
Last synced: 16 May 2025
https://github.com/merveenoyan/siglip
Projects based on SigLIP (Zhai et. al, 2023) and Hugging Face transformers integration 🤗
computer-vision machine-learning multimodal-learning siglip
Last synced: 04 Apr 2025
https://github.com/snap-research/mmvid
[CVPR 2022] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
bert deep-learning multimodal-learning multimodal-video-generation text-to-video transformer video-generation
Last synced: 02 Sep 2025
https://github.com/tencentarc/vit-lens
[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
Last synced: 04 Apr 2025
https://github.com/miccunifi/SEARLE
[ICCV 2023] - Zero-shot Composed Image Retrieval with Textual Inversion
circo cirr clip composed-image-retrieval fashion-iq knowledge-distillation multimodal-learning pytorch textual-inversion
Last synced: 03 Apr 2025
https://github.com/mhw32/multimodal-vae-public
A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335)
generative-models machine-learning multimodal-learning variational-autoencoder
Last synced: 14 Apr 2025
https://github.com/ofa-sys/ofasys
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models
audio computer-vision deep-learning motion multimodal-learning multitask-learning nlp pretrained-models pytorch transformers vision-and-language
Last synced: 24 Oct 2025
https://github.com/kyegomez/pali3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
artificial-intelligence autogpt gpt4 machine-learning multimodal multimodal-deep-learning multimodal-learning multimodality
Last synced: 13 Apr 2025
https://github.com/alipay/ant-multi-modal-framework
Research Code for Multimodal-Cognition Team in Ant Group
image-text-retrieval multimodal-learning multimodal-llm video-editing video-text-retrieval
Last synced: 10 Sep 2025
https://github.com/machine-intelligence-laboratory/TopicNet
Interface for easier topic modelling.
bigartm-library custom-score document-representation modalities multimodal-data multimodal-learning pypi topic-modeling topic-modelling
Last synced: 03 May 2025
https://github.com/galilai-group/stable-pretraining
Reliable, minimal and scalable library for pretraining foundation and world models
computer-vision computer-vision-algorithms contrastive-learning deep-learning distributed foundation-models joint-embedding joint-embedding-predictive-architecture large-language-model multimodal-learning pytorch self-supervised-learning stable-pretraining transformers
Last synced: 17 Feb 2026
https://github.com/usc-sail/fed-multimodal
[KDD 2023] FedMultimodal
action-recognition emotion-recognition federated-learning multimodal-learning
Last synced: 18 Jan 2026
https://github.com/zjunlp/hvpnet
[NAACL 2022 Findings] Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction
bert dataset entity-extraction hvpnet information-extraction kg multimodal multimodal-knowledge-graph multimodal-learning naacl ner prefix pytorch re relation-extraction
Last synced: 12 Sep 2025
https://github.com/pliang279/mfn
[AAAI 2018] Memory Fusion Network for Multi-view Sequential Learning
machine-learning multimodal-learning
Last synced: 05 May 2025
https://github.com/pliang279/multiviz
[ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models
computer-vision machine-learning multimodal-learning natural-language-processing
Last synced: 12 Apr 2025
https://github.com/declare-lab/llm-puzzletest
This repository is maintained to release dataset and models for multimodal puzzle reasoning.
gemini gemini-pro gpt-4 language-model large-language-models llm llms multimodal-deep-learning multimodal-learning
Last synced: 22 Sep 2025
https://github.com/kevalmorabia97/cova-web-object-detection
A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!
attention computer-vision convolutional-neural-networks deep-learning graph-attention-networks graph-convolutional-networks information-extraction multimodal-learning object-detection pytorch visual-attention
Last synced: 07 Apr 2025
https://github.com/wxl1999/plmpapers
A paper list of pre-trained language models (PLMs).
bert deep-learning gpt machine-learning multilingual multimodal-learning natural-language-processing pre-training representation-learning
Last synced: 18 Feb 2026
https://github.com/johnarevalo/gmu-mmimdb
Source code for training Gated Multimodal Units on MM-IMDb dataset
multimodal-learning representation-learning
Last synced: 04 Sep 2025
https://github.com/pliang279/factorized
[ICLR 2019] Learning Factorized Multimodal Representations
machine-learning multimodal-learning representation-learning
Last synced: 05 May 2025
https://github.com/laion-ai/general-gpt
computer-vision large-language-models multimodal-learning nlp
Last synced: 03 Oct 2025
https://github.com/tiger-ai-lab/quickvideo
Quick Long Video Understanding
llm multimodal multimodal-learning video
Last synced: 13 Jun 2025
https://github.com/praveena2j/jointcrossattentional-av-fusion
ABAW3 (CVPRW): A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
affective-computing attention-model audio-visual-learning emotion emotion-recognition multimodal-learning
Last synced: 30 Jul 2025
https://github.com/kyegomez/autort
Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents"
ai artificial-intelligence attention-is-all-you-need attention-mechanism gpt4 machine-learning ml multi-modal multimodal-learning robotics robots ros swarm swarms
Last synced: 09 Apr 2025
https://github.com/praveena2j/joint-cross-attention-for-audio-visual-fusion
IEEE T-BIOM : "Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention"
affective-computing attention attention-model audio-visual-learning emotion-recognition multimodal-learning
Last synced: 12 Apr 2025
https://github.com/ai4ce/EgoPAT3D
[CVPR 2022] Egocentric Action Target Prediction in 3D
3d-computer-vision computer-vision dataset deep-learning egocentric-vision human-robot-collaboration human-robot-interaction machine-learning multimodal-learning target-prediction wearable-robotics
Last synced: 20 Mar 2025
https://github.com/ai4ce/egopat3d
[CVPR 2022] Egocentric Action Target Prediction in 3D
3d-computer-vision computer-vision dataset deep-learning egocentric-vision human-robot-collaboration human-robot-interaction machine-learning multimodal-learning target-prediction wearable-robotics
Last synced: 11 Apr 2025
https://github.com/praveena2j/cross-attentional-av-fusion
FG2021: Cross Attentional AV Fusion for Dimensional Emotion Recognition
affective-computing attention-model audio-visual-learning emotion emotion-recognition multimodal-learning
Last synced: 12 Apr 2025
https://github.com/georgepar/slp
Utils and modules for Speech Language and Multimodal processing using pytorch and pytorch lightning
multimodal multimodal-deep-learning multimodal-learning natural-language-processing pytorch pytorch-lightning wandb
Last synced: 21 Sep 2025
https://github.com/cocoxili/cmpc
[IJCAI2022] Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
biometric-matching crossmodal-retrieval deep-learning multimodal-learning representation-learning voice-face-association voxceleb
Last synced: 14 Jan 2026
https://github.com/praveena2j/rjcma
ABAW6 (CVPR-W) We achieved second place in the valence arousal challenge of ABAW6
affective-computing arousal-valence attention-model emotion emotion-recognition multimodal-learning
Last synced: 12 Apr 2025
https://github.com/asnelt/mmae
Package for Multimodal Autoencoders in TensorFlow / Keras
autoencoder autoencoders bregman-distance deep-learning keras keras-models keras-tensorflow multimodal-deep-learning multimodal-learning tensorflow
Last synced: 26 Oct 2025
https://github.com/3dlg-hcvc/tricolo
[WACV 2024] TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval
3d computer-vision multimodal-learning natual-language-processing pytorch pytorch-lightning
Last synced: 05 Apr 2025
https://github.com/breezedeus/coin-clip
Coin-CLIP: fine-tuned with a vast collection of coin images from CLIP using contrastive learning. It enhances feature extraction for coins, boosting image search accuracy. This model merges Visual Transformer (ViT) with CLIP's multimodal learning, optimized for numismatic applications.
clip coin coin-identification coin-recognition coin-retrieval multimodal-learning numismatics
Last synced: 05 Mar 2026
https://github.com/aws-samples/sample-multimodal-agent-tutorial
Build production-ready AI agents with the Strands Agents SDK and AWS services. This repository demonstrates how you can create multi-modal systems with persistent memory in minimal code. Progress from your first agent to production-ready systems through hands-on chapters.
aws bedrock-agentcore generative-ai multimodal-learning s3-storage strands-agents
Last synced: 18 Jan 2026
https://github.com/kyegomez/kosmosg
My implementation of the model KosmosG from "KOSMOS-G: Generating Images in Context with Multimodal Large Language Models"
attention-is-all-you-need attention-mechanism attention-mechanisms computer-vision multimodal multimodal-learning
Last synced: 07 May 2025
https://github.com/aehrc/cxrmate
CXRMate: Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation
chest-x-ray-report-generation chest-xray-images image-captioning medical-imaging multimodal-learning radiology-reports
Last synced: 12 Apr 2025
https://github.com/mbzuai-oryx/camel-bench
CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
arabic benchmark large-multimodal-models mbzuai multimodal-learning visual-question-answering vqa
Last synced: 01 May 2025
https://github.com/alipay/pc2-noiseofweb
Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark containing 100K image-text pairs for robust image-text matching/retrieval models.
acmmm acmmm2024 benchmark captioning-images cross-modal-retrieval dataset image-text-matching image-text-retrieval multimodal-learning noisy-correspondence
Last synced: 25 Apr 2025
https://github.com/praveena2j/recurrentjointattentionwithlstms
ICASSP 2023: "Recursive Joint Attention for Audio-Visual Fusion in Regression Based Emotion Recognition"
affective-computing attention-model audio-visual-learning emotion-recognition multimodal-learning
Last synced: 12 Apr 2025
https://github.com/adityalab/mm4tsa
A professional list on Multi-Modalities For Time Series Analysis (MM4TSA) Papers and Resource.
awesome awesome-list cross-modality forecasting foundation-models large-language-models multimodal multimodal-learning multimodal-time-series resources reusing survey time-series
Last synced: 26 Mar 2025
https://github.com/skblaz/autobot
An autoML for explainable text classification.
automl automl-algorithms automl-experiments classification data-mining data-science distributed-computing ensemble-learning evolutionary-algorithms machine-learning multimodal-learning natural-language-processing nlp python representation-learning sparse-matrices text-classification transfer-learning transformers-models
Last synced: 18 Aug 2025
https://github.com/buaadreamer/spn4cir
[ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
acmmm2024 blip blip2 clip composed-image-retrieval cross-modal-retrieval data-generation image-retrieval llama llava memory-bank multi-modal-retrieval multimodal-learning transformer
Last synced: 14 Mar 2026
https://github.com/willxxy/ecg-byte
[arxiv 2024] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
biosignals deep-learning ecg electrocardiogram generative-ai large-language-models machine-learning multimodal multimodal-deep-learning multimodal-large-language-models multimodal-learning physiological-signals signal-processing
Last synced: 29 Oct 2025
https://github.com/kyegomez/gen2
Implementation of "Text driven video generation" in pytorch
artificial-intelligence gpt4 multimodal multimodal-deep-learning multimodal-learning multimodality stablediffusion texttovideo
Last synced: 02 Jul 2025
https://github.com/juselara1/mlsa
Tensorflow 2.0 implementation of the M-LSA method.
deep-learning histopathology-images kernel-methods multimodal-learning
Last synced: 04 Sep 2025
https://github.com/kyegomez/convnet
Implementation of the NFNets from the paper: "ConvNets Match Vision Transformers at Scale" by Google Research
ai convolutional-layers convolutional-neural-networks deeplearning machine-learning ml multimodal-learning multimodality
Last synced: 09 Oct 2025
https://github.com/ihp-lab/mm_analysis_empathy
[ICMI 23] Multimodal Analysis and Assessment of Therapist Empathy in Motivational Interviews
motivational-interviewing multimodal-learning therapist-empathy
Last synced: 13 Apr 2025
https://github.com/ichalkiad/cryptogpcausality
This repository contains the code for the paper "Sentiment-driven statistical causality in multimodal systems", by Ioannis Chalkiadakis, Anna Zaremba, Gareth W. Peters and Michael J. Chantler.
causal-inference causality causality-analysis cryptocurrency cryptocurrency-exchanges gaussian-processes multimodal-learning multimodal-sentiment-analysis natural-language-procressing public-news sentiment-analysis statistical-models text-mining
Last synced: 27 Aug 2025
https://github.com/mims-harvard/bmi702
Biomedical Artificial Intelligence
artificial-intelligence biomedical-ai generative-ai medical-ai medicine multimodal-learning
Last synced: 17 Feb 2026
https://github.com/waybarrios/guidance-based-video-grounding
The official PyTorch implementation of the paper: "Localizing Moments in Long Video Via Multimodal Guidance"
moment-retrieval multimodal-learning pytorch video-language
Last synced: 23 Jan 2026
https://github.com/praveena2j/rjcaforspeakerverification
[FG 2024] "Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention"
attention attention-model audio-visual-learning multimodal-learning speaker-verification
Last synced: 24 Sep 2025
https://github.com/wxjiao/multimodal-feature-extraction
A detailed description on how to extract and align text, audio, and video features at word-level.
multimodal-learning multimodal-representation
Last synced: 22 Jul 2025
https://github.com/fork123aniket/multi-round-vlm-powered-multimodal-conversational-ai-navigation-bot
Streamlit App Combining Vision, Language, and Audio AI Models
conversational-agent conversational-ai conversational-bot conversational-interface generative-ai internvl internvl2 multimodal multimodal-data multimodal-deep-learning multimodal-large-language-models multimodal-learning vision-language vision-language-learning vision-language-model vision-language-models vision-language-navigation vision-language-transformer
Last synced: 19 Feb 2026
https://github.com/aclai-lab/multidata.jl
Multimodal datasets for Machine-Learning
dataframes-jl julia machine-learning multimodal-learning multimodal-machine-learning
Last synced: 20 Jun 2025
https://github.com/hanksoong/charisma-predictor
Multimodal AI pipeline to predict Big Five personality traits and assess charismatic leadership using audio, text, and video inputs.
audio-processing big-five charismatic-leadership computer-vision deep-learning facial-landmarks fusion-models mediapipe multimodal-learning nlp personality-prediction pytorch transformer
Last synced: 30 Jun 2025
https://github.com/zihao-jing/mumo
Official repo of the paper "Multimodal Molecular Representation Learning via Structural Fusion and Progressive Injection"
ai deep-learning foundation-models molecular-modeling molecular-property-prediction multimodal-learning representation-learning
Last synced: 04 Mar 2026
https://github.com/aehrc/imageclefmedical_caption_23
MedICap: Code for the participation of team CSIRO at the ImageCLEFmedical Caption task of 2023.
image-captioning medical-image-captioning medical-imaging multimodal multimodal-learning report-generation
Last synced: 31 Jan 2026
https://github.com/bryanbocao/vitag
Repository of the paper ViTag in SECON 2022 and demo (Best Demo Award).
deep-learning multimodal multimodal-association multimodal-learning
Last synced: 05 Apr 2025
https://github.com/lucaswychan/quant-lvlm
Easy-to-use large vision language model pipeline for quantitative analysis
large-vision-language-model multimodal-learning pytorch quantitative-finance
Last synced: 24 Feb 2026
https://github.com/fork123aniket/agentic-rag-story-generation-with-multimodal-genai
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
agentic-ai agentic-rag agentic-workflow generative-ai generative-ai-model internvl2 multimodal multimodal-data multimodal-deep-learning multimodal-large-language-models multimodal-learning story-generation vision-language vision-language-learning vision-language-model vision-language-transformer
Last synced: 14 Oct 2025
https://github.com/rishikesh-jadhav/cmsc828i-advanced-techniques-in-visual-learning-recognition
This repository, contains my academic work for the Fall 2023 CMSC828I course. It includes assignments, projects, and relevant documentation covering various aspects of computer vision and recognition.
computer-vision deep-learning generative-model image-segmentation implicit-neural-representation machine-learning multimodal-learning object-detection self-supervised-learning superpixel-segmentation
Last synced: 18 Jun 2025
https://github.com/weiyx16/learning_ml
HW for CS229 Machine Learning
cs229 machine-learning multimodal-learning
Last synced: 15 Mar 2025
https://liyuantsao.github.io/HoliSDiP/
The official repository of "HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior"
diffusion-models diffusion-prior generative-ai multimodal-learning real-world-super-resolution super-resolution text-to-image
Last synced: 13 Oct 2025
https://github.com/praveena2j/dynamic-crossattention
IEEE ICME : "Cross-Attention is not always needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition"
affective-computing ai attention attention-mechanism attention-model audio-visual-learning computer-vision cross-attention emotion-recognition multimodal-learning
Last synced: 16 Oct 2025
https://github.com/ksm26/open-source-models-with-hugging-face
"Open Source Models with Hugging Face" course empowers you with the skills to leverage open-source models from the Hugging Face Hub for various tasks in NLP, audio, image, and multimodal domains.
audio audio-processing cloud-deployment cloud-deployment-model gradio gradio-python-llm hugging-face hugging-face-api hugging-face-hub hugging-face-instructor-embeddings hugging-face-transformers image image-processing multimodal multimodal-deep-learning multimodal-learning nlp nlp-machine-learning open-source-models transformers-library
Last synced: 28 Mar 2025
https://github.com/qkrwoghd04/binary_classification_using_bert-vit
This project aims to develop multimodal deep learning model for fall detecting(Sleep or Fall)
bert image-classification late-fusion multimodal-learning text-classification vision-transformer
Last synced: 26 Jul 2025
https://github.com/harisbinzia/hatefulmemes
Racist or Sexist Meme? Classifying Memes beyond Hateful
hateful-memes multimodal-learning
Last synced: 19 Mar 2026
https://github.com/iboudhaine/deep-fake-audio-detection
A deepfake audio detection system using the ASVspoof 2019 dataset, combining acoustic and text features with a custom DistilBERT model to classify audio as real or fake.
asvspoof audio-processing deepfake-detection distilbert machine-learning multimodal-learning python pytorch signal-processing whisper
Last synced: 07 May 2026