Projects in Awesome Lists tagged with multi-modality
A curated list of projects in awesome lists tagged with multi-modality .
https://github.com/haotian-liu/llava
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning
Last synced: 17 Nov 2025
https://github.com/haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning
Last synced: 14 Mar 2025
https://github.com/jina-ai/clip-as-service
🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
bert bert-as-service clip-as-service clip-model cross-modal-retrieval cross-modality deep-learning image2vec multi-modality neural-search onnx openai pytorch sentence-encoding sentence2vec
Last synced: 08 May 2025
https://github.com/kyegomez/swarms
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
agents ai artificial-intelligence attention-mechanism chatgpt gpt4 gpt4all huggingface langchain langchain-python machine-learning multi-modal-imaging multi-modality multimodal prompt-engineering prompt-toolkit prompting swarms transformer-models tree-of-thoughts
Last synced: 23 Oct 2025
https://github.com/lucidrains/deep-daze
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
artificial-intelligence deep-learning implicit-neural-representation multi-modality siren text-to-image transformers
Last synced: 10 Apr 2025
https://github.com/evolvinglmms-lab/otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
artificial-inteligence chatgpt deep-learning embodied-ai foundation-models gpt-4 instruction-tuning large-scale-models machine-learning multi-modality visual-language-learning
Last synced: 13 Dec 2025
https://github.com/internlm/internlm-xcomposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning
Last synced: 15 May 2025
https://github.com/InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning
Last synced: 07 May 2025
https://github.com/dlr-rm/3dobjecttracking
Algorithms and Publications on 3D Object Tracking
accv2020 articulated computer-vision cvpr2022 ijcv iros2023 multi-body multi-modality object-tracking paper pose-estimation real-time rgbd tpami tracking
Last synced: 30 Jun 2025
https://github.com/DLR-RM/3DObjectTracking
Algorithms and Publications on 3D Object Tracking
accv2020 articulated computer-vision cvpr2022 ijcv iros2023 multi-body multi-modality object-tracking paper pose-estimation real-time rgbd tpami tracking
Last synced: 20 Mar 2025
https://github.com/openbmb/visrag
Parsing-free RAG supported by VLMs
document-retrieval document-understanding multi-modal multi-modality rag retrieval retrieval-augmented-generation vision-language-model
Last synced: 05 Oct 2025
https://github.com/opengvlab/multi-modality-arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
chat chatbot chatgpt gradio large-language-models llms multi-modality vision-language-model vqa
Last synced: 20 Apr 2025
https://github.com/OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
chat chatbot chatgpt gradio large-language-models llms multi-modality vision-language-model vqa
Last synced: 03 Apr 2025
https://github.com/ziqihuangg/Collaborative-Diffusion
Collaborative Diffusion (CVPR 2023)
aigc diffusion-models face-editing face-generation gen-ai image-editing image-generation latent-diffusion-models multi-modality stable-diffusion
Last synced: 28 Mar 2025
https://github.com/kyegomez/Sophia
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.
artificial-intelligence chatgpt deep-learning multi-modality neural-network optimizer
Last synced: 22 Jul 2025
https://github.com/kyegomez/sophia
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.
artificial-intelligence chatgpt deep-learning multi-modality neural-network optimizer
Last synced: 23 Oct 2025
https://github.com/kyegomez/Gemini
The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google
ai artificial-intelligence gemini gpt4 machine-learning ml multi-modality multimodla
Last synced: 05 Apr 2025
https://github.com/kyegomez/gemini
The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google
ai artificial-intelligence gemini gpt4 machine-learning ml multi-modality multimodla
Last synced: 04 Apr 2025
https://github.com/dvlab-research/visionzip
Official repository for VisionZip (CVPR 2025)
efficiency multi-modality vision-language-model vlms
Last synced: 03 Jul 2025
https://github.com/JIA-Lab-research/MGM-Omni
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
audio-language-model multi-modal-large-language-model multi-modality multimodal text-to-speech
Last synced: 18 Jan 2026
https://github.com/zwwwayne/mmmot
[ICCV2019] Robust Multi-Modality Multi-Object Tracking
Last synced: 08 May 2025
https://github.com/ZwwWayne/mmMOT
[ICCV2019] Robust Multi-Modality Multi-Object Tracking
Last synced: 20 Mar 2025
https://github.com/dvlab-research/uvtr
Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)
3d-detection multi-modality pytorch
Last synced: 03 Jul 2025
https://github.com/RLHF-V/RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
chatbot gpt-4 llama multi-modality multimodal rlhf-v visual-language-learning
Last synced: 24 Feb 2025
https://github.com/dvlab-research/UVTR
Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)
3d-detection multi-modality pytorch
Last synced: 20 Mar 2025
https://github.com/dvlab-research/VisionZip
Official repo for "VisionZip: Longer is Better but Not Necessary in Vision Language Models"
efficiency multi-modality vision-language-model vlms
Last synced: 23 Sep 2025
https://github.com/sshh12/multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
large-context large-language-models large-multimodal-models llava llm multi-modality multimodal vision-language-model
Last synced: 07 May 2025
https://github.com/jina-ai/rungpt
An open-source cloud-native of large multi-modal models (LMMs) serving framework.
flamingo gpt-4 large-language-models large-multimadality-models llama llm-hosting llm-serve lmm-serve multi-modality opengpt self-hosting transformers
Last synced: 02 Apr 2026
https://github.com/kyegomez/andromeda
An all-new Language Model That Processes Ultra-Long Sequences of 100,000+ Ultra-Fast
agi artificial-general-intelligence artificial-intelligence artificial-intelligence-algorithms deep-learning gpt-4 language-model large-language-models multi-modality multimodal neural-networks transformer
Last synced: 06 Apr 2025
https://github.com/kyegomez/the-compiler
Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!
agora artficial-intelligence autogpt chain-of-thought chatgpt deep-learning deep-learning-algorithms multi-modal-fusion multi-modality multimodal-deep-learning prompt-engineering reinforcement-learning tree-of-thoughts
Last synced: 04 Oct 2025
https://github.com/kyegomez/mambabyte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
ai artificial-intelligence gpt4v machine-learning mamba megabyte ml multi-modality tokenizer
Last synced: 04 Apr 2025
https://github.com/dvlab-research/prompt-highlighter
[CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs
llm-inference multi-modality text-generation
Last synced: 03 Jul 2025
https://github.com/kyegomez/moe-mamba
Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Zeta
ai ml moe multi-modal-fusion multi-modality swarms
Last synced: 07 May 2025
https://github.com/kyegomez/MambaByte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
ai artificial-intelligence gpt4v machine-learning mamba megabyte ml multi-modality tokenizer
Last synced: 20 Mar 2025
https://github.com/skit-ai/speechllm
This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.
conversational-ai llm multi-modal-llms multi-modality speech
Last synced: 20 Feb 2026
https://github.com/dvlab-research/mgm-omni
An Open-source Omni Chatbot for Long Speech and Voice Clone
audio-language-model generative-ai large-language-models llm multi-modal-large-language-model multi-modality multimodal text-to-speech vision-language-model
Last synced: 01 Sep 2025
https://github.com/kyegomez/kosmos2.5
My implementation of Kosmos2.5 from the paper: "KOSMOS-2.5: A Multimodal Literate Model"
attention attention-is-all-you-need gpt3 gpt4 kosmos multi-modality multimodal multimodal-deep-learning opensource
Last synced: 06 Apr 2025
https://github.com/rentainhe/trar-vqa
[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"
attention clevr dynamic-network iccv2021 local-and-global multi-modal multi-modal-learning multi-modality multi-scale-features official pytorch transformer vision-and-language visual-question-answering visualization vqav2
Last synced: 28 Aug 2025
https://github.com/kyegomez/qformer
Implementation of Qformer from BLIP2 in Zeta Lego blocks.
ai artificial-intelligence attention-mechanism blip2 machine machine-learning multi-modal multi-modality
Last synced: 06 Mar 2026
https://github.com/amazon-science/crossmodal-contrastive-learning
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, ICCV 2021
computer-vision contrastive-learning multi-modality natural-language-processing transformers video video-captioning video-text-retrieval
Last synced: 04 Sep 2025
https://chenshuang-zhang.github.io/imagenet_d/
[CVPR 2024 Highlight] ImageNet-D
benchmark computer-vision dataset diffusion-models generative-models image-recognition imagenet large-language-model multi-modality out-of-distribution recognition robustness stable-diffusion synthetic-data text-to-image-synthesis vision-language-model
Last synced: 31 Mar 2025
https://github.com/trendscenter/fit
Fusion ICA Toolbox (MATLAB)
analysis cca eeg fmri gene ica iva joint-ica matlab mcca multi-modality parallel-ica pca
Last synced: 09 Apr 2025
https://github.com/kyegomez/fuyu
Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch
ai artificial-intelligence gpt4 gpt5 machine-learning multi-modal multi-modality
Last synced: 07 May 2025
https://github.com/kyegomez/athena-for-search
The World's First AI-Enabled Multi-Modality Native Search Engine
agora apacai artificial-intelligence bing chatgpt chatgpt-api data data-engineering google human-computer-interaction multi-modal-imaging multi-modality multi-modality-data search-algorithm search-engine user-interface
Last synced: 07 May 2025
https://github.com/kyegomez/mm1
PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"
ai artificial-intelligence deep-learning gpt4 machine-learning ml mm1 multi-modal multi-modal-revolution multi-modality
Last synced: 15 Apr 2025
https://github.com/kyegomez/mc-vit
Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"
ai multi-modal multi-modal-transformers multi-modality open-source transformer transformers vit
Last synced: 17 Aug 2025
https://github.com/kyegomez/mlxtransformer
Simple Implementation of a Transformer in the new framework MLX by Apple
artificial-intelligence gpt4 machine-learning multi-modal multi-modality
Last synced: 04 Oct 2025
https://github.com/kyegomez/forest-of-thoughts
A forest of autonomous agents.
ai artificial-intelligence machine-learning ml multi-modal multi-modality
Last synced: 07 May 2025
https://github.com/kyegomez/swarmos
An all-new OS that orchestrates autonomous agents as workers to execute tasks.
ai asynchronous asynchronous-programming concurrent gpt4 llms ml multi-modality multithreading operating-system os swarms
Last synced: 07 May 2025
https://github.com/kyegomez/tinygptv
Simple Implementation of TinyGPTV in super simple Zeta lego blocks
artificial-intelligence attention attention-is-all-you-need deep-learning multi-modal multi-modality transformers
Last synced: 07 May 2025
https://github.com/kyegomez/hrtx
Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2
ai artificial-intelligence ensemble gpt4v machine-learning ml multi-modal multi-modality rt-2 rtx
Last synced: 16 Apr 2025
https://github.com/kyegomez/multimodal-tot
Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement
artificial-intelligence gpt4 multi-modal multi-modality multi-modality-data
Last synced: 07 May 2025
https://github.com/kyegomez/hsss
Implementation of a Hierarchical Mamba as described in the paper: "Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling"
ai artificial-intelligence jesus machine-learning ml multi-modal multi-modality open-source pytorch rnn rnns ssms tensorflow zeta
Last synced: 07 May 2025
https://github.com/xufangzhi/moca
[Pattern Recognition] The implementation of MoCA
multi-modality textbook-question-answering
Last synced: 20 Aug 2025
https://github.com/kyegomez/visiondatasets
Open source scripts to create large scale datasets with rich detail for multi-modal models
ai artificial-intelligence function-calling gpt3 gpt4 json machine-learning ml multi-modal multi-modality pytorch tensorflow
Last synced: 07 May 2025
https://github.com/voidful/mmlm
Toward Multi Modality Language Model - implementation of GPT-4o/Project Astra
gpt-4o multi-modality multi-model world-models
Last synced: 04 Sep 2025
https://github.com/kyegomez/gats
Implementation of GATS from the paper: "GATS: Gather-Attend-Scatter" in pytorch and zeta
ai attention attention-is-all-you-need attention-mechanism gpt4 llama ml multi-modal multi-modality multimodal open-source
Last synced: 25 Oct 2025
https://github.com/kyegomez/vortexfusion
Transformers + Mambas + LSTMS All in One Model
agora ai ai-research deep-learning lstms mambas ml multi-modality ssms transformers
Last synced: 12 Oct 2025
https://github.com/anondo1969/fedsepsis
Repository for the journal article, 'FedSepsis: A Federated Multi-Modal Deep Learning-Based Internet of Medical Things Application for Early Detection of Sepsis from Electronic Health Records Using Raspberry Pi and Jetson Nano Devices', Mahbub Ul Alam, Rahim Rahmani. Sensors 23, no. 2: 970, https://doi.org/10.3390/s23020970.
clinical-decision-support-system clinicalbert deep-learning early-sepsis-detection electronic-health-records federated-learning gan health-informatics internet-of-medical-things iomt jetson-nano machine-learning multi-modality natural-language-processing nlp raspberry-pi sepsis smart-healthcare
Last synced: 08 May 2025
https://github.com/kyegomez/aoa-torch
Implementation of Attention on Attention in Zeta
ai artificial-intelligence gpt4 machine-learning multi-modal multi-modality research
Last synced: 07 May 2025
https://github.com/kyegomez/ai-reading-list
This collection brings together the highest-signal research papers in modern AI from the invention of the Transformer to the frontier work of 2024–2025 into a single, curated map of the field
agents ai ai-workflows attention attention-models diffusion diffusion-models flow-matching flux ml multi-modality research-paper transformers
Last synced: 18 Jan 2026
https://github.com/ravi-teja-konda/tunedllavadelights
Explore the rich flavors of Indian desserts with TunedLlavaDelights. Utilizing the in Llava fine-tuning, our project unveils detailed nutritional profiles, taste notes, and optimal consumption times for beloved sweets. Dive into a fusion of AI innovation and culinary tradition
chatgpt dalle2 dessert finetuning gpt4 gpt4v llama2 llava multi-modality multimodal nutrition nutrition-information stable-diffusion tranformers vision-language-learning vision-language-model
Last synced: 21 Apr 2026
https://github.com/anondo1969/fedsemicoviddetector
Repository for the journal article, 'Federated Semi-Supervised Multi-Task Learning to Detect COVID-19 and Lungs Segmentation Marking Using Chest Radiography Images and Raspberry Pi Devices: An Internet of Medical Things Application', Mahbub Ul Alam, Rahim Rahmani. Sensors 21, no. 15: 5025, https://doi.org/10.3390/s21155025.
clinical-decision-support-system covid-19-detection deep-learning electronic-health-records federated-learning health-informatics internet-of-medical-things iomt lungs-segmentation-detection machine-learning multi-modality raspberry-pi smart-healthcare
Last synced: 08 May 2025
https://github.com/nagababumo/open-source-models-with-hugging-face
asr audio-detection audio-processing automatic-speech-recognition blip clip huggingface huggingface-spaces huggingface-transformers image-captioning image-classification image-retrieval multi-modality object-detection open-source segementation sentence-embeddings transformers visual-question-answering zero-shot-learning
Last synced: 10 Sep 2025
https://github.com/anondo1969/thremaltimodal-covidetector
Repository for the conference paper 'COVID-19 detection from thermal image and tabular medical data utilizing multi-modal machine learning', Mahbub Ul Alam, Jaakko Hollmén and Rahim Rahmani. IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS), 2023, pp. 646-653.
covid-19-detection deep-learning internet-of-medical-things machine-learning multi-modality tabular-medical-data thermal-image
Last synced: 08 May 2025
https://github.com/jonathanjsjsc/swarm
🦟 Interactive swarm simulation where pointer swarms follow your cursor - WebGL / threejs
attention-mechanism docker java machine-learning multi-modality mybatis prompt-toolkit pso rabbitmq spring-cloud stable-diffusion swarm swarms tsp
Last synced: 02 Aug 2025
https://github.com/yuanze-lin/olympus
The official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
chatbot chatgpt deeplearning foundation-models instruction-tuning llava llms mllms multi-modality multimodal pytorch vision-language-model
Last synced: 01 Apr 2025