Projects in Awesome Lists tagged with multi-modal

https://github.com/openbmb/minicpm-v

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

minicpm minicpm-v multi-modal

Last synced: 14 Jan 2025

https://github.com/OpenBMB/MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

minicpm minicpm-v multi-modal

Last synced: 07 Nov 2024

https://github.com/activeloopai/deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

ai computer-vision cv data-science datalake datasets deep-learning image-processing langchain large-language-models llm machine-learning ml mlops multi-modal python pytorch tensorflow vector-database vector-search

Last synced: 13 Jan 2025

https://github.com/activeloopai/Hub

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

ai computer-vision cv data-science datalake datasets deep-learning image-processing langchain large-language-models llm machine-learning ml mlops multi-modal python pytorch tensorflow vector-database vector-search

Last synced: 08 Dec 2024

https://github.com/modelscope/modelscope

ModelScope: bring the notion of Model-as-a-Service to life.

cv deep-learning machine-learning multi-modal nlp python science speech

Last synced: 13 Jan 2025

https://github.com/thudm/cogvlm

a state-of-the-art-level open visual language model | 多模态预训练模型

cross-modality language-model multi-modal pretrained-models visual-language-models

Last synced: 14 Jan 2025

https://github.com/opengvlab/internvl

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

gpt gpt-4o gpt-4v image-classification image-text-retrieval llm multi-modal semantic-segmentation video-classification vision-language-model vit-22b vit-6b

Last synced: 14 Jan 2025

https://github.com/OpenGVLab/InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

gpt gpt-4o gpt-4v image-classification image-text-retrieval llm multi-modal semantic-segmentation video-classification vision-language-model vit-22b vit-6b

Last synced: 27 Oct 2024

https://github.com/lucidrains/dalle-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

artificial-intelligence attention-mechanism deep-learning multi-modal text-to-image transformers

Last synced: 14 Jan 2025

https://github.com/lucidrains/DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

artificial-intelligence attention-mechanism deep-learning multi-modal text-to-image transformers

Last synced: 26 Oct 2024

https://github.com/THUDM/CogVLM

a state-of-the-art-level open visual language model | 多模态预训练模型

cross-modality language-model multi-modal pretrained-models visual-language-models

Last synced: 31 Oct 2024

https://github.com/modelscope/agentscope

Start building LLM-empowered multi-agent applications in an easier way.

agent chatbot distributed-agents drag-and-drop gpt-4 gpt-4o large-language-models llama3 llm llm-agent multi-agent multi-modal

Last synced: 15 Jan 2025

https://github.com/ofa-sys/chinese-clip

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

chinese clip computer-vision contrastive-loss coreml-models deep-learning image-text-retrieval multi-modal multi-modal-learning nlp pretrained-models pytorch transformers vision-and-language-pre-training vision-language

Last synced: 14 Jan 2025

https://github.com/marqo-ai/marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

chatgpt clip deep-learning gpt hacktoberfest hnsw information-retrieval knn large-language-models machine-learning machinelearning multi-modal natural-language-processing search-engine semantic-search tensor-search transformers vector-search vision-language visual-search

Last synced: 14 Jan 2025

https://github.com/valhalla/valhalla

Open Source Routing Engine for OpenStreetMap

astar dijkstra directions isochrones multi-modal openstreetmap routing routing-engine tiled traveling-salesman

Last synced: 14 Jan 2025

https://github.com/OFA-Sys/Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

chinese clip computer-vision contrastive-loss coreml-models deep-learning image-text-retrieval multi-modal multi-modal-learning nlp pretrained-models pytorch transformers vision-and-language-pre-training vision-language

Last synced: 03 Nov 2024

https://github.com/thudm/visualglm-6b

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型

chatglm-6b gpt multi-modal

Last synced: 16 Jan 2025

https://github.com/THUDM/VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型

chatglm-6b gpt multi-modal

Last synced: 25 Oct 2024

https://github.com/zjunlp/deepke

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction

attribute-extraction chinese deep-learning deepke document-level few-shot information-extraction instructie kg knowledge-graph knowprompt lightner low-resource multi-modal named-entity-recognition ner nlp prompt pytorch relation-extraction

Last synced: 14 Jan 2025

https://github.com/modelscope/data-juicer

Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！

chinese data-analysis data-science data-visualization dataset gpt gpt-4 instruction-tuning large-language-models llama llava llm llms multi-modal nlp opendata pre-training pytorch sora streamlit

Last synced: 20 Jan 2025

https://github.com/pku-yuangroup/video-llava

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

instruction-tuning large-vision-language-model multi-modal

Last synced: 16 Jan 2025

https://github.com/docarray/docarray

Represent, send, store and search multimodal data

cross-modal data-structures dataclass deep-learning docarray elasticsearch fastapi machine-learning multi-modal multimodal nearest-neighbor-search nested-data neural-search protobuf pydantic pytorch qdrant semantic-search weaviate

Last synced: 14 Jan 2025

https://github.com/PKU-YuanGroup/Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

instruction-tuning large-vision-language-model multi-modal

Last synced: 28 Oct 2024

https://github.com/zjunlp/DeepKE

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction

attribute-extraction chinese deep-learning deepke document-level few-shot information-extraction instructie kg knowledge-graph knowprompt lightner low-resource multi-modal named-entity-recognition ner nlp prompt pytorch relation-extraction

Last synced: 27 Oct 2024

https://github.com/scisharp/llamasharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.

chatbot gpt llama llama-cpp llama2 llama3 llamacpp llava llm multi-modal semantic-kernel

Last synced: 20 Jan 2025

https://github.com/SciSharp/LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.

chatbot gpt llama llama-cpp llama2 llama3 llamacpp llava llm multi-modal semantic-kernel

Last synced: 28 Oct 2024

https://github.com/thudm/cogvlm2

GPT4V-level open-source multi-modal model based on Llama3-8B

cogvlm language-model multi-modal pretrained-models

Last synced: 16 Jan 2025

https://github.com/pku-yuangroup/moe-llava

Mixture-of-Experts for Large Vision-Language Models

large-vision-language-model mixture-of-experts moe multi-modal

Last synced: 18 Jan 2025

https://github.com/PKU-YuanGroup/MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models

large-vision-language-model mixture-of-experts moe multi-modal

Last synced: 27 Oct 2024

https://github.com/kav-k/gptdiscord

A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!

artificial-intelligence asyncio chatbot code-interpreter collaborate dalle2 digitalocean discord embeddings extractive-question-answering github gpt3 hacktoberfest help-wanted moderator-bot multi-modal openai openai-api pinecone python

Last synced: 17 Jan 2025

https://github.com/dvlab-research/lisa

Project Page for "LISA: Reasoning Segmentation via Large Language Model"

large-language-model llm multi-modal segmentation

Last synced: 14 Jan 2025

https://github.com/dvlab-research/LISA

Project Page for "LISA: Reasoning Segmentation via Large Language Model"

large-language-model llm multi-modal segmentation

Last synced: 04 Nov 2024

https://github.com/Kav-K/GPTDiscord

A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!

artificial-intelligence asyncio chatbot code-interpreter collaborate dalle2 digitalocean discord embeddings extractive-question-answering github gpt3 hacktoberfest help-wanted moderator-bot multi-modal openai openai-api pinecone python

Last synced: 28 Oct 2024

https://github.com/openmotionlab/motiongpt

[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs

3d-generation chatgpt gpt language-model motion motion-generation motiongpt multi-modal text-driven text-to-motion

Last synced: 18 Jan 2025

https://github.com/OpenMotionLab/MotionGPT

[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs

3d-generation chatgpt gpt language-model motion motion-generation motiongpt multi-modal text-driven text-to-motion

Last synced: 24 Oct 2024

https://github.com/qin2dim/hcaptcha-challenger

🥂 Gracefully face hCaptcha challenge with MoE(ONNX) embedded solution.

clip computer-vision hcaptcha hcaptcha-solver image-segmentation multi-modal multi-modal-learning object-detection onnx onnx-models onnxruntime opencv-python playwright solver yolo yolov5 zero-shot-classification

Last synced: 16 Jan 2025

https://github.com/QIN2DIM/hcaptcha-challenger

🥂 Gracefully face hCaptcha challenge with MoE(ONNX) embedded solution.

clip computer-vision hcaptcha hcaptcha-solver image-segmentation multi-modal multi-modal-learning object-detection onnx onnx-models onnxruntime opencv-python playwright solver yolo yolov5 zero-shot-classification

Last synced: 31 Oct 2024

https://github.com/dirtyharrylyl/transformer-in-vision

Recent Transformer-based CV and related works.

computer-vision deep-learning multi-modal paper self-attention transformer vision-transformers visual-language

Last synced: 04 Dec 2024

https://github.com/DirtyHarryLYL/Transformer-in-Vision

Recent Transformer-based CV and related works.

computer-vision deep-learning multi-modal paper self-attention transformer vision-transformers visual-language

Last synced: 28 Oct 2024

https://github.com/open-compass/VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 28 Nov 2024

https://github.com/open-compass/vlmevalkit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 20 Jan 2025

https://github.com/vercel/modelfusion

The TypeScript library for building AI applications.

ai artificial-intelligence chatbot claude dall-e embedding gpt-3 huggingface javascript js llamacpp llm mistral multi-modal ollama openai stable-diffusion ts typescript whisper

Last synced: 19 Jan 2025

https://github.com/bytedance/salmonn

SALMONN: Speech Audio Language Music Open Neural Network

audio audio-processing bytedance iclr2024 icml-2024 large-language-models multi-modal music research speech speech-recognition tsinghua-university

Last synced: 17 Jan 2025

https://github.com/medmnist/medmnist

[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification

2d 3d automl benchmark classification dataset decathlon deep-learning federated-learning few-shot-learning machine-learning medical medical-image-analysis medical-image-computing medical-imaging medmnist mnist multi-modal pytorch

Last synced: 16 Jan 2025

https://github.com/bytedance/SALMONN

SALMONN: Speech Audio Language Music Open Neural Network

audio audio-processing bytedance iclr2024 icml-2024 large-language-models multi-modal music research speech speech-recognition tsinghua-university

Last synced: 08 Nov 2024

https://github.com/IntelLabs/fastRAG

Efficient Retrieval Augmentation and Generation Framework

benchmark colbert diffusion generative-ai information-retrieval knowledge-graph llm multi-modal nlp question-answering semantic-search sentence-transformers summarization transformers

Last synced: 28 Oct 2024

https://github.com/intellabs/fastrag

Efficient Retrieval Augmentation and Generation Framework

benchmark colbert diffusion generative-ai information-retrieval knowledge-graph llm multi-modal nlp question-answering semantic-search sentence-transformers summarization transformers

Last synced: 18 Jan 2025

https://github.com/lucidrains/transfusion-pytorch

Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI

artificial-intelligence attention deep-learning flow-matching multi-modal transformers

Last synced: 16 Jan 2025

https://github.com/pku-yuangroup/languagebind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

language-central multi-modal pretraining zero-shot

Last synced: 15 Jan 2025

https://github.com/PKU-YuanGroup/LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

language-central multi-modal pretraining zero-shot

Last synced: 30 Nov 2024

https://github.com/microsoft/farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability

agriculture ai geospatial geospatial-analytics multi-modal remote-sensing stac sustainability weather

Last synced: 17 Jan 2025

https://github.com/salesforce/unicontrol

Unified Controllable Visual Generation Model

aigc generation multi-modal

Last synced: 17 Jan 2025

https://github.com/salesforce/UniControl

Unified Controllable Visual Generation Model

aigc generation multi-modal

Last synced: 31 Oct 2024

https://github.com/kyegomez/zeta

Build high-performance AI models with modular building blocks

artificial-intelligence deep-learning gpt4 llama2 longnet multi-agent-systems multi-modal multi-modal-learning multi-platform pytorch speech-recognition transformer transformers

Last synced: 16 Jan 2025

https://github.com/kyegomez/rt-2

Democratization of RT-2 "RT-2: New model translates vision and language into action"

artificial-intelligence attention-mechanism embodied-agent gpt4 multi-modal robotics transformer

Last synced: 18 Jan 2025

https://github.com/v-iashin/SpecVQGAN

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

audio audio-generation bmvc evaluation-metrics gan melgan multi-modal pytorch transformer vas vggsound video video-features video-understanding vqvae

Last synced: 06 Nov 2024

https://github.com/THUDM/CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B

cogvlm language-model multi-modal pretrained-models

Last synced: 14 Nov 2024

https://github.com/wangsuzhen/Audio2Head

code for paper "Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion" in the conference of IJCAI 2021

codes ijcai2021 multi-modal paper talking-face talking-head

Last synced: 31 Oct 2024

https://github.com/wisconsinaivision/vip-llava

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

chatbot clip cvpr2024 foundation-models gpt-4 gpt-4-vision llama llama2 llava multi-modal vision-language visual-prompting

Last synced: 20 Jan 2025

https://github.com/internlm/internevo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3

Last synced: 18 Jan 2025

https://github.com/InternLM/InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3

Last synced: 30 Oct 2024

https://github.com/wangxiao5791509/MultiModal_BigModels_Survey

[MIR-2023-Survey] A continuously updated paper list for multi-modal pre-trained big models

anhui-university audio big-models depth event-camera multi-modal natural-language pengchenglab point-cloud pre-training radar review rgb-text-audio self-attention survey thermal-infrared transformers

Last synced: 03 Nov 2024

https://github.com/Haiyang-W/UniTR

[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"

3d 3d-object-detection 3d-segmentation backbone bev camera computer-vision iccv2023 lidar multi-modal multi-view point-cloud transformer unified

Last synced: 28 Oct 2024

https://github.com/Open3DA/LL3DA

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.

3d 3d-models 3d-to-text cvpr2024 gpt instruction-tuning language-model llm multi-modal scene-understanding

Last synced: 19 Jan 2025

https://github.com/juliarobotics/caesar.jl

Robust robotic localization and mapping, together with NavAbility(TM). Reach out to [email protected] for help.

caesar database isam julia multi-modal non-parametric parametric-navigation-solutions robotics slam

Last synced: 13 Jan 2025

https://github.com/thu-ml/mmtrusteval

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness

Last synced: 13 Jan 2025

https://github.com/2u1/llama3.2-vision-finetune

An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.

llama3 multi-modal vision-language vision-language-model

Last synced: 14 Jan 2025

https://github.com/xlang-ai/spider2-v

[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

agent artificial-intelligence code-generation data-science-and-engineering gui llm multi-modal natural-language-processing vlm

Last synced: 13 Jan 2025

https://github.com/thu-ml/MMTrustEval

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness

Last synced: 02 Dec 2024

https://github.com/netease-media/grps_trtllm

【高性能OpenAI LLM服务】通过GPRS+TensorRT-LLM+Tokenizers.cpp实现纯C++版高性能OpenAI LLM服务，支持chat和function call模式，支持ai agent，支持分布式多卡推理，支持多模态，支持gradio聊天界面。

ai-agent chatglm function-call internvl2 llama-index llama3 llm multi-modal openai qwen-vl qwen2 qwen2-vl tensorrt-llm

Last synced: 14 Jan 2025

https://github.com/OpenShapeLab/ShapeGPT

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a unified and user-friendly shape-language model

3d-generation caption-generation chatgpt gpt language-model multi-modal shape unified

Last synced: 28 Oct 2024

https://github.com/qcraftai/distill-bev

DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation (ICCV 2023)

3d-object-detection autonomous-driving bev cross-modal distillation knowledge-distillation lidar multi-camera multi-modal nuscenes point-cloud self-driving

Last synced: 28 Oct 2024

https://github.com/juliarobotics/incrementalinference.jl

Clique recycling non-Gaussian (multi-modal) factor graph solver; also see Caesar.jl.

bayes bayes-network bayes-tree belief-propagation caesar chapman-kolmogorov factor-graphs filtering-algorithm inference isam julia-language multi-hypothesis multi-modal nonparametric optimization parametric robotics slam state-estimation sum-product

Last synced: 19 Jan 2025

https://github.com/guyyariv/AudioToken

This repo contains the official PyTorch implementation of AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

ai-art audio-to-image audio2image deep-learning diffusion-models image-generation multi-modal stable-diffusion text2image

Last synced: 22 Nov 2024

https://github.com/ThuCCSLab/FigStep

Jailbreaking Large Vision-language Models via Typographic Visual Prompts

gpt-4 jailbreak llm multi-modal safety security vlm

Last synced: 02 Dec 2024

https://github.com/kyegomez/switchtransformers

Implementation of Switch Transformers from the paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"

ai gpt4 llama mixture-model mixture-of-experts mixture-of-models ml moe multi-modal

Last synced: 13 Jan 2025

https://github.com/amanchadha/iperceive

attention captioning captioning-videos causality common-sense convolutional-neural-networks dense-captioning distilling-the-knowledge lstm multi-modal python python3 pytorch question-answering reasoning resnets self-attention transformers video videoqa

Last synced: 05 Nov 2024

https://github.com/kyegomez/kosmos-x

The Next Generation Multi-Modality Superintelligence

computer-vision gemini gpt gpt3 gpt4 multi-modal pytorch vision

Last synced: 16 Nov 2024

https://github.com/rentainhe/trar-vqa

[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"

attention clevr dynamic-network iccv2021 local-and-global multi-modal multi-modal-learning multi-modality multi-scale-features official pytorch transformer vision-and-language visual-question-answering visualization vqav2

Last synced: 07 Nov 2024

https://github.com/howard-hou/VisualRWKV

VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.

large-language-models multi-modal rwkv

Last synced: 14 Nov 2024

https://github.com/Eaphan/UPIDet

Unleash the Potential of Image Branch for Cross-modal 3D Object Detection [NeurIPS2023]

3d-object-detection cross-modal multi-modal

Last synced: 28 Oct 2024

https://github.com/kyegomez/lumiere

Implementation of the text to video model LUMIERE from the paper: "A Space-Time Diffusion Model for Video Generation" by Google Research

agents ai artificial-intelligence machine-learning multi-modal swarms text-to-video ttv

Last synced: 15 Jan 2025

https://github.com/icon-lab/I2I-Mamba

Official implementation of I2I-Mamba, an image-to-image translation model based on selective state spaces

artificial-intelligence deeplearning image-synthesis image-to-image-translation mamba mamba-state-space-models medical multi-modal neural-networks pytorch ssm

Last synced: 28 Oct 2024

https://github.com/lupantech/dual-mfa-vqa

Co-attending Regions and Detections for VQA.

aaai attention-mechanism caffe faster-rcnn multi-gpu multi-modal object-detection torch visual-question-answering vqa

Last synced: 23 Dec 2024

https://github.com/ashvardanian/usearch-images

Semantic Search demo featuring UForm, USearch, UCall, and StreamLit, to visual and retrieve from image datasets, similar to "CLIP Retrieval"

ai clip clip-model clip-retrival demo demo-app multi-lingual multi-modal semantic-search streamlit transformer vector-search

Last synced: 07 Nov 2024

https://github.com/kyegomez/hlt

Implementation of the transformer from the paper: "Real-World Humanoid Locomotion with Reinforcement Learning"

ai artificial-intelligence attention attention-is-all-you-need ml multi-modal robotics transformers

Last synced: 19 Dec 2024

https://github.com/kyegomez/autort

Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents"

ai artificial-intelligence attention-is-all-you-need attention-mechanism gpt4 machine-learning ml multi-modal multimodal-learning robotics robots ros swarm swarms

Last synced: 06 Jan 2025

https://github.com/kyegomez/qformer

Implementation of Qformer from BLIP2 in Zeta Lego blocks.

ai artificial-intelligence attention-mechanism blip2 machine machine-learning multi-modal multi-modality

Last synced: 01 Jan 2025

https://github.com/kyegomez/simba

A simpler Pytorch + Zeta Implementation of the paper: "SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time series"

agents imagenet llms multi-modal recurrent-networks ss4 ssm transformers

Last synced: 01 Jan 2025

https://github.com/kyegomez/mm1

PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"

ai artificial-intelligence deep-learning gpt4 machine-learning ml mm1 multi-modal multi-modal-revolution multi-modality

Last synced: 01 Jan 2025

https://github.com/Toytiny/RadarNet-pytorch

PyTorch code reproduction of RadarNet (ECCV'20) for radar-based 3D object detection

3d-detection 3d-vision automotive-radar lidar-point-cloud multi-modal object-detection point-cloud

Last synced: 28 Oct 2024

https://github.com/kyegomez/fuyu

Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch

ai artificial-intelligence gpt4 gpt5 machine-learning multi-modal multi-modality

Last synced: 09 Nov 2024

https://github.com/kyegomez/megavit

The open source implementation of the model from "Scaling Vision Transformers to 22 Billion Parameters"

artificial-intelligence computer-vision gpt4 multi-modal multi-modal-fusion multi-modal-learning vision-and-language vision-transformer

Last synced: 09 Nov 2024

https://github.com/kyegomez/forest-of-thoughts

A forest of autonomous agents.

ai artificial-intelligence machine-learning ml multi-modal multi-modality

Last synced: 09 Nov 2024

https://github.com/kyegomez/neva

The open source implementation of "NeVA: NeMo Vision and Language Assistant"

artificial-intelligence cuda gpt4 multi-modal multi-modal-learning multithreading neva nvidia robotics

Last synced: 09 Nov 2024

https://github.com/dermatologist/kedro-multimodal

Template for multi-modal machine learning in healthcare using Kedro. Combine reports, tabular data and images using various fusion methods.

healthcare jupyter kedro kubeflow machine-learning medical multi-modal pipeline vertex-ai

Last synced: 24 Oct 2024

https://github.com/kyegomez/palm2-vadapter

Implementation of "PaLM2-VAdapter:" from the multi-modal model paper: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter"

ai attention attention-is-all-you-need attention-mechanisms deeplearning ml models multi-modal neural-nets transformers

Last synced: 09 Nov 2024

https://github.com/kyegomez/hrtx

Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2

ai artificial-intelligence ensemble gpt4v machine-learning ml multi-modal multi-modality rt-2 rtx

Last synced: 01 Jan 2025

https://github.com/kyegomez/mc-vit

Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"

ai multi-modal multi-modal-transformers multi-modality open-source transformer transformers vit

Last synced: 09 Nov 2024

https://github.com/kyegomez/visionllama

Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta

ai deep-learning multi-modal vision-models vision-transformers vit

Last synced: 09 Nov 2024