Projects in Awesome Lists tagged with multi-modal
A curated list of projects in awesome lists tagged with multi-modal .
https://github.com/openbmb/minicpm-o
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
Last synced: 14 May 2025
https://github.com/openbmb/minicpm-v
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Last synced: 22 Sep 2025
https://github.com/activeloopai/deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
ai computer-vision cv data-science datalake datasets deep-learning image-processing langchain large-language-models llm machine-learning ml mlops multi-modal python pytorch tensorflow vector-database vector-search
Last synced: 11 Feb 2026
https://github.com/modelscope/modelscope
ModelScope: bring the notion of Model-as-a-Service to life.
cv deep-learning machine-learning multi-modal nlp python science speech
Last synced: 01 Apr 2026
https://github.com/agentscope-ai/agentscope
Start building LLM-empowered multi-agent applications in an easier way.
agent chatbot distributed-agents drag-and-drop gpt-4 gpt-4o large-language-models llama3 llm llm-agent mcp multi-agent multi-modal
Last synced: 15 Jan 2026
https://github.com/opengvlab/internvl
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
gpt gpt-4o gpt-4v image-classification image-text-retrieval llm multi-modal semantic-segmentation video-classification vision-language-model vit-22b vit-6b
Last synced: 12 May 2025
https://github.com/modelscope/agentscope
Start building LLM-empowered multi-agent applications in an easier way.
agent chatbot distributed-agents drag-and-drop gpt-4 gpt-4o large-language-models llama3 llm llm-agent mcp multi-agent multi-modal
Last synced: 14 May 2025
https://github.com/thudm/cogvlm
a state-of-the-art-level open visual language model | 多模态预训练模型
cross-modality language-model multi-modal pretrained-models visual-language-models
Last synced: 14 May 2025
https://github.com/THUDM/CogVLM
a state-of-the-art-level open visual language model | 多模态预训练模型
cross-modality language-model multi-modal pretrained-models visual-language-models
Last synced: 28 Mar 2025
https://github.com/OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
gpt gpt-4o gpt-4v image-classification image-text-retrieval llm multi-modal semantic-segmentation video-classification vision-language-model vit-22b vit-6b
Last synced: 16 Mar 2025
https://github.com/lucidrains/dalle-pytorch
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
artificial-intelligence attention-mechanism deep-learning multi-modal text-to-image transformers
Last synced: 13 May 2025
https://github.com/lucidrains/DALLE-pytorch
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
artificial-intelligence attention-mechanism deep-learning multi-modal text-to-image transformers
Last synced: 14 Mar 2025
https://github.com/datajuicer/data-juicer
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
data data-analysis data-pipeline data-processing data-science data-visualization foundation-models instruction-tuning large-language-models llm llms multi-modal pre-training synthetic-data
Last synced: 08 Nov 2025
https://github.com/ofa-sys/chinese-clip
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
chinese clip computer-vision contrastive-loss coreml-models deep-learning image-text-retrieval multi-modal multi-modal-learning nlp pretrained-models pytorch transformers vision-and-language-pre-training vision-language
Last synced: 29 Apr 2025
https://github.com/OFA-Sys/Chinese-CLIP
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
chinese clip computer-vision contrastive-loss coreml-models deep-learning image-text-retrieval multi-modal multi-modal-learning nlp pretrained-models pytorch transformers vision-and-language-pre-training vision-language
Last synced: 02 Apr 2025
https://github.com/marqo-ai/marqo
Ecommerce Search and Discovery - marqo.ai
ecommerce machine-learning multi-modal search-engine
Last synced: 07 Apr 2026
https://github.com/valhalla/valhalla
Open Source Routing Engine for OpenStreetMap
astar dijkstra directions isochrones multi-modal openstreetmap routing routing-engine tiled traveling-salesman
Last synced: 11 May 2025
https://github.com/modelscope/data-juicer
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
chinese data-analysis data-science data-visualization dataset gpt gpt-4 instruction-tuning large-language-models llama llava llm llms multi-modal nlp opendata pre-training pytorch streamlit synthetic-data
Last synced: 13 May 2025
https://github.com/THUDM/VisualGLM-6B
Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Last synced: 14 Mar 2025
https://github.com/thudm/visualglm-6b
Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Last synced: 14 May 2025
https://github.com/zjunlp/deepke
[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
attribute-extraction chinese deep-learning deepke document-level few-shot information-extraction instructie kg knowledge-graph knowprompt lightner low-resource multi-modal named-entity-recognition ner nlp prompt pytorch relation-extraction
Last synced: 12 May 2025
https://github.com/zjunlp/DeepKE
[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
attribute-extraction chinese deep-learning deepke document-level few-shot information-extraction instructie kg knowledge-graph knowprompt lightner low-resource multi-modal named-entity-recognition ner nlp prompt pytorch relation-extraction
Last synced: 18 Mar 2025
https://github.com/pku-yuangroup/video-llava
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
instruction-tuning large-vision-language-model multi-modal
Last synced: 02 Jul 2025
https://github.com/scisharp/llamasharp
A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
chatbot gpt llama llama-cpp llama2 llama3 llamacpp llava llm multi-modal semantic-kernel
Last synced: 14 May 2025
https://github.com/PKU-YuanGroup/Video-LLaVA
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
instruction-tuning large-vision-language-model multi-modal
Last synced: 20 Mar 2025
https://github.com/docarray/docarray
Represent, send, store and search multimodal data
cross-modal data-structures dataclass deep-learning docarray elasticsearch fastapi machine-learning multi-modal multimodal nearest-neighbor-search nested-data neural-search protobuf pydantic pytorch qdrant semantic-search weaviate
Last synced: 12 Jan 2026
https://github.com/SciSharp/LLamaSharp
A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
chatbot gpt llama llama-cpp llama2 llama3 llamacpp llava llm multi-modal semantic-kernel
Last synced: 24 Mar 2025
https://github.com/jina-ai/docarray
Represent, send, store and search multimodal data
cross-modal data-structures dataclass deep-learning docarray elasticsearch fastapi machine-learning multi-modal multimodal nearest-neighbor-search nested-data neural-search protobuf pydantic pytorch qdrant semantic-search weaviate
Last synced: 05 Apr 2025
https://github.com/thudm/cogvlm2
GPT4V-level open-source multi-modal model based on Llama3-8B
cogvlm language-model multi-modal pretrained-models
Last synced: 14 May 2025
https://github.com/THUDM/CogVLM2
GPT4V-level open-source multi-modal model based on Llama3-8B
cogvlm language-model multi-modal pretrained-models
Last synced: 07 May 2025
https://github.com/open-compass/vlmevalkit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa
Last synced: 13 May 2025
https://github.com/dvlab-research/lisa
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
large-language-model llm multi-modal segmentation
Last synced: 14 May 2025
https://github.com/pku-yuangroup/moe-llava
Mixture-of-Experts for Large Vision-Language Models
large-vision-language-model mixture-of-experts moe multi-modal
Last synced: 14 May 2025
https://github.com/PKU-YuanGroup/MoE-LLaVA
Mixture-of-Experts for Large Vision-Language Models
large-vision-language-model mixture-of-experts moe multi-modal
Last synced: 16 Mar 2025
https://github.com/Kav-K/GPTDiscord
A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!
artificial-intelligence asyncio chatbot code-interpreter collaborate dalle2 digitalocean discord embeddings extractive-question-answering github gpt3 hacktoberfest help-wanted moderator-bot multi-modal openai openai-api pinecone python
Last synced: 24 Mar 2025
https://github.com/kav-k/gptdiscord
A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!
artificial-intelligence asyncio chatbot code-interpreter collaborate dalle2 digitalocean discord embeddings extractive-question-answering github gpt3 hacktoberfest help-wanted moderator-bot multi-modal openai openai-api pinecone python
Last synced: 07 Feb 2026
https://github.com/dvlab-research/LISA
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
large-language-model llm multi-modal segmentation
Last synced: 03 Apr 2025
https://github.com/openmotionlab/motiongpt
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs
3d-generation chatgpt gpt language-model motion motion-generation motiongpt multi-modal text-driven text-to-motion
Last synced: 15 May 2025
https://github.com/OpenMotionLab/MotionGPT
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs
3d-generation chatgpt gpt language-model motion motion-generation motiongpt multi-modal text-driven text-to-motion
Last synced: 12 Mar 2025
https://github.com/intellabs/fastrag
Efficient Retrieval Augmentation and Generation Framework
benchmark colbert diffusion generative-ai information-retrieval knowledge-graph llm multi-modal nlp question-answering semantic-search sentence-transformers summarization transformers
Last synced: 14 May 2025
https://github.com/QIN2DIM/hcaptcha-challenger
🥂 Gracefully face hCaptcha challenge with MoE(ONNX) embedded solution.
clip computer-vision hcaptcha hcaptcha-solver image-segmentation multi-modal multi-modal-learning object-detection onnx onnx-models onnxruntime opencv-python playwright solver yolo yolov5 zero-shot-classification
Last synced: 28 Mar 2025
https://github.com/dirtyharrylyl/transformer-in-vision
Recent Transformer-based CV and related works.
computer-vision deep-learning multi-modal paper self-attention transformer vision-transformers visual-language
Last synced: 28 Jan 2026
https://github.com/DirtyHarryLYL/Transformer-in-Vision
Recent Transformer-based CV and related works.
computer-vision deep-learning multi-modal paper self-attention transformer vision-transformers visual-language
Last synced: 20 Mar 2025
https://github.com/vercel/modelfusion
The TypeScript library for building AI applications.
ai artificial-intelligence chatbot claude dall-e embedding gpt-3 huggingface javascript js llamacpp llm mistral multi-modal ollama openai stable-diffusion ts typescript whisper
Last synced: 15 May 2025
https://github.com/open-compass/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa
Last synced: 20 Jul 2025
https://github.com/bytedance/SALMONN
SALMONN: Speech Audio Language Music Open Neural Network
audio audio-processing bytedance iclr2024 icml-2024 large-language-models multi-modal music research speech speech-recognition tsinghua-university
Last synced: 14 Apr 2025
https://github.com/bytedance/salmonn
SALMONN: Speech Audio Language Music Open Neural Network
audio audio-processing bytedance iclr2024 icml-2024 large-language-models multi-modal music research speech speech-recognition tsinghua-university
Last synced: 13 Apr 2025
https://github.com/medmnist/medmnist
[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification
2d 3d automl benchmark classification dataset decathlon deep-learning federated-learning few-shot-learning machine-learning medical medical-image-analysis medical-image-computing medical-imaging medmnist mnist multi-modal pytorch
Last synced: 14 May 2025
https://github.com/lgrammel/ai-utils.js
The TypeScript library for building AI applications.
ai artificial-intelligence chatbot claude dall-e embedding gpt-3 huggingface javascript js llamacpp llm mistral multi-modal ollama openai stable-diffusion ts typescript whisper
Last synced: 29 Dec 2025
https://github.com/lucidrains/transfusion-pytorch
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
artificial-intelligence attention deep-learning flow-matching multi-modal transformers
Last synced: 14 May 2025
https://github.com/IntelLabs/fastRAG
Efficient Retrieval Augmentation and Generation Framework
benchmark colbert diffusion generative-ai information-retrieval knowledge-graph llm multi-modal nlp question-answering semantic-search sentence-transformers summarization transformers
Last synced: 24 Mar 2025
https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.
alignment compression data-augmentation data-synthesis feedback instruction-following kd knowledge-distillation large-language-model llm multi-modal self-distillation self-training supervised-finetuning survey
Last synced: 12 Apr 2025
https://github.com/pku-yuangroup/languagebind
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
language-central multi-modal pretraining zero-shot
Last synced: 12 Apr 2025
https://github.com/microsoft/farmvibes-ai
FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability
agriculture ai geospatial geospatial-analytics multi-modal remote-sensing stac sustainability weather
Last synced: 15 May 2025
https://github.com/PKU-YuanGroup/LanguageBind
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
language-central multi-modal pretraining zero-shot
Last synced: 24 Jul 2025
https://github.com/openbmb/visrag
Parsing-free RAG supported by VLMs
document-retrieval document-understanding multi-modal multi-modality rag retrieval retrieval-augmented-generation vision-language-model
Last synced: 05 Oct 2025
https://github.com/salesforce/unicontrol
Unified Controllable Visual Generation Model
Last synced: 08 Oct 2025
https://github.com/salesforce/UniControl
Unified Controllable Visual Generation Model
Last synced: 28 Mar 2025
https://github.com/kyegomez/zeta
Build high-performance AI models with modular building blocks
artificial-intelligence deep-learning gpt4 llama2 longnet multi-agent-systems multi-modal multi-modal-learning multi-platform pytorch speech-recognition transformer transformers
Last synced: 14 May 2025
https://github.com/kyegomez/rt-2
Democratization of RT-2 "RT-2: New model translates vision and language into action"
artificial-intelligence attention-mechanism embodied-agent gpt4 multi-modal robotics transformer
Last synced: 12 Apr 2025
https://github.com/internlm/internevo
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3
Last synced: 07 Oct 2025
https://github.com/dvlab-research/llmga
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral
aigc image-design-assistant image-editing image-generation large-language-model llm mllm multi-modal
Last synced: 03 Jul 2025
https://github.com/ankur-anand/unisondb
A streaming multimodal database for Edge AI, and Edge Computing.
ai-agents database edge-computing go golang golang-database grpc grpc-go key-value multi-modal replicated row-column streaming streaming-data streaming-database unisondb wide-column-database
Last synced: 14 Jan 2026
https://github.com/harlanhong/actalker
ICCV 2025 ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).
avatar diffusion-models digitalhuman face-animation multi-modal stablevideodiffusion talking-head
Last synced: 29 Jan 2026
https://github.com/v-iashin/SpecVQGAN
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
audio audio-generation bmvc evaluation-metrics gan melgan multi-modal pytorch transformer vas vggsound video video-features video-understanding vqvae
Last synced: 09 Apr 2025
https://github.com/wangsuzhen/Audio2Head
code for paper "Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion" in the conference of IJCAI 2021
codes ijcai2021 multi-modal paper talking-face talking-head
Last synced: 28 Mar 2025
https://github.com/wisconsinaivision/vip-llava
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
chatbot clip cvpr2024 foundation-models gpt-4 gpt-4-vision llama llama2 llava multi-modal vision-language visual-prompting
Last synced: 06 Apr 2025
https://github.com/InternLM/InternEvo
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
910b deepspeed-ulysses flash-attention gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism pytorch ring-attention sequence-parallelism tensor-parallelism transformers-models zero3
Last synced: 27 Mar 2025
https://github.com/wangxiao5791509/MultiModal_BigModels_Survey
[MIR-2023-Survey] A continuously updated paper list for multi-modal pre-trained big models
anhui-university audio big-models depth event-camera multi-modal natural-language pengchenglab point-cloud pre-training radar review rgb-text-audio self-attention survey thermal-infrared transformers
Last synced: 02 Apr 2025
https://github.com/Haiyang-W/UniTR
[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"
3d 3d-object-detection 3d-segmentation backbone bev camera computer-vision iccv2023 lidar multi-modal multi-view point-cloud transformer unified
Last synced: 20 Mar 2025
https://github.com/Open3DA/LL3DA
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
3d 3d-models 3d-to-text cvpr2024 gpt instruction-tuning language-model llm multi-modal scene-understanding
Last synced: 29 Sep 2025
https://github.com/junchen14/multi-modal-transformer
The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.
efficiency-transformer image-transformer language mlp-mixer multi-modal multi-modal-cvpr2021 transformer-readling-list video-language video-transformer vision-transformer
Last synced: 24 Jan 2026
https://github.com/FareedKhan-dev/all-rag-techniques
Implementation of all RAG techniques in a simpler way
ai llm llms multi-modal openai python rag
Last synced: 23 Mar 2025
https://github.com/juliarobotics/caesar.jl
Robust robotic localization and mapping, together with NavAbility(TM). Reach out to info@wherewhen.ai for help.
caesar database isam julia multi-modal non-parametric parametric-navigation-solutions robotics slam
Last synced: 06 Apr 2025
https://github.com/zjysteven/vlm-visualizer
Visualizing the attention of vision-language models
attention attention-mechanism llava multi-modal vision-language vision-language-model
Last synced: 17 Aug 2025
https://github.com/openmotionlab/motiongpt3
MotionGPT3: Human Motion as a Second Modality, a MoT-based framework for unified motion understanding and generation
chatgpt gpt language-model motion motiongpt motiongpt3 multi-modal text-to-motion
Last synced: 28 Apr 2026
https://github.com/2u1/llama3.2-vision-finetune
An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
llama3 multi-modal vision-language vision-language-model
Last synced: 05 Apr 2025
https://github.com/thu-ml/mmtrusteval
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness
Last synced: 05 Apr 2025
https://github.com/kyegomez/switchtransformers
Implementation of Switch Transformers from the paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"
ai gpt4 llama mixture-model mixture-of-experts mixture-of-models ml moe multi-modal
Last synced: 09 Oct 2025
https://github.com/netease-media/grps_trtllm
Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.
ai-agent chatglm deepseek-r1 function-call internvideo internvl2 janus-pro llama-index llama3 llm minicpm-v multi-modal olmocr openai phi qwen2 qwen2-vl qwq tensorrt-llm
Last synced: 06 Apr 2025
https://github.com/xlang-ai/spider2-v
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
agent artificial-intelligence code-generation data-science-and-engineering gui llm multi-modal natural-language-processing vlm
Last synced: 30 Apr 2025
https://github.com/xlang-ai/Spider2-V
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
agent artificial-intelligence code-generation data-science-and-engineering gui llm multi-modal natural-language-processing vlm
Last synced: 23 Feb 2025
https://github.com/thu-ml/MMTrustEval
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness
Last synced: 27 Jul 2025
https://github.com/awslabs/rhubarb
A Python framework for multi-modal document understanding with Amazon Bedrock
amazon-bedrock document-processing generative-ai intelligent-document-processing multi-modal
Last synced: 02 Apr 2026
https://github.com/Event-AHU/EventVOT_Benchmark
[CVPR-2024] The First High Definition (HD) Event based Visual Object Tracking Benchmark Dataset
benchmark-dataset cross-modality event-based-tracking high-definition multi-modal rgb-event single-object-tracking visual-object-tracking visual-tracking
Last synced: 03 Mar 2025
https://github.com/NetEase-Media/grps_trtllm
【高性能OpenAI LLM服务】通过GPRS+TensorRT-LLM+Tokenizers.cpp实现纯C++版高性能OpenAI LLM服务,支持chat和function call模式,支持ai agent,支持分布式多卡推理,支持多模态,支持gradio聊天界面。
ai-agent chatglm deepseek-r1 function-call internvl2 llama-index llama3 llm multi-modal openai qwen-vl qwen2 qwen2-vl tensorrt-llm
Last synced: 04 Nov 2025
https://github.com/OpenShapeLab/ShapeGPT
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a unified and user-friendly shape-language model
3d-generation caption-generation chatgpt gpt language-model multi-modal shape unified
Last synced: 20 Mar 2025
https://github.com/qcraftai/distill-bev
DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation (ICCV 2023)
3d-object-detection autonomous-driving bev cross-modal distillation knowledge-distillation lidar multi-camera multi-modal nuscenes point-cloud self-driving
Last synced: 20 Mar 2025
https://github.com/juliarobotics/incrementalinference.jl
Clique recycling non-Gaussian (multi-modal) factor graph solver; also see Caesar.jl.
bayes bayes-network bayes-tree belief-propagation caesar chapman-kolmogorov factor-graphs filtering-algorithm inference isam julia-language multi-hypothesis multi-modal nonparametric optimization parametric robotics slam state-estimation sum-product
Last synced: 11 Sep 2025
https://github.com/guyyariv/AudioToken
This repo contains the official PyTorch implementation of AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
ai-art audio-to-image audio2image deep-learning diffusion-models image-generation multi-modal stable-diffusion text2image
Last synced: 14 Jul 2025
https://github.com/icon-lab/i2i-mamba
Official implementation of I2I-Mamba, an image-to-image translation model based on selective state spaces
artificial-intelligence deeplearning image-synthesis image-to-image-translation mamba mamba-state-space-models medical multi-modal neural-networks pytorch ssm
Last synced: 06 Apr 2025
https://github.com/ThuCCSLab/FigStep
Jailbreaking Large Vision-language Models via Typographic Visual Prompts
gpt-4 jailbreak llm multi-modal safety security vlm
Last synced: 27 Jul 2025
https://github.com/amanchadha/iperceive
Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | Python3 | PyTorch | CNNs | Causality | Reasoning | LSTMs | Transformers | Multi-Head Self Attention | Published in IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
attention captioning captioning-videos causality common-sense convolutional-neural-networks dense-captioning distilling-the-knowledge lstm multi-modal python python3 pytorch question-answering reasoning resnets self-attention transformers video videoqa
Last synced: 05 Apr 2025
https://github.com/rentainhe/trar-vqa
[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"
attention clevr dynamic-network iccv2021 local-and-global multi-modal multi-modal-learning multi-modality multi-scale-features official pytorch transformer vision-and-language visual-question-answering visualization vqav2
Last synced: 28 Aug 2025
https://github.com/kyegomez/kosmos-x
The Next Generation Multi-Modality Superintelligence
computer-vision gemini gpt gpt3 gpt4 multi-modal pytorch vision
Last synced: 15 Apr 2025
https://github.com/icon-lab/I2I-Mamba
Official implementation of I2I-Mamba, an image-to-image translation model based on selective state spaces
artificial-intelligence deeplearning image-synthesis image-to-image-translation mamba mamba-state-space-models medical multi-modal neural-networks pytorch ssm
Last synced: 20 Mar 2025
https://github.com/howard-hou/VisualRWKV
VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.
large-language-models multi-modal rwkv
Last synced: 07 May 2025
https://github.com/thu-ml/mla-trust
A toolbox for benchmarking Multimodal LLM Agents trustworthiness across truthfulness, controllability, safety and privacy dimensions through 34 interactive tasks
agent benchmark controllability multi-modal privacy safety toolbox trustworthy-ai truthfulness
Last synced: 06 Mar 2026
https://github.com/Event-AHU/COESOT
A large-scale benchmark dataset for color-event based visual tracking
benchmark-dataset coesot dynamic-vision-sensors event-camera multi-modal multi-modality-tracking rgb-event single-object-tracking transformer visual-object-tracking
Last synced: 23 Nov 2025
https://github.com/Eaphan/UPIDet
Unleash the Potential of Image Branch for Cross-modal 3D Object Detection [NeurIPS2023]
3d-object-detection cross-modal multi-modal
Last synced: 20 Mar 2025