An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with multimodal

A curated list of projects in awesome lists tagged with multimodal .

https://github.com/haotian-liu/llava

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning

Last synced: 15 Apr 2025

https://github.com/haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning

Last synced: 14 Mar 2025

https://github.com/microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e

Last synced: 08 Apr 2025

https://github.com/deepseek-ai/janus

Janus-Series: Unified Multimodal Understanding and Generation Models

any-to-any foundation-models llm multimodal unified-model vision-language-pretraining

Last synced: 13 Apr 2025

https://github.com/nvidia/nemo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

asr deeplearning generative-ai large-language-models machine-translation multimodal neural-networks speaker-diariazation speaker-recognition speech-synthesis speech-translation tts

Last synced: 16 Apr 2025

https://github.com/mediar-ai/screenpipe

AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording

agents agi ai computer-vision llm machine-learning ml multimodal vision

Last synced: 08 Apr 2025

https://github.com/NVIDIA/NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

asr deeplearning generative-ai large-language-models machine-translation multimodal neural-networks speaker-diariazation speaker-recognition speech-synthesis speech-translation tts

Last synced: 14 Mar 2025

https://github.com/rerun-io/rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.

computer-vision cpp multimodal python robotics rust visualization

Last synced: 18 Apr 2025

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 19 Apr 2025

https://github.com/bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 Mar 2025

https://github.com/enricoros/big-agi

AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.

agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui

Last synced: 09 Apr 2025

https://github.com/skalskip/courses

This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)

computer-vision deep-learning deep-neural-networks generative-model machine-learning mlops multimodal natural-language-processing nlp stable-diffusion transformers tutorial

Last synced: 11 Apr 2025

https://github.com/enricoros/nextjs-chatgpt-app

Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.

agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui

Last synced: 13 Dec 2024

https://github.com/swyxio/ai-notes

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

ai gpt gpt-3 multimodal openai prompt-engineering stable-diffusion

Last synced: 12 Apr 2025

https://github.com/modelscope/ms-swift

Use PEFT or Full-parameter to finetune 450+ LLMs (Qwen2.5, InternLM3, GLM4, Llama3.3, Mistral, Yi1.5, Baichuan2, DeepSeek-R1, ...) and 150+ MLLMs (Qwen2.5-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2.5, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, GOT-OCR2, ...).

agent deepseek-r1 deploy distill embedding grpo internvl liger llama llama3-3 llm lora multimodal open-r1 peft qwen2-5 qwen2-vl rft sft

Last synced: 09 Apr 2025

https://github.com/facebookresearch/mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

captioning deep-learning dialog hateful-memes multi-tasking multimodal pretrained-models pytorch textvqa vqa

Last synced: 08 Apr 2025

https://github.com/ten-framework/ten-agent

TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.

agent ai asr cpp gemini golang gpt-4 gpt-4o llm low-latency multimodal nextjs14 openai python rag real-time realtime tts vision voice-assistant

Last synced: 31 Mar 2025

https://github.com/SkalskiP/courses

This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)

computer-vision deep-learning deep-neural-networks generative-model machine-learning mlops multimodal natural-language-processing nlp stable-diffusion transformers tutorial

Last synced: 26 Mar 2025

https://github.com/TEN-framework/TEN-Agent

TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.

agent ai asr cpp gemini golang gpt-4 gpt-4o llm low-latency multimodal nextjs14 openai python rag real-time realtime tts vision voice-assistant

Last synced: 08 Mar 2025

https://github.com/kyegomez/tree-of-thoughts

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

artificial-intelligence chatgpt deep-learning gpt4 multimodal prompt prompt-engineering prompt-learning prompt-tuning

Last synced: 08 Apr 2025

https://github.com/pyspur-dev/pyspur

A visual playground for agentic workflows: Iterate over your agents 10x faster

agent agents ai builder deepseek framework gemini graph human-in-the-loop llm llms loops multimodal ollama python rag reasoning tool trace workflow

Last synced: 10 Apr 2025

https://github.com/idea-ccnl/fengshenbang-lm

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 10 Apr 2025

https://github.com/IDEA-CCNL/Fengshenbang-LM

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 26 Mar 2025

https://github.com/enricoros/big-AGI

Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.

agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui

Last synced: 14 Mar 2025

https://github.com/rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

big-data dataset deep-learning download-images image image-dataset multimodal

Last synced: 08 Apr 2025

https://github.com/NExT-GPT/NExT-GPT

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

chatgpt foundation-models gpt-4 instruction-tuning large-language-models llm multi-modal-chatgpt multimodal visual-language-learning

Last synced: 12 Mar 2025

https://github.com/OpenGVLab/InternGPT

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa

Last synced: 27 Mar 2025

https://github.com/opengvlab/interngpt

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa

Last synced: 10 Apr 2025

https://github.com/PKU-Alignment/align-anything

Align Anything: Training All-modality Model with Feedback

chameleon dpo large-language-models multimodal rlhf vision-language-model

Last synced: 01 Apr 2025

https://github.com/roboflow/maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision qwen2-vl transformers vision-and-language vqa

Last synced: 11 Apr 2025

https://github.com/rom1504/clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

ai clip deep-learning knn multimodal semantic-search

Last synced: 11 Apr 2025

https://github.com/iterative/datachain

ETL, Analytics, Versioning for Unstructured Data

ai cv data-analytics data-wrangling embeddings llm llm-eval machine-learning mlops multimodal

Last synced: 20 Apr 2025

https://github.com/ofa-sys/ofa

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering

Last synced: 14 Apr 2025

https://github.com/OFA-Sys/OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering

Last synced: 02 Apr 2025

https://github.com/stability-ai/stability-sdk

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

ai-art generative-art latent-diffusion multimodal stable-diffusion

Last synced: 09 Apr 2025

https://github.com/Stability-AI/stability-sdk

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

ai-art generative-art latent-diffusion multimodal stable-diffusion

Last synced: 17 Apr 2025

https://github.com/stability-AI/stability-sdk

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

ai-art generative-art latent-diffusion multimodal stable-diffusion

Last synced: 18 Nov 2024

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn5.laion.ai&index=laion5B&useMclip=false

Easily compute clip embeddings and build a clip retrieval system with them

ai clip deep-learning knn multimodal semantic-search

Last synced: 14 Nov 2024

https://github.com/evolvinglmms-lab/lmms-eval

Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

agi evaluation large-language-models multimodal

Last synced: 10 Apr 2025

https://github.com/x-plug/mplug-docowl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Last synced: 12 Apr 2025

https://github.com/xlang-ai/osworld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

agent artificial-intelligence benchmark cli code-generation gui language-model large-action-model llm multimodal natural-language-processing reinforcement-learning rpa vlm

Last synced: 12 Apr 2025

https://github.com/X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Last synced: 17 Nov 2024

https://github.com/atfortes/llm-reasoning-papers

Reasoning in Large Language Models: Papers and Resources, including Chain-of-Thought, Instruction-Tuning and Multimodality.

awesome chain-of-thought chatgpt cot gpt gpt-4 in-context-learning language-models mllm multimodal papers prompt prompt-engineering question-answering reasoning vllm

Last synced: 06 Feb 2025

https://github.com/firebase/genkit

An open source framework for building AI-powered apps with familiar code-centric patterns. Genkit makes it easy to develop, integrate, and test AI features with observability and evaluations. Genkit works with various models and platforms.

agents ai embedders genkit llm machine-learning multimodal rag vector-database

Last synced: 10 Apr 2025

https://github.com/xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

agent artificial-intelligence benchmark cli code-generation gui language-model large-action-model llm multimodal natural-language-processing reinforcement-learning rpa vlm

Last synced: 18 Apr 2025

https://github.com/kyegomez/BitNet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

artificial-intelligence deep-neural-networks deeplearning gpt4 machine-learning multimodal multimodal-deep-learning

Last synced: 08 Apr 2025

https://github.com/kyegomez/bitnet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

artificial-intelligence deep-neural-networks deeplearning gpt4 machine-learning multimodal multimodal-deep-learning

Last synced: 14 Apr 2025

https://github.com/deepseek-ai/Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

any-to-any foundation-models llm multimodal unified-model vision-language-pretraining

Last synced: 06 Dec 2024

https://github.com/emcf/thepipe

Extract clean data from anywhere, powered by vision-language models ⚡

gpt-4 gpt-4o large-language-models multimodal pdf scrapers vision-transformer web

Last synced: 10 Apr 2025

https://github.com/alan-ai/alan-sdk-cordova

Conversational AI SDK for Apache Cordova to enable text and voice conversations with actions (iOS and Android)

chatbot conversational-ai machine-learning multimodal speech-recognition text-to-speech voice-assistant voice-commands voice-interface vui

Last synced: 09 Apr 2025

https://github.com/lucidrains/coca-pytorch

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

artificial-intelligence attention-mechanism contrastive-learning deep-learning image-to-text multimodal transformers

Last synced: 14 Apr 2025

https://github.com/unum-cloud/uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

bert clip clustering contrastive-learning cross-attention huggingface-transformers image-search language-vision llava multi-lingual multimodal neural-network openai openclip pretrained-models pytorch representation-learning semantic-search transformer vector-search

Last synced: 10 Apr 2025

https://github.com/abilzerian/llm-prompt-library

My personal prompt library for various LLMs + scripts & tools. Suitable for models from Deepseek, OpenAI, Claude, Meta, Mistral, Google, Grok, and others.

adaptive-learning meta-prompting multimodal prompt prompt-engineering prompt-evaluation prompt-generator prompt-injection prompt-learning prompt-management prompt-optimization prompt-template prompt-toolkit prompt-tuning promptengineering prompting rag text-analysis

Last synced: 20 Apr 2025

https://github.com/OpenBMB/VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列

diffusion-models large-language-models multimodal transformers

Last synced: 16 Apr 2025

https://github.com/openbmb/viscpm

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列

diffusion-models large-language-models multimodal transformers

Last synced: 08 Apr 2025

https://github.com/google-research-datasets/wit

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

cc-by-sa-3 machine-learning multilingual multimodal nlp wikipedia

Last synced: 21 Feb 2025

https://github.com/rhymes-ai/Aria

Codebase for Aria - an Open Multimodal Native MoE

mixture-of-experts multimodal vision-and-language

Last synced: 20 Feb 2025

https://github.com/OFA-Sys/ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

audio-language contrastive-loss foundation-models multimodal representation-learning vision-and-language vision-language vision-transformer

Last synced: 29 Nov 2024

https://github.com/om-ai-lab/OmAgent

A multimodal agent framework for solving complex tasks [EMNLP'2024]

agent large-language-models multimodal

Last synced: 07 Feb 2025

https://github.com/aidc-ai/ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

chatbot llama3 multimodal multimodal-large-language-models multimodality qwen vision-language-learning vision-language-model

Last synced: 14 Apr 2025

https://github.com/WangRongsheng/XrayGLM

🩺 首个会看胸部X光片的中文多模态医学大模型 | The first Chinese Medical Multimodal Model that Chest Radiographs Summarization.

large-language-models llms medical multimodal visualglm-6b xray

Last synced: 01 Apr 2025

https://github.com/potamides/detikzify

Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

draw graph huggingface inverse-graphics latex llama llm multimodal sketch tikz transformers vectorization visualization

Last synced: 15 Apr 2025

https://github.com/InternLM/HuixiangDou

HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance

application assistance chatbot dsl lark llm multimodal ocr pipeline rag robot wechat

Last synced: 24 Mar 2025

https://github.com/internlm/huixiangdou

HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance

application assistance chatbot dsl lark llm multimodal ocr pipeline rag robot wechat

Last synced: 08 Apr 2025

https://github.com/ArrowLuo/CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

activitynet clip didemo lsmdc msrvtt msvd multimodal multimodal-learning multimodality ranking retrieval retrieval-model search video-clip-retrieval video-text-retrieval

Last synced: 03 Apr 2025

https://github.com/showlab/show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

diffusion-models large-language-models multimodal

Last synced: 07 Apr 2025

https://github.com/lancedb/vectordb-recipes

High quality resources & applications for LLMs, multi-modal models and VectorDBs

agents ai deep-learning embeddings fine-tuning gpt gpt-4-vision langchain llama-index llms machine-learning multimodal openai rag vector-database

Last synced: 13 Apr 2025

https://github.com/xtreme1-io/xtreme1

Xtreme1 is an all-in-one data labeling and annotation platform for multimodal data training and supports 3D LiDAR point cloud, image, and LLM.

3d-annotation annotation annotation-tool computer-vision image-annotation image-classification image-labelling-tool labeling-tool multimodal point-cloud rlhf

Last synced: 20 Mar 2025

https://github.com/OpenRobotLab/PointLLM

[ECCV 2024 Best Paper Candidate] PointLLM: Empowering Large Language Models to Understand Point Clouds

3d chatbot foundation-models gpt-4 large-language-models llama multimodal objaverse point-cloud pointllm representation-learning vision-and-language

Last synced: 20 Mar 2025

https://github.com/morphik-org/morphik-core

Open source multi-modal RAG for building AI apps over private knowledge.

artificial-intelligence cache-augmented-generation colpali database litellm multimodal open-source rag rules-based-ingestion

Last synced: 10 Apr 2025

https://github.com/ailab-cvc/seed

Official implementation of SEED-LLaMA (ICLR 2024).

foundation-model multimodal vision-language

Last synced: 09 Apr 2025