Projects in Awesome Lists tagged with multimodal
A curated list of projects in awesome lists tagged with multimodal .
https://github.com/mintplex-labs/anything-llm
The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, and more.
agent-framework-javascript ai-agents crewai custom-ai-agents deepseek deepseek-r1 desktop-app llama3 llm llm-application llm-webui lmstudio local-llm localai multimodal nodejs ollama rag vector-database webui
Last synced: 15 Apr 2025
https://github.com/Mintplex-Labs/anything-llm
The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, and more.
agent-framework-javascript ai-agents crewai custom-ai-agents deepseek deepseek-r1 desktop-app llama3 llm llm-application llm-webui lmstudio local-llm localai multimodal nodejs ollama rag vector-database webui
Last synced: 17 Mar 2025
https://github.com/haotian-liu/llava
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning
Last synced: 15 Apr 2025
https://github.com/haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning
Last synced: 14 Mar 2025
https://github.com/jina-ai/serve
☁️ Build multimodal AI applications with cloud-native stack
cloud-native cncf deep-learning docker fastapi framework generative-ai grpc jaeger kubernetes llmops machine-learning microservice mlops multimodal neural-search opentelemetry orchestration pipeline prometheus
Last synced: 19 Apr 2025
https://github.com/jina-ai/jina
☁️ Build multimodal AI applications with cloud-native stack
cloud-native cncf deep-learning docker fastapi framework generative-ai grpc jaeger kubernetes llmops machine-learning microservice mlops multimodal neural-search opentelemetry orchestration pipeline prometheus
Last synced: 01 Feb 2025
https://github.com/microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e
Last synced: 08 Apr 2025
https://github.com/deepseek-ai/janus
Janus-Series: Unified Multimodal Understanding and Generation Models
any-to-any foundation-models llm multimodal unified-model vision-language-pretraining
Last synced: 13 Apr 2025
https://github.com/nvidia/nemo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
asr deeplearning generative-ai large-language-models machine-translation multimodal neural-networks speaker-diariazation speaker-recognition speech-synthesis speech-translation tts
Last synced: 16 Apr 2025
https://github.com/mediar-ai/screenpipe
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
agents agi ai computer-vision llm machine-learning ml multimodal vision
Last synced: 08 Apr 2025
https://github.com/NVIDIA/NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
asr deeplearning generative-ai large-language-models machine-translation multimodal neural-networks speaker-diariazation speaker-recognition speech-synthesis speech-translation tts
Last synced: 14 Mar 2025
https://github.com/rerun-io/rerun
Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
computer-vision cpp multimodal python robotics rust visualization
Last synced: 18 Apr 2025
https://github.com/bentoml/bentoml
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python
Last synced: 19 Apr 2025
https://github.com/bentoml/BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!
ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python
Last synced: 12 Mar 2025
https://github.com/enricoros/big-agi
AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.
agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui
Last synced: 09 Apr 2025
https://github.com/skalskip/courses
This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)
computer-vision deep-learning deep-neural-networks generative-model machine-learning mlops multimodal natural-language-processing nlp stable-diffusion transformers tutorial
Last synced: 11 Apr 2025
https://github.com/enricoros/nextjs-chatgpt-app
Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.
agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui
Last synced: 13 Dec 2024
https://github.com/swyxio/ai-notes
notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.
ai gpt gpt-3 multimodal openai prompt-engineering stable-diffusion
Last synced: 12 Apr 2025
https://github.com/modelscope/ms-swift
Use PEFT or Full-parameter to finetune 450+ LLMs (Qwen2.5, InternLM3, GLM4, Llama3.3, Mistral, Yi1.5, Baichuan2, DeepSeek-R1, ...) and 150+ MLLMs (Qwen2.5-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2.5, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, GOT-OCR2, ...).
agent deepseek-r1 deploy distill embedding grpo internvl liger llama llama3-3 llm lora multimodal open-r1 peft qwen2-5 qwen2-vl rft sft
Last synced: 09 Apr 2025
https://github.com/facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
captioning deep-learning dialog hateful-memes multi-tasking multimodal pretrained-models pytorch textvqa vqa
Last synced: 08 Apr 2025
https://github.com/ten-framework/ten-agent
TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.
agent ai asr cpp gemini golang gpt-4 gpt-4o llm low-latency multimodal nextjs14 openai python rag real-time realtime tts vision voice-assistant
Last synced: 31 Mar 2025
https://github.com/SkalskiP/courses
This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)
computer-vision deep-learning deep-neural-networks generative-model machine-learning mlops multimodal natural-language-processing nlp stable-diffusion transformers tutorial
Last synced: 26 Mar 2025
https://github.com/TEN-framework/TEN-Agent
TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.
agent ai asr cpp gemini golang gpt-4 gpt-4o llm low-latency multimodal nextjs14 openai python rag real-time realtime tts vision voice-assistant
Last synced: 08 Mar 2025
https://github.com/kyegomez/swarms
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
agents ai artificial-intelligence attention-mechanism chatgpt gpt4 gpt4all huggingface langchain langchain-python machine-learning multi-modal-imaging multi-modality multimodal prompt-engineering prompt-toolkit prompting swarms transformer-models tree-of-thoughts
Last synced: 08 Apr 2025
https://github.com/kyegomez/tree-of-thoughts
Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%
artificial-intelligence chatgpt deep-learning gpt4 multimodal prompt prompt-engineering prompt-learning prompt-tuning
Last synced: 08 Apr 2025
https://github.com/idea-ccnl/fengshenbang-lm
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers
Last synced: 10 Apr 2025
https://github.com/IDEA-CCNL/Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers
Last synced: 26 Mar 2025
https://github.com/enricoros/big-AGI
Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.
agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui
Last synced: 14 Mar 2025
https://github.com/jina-ai/discoart
🪩 Create Disco Diffusion artworks in one line
clip-guided-diffusion creative-ai creative-art cross-modal dalle diffusion disco-diffusion discodiffusion generative-art imgen latent-diffusion midjourney multimodal prompts stable-diffusion
Last synced: 13 Apr 2025
https://github.com/rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
big-data dataset deep-learning download-images image image-dataset multimodal
Last synced: 08 Apr 2025
https://github.com/open-mmlab/mmpretrain
OpenMMLab Pre-training Toolbox and Benchmark
beit clip constrastive-learning convnext deep-learning image-classification mae masked-image-modeling mobilenet moco multimodal pretrained-models pytorch resnet self-supervised-learning swin-transformer vision-transformer
Last synced: 08 Apr 2025
https://github.com/next-gpt/next-gpt
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
chatgpt foundation-models gpt-4 instruction-tuning large-language-models llm mllm multi-modal-chatgpt multimodal visual-language-learning
Last synced: 10 Apr 2025
https://github.com/NExT-GPT/NExT-GPT
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
chatgpt foundation-models gpt-4 instruction-tuning large-language-models llm multi-modal-chatgpt multimodal visual-language-learning
Last synced: 12 Mar 2025
https://github.com/OpenGVLab/InternGPT
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa
Last synced: 27 Mar 2025
https://github.com/opengvlab/interngpt
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa
Last synced: 10 Apr 2025
https://github.com/PKU-Alignment/align-anything
Align Anything: Training All-modality Model with Feedback
chameleon dpo large-language-models multimodal rlhf vision-language-model
Last synced: 01 Apr 2025
https://github.com/microsoft/torchscale
Foundation Architecture for (M)LLMs
computer-vision machine-learning multimodal natural-language-processing pretrained-language-model speech-processing transformer translation
Last synced: 10 Apr 2025
https://github.com/docarray/docarray
Represent, send, store and search multimodal data
cross-modal data-structures dataclass deep-learning docarray elasticsearch fastapi machine-learning multi-modal multimodal nearest-neighbor-search nested-data neural-search protobuf pydantic pytorch qdrant semantic-search weaviate
Last synced: 08 Apr 2025
https://github.com/jina-ai/docarray
Represent, send, store and search multimodal data
cross-modal data-structures dataclass deep-learning docarray elasticsearch fastapi machine-learning multi-modal multimodal nearest-neighbor-search nested-data neural-search protobuf pydantic pytorch qdrant semantic-search weaviate
Last synced: 05 Apr 2025
https://github.com/X-PLUG/MobileAgent
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
agent android app automation copilot gpt4v gui harmony ios mllm mobile mobile-agents multimodal multimodal-agent multimodal-large-language-models
Last synced: 11 Nov 2024
https://github.com/roboflow/maestro
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision qwen2-vl transformers vision-and-language vqa
Last synced: 11 Apr 2025
https://github.com/rom1504/clip-retrieval
Easily compute clip embeddings and build a clip retrieval system with them
ai clip deep-learning knn multimodal semantic-search
Last synced: 11 Apr 2025
https://github.com/iterative/datachain
ETL, Analytics, Versioning for Unstructured Data
ai cv data-analytics data-wrangling embeddings llm llm-eval machine-learning mlops multimodal
Last synced: 20 Apr 2025
https://github.com/InternLM/InternLM-XComposer
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning
Last synced: 14 Nov 2024
https://github.com/internlm/internlm-xcomposer
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning
Last synced: 10 Apr 2025
https://github.com/ofa-sys/ofa
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering
Last synced: 14 Apr 2025
https://github.com/OFA-Sys/OFA
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering
Last synced: 02 Apr 2025
https://github.com/om-ai-lab/omagent
Build multimodal language agents for fast prototype and production
agent chatbot gemini gpt gpt4 gradio language-agent large-language-models llama llava llm multimodal multimodal-agent openai python rag smart-hardware vision-and-language vlm workflow
Last synced: 12 Apr 2025
https://github.com/x-plug/mplug-owl
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
alpaca chatbot chatgpt damo dialogue gpt gpt4 gpt4-api huggingface instruction-tuning large-language-models llama mplug mplug-owl multimodal pretraining pytorch transformer video visual-recognition
Last synced: 10 Apr 2025
https://github.com/stability-ai/stability-sdk
SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)
ai-art generative-art latent-diffusion multimodal stable-diffusion
Last synced: 09 Apr 2025
https://github.com/Stability-AI/stability-sdk
SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)
ai-art generative-art latent-diffusion multimodal stable-diffusion
Last synced: 17 Apr 2025
https://github.com/stability-AI/stability-sdk
SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)
ai-art generative-art latent-diffusion multimodal stable-diffusion
Last synced: 18 Nov 2024
https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn5.laion.ai&index=laion5B&useMclip=false
Easily compute clip embeddings and build a clip retrieval system with them
ai clip deep-learning knn multimodal semantic-search
Last synced: 14 Nov 2024
https://github.com/X-PLUG/mPLUG-Owl
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
alpaca chatbot chatgpt damo dialogue gpt gpt4 gpt4-api huggingface instruction-tuning large-language-models llama mplug mplug-owl multimodal pretraining pytorch transformer video visual-recognition
Last synced: 19 Apr 2025
https://github.com/evolvinglmms-lab/lmms-eval
Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
agi evaluation large-language-models multimodal
Last synced: 10 Apr 2025
https://github.com/autodistill/autodistill
Images to inference with no labeling (use foundation models to train supervised models).
auto-labeling computer-vision deep-learning foundation-models grounding-dino image-annotation image-classification instance-segmentation labeling-tool machine-learning model-distillation multimodal object-detection pytorch segment-anything yolov5 yolov8
Last synced: 09 Apr 2025
https://github.com/x-plug/mplug-docowl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding
Last synced: 12 Apr 2025
https://github.com/alan-ai/alan-sdk-android
Conversational AI SDK for Android to enable text and voice conversations with actions (Java, Kotlin)
alan-ai alan-sdk alan-studio alan-voice android conversational-ai machine-learning multimodal sdk speech-recognition text-to-speech voice voice-assistant voice-commands voice-control voice-interface vui
Last synced: 11 Apr 2025
https://github.com/alan-ai/alan-sdk-flutter
Conversational AI SDK for Flutter to enable text and voice conversations with actions (iOS and Android)
alan-sdk alan-studio alan-voice chatbot conversational-ai flutter machine-learning multimodal sdk speech-recognition text-to-speech voice voice-ai voice-assistant voice-commands voice-control voice-interface vui
Last synced: 07 Apr 2025
https://github.com/xlang-ai/osworld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
agent artificial-intelligence benchmark cli code-generation gui language-model large-action-model llm multimodal natural-language-processing reinforcement-learning rpa vlm
Last synced: 12 Apr 2025
https://github.com/alan-ai/alan-sdk-ionic
Conversational AI SDK for Ionic to enable text and voice conversations with actions (React, Angular, Vue)
alan-ionic-sdk alan-studio chatbot conversational-ai ionic machine-learning multimodal sdk speech-recognition text-to-speech voice voice-ai voice-assistant voice-commands voice-control voice-interface vui
Last synced: 07 Apr 2025
https://github.com/invictus717/metatransformer
Meta-Transformer for Unified Multimodal Learning
artificial-intelligence computer-vision foundationmodel machine-learning multimedia multimodal transformers
Last synced: 08 Apr 2025
https://github.com/alibabaresearch/advancedliteratemachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 28 Mar 2025
https://github.com/AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 10 Apr 2025
https://github.com/invictus717/MetaTransformer
Meta-Transformer for Unified Multimodal Learning
artificial-intelligence computer-vision foundationmodel machine-learning multimedia multimodal transformers
Last synced: 10 Apr 2025
https://github.com/X-PLUG/mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding
Last synced: 17 Nov 2024
https://github.com/open-mmlab/multimodal-gpt
Multimodal-GPT
flamingo gpt gpt-4 llama multimodal transformer vision-and-language
Last synced: 08 Apr 2025
https://github.com/OpenGVLab/InternVideo
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
action-recognition benchmark contrastive-learning foundation-models instruction-tuning masked-autoencoder multimodal open-set-recognition self-supervised spatio-temporal-action-localization temporal-action-localization video-clip video-data video-dataset video-question-answering video-retrieval video-understanding vision-transformer zero-shot-classification zero-shot-retrieval
Last synced: 20 Mar 2025
https://github.com/atfortes/llm-reasoning-papers
Reasoning in Large Language Models: Papers and Resources, including Chain-of-Thought, Instruction-Tuning and Multimodality.
awesome chain-of-thought chatgpt cot gpt gpt-4 in-context-learning language-models mllm multimodal papers prompt prompt-engineering question-answering reasoning vllm
Last synced: 06 Feb 2025
https://github.com/firebase/genkit
An open source framework for building AI-powered apps with familiar code-centric patterns. Genkit makes it easy to develop, integrate, and test AI features with observability and evaluations. Genkit works with various models and platforms.
agents ai embedders genkit llm machine-learning multimodal rag vector-database
Last synced: 10 Apr 2025
https://github.com/xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
agent artificial-intelligence benchmark cli code-generation gui language-model large-action-model llm multimodal natural-language-processing reinforcement-learning rpa vlm
Last synced: 18 Apr 2025
https://github.com/kyegomez/BitNet
Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch
artificial-intelligence deep-neural-networks deeplearning gpt4 machine-learning multimodal multimodal-deep-learning
Last synced: 08 Apr 2025
https://github.com/kyegomez/bitnet
Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch
artificial-intelligence deep-neural-networks deeplearning gpt4 machine-learning multimodal multimodal-deep-learning
Last synced: 14 Apr 2025
https://github.com/deepseek-ai/Janus
Janus-Series: Unified Multimodal Understanding and Generation Models
any-to-any foundation-models llm multimodal unified-model vision-language-pretraining
Last synced: 06 Dec 2024
https://github.com/emcf/thepipe
Extract clean data from anywhere, powered by vision-language models ⚡
gpt-4 gpt-4o large-language-models multimodal pdf scrapers vision-transformer web
Last synced: 10 Apr 2025
https://github.com/alan-ai/alan-sdk-cordova
Conversational AI SDK for Apache Cordova to enable text and voice conversations with actions (iOS and Android)
chatbot conversational-ai machine-learning multimodal speech-recognition text-to-speech voice-assistant voice-commands voice-interface vui
Last synced: 09 Apr 2025
https://github.com/lucidrains/coca-pytorch
Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch
artificial-intelligence attention-mechanism contrastive-learning deep-learning image-to-text multimodal transformers
Last synced: 14 Apr 2025
https://github.com/unum-cloud/uform
Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
bert clip clustering contrastive-learning cross-attention huggingface-transformers image-search language-vision llava multi-lingual multimodal neural-network openai openclip pretrained-models pytorch representation-learning semantic-search transformer vector-search
Last synced: 10 Apr 2025
https://github.com/abilzerian/llm-prompt-library
My personal prompt library for various LLMs + scripts & tools. Suitable for models from Deepseek, OpenAI, Claude, Meta, Mistral, Google, Grok, and others.
adaptive-learning meta-prompting multimodal prompt prompt-engineering prompt-evaluation prompt-generator prompt-injection prompt-learning prompt-management prompt-optimization prompt-template prompt-toolkit prompt-tuning promptengineering prompting rag text-analysis
Last synced: 20 Apr 2025
https://github.com/OpenBMB/VisCPM
[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
diffusion-models large-language-models multimodal transformers
Last synced: 16 Apr 2025
https://github.com/openbmb/viscpm
[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
diffusion-models large-language-models multimodal transformers
Last synced: 08 Apr 2025
https://github.com/google-research-datasets/wit
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
cc-by-sa-3 machine-learning multilingual multimodal nlp wikipedia
Last synced: 21 Feb 2025
https://github.com/rhymes-ai/Aria
Codebase for Aria - an Open Multimodal Native MoE
mixture-of-experts multimodal vision-and-language
Last synced: 20 Feb 2025
https://github.com/OFA-Sys/ONE-PEACE
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
audio-language contrastive-loss foundation-models multimodal representation-learning vision-and-language vision-language vision-transformer
Last synced: 29 Nov 2024
https://github.com/om-ai-lab/OmAgent
A multimodal agent framework for solving complex tasks [EMNLP'2024]
agent large-language-models multimodal
Last synced: 07 Feb 2025
https://github.com/aidc-ai/ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
chatbot llama3 multimodal multimodal-large-language-models multimodality qwen vision-language-learning vision-language-model
Last synced: 14 Apr 2025
https://github.com/WangRongsheng/XrayGLM
🩺 首个会看胸部X光片的中文多模态医学大模型 | The first Chinese Medical Multimodal Model that Chest Radiographs Summarization.
large-language-models llms medical multimodal visualglm-6b xray
Last synced: 01 Apr 2025
https://github.com/potamides/detikzify
Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
draw graph huggingface inverse-graphics latex llama llm multimodal sketch tikz transformers vectorization visualization
Last synced: 15 Apr 2025
https://github.com/InternLM/HuixiangDou
HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance
application assistance chatbot dsl lark llm multimodal ocr pipeline rag robot wechat
Last synced: 24 Mar 2025
https://github.com/internlm/huixiangdou
HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance
application assistance chatbot dsl lark llm multimodal ocr pipeline rag robot wechat
Last synced: 08 Apr 2025
https://github.com/ArrowLuo/CLIP4Clip
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
activitynet clip didemo lsmdc msrvtt msvd multimodal multimodal-learning multimodality ranking retrieval retrieval-model search video-clip-retrieval video-text-retrieval
Last synced: 03 Apr 2025
https://github.com/showlab/show-o
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
diffusion-models large-language-models multimodal
Last synced: 07 Apr 2025
https://github.com/lancedb/vectordb-recipes
High quality resources & applications for LLMs, multi-modal models and VectorDBs
agents ai deep-learning embeddings fine-tuning gpt gpt-4-vision langchain llama-index llms machine-learning multimodal openai rag vector-database
Last synced: 13 Apr 2025
https://github.com/xtreme1-io/xtreme1
Xtreme1 is an all-in-one data labeling and annotation platform for multimodal data training and supports 3D LiDAR point cloud, image, and LLM.
3d-annotation annotation annotation-tool computer-vision image-annotation image-classification image-labelling-tool labeling-tool multimodal point-cloud rlhf
Last synced: 20 Mar 2025
https://github.com/OpenRobotLab/PointLLM
[ECCV 2024 Best Paper Candidate] PointLLM: Empowering Large Language Models to Understand Point Clouds
3d chatbot foundation-models gpt-4 large-language-models llama multimodal objaverse point-cloud pointllm representation-learning vision-and-language
Last synced: 20 Mar 2025
https://github.com/morphik-org/morphik-core
Open source multi-modal RAG for building AI apps over private knowledge.
artificial-intelligence cache-augmented-generation colpali database litellm multimodal open-source rag rules-based-ingestion
Last synced: 10 Apr 2025
https://github.com/nomic-ai/contrastors
Train Models Contrastively in Pytorch
contrastive-learning deep-learning dense-retrieval embeddings image-embeddings multimodal multimodal-rag pytorch rag text-embeddings transformers
Last synced: 01 Apr 2025
https://github.com/allenai/papermage
library supporting NLP and CV research on scientific papers
computer-vision machine-learning multimodal natural-language-processing pdf-processing python scientific-papers
Last synced: 06 Mar 2025
https://github.com/ailab-cvc/seed
Official implementation of SEED-LLaMA (ICLR 2024).
foundation-model multimodal vision-language
Last synced: 09 Apr 2025