Projects in Awesome Lists tagged with multimodal

https://github.com/mintplex-labs/anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, and more.

agent-framework-javascript ai-agents crewai custom-ai-agents deepseek deepseek-r1 desktop-app llama3 llm llm-application llm-webui lmstudio local-llm localai multimodal nodejs ollama rag vector-database webui

Last synced: 15 Apr 2025

https://github.com/Mintplex-Labs/anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, and more.

agent-framework-javascript ai-agents crewai custom-ai-agents deepseek deepseek-r1 desktop-app llama3 llm llm-application llm-webui lmstudio local-llm localai multimodal nodejs ollama rag vector-database webui

Last synced: 17 Mar 2025

https://github.com/haotian-liu/llava

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning

Last synced: 15 Apr 2025

https://github.com/haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning

Last synced: 14 Mar 2025

https://github.com/jina-ai/serve

☁️ Build multimodal AI applications with cloud-native stack

cloud-native cncf deep-learning docker fastapi framework generative-ai grpc jaeger kubernetes llmops machine-learning microservice mlops multimodal neural-search opentelemetry orchestration pipeline prometheus

Last synced: 19 Apr 2025

https://github.com/jina-ai/jina

☁️ Build multimodal AI applications with cloud-native stack

cloud-native cncf deep-learning docker fastapi framework generative-ai grpc jaeger kubernetes llmops machine-learning microservice mlops multimodal neural-search opentelemetry orchestration pipeline prometheus

Last synced: 01 Feb 2025

https://github.com/microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e

Last synced: 08 Apr 2025

https://github.com/deepseek-ai/janus

Janus-Series: Unified Multimodal Understanding and Generation Models

any-to-any foundation-models llm multimodal unified-model vision-language-pretraining

Last synced: 13 Apr 2025

https://github.com/nvidia/nemo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

asr deeplearning generative-ai large-language-models machine-translation multimodal neural-networks speaker-diariazation speaker-recognition speech-synthesis speech-translation tts

Last synced: 16 Apr 2025

https://github.com/mediar-ai/screenpipe

AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording

agents agi ai computer-vision llm machine-learning ml multimodal vision

Last synced: 08 Apr 2025

https://github.com/NVIDIA/NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

asr deeplearning generative-ai large-language-models machine-translation multimodal neural-networks speaker-diariazation speaker-recognition speech-synthesis speech-translation tts

Last synced: 14 Mar 2025

https://github.com/rerun-io/rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.

computer-vision cpp multimodal python robotics rust visualization

Last synced: 18 Apr 2025

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 19 Apr 2025

https://github.com/bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 Mar 2025

https://github.com/enricoros/big-agi

AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.

agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui

Last synced: 09 Apr 2025

https://github.com/skalskip/courses

This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)

computer-vision deep-learning deep-neural-networks generative-model machine-learning mlops multimodal natural-language-processing nlp stable-diffusion transformers tutorial

Last synced: 11 Apr 2025

https://github.com/enricoros/nextjs-chatgpt-app

Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.

agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui

Last synced: 13 Dec 2024

https://github.com/swyxio/ai-notes

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

ai gpt gpt-3 multimodal openai prompt-engineering stable-diffusion

Last synced: 12 Apr 2025

https://github.com/modelscope/ms-swift

Use PEFT or Full-parameter to finetune 450+ LLMs (Qwen2.5, InternLM3, GLM4, Llama3.3, Mistral, Yi1.5, Baichuan2, DeepSeek-R1, ...) and 150+ MLLMs (Qwen2.5-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2.5, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, GOT-OCR2, ...).

agent deepseek-r1 deploy distill embedding grpo internvl liger llama llama3-3 llm lora multimodal open-r1 peft qwen2-5 qwen2-vl rft sft

Last synced: 09 Apr 2025

https://github.com/facebookresearch/mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

captioning deep-learning dialog hateful-memes multi-tasking multimodal pretrained-models pytorch textvqa vqa

Last synced: 08 Apr 2025

https://github.com/ten-framework/ten-agent

TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.

agent ai asr cpp gemini golang gpt-4 gpt-4o llm low-latency multimodal nextjs14 openai python rag real-time realtime tts vision voice-assistant

Last synced: 31 Mar 2025

https://github.com/SkalskiP/courses

This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)

computer-vision deep-learning deep-neural-networks generative-model machine-learning mlops multimodal natural-language-processing nlp stable-diffusion transformers tutorial

Last synced: 26 Mar 2025

https://github.com/TEN-framework/TEN-Agent

TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaking, and is fully compatible with platforms like Dify and Coze.

agent ai asr cpp gemini golang gpt-4 gpt-4o llm low-latency multimodal nextjs14 openai python rag real-time realtime tts vision voice-assistant

Last synced: 08 Mar 2025

https://github.com/kyegomez/swarms

The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai

agents ai artificial-intelligence attention-mechanism chatgpt gpt4 gpt4all huggingface langchain langchain-python machine-learning multi-modal-imaging multi-modality multimodal prompt-engineering prompt-toolkit prompting swarms transformer-models tree-of-thoughts

Last synced: 08 Apr 2025

https://github.com/kyegomez/tree-of-thoughts

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

artificial-intelligence chatgpt deep-learning gpt4 multimodal prompt prompt-engineering prompt-learning prompt-tuning

Last synced: 08 Apr 2025

https://github.com/pyspur-dev/pyspur

A visual playground for agentic workflows: Iterate over your agents 10x faster

agent agents ai builder deepseek framework gemini graph human-in-the-loop llm llms loops multimodal ollama python rag reasoning tool trace workflow

Last synced: 10 Apr 2025

https://github.com/idea-ccnl/fengshenbang-lm

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 10 Apr 2025

https://github.com/IDEA-CCNL/Fengshenbang-LM

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 26 Mar 2025

https://github.com/enricoros/big-AGI

Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.

agi anthropic beam chatgpt chatgpt-ui generative-ai gpt gpt-4 gpt-5 groq large-language-models mistral multimodal openai openai-api stable-diffusion ui

Last synced: 14 Mar 2025

https://github.com/jina-ai/discoart

🪩 Create Disco Diffusion artworks in one line

clip-guided-diffusion creative-ai creative-art cross-modal dalle diffusion disco-diffusion discodiffusion generative-art imgen latent-diffusion midjourney multimodal prompts stable-diffusion

Last synced: 13 Apr 2025

https://github.com/rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

big-data dataset deep-learning download-images image image-dataset multimodal

Last synced: 08 Apr 2025

https://github.com/open-mmlab/mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

beit clip constrastive-learning convnext deep-learning image-classification mae masked-image-modeling mobilenet moco multimodal pretrained-models pytorch resnet self-supervised-learning swin-transformer vision-transformer

Last synced: 08 Apr 2025

https://github.com/next-gpt/next-gpt

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

chatgpt foundation-models gpt-4 instruction-tuning large-language-models llm mllm multi-modal-chatgpt multimodal visual-language-learning

Last synced: 10 Apr 2025

https://github.com/NExT-GPT/NExT-GPT

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

chatgpt foundation-models gpt-4 instruction-tuning large-language-models llm multi-modal-chatgpt multimodal visual-language-learning

Last synced: 12 Mar 2025

https://github.com/OpenGVLab/InternGPT

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa

Last synced: 27 Mar 2025

https://github.com/opengvlab/interngpt

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

chatgpt click draggan foundation-model gpt gpt-4 gradio husky image-captioning imagebind internimage langchain llama llm multimodal sam segment-anything vicuna video-generation vqa

Last synced: 10 Apr 2025

https://github.com/PKU-Alignment/align-anything

Align Anything: Training All-modality Model with Feedback

chameleon dpo large-language-models multimodal rlhf vision-language-model

Last synced: 01 Apr 2025

https://github.com/microsoft/torchscale

Foundation Architecture for (M)LLMs

computer-vision machine-learning multimodal natural-language-processing pretrained-language-model speech-processing transformer translation

Last synced: 10 Apr 2025

https://github.com/docarray/docarray

Represent, send, store and search multimodal data

cross-modal data-structures dataclass deep-learning docarray elasticsearch fastapi machine-learning multi-modal multimodal nearest-neighbor-search nested-data neural-search protobuf pydantic pytorch qdrant semantic-search weaviate

Last synced: 08 Apr 2025

https://github.com/jina-ai/docarray

Represent, send, store and search multimodal data

cross-modal data-structures dataclass deep-learning docarray elasticsearch fastapi machine-learning multi-modal multimodal nearest-neighbor-search nested-data neural-search protobuf pydantic pytorch qdrant semantic-search weaviate

Last synced: 05 Apr 2025

https://github.com/X-PLUG/MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

agent android app automation copilot gpt4v gui harmony ios mllm mobile mobile-agents multimodal multimodal-agent multimodal-large-language-models

Last synced: 11 Nov 2024

https://github.com/roboflow/maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision qwen2-vl transformers vision-and-language vqa

Last synced: 11 Apr 2025

https://github.com/rom1504/clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

ai clip deep-learning knn multimodal semantic-search

Last synced: 11 Apr 2025

https://github.com/iterative/datachain

ETL, Analytics, Versioning for Unstructured Data

ai cv data-analytics data-wrangling embeddings llm llm-eval machine-learning mlops multimodal

Last synced: 20 Apr 2025

https://github.com/InternLM/InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning

Last synced: 14 Nov 2024

https://github.com/internlm/internlm-xcomposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning

Last synced: 10 Apr 2025

https://github.com/ofa-sys/ofa

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering

Last synced: 14 Apr 2025

https://github.com/OFA-Sys/OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering

Last synced: 02 Apr 2025

https://github.com/om-ai-lab/omagent

Build multimodal language agents for fast prototype and production

agent chatbot gemini gpt gpt4 gradio language-agent large-language-models llama llava llm multimodal multimodal-agent openai python rag smart-hardware vision-and-language vlm workflow

Last synced: 12 Apr 2025

https://github.com/x-plug/mplug-owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

alpaca chatbot chatgpt damo dialogue gpt gpt4 gpt4-api huggingface instruction-tuning large-language-models llama mplug mplug-owl multimodal pretraining pytorch transformer video visual-recognition

Last synced: 10 Apr 2025

https://github.com/stability-ai/stability-sdk

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

ai-art generative-art latent-diffusion multimodal stable-diffusion

Last synced: 09 Apr 2025

https://github.com/Stability-AI/stability-sdk

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

ai-art generative-art latent-diffusion multimodal stable-diffusion

Last synced: 17 Apr 2025

https://github.com/stability-AI/stability-sdk

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

ai-art generative-art latent-diffusion multimodal stable-diffusion

Last synced: 18 Nov 2024

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn5.laion.ai&index=laion5B&useMclip=false

Easily compute clip embeddings and build a clip retrieval system with them

ai clip deep-learning knn multimodal semantic-search

Last synced: 14 Nov 2024

https://github.com/X-PLUG/mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

alpaca chatbot chatgpt damo dialogue gpt gpt4 gpt4-api huggingface instruction-tuning large-language-models llama mplug mplug-owl multimodal pretraining pytorch transformer video visual-recognition

Last synced: 19 Apr 2025

https://github.com/evolvinglmms-lab/lmms-eval

Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

agi evaluation large-language-models multimodal

Last synced: 10 Apr 2025

https://github.com/autodistill/autodistill

Images to inference with no labeling (use foundation models to train supervised models).

auto-labeling computer-vision deep-learning foundation-models grounding-dino image-annotation image-classification instance-segmentation labeling-tool machine-learning model-distillation multimodal object-detection pytorch segment-anything yolov5 yolov8

Last synced: 09 Apr 2025

https://github.com/x-plug/mplug-docowl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Last synced: 12 Apr 2025

https://github.com/alan-ai/alan-sdk-android

Conversational AI SDK for Android to enable text and voice conversations with actions (Java, Kotlin)

alan-ai alan-sdk alan-studio alan-voice android conversational-ai machine-learning multimodal sdk speech-recognition text-to-speech voice voice-assistant voice-commands voice-control voice-interface vui

Last synced: 11 Apr 2025

https://github.com/alan-ai/alan-sdk-flutter

Conversational AI SDK for Flutter to enable text and voice conversations with actions (iOS and Android)

alan-sdk alan-studio alan-voice chatbot conversational-ai flutter machine-learning multimodal sdk speech-recognition text-to-speech voice voice-ai voice-assistant voice-commands voice-control voice-interface vui

Last synced: 07 Apr 2025

https://github.com/xlang-ai/osworld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

agent artificial-intelligence benchmark cli code-generation gui language-model large-action-model llm multimodal natural-language-processing reinforcement-learning rpa vlm

Last synced: 12 Apr 2025

https://github.com/alan-ai/alan-sdk-ionic

Conversational AI SDK for Ionic to enable text and voice conversations with actions (React, Angular, Vue)

alan-ionic-sdk alan-studio chatbot conversational-ai ionic machine-learning multimodal sdk speech-recognition text-to-speech voice voice-ai voice-assistant voice-commands voice-control voice-interface vui

Last synced: 07 Apr 2025

https://github.com/invictus717/metatransformer

Meta-Transformer for Unified Multimodal Learning

artificial-intelligence computer-vision foundationmodel machine-learning multimedia multimodal transformers

Last synced: 08 Apr 2025

https://github.com/alibabaresearch/advancedliteratemachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer

Last synced: 28 Mar 2025

https://github.com/AlibabaResearch/AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer

Last synced: 10 Apr 2025

https://github.com/invictus717/MetaTransformer

Meta-Transformer for Unified Multimodal Learning

artificial-intelligence computer-vision foundationmodel machine-learning multimedia multimodal transformers

Last synced: 10 Apr 2025

https://github.com/X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Last synced: 17 Nov 2024

https://github.com/open-mmlab/multimodal-gpt

Multimodal-GPT

flamingo gpt gpt-4 llama multimodal transformer vision-and-language

Last synced: 08 Apr 2025

https://github.com/OpenGVLab/InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

action-recognition benchmark contrastive-learning foundation-models instruction-tuning masked-autoencoder multimodal open-set-recognition self-supervised spatio-temporal-action-localization temporal-action-localization video-clip video-data video-dataset video-question-answering video-retrieval video-understanding vision-transformer zero-shot-classification zero-shot-retrieval

Last synced: 20 Mar 2025

https://github.com/atfortes/llm-reasoning-papers

Reasoning in Large Language Models: Papers and Resources, including Chain-of-Thought, Instruction-Tuning and Multimodality.

awesome chain-of-thought chatgpt cot gpt gpt-4 in-context-learning language-models mllm multimodal papers prompt prompt-engineering question-answering reasoning vllm

Last synced: 06 Feb 2025

https://github.com/firebase/genkit

An open source framework for building AI-powered apps with familiar code-centric patterns. Genkit makes it easy to develop, integrate, and test AI features with observability and evaluations. Genkit works with various models and platforms.

agents ai embedders genkit llm machine-learning multimodal rag vector-database

Last synced: 10 Apr 2025

https://github.com/xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

agent artificial-intelligence benchmark cli code-generation gui language-model large-action-model llm multimodal natural-language-processing reinforcement-learning rpa vlm

Last synced: 18 Apr 2025

https://github.com/kyegomez/BitNet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

artificial-intelligence deep-neural-networks deeplearning gpt4 machine-learning multimodal multimodal-deep-learning

Last synced: 08 Apr 2025

https://github.com/kyegomez/bitnet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

artificial-intelligence deep-neural-networks deeplearning gpt4 machine-learning multimodal multimodal-deep-learning

Last synced: 14 Apr 2025

https://github.com/deepseek-ai/Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

any-to-any foundation-models llm multimodal unified-model vision-language-pretraining

Last synced: 06 Dec 2024

https://github.com/emcf/thepipe

Extract clean data from anywhere, powered by vision-language models ⚡

gpt-4 gpt-4o large-language-models multimodal pdf scrapers vision-transformer web

Last synced: 10 Apr 2025

https://github.com/alan-ai/alan-sdk-cordova

Conversational AI SDK for Apache Cordova to enable text and voice conversations with actions (iOS and Android)

chatbot conversational-ai machine-learning multimodal speech-recognition text-to-speech voice-assistant voice-commands voice-interface vui

Last synced: 09 Apr 2025

https://github.com/lucidrains/coca-pytorch

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

artificial-intelligence attention-mechanism contrastive-learning deep-learning image-to-text multimodal transformers

Last synced: 14 Apr 2025

https://github.com/unum-cloud/uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

bert clip clustering contrastive-learning cross-attention huggingface-transformers image-search language-vision llava multi-lingual multimodal neural-network openai openclip pretrained-models pytorch representation-learning semantic-search transformer vector-search

Last synced: 10 Apr 2025

https://github.com/abilzerian/llm-prompt-library

My personal prompt library for various LLMs + scripts & tools. Suitable for models from Deepseek, OpenAI, Claude, Meta, Mistral, Google, Grok, and others.

adaptive-learning meta-prompting multimodal prompt prompt-engineering prompt-evaluation prompt-generator prompt-injection prompt-learning prompt-management prompt-optimization prompt-template prompt-toolkit prompt-tuning promptengineering prompting rag text-analysis

Last synced: 20 Apr 2025

https://github.com/OpenBMB/VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列

diffusion-models large-language-models multimodal transformers

Last synced: 16 Apr 2025

https://github.com/openbmb/viscpm

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列

diffusion-models large-language-models multimodal transformers

Last synced: 08 Apr 2025

https://github.com/google-research-datasets/wit

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

cc-by-sa-3 machine-learning multilingual multimodal nlp wikipedia

Last synced: 21 Feb 2025

https://github.com/rhymes-ai/Aria

Codebase for Aria - an Open Multimodal Native MoE

mixture-of-experts multimodal vision-and-language

Last synced: 20 Feb 2025

https://github.com/OFA-Sys/ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

audio-language contrastive-loss foundation-models multimodal representation-learning vision-and-language vision-language vision-transformer