Projects in Awesome Lists tagged with mllm

https://github.com/microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e

Last synced: 16 Dec 2024

https://github.com/X-PLUG/MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

agent android app automation copilot gpt4v gui harmony ios mllm mobile mobile-agents multimodal multimodal-agent multimodal-large-language-models

Last synced: 11 Nov 2024

https://github.com/internlm/internlm-xcomposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning

Last synced: 19 Dec 2024

https://github.com/InternLM/InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning

Last synced: 14 Nov 2024

https://github.com/cambrian-mllm/cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

chatbot clip computer-vision dino instruction-tuning large-language-models llms mllm multimodal-large-language-models representation-learning

Last synced: 19 Dec 2024

https://github.com/x-plug/mplug-docowl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Last synced: 19 Dec 2024

https://github.com/X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Last synced: 17 Nov 2024

https://github.com/baai-dcai/bunny

A family of lightweight multimodal models.

chatgpt chinese english gpt-4 mllm multimodal-large-language-models vlm

Last synced: 09 Nov 2024

https://github.com/bradyfu/woodpecker

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models. The first work to correct hallucinations in MLLMs.

hallucination hallucinations large-language-models llm mllm multimodal-large-language-models multimodality

Last synced: 21 Dec 2024

https://github.com/BradyFU/Woodpecker

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models. The first work to correct hallucinations in MLLMs.

hallucination hallucinations large-language-models llm mllm multimodal-large-language-models multimodality

Last synced: 16 Nov 2024

https://github.com/foundationvision/groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

foundation-models grounding large-language-models llama llama2 llm mllm multimodal vision-language-model

Last synced: 21 Dec 2024

https://github.com/nvlabs/eagle

EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

demo eagle gpt4 huggingface large-language-models llama llama3 llava llm lmm lvlm mllm nvdia

Last synced: 21 Dec 2024

https://github.com/NVlabs/EAGLE

EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

demo eagle gpt4 huggingface large-language-models llama llama3 llava llm lmm lvlm mllm nvdia

Last synced: 26 Sep 2024

https://github.com/Coobiw/MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.

deepspeed fine-tuning mllm model-parallel multimodal-large-language-models pipeline-parallelism pretraining qwen video-language-model video-large-language-models

Last synced: 16 Oct 2024

https://github.com/X-PLUG/Youku-mPLUG

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks

benchmark chinese dataset mllm multimodal multimodal-large-language-models multimodal-pretraining video video-question-answering video-retrieval youku

Last synced: 09 Nov 2024

https://github.com/baaivision/eve

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models

clip encoder-free-vlm instruction-following large-language-models llm mllm multimodal-large-language-models vision-language-models vlm

Last synced: 20 Dec 2024

https://github.com/gokayfem/ComfyUI_VLM_nodes

Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation

comfyui custom-nodes image-captioning img2sfx img2text joytag llava llm mllm nodes phi15 siglip vlm

Last synced: 22 Nov 2024

https://github.com/gokayfem/comfyui_vlm_nodes

Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation

comfyui custom-nodes image-captioning img2sfx img2text joytag llava llm mllm nodes phi15 siglip vlm

Last synced: 17 Dec 2024

https://tiger-ai-lab.github.io/Mantis/

Official code for Paper "Mantis: Multi-Image Instruction Tuning"

fuyu language llava-llama3 lmm mantis mllm multi-image-understanding multimodal video vision vlm

Last synced: 07 Nov 2024

https://github.com/baaivision/densefusion

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

image-descriptions mllm multimodal-large-language-models vision-language-models visual-perception vlm

Last synced: 16 Dec 2024

https://github.com/bz-lab/AUITestAgent

AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.

agent automation gpt-4o gui llm mllm mobile-app multi-agent multimodal multimodal-agent testing

Last synced: 08 Nov 2024

https://github.com/thu-ml/MMTrustEval

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness

Last synced: 02 Dec 2024

https://github.com/thu-ml/mmtrusteval

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness

Last synced: 16 Dec 2024

https://github.com/microsoft/eureka-ml-insights

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

ai artificial-intelligence evaluation-framework llm machine-learning mllm

Last synced: 17 Dec 2024

https://github.com/foundationvision/generateu

[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection

mllm multimodality object-detection open-vocabulary open-vocabulary-detection open-world

Last synced: 05 Nov 2024

https://github.com/niutrans/vision-llm-alignment

This repository contains the code for SFT, RLHF, and DPO, designed for vision-based LLMs, including the LLaVA models and the LLaMA-3.2-vision models.

alignment dpo llama3-vision llava llm mllm multi-model ppo reward rlhf sft vision

Last synced: 18 Nov 2024

https://github.com/buaadreamer/chinese-llava-med

中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine

ai chinese gpt4v huggingface-datasets llama-factory llava medical minigpt4 mllm multimodal qwen1-5 transformers

Last synced: 06 Dec 2024

https://github.com/kwaivgi/uniaa

Unified Multi-modal IAA Baseline and Benchmark

benchmark dataset image-aesthetic-assessment llava mllm

Last synced: 09 Nov 2024

https://github.com/waltonfuture/Diff-eRank

Code for https://arxiv.org/abs/2401.17139 (NeurIPS 2024)

evaluation-metrics llm llm-inference machine-learning mllm neurips-2024

Last synced: 26 Nov 2024

https://github.com/buaadreamer/mllm-finetuning-demo

使用LLaMA-Factory微调多模态大语言模型的示例代码 Demo of Finetuning Multimodal LLM with LLaMA-Factory

finetune-llm huggingface-datasets llama-factory llava lora mllm paligemma pretraining supervised-finetuning transformers yi-vl

Last synced: 06 Dec 2024

https://github.com/hewei2001/reachqa

Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"

data-synthesis llm mllm

Last synced: 04 Dec 2024

https://github.com/showlab/visincontext

Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

efficient in-context-learning llm mllm

Last synced: 09 Nov 2024

https://github.com/xirui-li/MOSSBench

An implementation for MLLM oversensitivity evaluatio

alignment attack mllm oversensitivity vlm

Last synced: 02 Dec 2024

https://github.com/buaadreamer/qwen2-vl-history

Qwen2-VL在文旅领域的LLaMA-Factory微调案例 The case for fine-tuning Qwen2-VL in the field of historical literature and museums

beauty history llama-factory mllm multimodal-large-language-models museum qwen2-vl supervised-finetuning

Last synced: 06 Dec 2024

https://github.com/freedomintelligence/trim

We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.

llm mllm multimodal vision-and-language vision-language-model vlm

Last synced: 17 Nov 2024

https://github.com/tychenjiajun/exif-ai

A Node.js CLI and library that uses OpenAI, Ollama, ZhipuAI, Google Gemini or Coze to write AI-generated image descriptions and/or tags to EXIF metadata by its content.

ai cli cli-tool coze exif gemini image jpeg jpg llm metadata mllm ollama openai openai-api photo zhipu

Last synced: 11 Oct 2024

https://github.com/pipixin321/awesome-video-mllms

:fire: :fire: :fire: Awesome MLLMs/Benchmarks for Short/Long/Streaming Video Understanding :video_camera:

awesome-list benchmarks large-language-models mllm video video-understanding

Last synced: 09 Dec 2024