{"id":19591591,"url":"https://github.com/becauseofai/modernai","last_synced_at":"2026-03-19T10:39:24.673Z","repository":{"id":199075022,"uuid":"702087823","full_name":"becauseofAI/ModernAI","owner":"becauseofAI","description":"Awesome Modern Artificial Intelligence.","archived":false,"fork":false,"pushed_at":"2024-02-21T09:12:51.000Z","size":129,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-26T13:47:49.258Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/becauseofAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-08T13:12:04.000Z","updated_at":"2023-11-26T14:10:41.000Z","dependencies_parsed_at":"2023-11-15T16:46:07.333Z","dependency_job_id":"9aacd33a-b0d2-49e8-93a0-e6acf7173038","html_url":"https://github.com/becauseofAI/ModernAI","commit_stats":null,"previous_names":["becauseofai/awesome-modern-artificial-intelligence","becauseofai/modernai"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/becauseofAI/ModernAI","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/becauseofAI%2FModernAI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/becauseofAI%2FModernAI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/becauseofAI%2FModernAI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/becauseofAI%2FModernAI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/becauseofAI","download_url":"https://codeload.github.com/becauseofAI/ModernAI/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/becauseofAI%2FModernAI/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29954964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-28T22:53:01.873Z","status":"ssl_error","status_checked_at":"2026-02-28T22:52:50.699Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T08:29:42.165Z","updated_at":"2026-02-28T23:31:50.127Z","avatar_url":"https://github.com/becauseofAI.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cp align=\"center\"\u003eModernAI: Awesome Modern Artificial Intelligence\u003c/p\u003e  \n\u003cdiv align=\"center\"\u003e\u003cimg src=\"https://emtemp.gcom.cloud/ngw/globalassets/en/articles/images/hype-cycle-for-artificial-intelligence-2023.png\"/\u003e\u003c/div\u003e  \n\n\n## \u003cp align=\"center\"\u003e🔥Hot update in progress ...\u003c/p\u003e\n\n## Large Model Evolutionary Graph\n\u003cdetails\u003e\n\u003csummary\u003eLLM\u003c/summary\u003e\n\u003cdiv align=\"center\"\u003e\u003cimg src=\"https://github.com/Mooler0410/LLMsPracticalGuide/blob/main/imgs/tree.jpg\"/\u003e\u003c/div\u003e  \n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eMLLM (LLaMA-based)\u003c/summary\u003e\n\u003cdiv align=\"center\"\u003e\u003cimg src=\"https://github.com/RUCAIBox/LLMSurvey/blob/main/assets/llama-0628-final.png\"/\u003e\u003c/div\u003e  \n\u003c/details\u003e\n\n## Survey\n1. Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.03568.pdf)\n2. MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13601.pdf) \n\n## Large Language Model (LLM)\n1. OLMo: Accelerating the Science of Language Models [arXiv 2402] [[paper]](https://arxiv.org/pdf/2402.00838.pdf) [[code]](https://github.com/allenai/OLMo)\n\n## Chinese Large Language Model (CLLM)\n1. https://github.com/LinkSoul-AI/Chinese-Llama-2-7b\n2. https://github.com/ymcui/Chinese-LLaMA-Alpaca-2\n3. https://github.com/LlamaFamily/Llama2-Chinese\n\n## Large Vision Backbone\n1. AIM: Scalable Pre-training of Large Autoregressive Image Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.08541.pdf) [[code]](https://github.com/apple/ml-aim) \n\n## Large Vision Model (LVM)\n1. Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.00785.pdf) [[code]](https://github.com/ytongbai/LVM) (💥Visual GPT Time?)\n\n## Large Vision-Language Model (VLM)\n1. UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06397v1.pdf) [code]\n\n## Vision Foundation Model (VFM)\n1. SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [[paper]](https://arxiv.org/pdf/2304.02643.pdf) [[code]](https://github.com/facebookresearch/segment-anything) \n2. SSA: Semantic segment anything [github 2023] [paper] [[code]](https://github.com/fudan-zvg/Semantic-Segment-Anything)\n3. SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [[paper]](https://arxiv.org/pdf/2304.06718.pdf) [[code]](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)\n5. RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.03514.pdf) [[code]](https://github.com/xinyu1205/Recognize_Anything-Tag2Text) \n6. Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [[paper]](https://browse.arxiv.org/pdf/2307.04767.pdf) [[code]](https://github.com/UX-Decoder/Semantic-SAM)\n7. UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [[paper]](https://arxiv.org/pdf/2303.06674.pdf) [[code]](https://github.com/MasterBin-IIAU/UNINEXT)\n8. APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.02153.pdf) [[code]](https://github.com/shenyunhang/APE)\n9. GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.09158.pdf) [[code]](https://github.com/FoundationVision/GLEE)\n10. OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10229.pdf) [[code]]](https://github.com/lxtGH/OMG-Seg)\n11. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10891.pdf) [[code]]](https://github.com/LiheYoung/Depth-Anything)\n12. ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.12665.pdf) [[code]]](https://github.com/Lszcoding/ClipSAM) \n13. PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13051.pdf) [[code]]](https://github.com/xzz2/pa-sam)\n14. YOLO-World: **Real-Time Open-Vocabulary** Object Detection [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.17270.pdf) [[code]]](https://github.com/AILab-CVC/YOLO-World)\n\n## Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM) \n\n| Model | Vision | Projector | LLM | OKVQA | GQA | VSR | IconVQA | VizWiz | HM | VQA\u003csup\u003ev2\u003c/sup\u003e | SQA\u003csup\u003eI\u003c/sup\u003e | VQA\u003csup\u003eT\u003c/sup\u003e | POPE | MME\u003csup\u003eP\u003c/sup\u003e | MME\u003csup\u003eC\u003c/sup\u003e | MMB | MMB\u003csup\u003eCN\u003c/sup\u003e | SEED\u003csup\u003eI\u003c/sup\u003e | LLaVA\u003csup\u003eW\u003c/sup\u003e | MM-Vet | QBench |\n| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |\n| MiniGPT-v2 | EVA-Clip-g | Linear | LLaMA-2-7B | 56.9\u003csup\u003e**2**\u003c/sup\u003e | 60.3 | 60.6\u003csup\u003e**2**\u003c/sup\u003e | 47.7\u003csup\u003e**2**\u003c/sup\u003e | 32.9 | 58.2\u003csup\u003e**2**\u003c/sup\u003e | | | | | | | | | | | | |\n| MiniGPT-v2-Chat | EVA-Clip-g | Linear | LLaMA-2-7B | 57.8\u003csup\u003e**1**\u003c/sup\u003e | 60.1 | 62.9\u003csup\u003e**1**\u003c/sup\u003e | 51.5\u003csup\u003e**1**\u003c/sup\u003e | 53.6 | 58.8\u003csup\u003e**1**\u003c/sup\u003e | | | | | | | | | | | | |\n| Qwen-VL-Chat | | | Qwen-7B | | 57.5\u003csup\u003e∗\u003c/sup\u003e | | | 38.9 | | 78.2\u003csup\u003e∗\u003c/sup\u003e | 68.2 | 61.5 | | 1487.5| 360.7\u003csup\u003e**2**\u003c/sup\u003e | 60.6 | 56.7 | 58.2 | | | |\n| LLaVA-1.5 | | | Vicuna-1.5-7B | | 62.0\u003csup\u003e∗\u003c/sup\u003e | | | 50.0 | | 78.5\u003csup\u003e∗\u003c/sup\u003e | 66.8 | 58.2 | 85.9\u003csup\u003e**1**\u003c/sup\u003e | 1510.7 | 316.1\u003csup\u003e+\u003c/sup\u003e | 64.3 | 58.3 | 58.6 | 63.4 | 30.5 | 58.7 |\n| LLaVA-1.5 +ShareGPT4V| | | Vicuna-1.5-7B | | | | | 57.2 | | 80.6\u003csup\u003e**2**\u003c/sup\u003e | 68.4 | | | 1567.4\u003csup\u003e**2**\u003c/sup\u003e | 376.4\u003csup\u003e**1**\u003c/sup\u003e | 68.8 | 62.2 | 69.7\u003csup\u003e**1**\u003c/sup\u003e | 72.6 | 37.6 | 63.4\u003csup\u003e**1**∗\u003c/sup\u003e |\n| LLaVA-1.5 | | | Vicuna-1.5-13B | | 63.3\u003csup\u003e**1**\u003c/sup\u003e | | | 53.6 | | 80.0\u003csup\u003e∗\u003c/sup\u003e | 71.6 | 61.3 | 85.9\u003csup\u003e**1**\u003c/sup\u003e | 1531.3 | 295.4\u003csup\u003e+\u003c/sup\u003e | 67.7 | 63.6 | 61.6 | 70.7 | 35.4 | 62.1\u003csup\u003e**2**∗\u003c/sup\u003e |\n| VILA-7B | | | LLaMA-2-7B | | 62.3\u003csup\u003e∗\u003c/sup\u003e | | | 57.8 | | 79.9\u003csup\u003e∗\u003c/sup\u003e | 68.2 | 64.4 | 85.5\u003csup\u003e**2**∗\u003c/sup\u003e | 1533.0 | | 68.9 | 61.7 | 61.1 | 69.7 | 34.9 | |\n| VILA-13B | | | LLaMA-2-13B | | 63.3\u003csup\u003e**1**∗\u003c/sup\u003e | | | 60.6\u003csup\u003e**2**\u003c/sup\u003e | | 80.8\u003csup\u003e**1**∗\u003c/sup\u003e | 73.7\u003csup\u003e**1**∗\u003c/sup\u003e | 66.6\u003csup\u003e**1**∗\u003c/sup\u003e | 84.2 | 1570.1\u003csup\u003e**1**∗\u003c/sup\u003e | | 70.3\u003csup\u003e**2**∗\u003c/sup\u003e | 64.3\u003csup\u003e**2**∗\u003c/sup\u003e | 62.8\u003csup\u003e**2**∗\u003c/sup\u003e | 73.0\u003csup\u003e**2**∗\u003c/sup\u003e | 38.8\u003csup\u003e**2**∗\u003c/sup\u003e | |\n| VILA-13B +ShareGPT4V| | | LLaMA-2-13B | | 63.2\u003csup\u003e**2**∗\u003c/sup\u003e | | | 62.4\u003csup\u003e**1**\u003c/sup\u003e | | 80.6\u003csup\u003e**2**∗\u003c/sup\u003e | 73.1\u003csup\u003e**2**∗\u003c/sup\u003e | 65.3\u003csup\u003e**2**∗\u003c/sup\u003e | 84.8 | 1556.5 | | 70.8\u003csup\u003e**1**∗\u003c/sup\u003e | 65.4\u003csup\u003e**1**∗\u003c/sup\u003e | 61.4 | 78.4\u003csup\u003e**1**∗\u003c/sup\u003e | 45.7\u003csup\u003e**1**∗\u003c/sup\u003e | |\n| SPHINX | | | | | | | | | | | | | | | | | | | | | |\n| SPHINX-Plus | | | | | | | | | | | | | | | | | | | | | |\n| SPHINX-Plus-2K | | | | | | | | | | | | | | | | | | | | | |\n| SPHINX-MoE | | | | | | | | | | | | | | | | | | | | | |\n| InternVL | | | | | | | | | | | | | | | | | | | | | |\n| LLaVA-1.6 | | | | | | | | | | | | | | | | | | | | | |\n| | | | | | | | | | | | | | | | | | | | | | |\n\n\u003e\\+ indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.  \n\u003e∗ indicates that the training images of the datasets are observed during training.\n\n\u003cdetails\u003e\n\u003csummary\u003eParadigm Comparison\u003c/summary\u003e\n\u003cdiv align=\"center\"\u003e\u003cimg src=\"https://user-images.githubusercontent.com/31701434/275126977-d7a482ac-fa57-4643-a7a8-a210bd3a43d5.png\"/\u003e\u003c/div\u003e  \n\u003c/details\u003e  \n\n1. LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [[paper]](https://browse.arxiv.org/pdf/2209.09019.pdf) [[code]](https://github.com/salesforce/LAVIS)\n2. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [[paper]](https://browse.arxiv.org/pdf/2301.12597.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)\n3. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [[paper]](https://browse.arxiv.org/pdf/2305.06500.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)\n4. MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [[paper]](https://browse.arxiv.org/pdf/2304.10592.pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4)\n5. MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [[paper]](https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPTv2.pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4)\n6. VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [[paper]](https://browse.arxiv.org/pdf/2103.10360.pdf) [[code]](https://github.com/THUDM/VisualGLM-6B)\n7. Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.14824.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-2)  \n8. NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [[paper]](https://browse.arxiv.org/pdf/2309.05519.pdf) [[code]](https://github.com/NExT-GPT/NExT-GPT) \n9. LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [[paper]](https://browse.arxiv.org/pdf/2304.08485.pdf) [arXiv 2310] [[paper]](https://browse.arxiv.org/pdf/2310.03744.pdf) [[code]](https://github.com/haotian-liu/LLaVA)\n10. 🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [[paper]](https://arxiv.org/pdf/2304.14178.pdf) [[code]](https://github.com/X-PLUG/mPLUG-Owl)\n11. 🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04257.pdf) [[code]](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)\n12. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [[paper]](https://arxiv.org/pdf/2305.11175.pdf) [[code]](https://github.com/OpenGVLab/VisionLLM)\n13. 🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.15195.pdf) [[code]](https://github.com/shikras/shikra)\n14. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [[paper]](https://arxiv.org/pdf/2308.12966.pdf) [[code]](https://github.com/QwenLM/Qwen-VL)\n15. LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.04669.pdf) [[code]](https://github.com/jy0205/LaVIT)\n16. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [[paper]](https://browse.arxiv.org/pdf/2309.16058.pdf) [code]\n17. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.15112.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer)\n18. MiniGPT-5: Interleaved Vision-and-Language **Generation** via Generative Vokens [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02239.pdf) [[code]](https://github.com/eric-ai-lab/MiniGPT-5)\n19. CogVLM: Visual Expert for Large Language Models [github 2310] [[paper]](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf) [[code]](https://github.com/THUDM/CogVLM)\n20. 🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.16045.pdf) [[code]](https://github.com/BradyFU/Woodpecker)\n21. SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.11441.pdf) [[code]](https://github.com/microsoft/SoM)\n22. Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.07704v1.pdf) [[code]](https://github.com/apple/ml-ferret) \n23. 🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04219.pdf) [[code]](https://github.com/Luodian/Otter)\n24. NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04498.pdf) [[project]](https://next-chatv.github.io/)\n25. Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06783.pdf) [[code]](https://github.com/Q-Future/Q-Instruct)\n26. InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06791.pdf) [[code]](https://github.com/mightyzau/InfMLLM)\n27. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06242.pdf) [code] [dataset]\n28. 🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.11860.pdf) [[code]](https://github.com/rshaojimmy/JiuTian)\n29. 🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06607.pdf) [[code]](https://github.com/Yuliang-Liu/Monkey)\n30. CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.17945.pdf) [[code]](https://github.com/lizhaoliu-Lec/CG-VLM)\n31. 🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.02228.pdf) [[code]](https://github.com/MaverickRen/PixelLM)\n32. 🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.06742.pdf) [[code]](https://github.com/kakaobrain/honeybee)\n33. VILA: On Pre-training for Visual Language Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.07533.pdf) [code]\n34. CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.08914.pdf) [[code]](https://arxiv.org/pdf/2312.08914.pdf) (**support 1120×1120 resolution**)\n35. PixelLLM: Pixel Aligned Language Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.09237.pdf) [code]\n36. 🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.10032.pdf) [[code]](https://github.com/CircleRadon/Osprey)\n37. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.17172.pdf) [[code]](https://github.com/allenai/unified-io-2)\n38. VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.12423.pdf) [code]\n39. Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.13286.pdf) [[code]](https://github.com/baaivision/Emu)\n40. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.14135.pdf) [[code]](https://github.com/penghao-wu/vstar)\n41. BakLLaVA-1: BakLLaVA 1 is a **Mistral 7B** base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [[code]](https://github.com/SkunkworksAI/BakLLaVA)\n42. LEGO: Language Enhanced **Multi-modal Grounding** Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06071.pdf) [[code]](https://github.com/lzw-lzw/lego)\n43. MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06209.pdf) [[code]](https://github.com/tsb0601/MMVP)\n44. ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06395.pdf) [code]\n45. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.15947.pdf) [[code]](https://github.com/PKU-YuanGroup/MoE-LLaVA) \n46. LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.16160.pdf) [code]\n47. 🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.16420.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer)\n48. MouSi: **Poly-Visual-Expert** Vision-Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.17221.pdf) [[code]](https://github.com/FudanNLPLAB/MouSi)\n49. Yi Vision Language Model [[HF 2401]](https://huggingface.co/01-ai/Yi-VL-34B)\n50. \n\n## Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM) \n1. Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.12503.pdf) [[code]](https://github.com/Ucas-HaoranWei/Vary-toy)\n\n## Image Generation with MLLM\n1. Generating Images with Multimodal Language Models [NeurIPS 2023] [[paper]](https://arxiv.org/pdf/2305.17216.pdf) [[code]](https://github.com/kohjingyu/gill)\n2. DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.11499.pdf) [[code]](https://github.com/RunpeiDong/DreamLLM)\n3. Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.17102.pdf) [[code]](https://github.com/tsujuifu/pytorch_mgie)\n4. KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02992.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-g)\n5. LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.16500v2.pdf) [[code]](https://github.com/dvlab-research/LLMGA)\n\n## Modern Autonomous Driving (MAD)\n### End-to-End Solution\n1. UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [[paper]](https://arxiv.org/pdf/2212.10156.pdf) [[code]](https://github.com/OpenDriveLab/UniAD)\n2. Scene as Occupancy [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.02851.pdf) [[code]](https://github.com/OpenDriveLab/OccNet)\n3. FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [[paper]](https://arxiv.org/pdf/2308.01006.pdf) [[code]](https://github.com/westlake-autolab/FusionAD)\n4. BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.10357.pdf) [code]\n5. UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06994.pdf) [[code]](https://github.com/Cc-Hy/UniVision) \n### with Large Language Model\n1. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [[paper]](https://arxiv.org/pdf/2307.07162.pdf) [[code]](https://github.com/PJLab-ADG/DriveLikeAHuman)\n2. LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [[blog]](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/)\n3. DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.01412.pdf) [code]\n\n## Embodied AI (EAI) and Robo Agent\n1. VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [[paper]](https://arxiv.org/pdf/2210.03094.pdf) [[code]](https://github.com/vimalabs/VIMA)\n2. PaLM-E: An Embodied Multimodal Language Model  [arXiv 2303] [[paper]](https://arxiv.org/pdf/2303.03378.pdf) [code]\n3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [[paper]](https://arxiv.org/pdf/2307.05973.pdf) [[code]](https://github.com/huangwl18/VoxPoser)\n4. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [[paper]](https://arxiv.org/pdf/2307.15818.pdf) [[project]](https://robotics-transformer2.github.io/)\n5. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.01918.pdf) [[code]](https://github.com/robopen/roboagent/)\n6. MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10727.pdf) [[code]](https://github.com/MLLM-Tool/MLLM-Tool)\n\n## Neural Radiance Fields (NeRF)\n1. EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.02077.pdf) [[code]](https://github.com/NVlabs/EmerNeRF) \n\n## Diffusion Model\n1. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.17994.pdf) [[code]](https://github.com/kylesargent/ZeroNVS)\n2. Vlogger: Make Your Dream A Vlog [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.09414.pdf) [[code]](https://github.com/zhuangshaobin/Vlogger)\n3. BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13974.pdf) [code]\n\n## World Model\n1. CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.01828.pdf) [[code]](https://github.com/neuroailab/CounterfactualWorldModels)\n2. MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [[paper]](https://arxiv.org/pdf/2210.07729.pdf) [[code]](https://github.com/wayveai/mile) [[blog]](https://wayve.ai/thinking/learning-a-world-model-and-a-driving-policy/)\n3. GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.17080.pdf) [code]\n4. ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.13549.pdf) [code]\n5. OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.16038.pdf) [[code]](https://github.com/wzzheng/OccWorld)\n6. LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [[paper]](https://arxiv.org/pdf/2402.08268.pdf) [[code]](https://github.com/LargeWorldModel/LWM)\n\n## Artificial Intelligence Generated Content (AIGC)\n### Text-to-Image\n### Text-to-Video\n1. Sora: Video generation models as world simulators [openai 2402] [[technical report]](https://openai.com/research/video-generation-models-as-world-simulators) (💥Visual GPT Time?)\n### Text-to-3D\n### Image-to-3D\n\n## Artificial General Intelligence (AGI)\n\n## New Method\n1. [Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [[paper]](https://arxiv.org/pdf/2109.01652.pdf) [[code]](https://github.com/google-research/flan) \n\n## New Dataset\n1. DriveLM: Drive on Language [paper] [[project]](https://github.com/OpenDriveLab/DriveLM)\n2. MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02601.pdf) [[code]](https://github.com/cure-lab/MagicDrive) \n3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models [[paper]](https://robotics-transformer-x.github.io/paper.pdf) [[project]](https://robotics-transformer-x.github.io/) [[blog]](https://www.deepmind.com/blog/scaling-up-learning-across-many-different-robot-types)\n4. To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.07574.pdf) [[code]](https://github.com/X2FD/LVIS-INSTRUCT4V) [[dataset]](https://huggingface.co/datasets/X2FD/LVIS-Instruct4V)\n5. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06242.pdf) [code] [dataset]\n6. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [[paper]](https://arxiv.org/pdf/2311.12793.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V) [[dataset]](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V)\n\n## New Vision Backbone\n1. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.09417.pdf) [[code]](https://github.com/hustvl/Vim)\n2. VMamba: Visual State Space Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10166.pdf) [[code]](https://github.com/MzeroMiko/VMamba)\n\n## Benchmark\n1. Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10529.pdf) [[code]](https://github.com/umd-huang-lab/Mementos)\n\n## Platform and API\n1. SenseNova 商汤日日新开放平台 [[url]](https://platform.sensenova.cn/)\n\n## SOTA Downstream Task\n### Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbecauseofai%2Fmodernai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbecauseofai%2Fmodernai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbecauseofai%2Fmodernai/lists"}