https://github.com/becauseofai/modernai
Awesome Modern Artificial Intelligence.
https://github.com/becauseofai/modernai
Last synced: 3 months ago
JSON representation
Awesome Modern Artificial Intelligence.
- Host: GitHub
- URL: https://github.com/becauseofai/modernai
- Owner: becauseofAI
- License: apache-2.0
- Created: 2023-10-08T13:12:04.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-21T09:12:51.000Z (over 2 years ago)
- Last Synced: 2025-02-26T13:47:49.258Z (over 1 year ago)
- Size: 126 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
#
ModernAI: Awesome Modern Artificial Intelligence
##
🔥Hot update in progress ...
## Large Model Evolutionary Graph
LLM
MLLM (LLaMA-based)
## Survey
1. Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.03568.pdf)
2. MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13601.pdf)
## Large Language Model (LLM)
1. OLMo: Accelerating the Science of Language Models [arXiv 2402] [[paper]](https://arxiv.org/pdf/2402.00838.pdf) [[code]](https://github.com/allenai/OLMo)
## Chinese Large Language Model (CLLM)
1. https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
2. https://github.com/ymcui/Chinese-LLaMA-Alpaca-2
3. https://github.com/LlamaFamily/Llama2-Chinese
## Large Vision Backbone
1. AIM: Scalable Pre-training of Large Autoregressive Image Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.08541.pdf) [[code]](https://github.com/apple/ml-aim)
## Large Vision Model (LVM)
1. Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.00785.pdf) [[code]](https://github.com/ytongbai/LVM) (💥Visual GPT Time?)
## Large Vision-Language Model (VLM)
1. UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06397v1.pdf) [code]
## Vision Foundation Model (VFM)
1. SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [[paper]](https://arxiv.org/pdf/2304.02643.pdf) [[code]](https://github.com/facebookresearch/segment-anything)
2. SSA: Semantic segment anything [github 2023] [paper] [[code]](https://github.com/fudan-zvg/Semantic-Segment-Anything)
3. SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [[paper]](https://arxiv.org/pdf/2304.06718.pdf) [[code]](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
5. RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.03514.pdf) [[code]](https://github.com/xinyu1205/Recognize_Anything-Tag2Text)
6. Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [[paper]](https://browse.arxiv.org/pdf/2307.04767.pdf) [[code]](https://github.com/UX-Decoder/Semantic-SAM)
7. UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [[paper]](https://arxiv.org/pdf/2303.06674.pdf) [[code]](https://github.com/MasterBin-IIAU/UNINEXT)
8. APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.02153.pdf) [[code]](https://github.com/shenyunhang/APE)
9. GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.09158.pdf) [[code]](https://github.com/FoundationVision/GLEE)
10. OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10229.pdf) [[code]]](https://github.com/lxtGH/OMG-Seg)
11. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10891.pdf) [[code]]](https://github.com/LiheYoung/Depth-Anything)
12. ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.12665.pdf) [[code]]](https://github.com/Lszcoding/ClipSAM)
13. PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13051.pdf) [[code]]](https://github.com/xzz2/pa-sam)
14. YOLO-World: **Real-Time Open-Vocabulary** Object Detection [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.17270.pdf) [[code]]](https://github.com/AILab-CVC/YOLO-World)
## Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)
| Model | Vision | Projector | LLM | OKVQA | GQA | VSR | IconVQA | VizWiz | HM | VQAv2 | SQAI | VQAT | POPE | MMEP | MMEC | MMB | MMBCN | SEEDI | LLaVAW | MM-Vet | QBench |
| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| MiniGPT-v2 | EVA-Clip-g | Linear | LLaMA-2-7B | 56.9**2** | 60.3 | 60.6**2** | 47.7**2** | 32.9 | 58.2**2** | | | | | | | | | | | | |
| MiniGPT-v2-Chat | EVA-Clip-g | Linear | LLaMA-2-7B | 57.8**1** | 60.1 | 62.9**1** | 51.5**1** | 53.6 | 58.8**1** | | | | | | | | | | | | |
| Qwen-VL-Chat | | | Qwen-7B | | 57.5∗ | | | 38.9 | | 78.2∗ | 68.2 | 61.5 | | 1487.5| 360.7**2** | 60.6 | 56.7 | 58.2 | | | |
| LLaVA-1.5 | | | Vicuna-1.5-7B | | 62.0∗ | | | 50.0 | | 78.5∗ | 66.8 | 58.2 | 85.9**1** | 1510.7 | 316.1+ | 64.3 | 58.3 | 58.6 | 63.4 | 30.5 | 58.7 |
| LLaVA-1.5 +ShareGPT4V| | | Vicuna-1.5-7B | | | | | 57.2 | | 80.6**2** | 68.4 | | | 1567.4**2** | 376.4**1** | 68.8 | 62.2 | 69.7**1** | 72.6 | 37.6 | 63.4**1**∗ |
| LLaVA-1.5 | | | Vicuna-1.5-13B | | 63.3**1** | | | 53.6 | | 80.0∗ | 71.6 | 61.3 | 85.9**1** | 1531.3 | 295.4+ | 67.7 | 63.6 | 61.6 | 70.7 | 35.4 | 62.1**2**∗ |
| VILA-7B | | | LLaMA-2-7B | | 62.3∗ | | | 57.8 | | 79.9∗ | 68.2 | 64.4 | 85.5**2**∗ | 1533.0 | | 68.9 | 61.7 | 61.1 | 69.7 | 34.9 | |
| VILA-13B | | | LLaMA-2-13B | | 63.3**1**∗ | | | 60.6**2** | | 80.8**1**∗ | 73.7**1**∗ | 66.6**1**∗ | 84.2 | 1570.1**1**∗ | | 70.3**2**∗ | 64.3**2**∗ | 62.8**2**∗ | 73.0**2**∗ | 38.8**2**∗ | |
| VILA-13B +ShareGPT4V| | | LLaMA-2-13B | | 63.2**2**∗ | | | 62.4**1** | | 80.6**2**∗ | 73.1**2**∗ | 65.3**2**∗ | 84.8 | 1556.5 | | 70.8**1**∗ | 65.4**1**∗ | 61.4 | 78.4**1**∗ | 45.7**1**∗ | |
| SPHINX | | | | | | | | | | | | | | | | | | | | | |
| SPHINX-Plus | | | | | | | | | | | | | | | | | | | | | |
| SPHINX-Plus-2K | | | | | | | | | | | | | | | | | | | | | |
| SPHINX-MoE | | | | | | | | | | | | | | | | | | | | | |
| InternVL | | | | | | | | | | | | | | | | | | | | | |
| LLaVA-1.6 | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | |
>\+ indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.
>∗ indicates that the training images of the datasets are observed during training.
Paradigm Comparison
1. LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [[paper]](https://browse.arxiv.org/pdf/2209.09019.pdf) [[code]](https://github.com/salesforce/LAVIS)
2. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [[paper]](https://browse.arxiv.org/pdf/2301.12597.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
3. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [[paper]](https://browse.arxiv.org/pdf/2305.06500.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)
4. MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [[paper]](https://browse.arxiv.org/pdf/2304.10592.pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4)
5. MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [[paper]](https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPTv2.pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4)
6. VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [[paper]](https://browse.arxiv.org/pdf/2103.10360.pdf) [[code]](https://github.com/THUDM/VisualGLM-6B)
7. Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.14824.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-2)
8. NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [[paper]](https://browse.arxiv.org/pdf/2309.05519.pdf) [[code]](https://github.com/NExT-GPT/NExT-GPT)
9. LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [[paper]](https://browse.arxiv.org/pdf/2304.08485.pdf) [arXiv 2310] [[paper]](https://browse.arxiv.org/pdf/2310.03744.pdf) [[code]](https://github.com/haotian-liu/LLaVA)
10. 🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [[paper]](https://arxiv.org/pdf/2304.14178.pdf) [[code]](https://github.com/X-PLUG/mPLUG-Owl)
11. 🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04257.pdf) [[code]](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)
12. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [[paper]](https://arxiv.org/pdf/2305.11175.pdf) [[code]](https://github.com/OpenGVLab/VisionLLM)
13. 🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.15195.pdf) [[code]](https://github.com/shikras/shikra)
14. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [[paper]](https://arxiv.org/pdf/2308.12966.pdf) [[code]](https://github.com/QwenLM/Qwen-VL)
15. LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.04669.pdf) [[code]](https://github.com/jy0205/LaVIT)
16. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [[paper]](https://browse.arxiv.org/pdf/2309.16058.pdf) [code]
17. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.15112.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer)
18. MiniGPT-5: Interleaved Vision-and-Language **Generation** via Generative Vokens [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02239.pdf) [[code]](https://github.com/eric-ai-lab/MiniGPT-5)
19. CogVLM: Visual Expert for Large Language Models [github 2310] [[paper]](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf) [[code]](https://github.com/THUDM/CogVLM)
20. 🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.16045.pdf) [[code]](https://github.com/BradyFU/Woodpecker)
21. SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.11441.pdf) [[code]](https://github.com/microsoft/SoM)
22. Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.07704v1.pdf) [[code]](https://github.com/apple/ml-ferret)
23. 🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04219.pdf) [[code]](https://github.com/Luodian/Otter)
24. NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04498.pdf) [[project]](https://next-chatv.github.io/)
25. Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06783.pdf) [[code]](https://github.com/Q-Future/Q-Instruct)
26. InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06791.pdf) [[code]](https://github.com/mightyzau/InfMLLM)
27. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06242.pdf) [code] [dataset]
28. 🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.11860.pdf) [[code]](https://github.com/rshaojimmy/JiuTian)
29. 🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06607.pdf) [[code]](https://github.com/Yuliang-Liu/Monkey)
30. CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.17945.pdf) [[code]](https://github.com/lizhaoliu-Lec/CG-VLM)
31. 🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.02228.pdf) [[code]](https://github.com/MaverickRen/PixelLM)
32. 🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.06742.pdf) [[code]](https://github.com/kakaobrain/honeybee)
33. VILA: On Pre-training for Visual Language Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.07533.pdf) [code]
34. CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.08914.pdf) [[code]](https://arxiv.org/pdf/2312.08914.pdf) (**support 1120×1120 resolution**)
35. PixelLLM: Pixel Aligned Language Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.09237.pdf) [code]
36. 🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.10032.pdf) [[code]](https://github.com/CircleRadon/Osprey)
37. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.17172.pdf) [[code]](https://github.com/allenai/unified-io-2)
38. VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.12423.pdf) [code]
39. Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.13286.pdf) [[code]](https://github.com/baaivision/Emu)
40. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.14135.pdf) [[code]](https://github.com/penghao-wu/vstar)
41. BakLLaVA-1: BakLLaVA 1 is a **Mistral 7B** base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [[code]](https://github.com/SkunkworksAI/BakLLaVA)
42. LEGO: Language Enhanced **Multi-modal Grounding** Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06071.pdf) [[code]](https://github.com/lzw-lzw/lego)
43. MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06209.pdf) [[code]](https://github.com/tsb0601/MMVP)
44. ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06395.pdf) [code]
45. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.15947.pdf) [[code]](https://github.com/PKU-YuanGroup/MoE-LLaVA)
46. LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.16160.pdf) [code]
47. 🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.16420.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer)
48. MouSi: **Poly-Visual-Expert** Vision-Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.17221.pdf) [[code]](https://github.com/FudanNLPLAB/MouSi)
49. Yi Vision Language Model [[HF 2401]](https://huggingface.co/01-ai/Yi-VL-34B)
50.
## Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)
1. Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.12503.pdf) [[code]](https://github.com/Ucas-HaoranWei/Vary-toy)
## Image Generation with MLLM
1. Generating Images with Multimodal Language Models [NeurIPS 2023] [[paper]](https://arxiv.org/pdf/2305.17216.pdf) [[code]](https://github.com/kohjingyu/gill)
2. DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.11499.pdf) [[code]](https://github.com/RunpeiDong/DreamLLM)
3. Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.17102.pdf) [[code]](https://github.com/tsujuifu/pytorch_mgie)
4. KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02992.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-g)
5. LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.16500v2.pdf) [[code]](https://github.com/dvlab-research/LLMGA)
## Modern Autonomous Driving (MAD)
### End-to-End Solution
1. UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [[paper]](https://arxiv.org/pdf/2212.10156.pdf) [[code]](https://github.com/OpenDriveLab/UniAD)
2. Scene as Occupancy [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.02851.pdf) [[code]](https://github.com/OpenDriveLab/OccNet)
3. FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [[paper]](https://arxiv.org/pdf/2308.01006.pdf) [[code]](https://github.com/westlake-autolab/FusionAD)
4. BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.10357.pdf) [code]
5. UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06994.pdf) [[code]](https://github.com/Cc-Hy/UniVision)
### with Large Language Model
1. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [[paper]](https://arxiv.org/pdf/2307.07162.pdf) [[code]](https://github.com/PJLab-ADG/DriveLikeAHuman)
2. LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [[blog]](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/)
3. DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.01412.pdf) [code]
## Embodied AI (EAI) and Robo Agent
1. VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [[paper]](https://arxiv.org/pdf/2210.03094.pdf) [[code]](https://github.com/vimalabs/VIMA)
2. PaLM-E: An Embodied Multimodal Language Model [arXiv 2303] [[paper]](https://arxiv.org/pdf/2303.03378.pdf) [code]
3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [[paper]](https://arxiv.org/pdf/2307.05973.pdf) [[code]](https://github.com/huangwl18/VoxPoser)
4. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [[paper]](https://arxiv.org/pdf/2307.15818.pdf) [[project]](https://robotics-transformer2.github.io/)
5. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.01918.pdf) [[code]](https://github.com/robopen/roboagent/)
6. MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10727.pdf) [[code]](https://github.com/MLLM-Tool/MLLM-Tool)
## Neural Radiance Fields (NeRF)
1. EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.02077.pdf) [[code]](https://github.com/NVlabs/EmerNeRF)
## Diffusion Model
1. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.17994.pdf) [[code]](https://github.com/kylesargent/ZeroNVS)
2. Vlogger: Make Your Dream A Vlog [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.09414.pdf) [[code]](https://github.com/zhuangshaobin/Vlogger)
3. BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13974.pdf) [code]
## World Model
1. CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.01828.pdf) [[code]](https://github.com/neuroailab/CounterfactualWorldModels)
2. MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [[paper]](https://arxiv.org/pdf/2210.07729.pdf) [[code]](https://github.com/wayveai/mile) [[blog]](https://wayve.ai/thinking/learning-a-world-model-and-a-driving-policy/)
3. GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.17080.pdf) [code]
4. ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.13549.pdf) [code]
5. OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.16038.pdf) [[code]](https://github.com/wzzheng/OccWorld)
6. LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [[paper]](https://arxiv.org/pdf/2402.08268.pdf) [[code]](https://github.com/LargeWorldModel/LWM)
## Artificial Intelligence Generated Content (AIGC)
### Text-to-Image
### Text-to-Video
1. Sora: Video generation models as world simulators [openai 2402] [[technical report]](https://openai.com/research/video-generation-models-as-world-simulators) (💥Visual GPT Time?)
### Text-to-3D
### Image-to-3D
## Artificial General Intelligence (AGI)
## New Method
1. [Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [[paper]](https://arxiv.org/pdf/2109.01652.pdf) [[code]](https://github.com/google-research/flan)
## New Dataset
1. DriveLM: Drive on Language [paper] [[project]](https://github.com/OpenDriveLab/DriveLM)
2. MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02601.pdf) [[code]](https://github.com/cure-lab/MagicDrive)
3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models [[paper]](https://robotics-transformer-x.github.io/paper.pdf) [[project]](https://robotics-transformer-x.github.io/) [[blog]](https://www.deepmind.com/blog/scaling-up-learning-across-many-different-robot-types)
4. To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.07574.pdf) [[code]](https://github.com/X2FD/LVIS-INSTRUCT4V) [[dataset]](https://huggingface.co/datasets/X2FD/LVIS-Instruct4V)
5. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06242.pdf) [code] [dataset]
6. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [[paper]](https://arxiv.org/pdf/2311.12793.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V) [[dataset]](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V)
## New Vision Backbone
1. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.09417.pdf) [[code]](https://github.com/hustvl/Vim)
2. VMamba: Visual State Space Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10166.pdf) [[code]](https://github.com/MzeroMiko/VMamba)
## Benchmark
1. Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10529.pdf) [[code]](https://github.com/umd-huang-lab/Mementos)
## Platform and API
1. SenseNova 商汤日日新开放平台 [[url]](https://platform.sensenova.cn/)
## SOTA Downstream Task
### Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world