https://github.com/becauseofai/modernai

Awesome Modern Artificial Intelligence.
https://github.com/becauseofai/modernai
Last synced: 3 months ago
JSON representation
Awesome Modern Artificial Intelligence.
Host: GitHub
URL: https://github.com/becauseofai/modernai
Owner: becauseofAI
License: apache-2.0
Created: 2023-10-08T13:12:04.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-21T09:12:51.000Z (over 2 years ago)
Last Synced: 2025-02-26T13:47:49.258Z (over 1 year ago)
Size: 126 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # 
ModernAI: Awesome Modern Artificial Intelligence  

  

## 
🔥Hot update in progress ...


## Large Model Evolutionary Graph

LLM

  

MLLM (LLaMA-based)

  

## Survey

1. Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.03568.pdf)

2. MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13601.pdf) 

## Large Language Model (LLM)

1. OLMo: Accelerating the Science of Language Models [arXiv 2402] [[paper]](https://arxiv.org/pdf/2402.00838.pdf) [[code]](https://github.com/allenai/OLMo)

## Chinese Large Language Model (CLLM)

1. https://github.com/LinkSoul-AI/Chinese-Llama-2-7b

2. https://github.com/ymcui/Chinese-LLaMA-Alpaca-2

3. https://github.com/LlamaFamily/Llama2-Chinese

## Large Vision Backbone

1. AIM: Scalable Pre-training of Large Autoregressive Image Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.08541.pdf) [[code]](https://github.com/apple/ml-aim) 

## Large Vision Model (LVM)

1. Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.00785.pdf) [[code]](https://github.com/ytongbai/LVM) (💥Visual GPT Time?)

## Large Vision-Language Model (VLM)

1. UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06397v1.pdf) [code]

## Vision Foundation Model (VFM)

1. SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [[paper]](https://arxiv.org/pdf/2304.02643.pdf) [[code]](https://github.com/facebookresearch/segment-anything) 

2. SSA: Semantic segment anything [github 2023] [paper] [[code]](https://github.com/fudan-zvg/Semantic-Segment-Anything)

3. SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [[paper]](https://arxiv.org/pdf/2304.06718.pdf) [[code]](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)

5. RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.03514.pdf) [[code]](https://github.com/xinyu1205/Recognize_Anything-Tag2Text) 

6. Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [[paper]](https://browse.arxiv.org/pdf/2307.04767.pdf) [[code]](https://github.com/UX-Decoder/Semantic-SAM)

7. UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [[paper]](https://arxiv.org/pdf/2303.06674.pdf) [[code]](https://github.com/MasterBin-IIAU/UNINEXT)

8. APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.02153.pdf) [[code]](https://github.com/shenyunhang/APE)

9. GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.09158.pdf) [[code]](https://github.com/FoundationVision/GLEE)

10. OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10229.pdf) [[code]]](https://github.com/lxtGH/OMG-Seg)

11. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10891.pdf) [[code]]](https://github.com/LiheYoung/Depth-Anything)

12. ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.12665.pdf) [[code]]](https://github.com/Lszcoding/ClipSAM) 

13. PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13051.pdf) [[code]]](https://github.com/xzz2/pa-sam)

14. YOLO-World: **Real-Time Open-Vocabulary** Object Detection [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.17270.pdf) [[code]]](https://github.com/AILab-CVC/YOLO-World)

## Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM) 

| Model | Vision | Projector | LLM | OKVQA | GQA | VSR | IconVQA | VizWiz | HM | VQA^v2 | SQA^I | VQA^T | POPE | MME^P | MME^C | MMB | MMB^CN | SEED^I | LLaVA^W | MM-Vet | QBench |

| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |

| MiniGPT-v2 | EVA-Clip-g | Linear | LLaMA-2-7B | 56.9^**2** | 60.3 | 60.6^**2** | 47.7^**2** | 32.9 | 58.2^**2** | | | | | | | | | | | | |

| MiniGPT-v2-Chat | EVA-Clip-g | Linear | LLaMA-2-7B | 57.8^**1** | 60.1 | 62.9^**1** | 51.5^**1** | 53.6 | 58.8^**1** | | | | | | | | | | | | |

| Qwen-VL-Chat | | | Qwen-7B | | 57.5^∗ | | | 38.9 | | 78.2^∗ | 68.2 | 61.5 | | 1487.5| 360.7^**2** | 60.6 | 56.7 | 58.2 | | | |

| LLaVA-1.5 | | | Vicuna-1.5-7B | | 62.0^∗ | | | 50.0 | | 78.5^∗ | 66.8 | 58.2 | 85.9^**1** | 1510.7 | 316.1⁺ | 64.3 | 58.3 | 58.6 | 63.4 | 30.5 | 58.7 |

| LLaVA-1.5 +ShareGPT4V| | | Vicuna-1.5-7B | | | | | 57.2 | | 80.6^**2** | 68.4 | | | 1567.4^**2** | 376.4^**1** | 68.8 | 62.2 | 69.7^**1** | 72.6 | 37.6 | 63.4^**1**∗ |

| LLaVA-1.5 | | | Vicuna-1.5-13B | | 63.3^**1** | | | 53.6 | | 80.0^∗ | 71.6 | 61.3 | 85.9^**1** | 1531.3 | 295.4⁺ | 67.7 | 63.6 | 61.6 | 70.7 | 35.4 | 62.1^**2**∗ |

| VILA-7B | | | LLaMA-2-7B | | 62.3^∗ | | | 57.8 | | 79.9^∗ | 68.2 | 64.4 | 85.5^**2**∗ | 1533.0 | | 68.9 | 61.7 | 61.1 | 69.7 | 34.9 | |

| VILA-13B | | | LLaMA-2-13B | | 63.3^**1**∗ | | | 60.6^**2** | | 80.8^**1**∗ | 73.7^**1**∗ | 66.6^**1**∗ | 84.2 | 1570.1^**1**∗ | | 70.3^**2**∗ | 64.3^**2**∗ | 62.8^**2**∗ | 73.0^**2**∗ | 38.8^**2**∗ | |

| VILA-13B +ShareGPT4V| | | LLaMA-2-13B | | 63.2^**2**∗ | | | 62.4^**1** | | 80.6^**2**∗ | 73.1^**2**∗ | 65.3^**2**∗ | 84.8 | 1556.5 | | 70.8^**1**∗ | 65.4^**1**∗ | 61.4 | 78.4^**1**∗ | 45.7^**1**∗ | |

| SPHINX | | | | | | | | | | | | | | | | | | | | | |

| SPHINX-Plus | | | | | | | | | | | | | | | | | | | | | |

| SPHINX-Plus-2K | | | | | | | | | | | | | | | | | | | | | |

| SPHINX-MoE | | | | | | | | | | | | | | | | | | | | | |

| InternVL | | | | | | | | | | | | | | | | | | | | | |

| LLaVA-1.6 | | | | | | | | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | | | | | | | | |

>\+ indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.  

>∗ indicates that the training images of the datasets are observed during training.

Paradigm Comparison

  

  

1. LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [[paper]](https://browse.arxiv.org/pdf/2209.09019.pdf) [[code]](https://github.com/salesforce/LAVIS)

2. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [[paper]](https://browse.arxiv.org/pdf/2301.12597.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)

3. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [[paper]](https://browse.arxiv.org/pdf/2305.06500.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)

4. MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [[paper]](https://browse.arxiv.org/pdf/2304.10592.pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4)

5. MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [[paper]](https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPTv2.pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4)

6. VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [[paper]](https://browse.arxiv.org/pdf/2103.10360.pdf) [[code]](https://github.com/THUDM/VisualGLM-6B)

7. Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.14824.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-2)  

8. NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [[paper]](https://browse.arxiv.org/pdf/2309.05519.pdf) [[code]](https://github.com/NExT-GPT/NExT-GPT) 

9. LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [[paper]](https://browse.arxiv.org/pdf/2304.08485.pdf) [arXiv 2310] [[paper]](https://browse.arxiv.org/pdf/2310.03744.pdf) [[code]](https://github.com/haotian-liu/LLaVA)

10. 🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [[paper]](https://arxiv.org/pdf/2304.14178.pdf) [[code]](https://github.com/X-PLUG/mPLUG-Owl)

11. 🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04257.pdf) [[code]](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)

12. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [[paper]](https://arxiv.org/pdf/2305.11175.pdf) [[code]](https://github.com/OpenGVLab/VisionLLM)

13. 🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.15195.pdf) [[code]](https://github.com/shikras/shikra)

14. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [[paper]](https://arxiv.org/pdf/2308.12966.pdf) [[code]](https://github.com/QwenLM/Qwen-VL)

15. LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.04669.pdf) [[code]](https://github.com/jy0205/LaVIT)

16. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [[paper]](https://browse.arxiv.org/pdf/2309.16058.pdf) [code]

17. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.15112.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer)

18. MiniGPT-5: Interleaved Vision-and-Language **Generation** via Generative Vokens [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02239.pdf) [[code]](https://github.com/eric-ai-lab/MiniGPT-5)

19. CogVLM: Visual Expert for Large Language Models [github 2310] [[paper]](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf) [[code]](https://github.com/THUDM/CogVLM)

20. 🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.16045.pdf) [[code]](https://github.com/BradyFU/Woodpecker)

21. SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.11441.pdf) [[code]](https://github.com/microsoft/SoM)

22. Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.07704v1.pdf) [[code]](https://github.com/apple/ml-ferret) 

23. 🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04219.pdf) [[code]](https://github.com/Luodian/Otter)

24. NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.04498.pdf) [[project]](https://next-chatv.github.io/)

25. Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06783.pdf) [[code]](https://github.com/Q-Future/Q-Instruct)

26. InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06791.pdf) [[code]](https://github.com/mightyzau/InfMLLM)

27. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06242.pdf) [code] [dataset]

28. 🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.11860.pdf) [[code]](https://github.com/rshaojimmy/JiuTian)

29. 🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06607.pdf) [[code]](https://github.com/Yuliang-Liu/Monkey)

30. CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.17945.pdf) [[code]](https://github.com/lizhaoliu-Lec/CG-VLM)

31. 🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.02228.pdf) [[code]](https://github.com/MaverickRen/PixelLM)

32. 🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.06742.pdf) [[code]](https://github.com/kakaobrain/honeybee)

33. VILA: On Pre-training for Visual Language Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.07533.pdf) [code]

34. CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.08914.pdf) [[code]](https://arxiv.org/pdf/2312.08914.pdf) (**support 1120×1120 resolution**)

35. PixelLLM: Pixel Aligned Language Models [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.09237.pdf) [code]

36. 🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.10032.pdf) [[code]](https://github.com/CircleRadon/Osprey)

37. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.17172.pdf) [[code]](https://github.com/allenai/unified-io-2)

38. VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.12423.pdf) [code]

39. Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.13286.pdf) [[code]](https://github.com/baaivision/Emu)

40. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [[paper]](https://arxiv.org/pdf/2312.14135.pdf) [[code]](https://github.com/penghao-wu/vstar)

41. BakLLaVA-1: BakLLaVA 1 is a **Mistral 7B** base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [[code]](https://github.com/SkunkworksAI/BakLLaVA)

42. LEGO: Language Enhanced **Multi-modal Grounding** Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06071.pdf) [[code]](https://github.com/lzw-lzw/lego)

43. MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06209.pdf) [[code]](https://github.com/tsb0601/MMVP)

44. ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06395.pdf) [code]

45. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.15947.pdf) [[code]](https://github.com/PKU-YuanGroup/MoE-LLaVA) 

46. LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.16160.pdf) [code]

47. 🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.16420.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer)

48. MouSi: **Poly-Visual-Expert** Vision-Language Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.17221.pdf) [[code]](https://github.com/FudanNLPLAB/MouSi)

49. Yi Vision Language Model [[HF 2401]](https://huggingface.co/01-ai/Yi-VL-34B)

50. 

## Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM) 

1. Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.12503.pdf) [[code]](https://github.com/Ucas-HaoranWei/Vary-toy)

## Image Generation with MLLM

1. Generating Images with Multimodal Language Models [NeurIPS 2023] [[paper]](https://arxiv.org/pdf/2305.17216.pdf) [[code]](https://github.com/kohjingyu/gill)

2. DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.11499.pdf) [[code]](https://github.com/RunpeiDong/DreamLLM)

3. Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.17102.pdf) [[code]](https://github.com/tsujuifu/pytorch_mgie)

4. KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02992.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-g)

5. LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.16500v2.pdf) [[code]](https://github.com/dvlab-research/LLMGA)

## Modern Autonomous Driving (MAD)

### End-to-End Solution

1. UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [[paper]](https://arxiv.org/pdf/2212.10156.pdf) [[code]](https://github.com/OpenDriveLab/UniAD)

2. Scene as Occupancy [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.02851.pdf) [[code]](https://github.com/OpenDriveLab/OccNet)

3. FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [[paper]](https://arxiv.org/pdf/2308.01006.pdf) [[code]](https://github.com/westlake-autolab/FusionAD)

4. BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.10357.pdf) [code]

5. UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.06994.pdf) [[code]](https://github.com/Cc-Hy/UniVision) 

### with Large Language Model

1. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [[paper]](https://arxiv.org/pdf/2307.07162.pdf) [[code]](https://github.com/PJLab-ADG/DriveLikeAHuman)

2. LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [[blog]](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/)

3. DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.01412.pdf) [code]

## Embodied AI (EAI) and Robo Agent

1. VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [[paper]](https://arxiv.org/pdf/2210.03094.pdf) [[code]](https://github.com/vimalabs/VIMA)

2. PaLM-E: An Embodied Multimodal Language Model  [arXiv 2303] [[paper]](https://arxiv.org/pdf/2303.03378.pdf) [code]

3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [[paper]](https://arxiv.org/pdf/2307.05973.pdf) [[code]](https://github.com/huangwl18/VoxPoser)

4. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [[paper]](https://arxiv.org/pdf/2307.15818.pdf) [[project]](https://robotics-transformer2.github.io/)

5. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.01918.pdf) [[code]](https://github.com/robopen/roboagent/)

6. MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10727.pdf) [[code]](https://github.com/MLLM-Tool/MLLM-Tool)

## Neural Radiance Fields (NeRF)

1. EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.02077.pdf) [[code]](https://github.com/NVlabs/EmerNeRF) 

## Diffusion Model

1. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.17994.pdf) [[code]](https://github.com/kylesargent/ZeroNVS)

2. Vlogger: Make Your Dream A Vlog [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.09414.pdf) [[code]](https://github.com/zhuangshaobin/Vlogger)

3. BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.13974.pdf) [code]

## World Model

1. CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [[paper]](https://arxiv.org/pdf/2306.01828.pdf) [[code]](https://github.com/neuroailab/CounterfactualWorldModels)

2. MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [[paper]](https://arxiv.org/pdf/2210.07729.pdf) [[code]](https://github.com/wayveai/mile) [[blog]](https://wayve.ai/thinking/learning-a-world-model-and-a-driving-policy/)

3. GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [[paper]](https://arxiv.org/pdf/2309.17080.pdf) [code]

4. ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.13549.pdf) [code]

5. OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.16038.pdf) [[code]](https://github.com/wzzheng/OccWorld)

6. LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [[paper]](https://arxiv.org/pdf/2402.08268.pdf) [[code]](https://github.com/LargeWorldModel/LWM)

## Artificial Intelligence Generated Content (AIGC)

### Text-to-Image

### Text-to-Video

1. Sora: Video generation models as world simulators [openai 2402] [[technical report]](https://openai.com/research/video-generation-models-as-world-simulators) (💥Visual GPT Time?)

### Text-to-3D

### Image-to-3D

## Artificial General Intelligence (AGI)

## New Method

1. [Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [[paper]](https://arxiv.org/pdf/2109.01652.pdf) [[code]](https://github.com/google-research/flan) 

## New Dataset

1. DriveLM: Drive on Language [paper] [[project]](https://github.com/OpenDriveLab/DriveLM)

2. MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [[paper]](https://arxiv.org/pdf/2310.02601.pdf) [[code]](https://github.com/cure-lab/MagicDrive) 

3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models [[paper]](https://robotics-transformer-x.github.io/paper.pdf) [[project]](https://robotics-transformer-x.github.io/) [[blog]](https://www.deepmind.com/blog/scaling-up-learning-across-many-different-robot-types)

4. To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.07574.pdf) [[code]](https://github.com/X2FD/LVIS-INSTRUCT4V) [[dataset]](https://huggingface.co/datasets/X2FD/LVIS-Instruct4V)

5. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [[paper]](https://arxiv.org/pdf/2311.06242.pdf) [code] [dataset]

6. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [[paper]](https://arxiv.org/pdf/2311.12793.pdf) [[code]](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V) [[dataset]](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V)

## New Vision Backbone

1. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.09417.pdf) [[code]](https://github.com/hustvl/Vim)

2. VMamba: Visual State Space Model [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10166.pdf) [[code]](https://github.com/MzeroMiko/VMamba)

## Benchmark

1. Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [[paper]](https://arxiv.org/pdf/2401.10529.pdf) [[code]](https://github.com/umd-huang-lab/Mementos)

## Platform and API

1. SenseNova 商汤日日新开放平台 [[url]](https://platform.sensenova.cn/)

## SOTA Downstream Task

### Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/becauseofai/modernai

Awesome Lists containing this project

README