https://github.com/cheryyunl/awesome-generalist-agents
A curated list of papers for generalist agents
https://github.com/cheryyunl/awesome-generalist-agents
List: awesome-generalist-agents
Last synced: 14 days ago
JSON representation
A curated list of papers for generalist agents
- Host: GitHub
- URL: https://github.com/cheryyunl/awesome-generalist-agents
- Owner: cheryyunl
- Created: 2025-01-14T23:15:07.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-01-23T01:38:27.000Z (10 months ago)
- Last Synced: 2025-01-23T02:29:39.645Z (10 months ago)
- Size: 138 KB
- Stars: 44
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-hacking-lists - cheryyunl/awesome-generalist-agents - A curated list of papers for generalist agents (Others)
- awesome-embodied-vla-va-vln - [repo
- ultimate-awesome - awesome-generalist-agents - A curated list of papers for generalist agents. (Other Lists / TeX Lists)
README
# Awesome-Generalist-Agents [](https://github.com/sindresorhus/awesome) [](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) [](http://makeapullrequest.com)
A curated list of papers for generalist AI agents in both virtual and physical worlds.
- [Awesome-Generalist-Agents](#awesome-generalist-agents)
- [Generalist Agents in Both Virtual and Physical Worlds](#generalist-agents-in-both-virtual-and-physical-worlds)
- [Generalist Embodied Agents](#embodied-agents)
- [Generalist Web Agents](#generalist-web-agents)
- [Datasets & Benchmarks](#datasets-&-benchmarks)
---
## Generalist Agents in Both Virtual and Physical Worlds
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| May 2022 | Gato | [A Generalist Agent](https://arxiv.org/abs/2205.06175) | TMLR'22 | [Report](https://deepmind.google/discover/blog/a-generalist-agent/) |
| Feb 2024 | Interactive Agent Foundation Model | [An Interactive Agent Foundation Model](https://arxiv.org/abs/2402.05929) | ArXiv'24 | [Report](https://www.microsoft.com/en-us/research/publication/interactive-agent-foundation-model/) |
| Feb 2025 | Magma | [Magma: A Foundation Model for Multimodal AI Agents](https://arxiv.org/abs/2502.13130) | ArXiv'25 | [Project](https://github.com/microsoft/Magma) |
## Generalist Embodied Agents
### Large Vision-Language (Action) Models
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Dec 2022 | RT-1 | [RT-1: Robotics Transformer for Real-World Control at Scale](https://arxiv.org/abs/2406.09246) | RSS'23 | [Project](https://robotics-transformer1.github.io/) |
| Mar 2023 | PaLM-E | [PaLM-E: An Embodied Multimodal Language Model](https://arxiv.org/abs/2303.03378) | ArXiv'23 | [Project](https://palm-e.github.io/) |
| July 2023 | RT-2 | [RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](https://arxiv.org/abs/2307.15818) | ArXiv'23 | [Project](https://robotics-transformer2.github.io/) |
| Nov 2023 | LEO | [An embodied generalist agent in 3d world](https://arxiv.org/abs/2311.12871) | ICML'24 | [Project](https://embodied-generalist.github.io/) |
| Nov 2023 | RoboFlamingo | [Vision-Language Foundation Models as Effective Robot Imitators](https://arxiv.org/abs/2311.01378) | ArXiv'23 | [Project](https://roboflamingo.github.io/) |
| Dec 2023 | GR-1 | [Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation](https://arxiv.org/abs/2312.13139) | ArXiv'23 | [Project](https://gr1-manipulation.github.io/) |
| Mar 2024 | 3D-VLA | [3D-VLA: A 3D Vision-Language-Action Generative World Model](https://arxiv.org/abs/2403.09631) | ICML'24 | [Project](https://vis-www.cs.umass.edu/3dvla) |
| May 2024 | Octo | [Octo: An Open-Source Generalist Robot Policy](https://arxiv.org/abs/2403.09618) | ArXiv'24 | [Project](https://octo-models.github.io/) |
| Jun 2024 | OpenVLA | [OpenVLA: An Open-Source Vision-Language-Action Model](https://arxiv.org/abs/2406.09246) | CORL'24 | [Project](https://openvla.github.io/) |
| Jun 2024 | RoboUniView | [RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation](https://arxiv.org/abs/2406.18977) | ArXiv'24 | [Project](https://liufanfanlff.github.io/RoboUniview.github.io/) |
| Jul 2024 | Embodied-CoT | [Robotic Control via Embodied Chain-of-Thought Reasoning](https://arxiv.org/abs/2407.08693) | ArXiv'24 | [Project](https://embodied-cot.github.io/) |
| Jun 2024 | LLARVA | [LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning](https://arxiv.org/abs/2406.11815) | ArXiv'24 | [Project](https://llarva24.github.io/) |
| Sep 2024 | TinyVLA | [TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation](https://arxiv.org/abs/2409.12514) | ArXiv'24 | [Project](https://tiny-vla.github.io/) |
| Oct 2024 | GR-2 | [GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation](https://arxiv.org/abs/2410.06158) | ArXiv'24 | [Project](https://gr2-manipulation.github.io/) |
| Oct 2024 | LAPA | [Latent Action Pretraining from Videos](https://arxiv.org/abs/2410.11758) | ArXiv'24 | [Project](https://latentactionpretraining.github.io/) |
| Oct 2024 | π0 | [π0: A Vision-Language-Action Flow Model for General Robot Control](https://arxiv.org/abs/2410.24164) | ArXiv'24 | [Project](https://www.physicalintelligence.company/blog/pi0) |
| Oct 2024 | RDT-1B | [RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation](https://arxiv.org/abs/2410.07864) | ArXiv'24 | [Project](https://rdt-robotics.github.io/rdt-robotics/) |
| Nov 2024 | CogACT | [CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](https://arxiv.org/abs/2411.xxxxx) | ArXiv'24 | [Project](https://cogact.github.io/) |
| Nov 2024 | DeeR-VLA | [DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution](https://arxiv.org/abs/2411.02359) | ArXiv'24 | [Project](https://github.com/yueyang130/DeeR-VLA) |
| Nov 2024 | RT-Affordance | [RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation](https://arxiv.org/abs/2411.02704) | ArXiv'24 | [Project](https://snasiriany.me/rt-affordance) |
| Dec 2024 | Diffusion-VLA | [Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression](https://arxiv.org/abs/2412.03293) | ArXiv'24 | [Project](https://diffusion-vla.github.io/) |
| Dec 2024 | RoboVLMs | [Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models](https://arxiv.org/abs/2412.14058) | ArXiv'24 | [Project](https://robovlms.github.io/) |
| Dec 2024 | Moto | [Moto: Latent Motion Token as the Bridging Language for Robot Manipulation](https://arxiv.org/abs/2412.04445) | ArXiv'24 | [Project](https://chenyi99.github.io/moto/) |
| Dec 2024 | TraceVLA | [TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies](https://arxiv.org/abs/2412.10345) | ArXiv'24 | [Project](https://tracevla.github.io/) |
| Dec 2024 | NaVILA | [NaVILA: Legged Robot Vision-Language-Action Model for Navigation](https://arxiv.org/abs/2412.04453) | ArXiv'24 | [Project](https://navila-bot.github.io/) |
| Jan 2025 | FAST | [FAST: Efficient Action Tokenization for Vision-Language-Action Models](https://www.pi.website/download/fast.pdf) | ArXiv'25 | [Project](https://www.pi.website/research/fast) |
| Feb 2025 | DexVLA | [DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control](https://arxiv.org/abs/2502.05855) | ArXiv'25 | [Project](https://dex-vla.github.io/) |
### Generalist Robotics Policies
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Apr 2021 | Mt-Opt | [Mt-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale](https://arxiv.org/abs/2104.08212) | ArXiv'21 | [Project](https://karolhausman.github.io/mt-opt/) |
| Jan 2023 | UniPi | [Learning Universal Policies via Text-Guided Video Generation](https://arxiv.org/abs/2302.00111) | NeurIPS'23 | [Project](https://universal-policy.github.io/unipi/) |
| Mar 2023 | MOO | [Open-World Object Manipulation using Pre-trained Vision-Language Models](https://arxiv.org/abs/2303.00905) | CoRL'23 | [Project](https://robot-moo.github.io/) |
| Jun 2023 | RoboCat | [RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation](https://arxiv.org/abs/2306.11706) | ArXiv'23 | [Report](https://deepmind.google/discover/blog/robocat-a-self-improving-robotic-agent/) |
| Sep 2023 | RoboAgent | [RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking](https://arxiv.org/abs/2309.01918) | ICRA'24 | [Project](https://robopen.github.io/) |
| Feb 2024 | Extreme Cross-Embodiment | [Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation](https://arxiv.org/abs/2402.19432) | RSS'24 | [Project](https://extreme-cross-embodiment.github.io/) |
| Jun 2024 | RoboPoint | [RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics](https://arxiv.org/abs/2406.10721) | CORL'24 | [Project](https://robo-point.github.io/) |
| Aug 2024 | Crossformer | [Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation](https://arxiv.org/abs/2408.11812) | CORL'24 | [Project](https://crossformer-model.github.io/) |
| Sep 2024 | HPT | [Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers](https://arxiv.org/abs/2409.20537) | NeurIPS'24 | [Project](https://liruiw.github.io/hpt/) |
| Sep 2024 | RUMs | [Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments](https://arxiv.org/abs/2409.05865) | ArXiv'24 | [Project](https://robotutilitymodels.com/) |
| Sep 2024 | FLaRe | [FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning](https://arxiv.org/abs/2409.16578) | ArXiv'24 | [Project](https://robot-flare.github.io/) |
| Sep 2024 | Neural MP | [Neural MP: A Generalist Neural Motion Planner](https://arxiv.org/abs/2409.05864) | ArXiv'24 | [Project](https://mihdalal.github.io/neuralmotionplanner/) |
| Oct 2024 | Law in IL | [Data Scaling Laws in Imitation Learning for Robotic Manipulation](https://arxiv.org/abs/2410.18647) | ArXiv'24 | [Project](https://data-scaling-laws.github.io/) |
| Dec 2024 | RING | [The One RING: a Robotic Indoor Navigation Generalist](https://arxiv.org/abs/2412.14401) | ArXiv'24 | [Project](https://one-ring-policy.allen.ai/) |
| Jan 2025 | FUSE | [Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding](https://arxiv.org/abs/2501.04693) | ArXiv'25 | [Project](https://fuse-model.github.io/) |
### Multimodal World Models
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Mar 2018 | World Models | [World Models](https://arxiv.org/abs/1803.10122) | ArXiv'18 | [Project](https://worldmodels.github.io/) |
| Jan 2023 | DreamerV3 | [Mastering Diverse Domains through World Models](https://arxiv.org/abs/2301.04104) | ArXiv'23 | [Project](https://danijar.com/project/dreamerv3/) |
| Aug 2023 | Human World Model | [Structured World Models from Human Videos](https://arxiv.org/abs/2308.10901) | RSS'23 | [Project](https://human-world-model.github.io/) |
| Feb 2024 | World Models | [The Essential Role of Causality in Foundation World Models for Embodied AI](https://arxiv.org/abs/2402.06665) | ArXiv'24 | [Project]() |
| Nov 2024 | WHALE | [WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making](https://arxiv.org/abs/2411.05619) | ArXiv'24 | [Project]() |
## Generlist Web Agents
### Generalist Agents for Simulated Worlds
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Feb 2024 | Agent-Pro | [Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization](https://arxiv.org/abs/2402.17574) | ACL'24 | [Project](https://github.com/zwq2018/Agent-Pro) |
| Dec 2023 | LARP | [LARP: Language-Agent Role Play for Open-World Games](https://arxiv.org/abs/2312.17653) | ArXiv'23 | [Project](https://miao-ai-lab.github.io/LARP/) |
| Mar 2024 | SIMA | [Scaling Instructable Agents Across Many Simulated Worlds](https://arxiv.org/abs/2404.10179) | ArXiv'24 | [Report](https://deepmind.google/discover/blog/sima-generalist-ai-agent-for-3d-virtual-environments/?utm_source=twitter&utm_medium=social&utm_campaign=SIMA/) |
| Aug 2024 | Optimus-1 | [Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks](https://arxiv.org/abs/2408.03615) | ArXiv'24 | [Project](https://cybertronagent.github.io/Optimus-1.github.io/) |
### Generalist Agents for Realistic Tasks
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Feb 2023 | Toolformer | [Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/abs/2302.04761) | NeurIPS'23 | [Project]() |
| Mar 2023 | RCI | [Language Models can Solve Computer Tasks](https://arxiv.org/abs/2303.17491) | ArXiv'23 | [Project](https://posgnu.github.io/rci-web/) |
| Mar 2023 | HuggingGPT | [HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face](https://arxiv.org/abs/2303.17580) | ArXiv'23 | [Project](https://huggingface.co/spaces/microsoft/HuggingGPT) |
| May 2023 | Pix2Act | [From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces](https://arxiv.org/abs/2306.00245) | NeurIPS'23 | [Project](https://github.com/google-deepmind/pix2act) |
| Jul 2023 | WebAgent | [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis](https://arxiv.org/abs/2307.12856) | ICLR'24 | [Project]() |
| Sep 2023 | LASER | [LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) | ArXiv'23 | [Project]([https://github.com/laser-agent/laser](https://github.com/Mayer123/LASER)) |
| Sep 2023 | Auto-GUI | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | ACL'24 | [Project](https://github.com/cooelf/Auto-GUI) |
| Sep 2023 | Agents | [Agents: An Open-source Framework for Autonomous Language Agents](https://arxiv.org/abs/2309.07870) | ArXiv'23 | [Project](https://github.com/aiwaves-cn/agents) |
| Oct 2023 | AgentTuning | [AgentTuning: Enabling Generalized Agent Abilities for LLMs](https://arxiv.org/abs/2310.12823) | ArXiv'23 | [Project](https://thudm.github.io/AgentTuning/) |
| Dec 2023 | CogAgent | [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) | CVPR'24 | [Project](https://github.com/THUDM/CogAgent) |
| Dec 2023 | AppAgent | [AppAgent: Multimodal Agents as Smartphone Users](https://arxiv.org/abs/2312.13771) | ArXiv'23 | [Project](https://appagent-official.github.io/) |
| Dec 2023 | CLOVA | [CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update](https://arxiv.org/abs/2312.10908) | CVPR 2024 | [Project](https://clova-tool.github.io/) |
| Jan 2024 | SeeAct | [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://arxiv.org/abs/2401.01614) | ICML'24 | [Project](https://osu-nlp-group.github.io/SeeAct/) |
| Jan 2024 | Mobile-Agent | [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](https://arxiv.org/abs/2401.16158) | ArXiv'24 | [Project](https://github.com/X-PLUG/MobileAgent) |
| Jan 2024 | WebVoyager | [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://arxiv.org/abs/2401.13919) | ACL'24 | [Project](https://github.com/web-voyager/webvoyager) |
| Jan 2024 | SeeClick | [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) | ArXiv'24 | [Project](https://github.com/njucckevin/SeeClick)|
| Jan 2024 | Mobile-Agent | [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](https://arxiv.org/abs/2401.16158) | ArXiv'24 | [Project](https://github.com/X-PLUG/MobileAgent) |
| Feb 2024 | OS-Copilot | [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement](https://arxiv.org/abs/2402.07456) | ArXiv'24 | [Project](https://os-copilot.github.io/) |
| Feb 2024 | ScreenAgent | [ScreenAgent: A Vision Language Model-driven Computer Control Agent](https://arxiv.org/abs/2402.07945) | ArXiv'24 | [Project](https://github.com/niuzaisheng/ScreenAgent) |
| Feb 2024 | Middleware | [Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments](https://arxiv.org/abs/2402.14672) | EMNLP'2024 | [Project]() |
| Apr 2024 | WILBUR | [WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents](https://arxiv.org/abs/2404.05107) | ArXiv'24 | [Project](https://michael-lutz.github.io/WILBUR/) |
| Jul 2024 | OmniParser | [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/abs/2408.00203) | ArXiv'24 | [Project](ttps://microsoft.github.io/OmniParser/) |
| Aug 2024 | Agent Q | [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199) | ArXiv'24 | [Project](https://github.com/sentient-engineering/agent-q) |
| Oct 2024 | OS-ATLAS | [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://arxiv.org/abs/2410.23218) | ArXiv'24 | [Project](https://osatlas.github.io/) |
| Nov 2024 | ShowUI | [ShowUI: One Vision-Language-Action Model for GUI Visual Agent](https://arxiv.org/abs/2411.17465) | ArXiv'24 | [Project](https://github.com/showlab/ShowUI/) |
| Jan 2025 | InfiGUIAgent | [InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection](https://arxiv.org/abs/2501.04575) | ArXiv'25 | [Project](https://github.com/Reallm-Labs/InfiGUIAgent) |
| Jan 2025 | UI-TARS | [UI-TARS: Pioneering Automated GUI Interaction with Native Agents](https://arxiv.org/abs/2501.12326) | ArXiv'25 | [Project](https://github.com/bytedance/UI-TARS) |
## Datasets & Benchmarks
### For Embodied Agents
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Jun 2023 | LIBERO | [LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning](https://arxiv.org/abs/2306.03310) | NeurIPS'23 | [Project](https://libero-project.github.io/main.html) |
| Oct 2023 | Open X-Embodiment | [Open X-Embodiment: Robotic Learning Datasets and RT-X Models](https://arxiv.org/abs/2310.08864) | ArXiv'24 | [Project](https://robotics-transformer-x.github.io/) |
| Oct 2023 | GenSim | [GenSim: Generating Robotic Simulation Tasks via Large Language Models](https://arxiv.org/abs/2310.01361) | ICLR'24 | [Project](https://gen-sim.github.io/) |
| Aug 2024 | ARIO | [All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents](https://arxiv.org/abs/2408.10899) | ArXiv'24 | [Project](https://imaei.github.io/project_pages/ario/) |
| May 2024 | Simpler | [Evaluating Real-World Robot Manipulation Policies in Simulation](https://arxiv.org/abs/2405.05941) | ArXiv'24 | [Project](https://simpler-env.github.io/) |
| Jun 2024 | ManiSkill3 | [ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI](https://arxiv.org/abs/2406.02523) | ArXiv'24 | [Project](https://www.maniskill.ai/home) |
| Jul 2024 | RoboCasa | [RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots](https://arxiv.org/abs/2407.10943) | ArXiv'24 | [Project](https://robocasa.ai/) |
| Jul 2024 | GRUtopia | [GRUtopia: Dream General Robots in a City at Scale](https://arxiv.org/abs/2407.10943) | ArXiv'24 | [Project](https://github.com/OpenRobotLab/GRUtopia) |
| Oct 2024 | Genesis | [Genesis: A Generative and Universal Physics Engine for Robotics and Beyond](https://arxiv.org/abs/2410.00425) | ArXiv'24 | [Project](https://genesis-embodied-ai.github.io/) |
| Oct 2024 | GenSim2 | [GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs](https://arxiv.org/abs/2410.03645) | CORL'24 | [Project](https://gensim2.github.io/) |
| Dec 2024 | RoboMIND | [RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation](https://arxiv.org/abs/2412.13877) | ArXiv'24 | [Project](https://x-humanoid-robomind.github.io/) |
| Dec 2024 | VLABench | [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks](https://arxiv.org/abs/2412.18194) | ArXiv'24 | [Project](https://vlabench.github.io/) |
| Jan 2024 | MuJoCo Playground | [MuJoCo Playground](https://playground.mujoco.org/assets/playground_technical_report.pdf) | Report'24 | [Project](https://playground.mujoco.org/) |
### For Web Agents
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Jul 2022 | WebShop | [Towards Scalable Real-World Web Interaction with Grounded Language Agents](https://arxiv.org/abs/2207.01206) | NeurIPS'22 | [Project](https://webshop-pnlp.github.io/) |
| May 2023 | Mobile-Env | [Mobile-Env: An Evaluation Platform and Benchmark for Interactive Agents in LLM Era](https://arxiv.org/abs/2305.08144) | ArXiv'23 | [Project](https://github.com/X-LANCE/Mobile-Env) |
| Jun 2023 | Mind2Web | [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) | NeurIPS'23 | [Project](https://osu-nlp-group.github.io/Mind2Web/) |
| Jul 2023 | WebArena | [WebArena: A Realistic Web Environment for Building Autonomous Agents](https://arxiv.org/abs/2307.13854) | ICLR'24 | [Project](https://webarena.dev/) |
| Jul 2023 | ToolBench | [ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs](https://arxiv.org/abs/2307.16789) | ICLR'24 | [Project](https://github.com/OpenBMB/ToolBench) |
| Jul 2023 | AITW | [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/abs/2307.10088) | ArXiv'23 | [Project](https://github.com/google-research/google-research/tree/master/android_in_the_wild) |
| Aug 2023 | AgentBench | [AgentBench: Evaluating LLMs as Agents](https://arxiv.org/abs/2308.03688) | ArXiv'23 | [Project](https://llmbench.ai/) |
| Jan 2024 | VWA | [Visualwebarena: Evaluating multimodal agents on realistic visual web tasks](https://arxiv.org/abs/2401.13649) | ACL'2024 | [Project](https://jykoh.com/vwa) |
| Jan 2024 | A3 | [A3: Android Agent Arena for Mobile GUI Agents](https://arxiv.org/abs/2501.01149) | ArXiv'24 | [Project](https://yuxiangchai.github.io/Android-Agent-Arena/) |
| Feb 2024 | TravelPlanner | [Travelplanner: A benchmark for real-world planning with language agents](https://arxiv.org/abs/2402.01622) | ICML'2024 | [Project](https://osu-nlp-group.github.io/TravelPlanner/) |
| Feb 2024 | OmniACT | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | ArXiv'24 | [Dataset](https://huggingface.co/datasets/Writer/omniact) |
| Mar 2024 | WorkArena | [WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?](https://arxiv.org/abs/2403.07718) | ArXiv'24 | [Project](https://servicenow.github.io/WorkArena/) |
| Apr 2024 | OSWorld | [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07864) | ArXiv'24 | [Project](https://os-world.github.io/) |
| Jul 2024 | MMAU | [MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains](https://arxiv.org/abs/2407.18961) | ArXiv'24 | [Project](https://github.com/apple/axlearn/tree/main/docs/research/mmau) |
| Sep 2024 | WindowsAgentArena | [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale](https://arxiv.org/abs/2409.08264) | ArXiv'24 | [Project](https://github.com/microsoft/WindowsAgentArena) |
### General Benchmarks
| Date | keywords | Paper | Publication | Others |
| :-----: | :------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:
| Aug 2024 | VisualAgentBench | [VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents](https://arxiv.org/abs/2408.06327) | ArXiv'24 | [Project](https://github.com/THUDM/VisualAgentBench/) |
## 🌷
We are currently under ongoing updates and always welcome contributions. If you find any interesting papers that are not included in this collection, feel free to open a pull request.
For any questions or suggestions, please contact [Yongyuan Liang](https://cheryyunl.github.io/) or [Ruihan Yang](https://rchalyang.github.io/).