awesome-generalist-agents

A curated list of papers for generalist agents
https://github.com/cheryyunl/awesome-generalist-agents

Last synced: 20 days ago
JSON representation

Generalist Agents in Both Virtual and Physical Worlds
- A Generalist Agent - generalist-agent/) |
- An Interactive Agent Foundation Model - us/research/publication/interactive-agent-foundation-model/) |
- Magma: A Foundation Model for Multimodal AI Agents
Generalist Embodied Agents
- Large Vision-Language (Action) Models
  - PaLM-E: An Embodied Multimodal Language Model - e.github.io/) |
  - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - transformer2.github.io/) |
  - An embodied generalist agent in 3d world - generalist.github.io/) |
  - Vision-Language Foundation Models as Effective Robot Imitators
  - Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation - manipulation.github.io/) |
  - 3D-VLA: A 3D Vision-Language-Action Generative World Model - www.cs.umass.edu/3dvla) |
  - Octo: An Open-Source Generalist Robot Policy - models.github.io/) |
  - RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
  - Robotic Control via Embodied Chain-of-Thought Reasoning - cot.github.io/) |
  - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
  - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation - vla.github.io/) |
  - GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation - manipulation.github.io/) |
  - Latent Action Pretraining from Videos
  - π0: A Vision-Language-Action Flow Model for General Robot Control
  - RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation - robotics.github.io/rdt-robotics/) |
  - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
  - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution - VLA) |
  - RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation - affordance) |
  - Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression - vla.github.io/) |
  - Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models
  - Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
  - TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
  - NaVILA: Legged Robot Vision-Language-Action Model for Navigation - bot.github.io/) |
  - FAST: Efficient Action Tokenization for Vision-Language-Action Models
  - DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control - vla.github.io/) |
  - OpenVLA: An Open-Source Vision-Language-Action Model
- Generalist Robotics Policies
  - Mt-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale - opt/) |
  - Learning Universal Policies via Text-Guided Video Generation - policy.github.io/unipi/) |
  - Open-World Object Manipulation using Pre-trained Vision-Language Models - moo.github.io/) |
  - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation - a-self-improving-robotic-agent/) |
  - RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking
  - Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation - cross-embodiment.github.io/) |
  - RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics - point.github.io/) |
  - Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation - model.github.io/) |
  - Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers
  - Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments
  - FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning - flare.github.io/) |
  - Neural MP: A Generalist Neural Motion Planner
  - Data Scaling Laws in Imitation Learning for Robotic Manipulation - scaling-laws.github.io/) |
  - The One RING: a Robotic Indoor Navigation Generalist - ring-policy.allen.ai/) |
  - Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding - model.github.io/) |
- Multimodal World Models
Generlist Web Agents
- Generalist Agents for Simulated Worlds
  - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization - Pro) |
  - LARP: Language-Agent Role Play for Open-World Games - ai-lab.github.io/LARP/) |
  - Scaling Instructable Agents Across Many Simulated Worlds - generalist-ai-agent-for-3d-virtual-environments/?utm_source=twitter&utm_medium=social&utm_campaign=SIMA/) |
  - Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks - 1.github.io/) |
- Generalist Agents for Realistic Tasks
  - Toolformer: Language Models Can Teach Themselves to Use Tools
  - Language Models can Solve Computer Tasks - web/) |
  - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  - From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces - deepmind/pix2act) |
  - A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
  - LLM Agent with State-Space Exploration for Web Navigation - agent/laser](https://github.com/Mayer123/LASER)) |
  - You Only Look at Screens: Multimodal Chain-of-Action Agents - GUI) |
  - Agents: An Open-source Framework for Autonomous Language Agents - cn/agents) |
  - AgentTuning: Enabling Generalized Agent Abilities for LLMs
  - CogAgent: A Visual Language Model for GUI Agents
  - AppAgent: Multimodal Agents as Smartphone Users - official.github.io/) |
  - CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update - tool.github.io/) |
  - GPT-4V(ision) is a Generalist Web Agent, if Grounded - nlp-group.github.io/SeeAct/) |
  - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception - PLUG/MobileAgent) |
  - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models - voyager/webvoyager) |
  - SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
  - OS-Copilot: Towards Generalist Computer Agents with Self-Improvement - copilot.github.io/) |
  - ScreenAgent: A Vision Language Model-driven Computer Control Agent
  - Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments
  - WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents - lutz.github.io/WILBUR/) |
  - OmniParser for Pure Vision Based GUI Agent
  - Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents - engineering/agent-q) |
  - OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
  - ShowUI: One Vision-Language-Action Model for GUI Visual Agent
  - InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection - Labs/InfiGUIAgent) |
  - UI-TARS: Pioneering Automated GUI Interaction with Native Agents - TARS) |
Datasets & Benchmarks
- For Embodied Agents
  - GenSim: Generating Robotic Simulation Tasks via Large Language Models - sim.github.io/) |
  - All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents
  - LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning - project.github.io/main.html) |
  - Open X-Embodiment: Robotic Learning Datasets and RT-X Models - transformer-x.github.io/) |
  - Evaluating Real-World Robot Manipulation Policies in Simulation - env.github.io/) |
  - ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI
  - Genesis: A Generative and Universal Physics Engine for Robotics and Beyond - embodied-ai.github.io/) |
  - GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs
  - RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation - humanoid-robomind.github.io/) |
  - VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
  - MuJoCo Playground
  - GRUtopia: Dream General Robots in a City at Scale
- For Web Agents
  - Towards Scalable Real-World Web Interaction with Grounded Language Agents - pnlp.github.io/) |
  - Mobile-Env: An Evaluation Platform and Benchmark for Interactive Agents in LLM Era - LANCE/Mobile-Env) |
  - Mind2Web: Towards a Generalist Agent for the Web - nlp-group.github.io/Mind2Web/) |
  - WebArena: A Realistic Web Environment for Building Autonomous Agents
  - ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
  - Android in the Wild: A Large-Scale Dataset for Android Device Control - research/google-research/tree/master/android_in_the_wild) |
  - AgentBench: Evaluating LLMs as Agents
  - Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
  - A3: Android Agent Arena for Mobile GUI Agents - Agent-Arena/) |
  - Travelplanner: A benchmark for real-world planning with language agents - nlp-group.github.io/TravelPlanner/) |
  - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
  - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
  - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments - world.github.io/) |
  - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
  - Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
- General Benchmarks
  - VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
🌷
- General Benchmarks
  - Yongyuan Liang

Categories

Generalist Embodied Agents 46 Generlist Web Agents 30 Datasets & Benchmarks 28 Generalist Agents in Both Virtual and Physical Worlds 3 🌷 1

Sub Categories

Generalist Agents for Realistic Tasks 26 Large Vision-Language (Action) Models 26 Generalist Robotics Policies 15 For Web Agents 15 For Embodied Agents 12 Multimodal World Models 5 Generalist Agents for Simulated Worlds 4 General Benchmarks 2