awesome-generalist-agents
A curated list of papers for generalist agents
https://github.com/cheryyunl/awesome-generalist-agents
Last synced: 3 days ago
JSON representation
-
Generalist Agents in Both Virtual and Physical Worlds
- A Generalist Agent - generalist-agent/) |
- An Interactive Agent Foundation Model - us/research/publication/interactive-agent-foundation-model/) |
- Magma: A Foundation Model for Multimodal AI Agents
-
Generalist Embodied Agents
-
Large Vision-Language (Action) Models
- PaLM-E: An Embodied Multimodal Language Model - e.github.io/) |
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - transformer2.github.io/) |
- An embodied generalist agent in 3d world - generalist.github.io/) |
- Vision-Language Foundation Models as Effective Robot Imitators
- Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation - manipulation.github.io/) |
- 3D-VLA: A 3D Vision-Language-Action Generative World Model - www.cs.umass.edu/3dvla) |
- Octo: An Open-Source Generalist Robot Policy - models.github.io/) |
- RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
- Robotic Control via Embodied Chain-of-Thought Reasoning - cot.github.io/) |
- LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
- TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation - vla.github.io/) |
- GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation - manipulation.github.io/) |
- Latent Action Pretraining from Videos
- π0: A Vision-Language-Action Flow Model for General Robot Control
- RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation - robotics.github.io/rdt-robotics/) |
- CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution - VLA) |
- RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation - affordance) |
- Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression - vla.github.io/) |
- Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models
- Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
- TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
- NaVILA: Legged Robot Vision-Language-Action Model for Navigation - bot.github.io/) |
- FAST: Efficient Action Tokenization for Vision-Language-Action Models
- DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control - vla.github.io/) |
- OpenVLA: An Open-Source Vision-Language-Action Model
-
Generalist Robotics Policies
- Mt-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale - opt/) |
- Learning Universal Policies via Text-Guided Video Generation - policy.github.io/unipi/) |
- Open-World Object Manipulation using Pre-trained Vision-Language Models - moo.github.io/) |
- RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation - a-self-improving-robotic-agent/) |
- RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking
- Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation - cross-embodiment.github.io/) |
- RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics - point.github.io/) |
- Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation - model.github.io/) |
- Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers
- Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments
- FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning - flare.github.io/) |
- Neural MP: A Generalist Neural Motion Planner
- Data Scaling Laws in Imitation Learning for Robotic Manipulation - scaling-laws.github.io/) |
- The One RING: a Robotic Indoor Navigation Generalist - ring-policy.allen.ai/) |
- Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding - model.github.io/) |
-
Multimodal World Models
-
-
Generlist Web Agents
-
Generalist Agents for Simulated Worlds
- Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization - Pro) |
- LARP: Language-Agent Role Play for Open-World Games - ai-lab.github.io/LARP/) |
- Scaling Instructable Agents Across Many Simulated Worlds - generalist-ai-agent-for-3d-virtual-environments/?utm_source=twitter&utm_medium=social&utm_campaign=SIMA/) |
- Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks - 1.github.io/) |
-
Generalist Agents for Realistic Tasks
- Toolformer: Language Models Can Teach Themselves to Use Tools
- Language Models can Solve Computer Tasks - web/) |
- HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
- From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces - deepmind/pix2act) |
- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- LLM Agent with State-Space Exploration for Web Navigation - agent/laser](https://github.com/Mayer123/LASER)) |
- You Only Look at Screens: Multimodal Chain-of-Action Agents - GUI) |
- Agents: An Open-source Framework for Autonomous Language Agents - cn/agents) |
- AgentTuning: Enabling Generalized Agent Abilities for LLMs
- CogAgent: A Visual Language Model for GUI Agents
- AppAgent: Multimodal Agents as Smartphone Users - official.github.io/) |
- CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update - tool.github.io/) |
- GPT-4V(ision) is a Generalist Web Agent, if Grounded - nlp-group.github.io/SeeAct/) |
- Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception - PLUG/MobileAgent) |
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models - voyager/webvoyager) |
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- OS-Copilot: Towards Generalist Computer Agents with Self-Improvement - copilot.github.io/) |
- ScreenAgent: A Vision Language Model-driven Computer Control Agent
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments
- WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents - lutz.github.io/WILBUR/) |
- OmniParser for Pure Vision Based GUI Agent
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents - engineering/agent-q) |
- OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection - Labs/InfiGUIAgent) |
- UI-TARS: Pioneering Automated GUI Interaction with Native Agents - TARS) |
-
-
Datasets & Benchmarks
-
For Embodied Agents
- GenSim: Generating Robotic Simulation Tasks via Large Language Models - sim.github.io/) |
- All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents
- LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning - project.github.io/main.html) |
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models - transformer-x.github.io/) |
- Evaluating Real-World Robot Manipulation Policies in Simulation - env.github.io/) |
- ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI
- Genesis: A Generative and Universal Physics Engine for Robotics and Beyond - embodied-ai.github.io/) |
- GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs
- RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation - humanoid-robomind.github.io/) |
- VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
- MuJoCo Playground
- GRUtopia: Dream General Robots in a City at Scale
-
For Web Agents
- Towards Scalable Real-World Web Interaction with Grounded Language Agents - pnlp.github.io/) |
- Mobile-Env: An Evaluation Platform and Benchmark for Interactive Agents in LLM Era - LANCE/Mobile-Env) |
- Mind2Web: Towards a Generalist Agent for the Web - nlp-group.github.io/Mind2Web/) |
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
- Android in the Wild: A Large-Scale Dataset for Android Device Control - research/google-research/tree/master/android_in_the_wild) |
- AgentBench: Evaluating LLMs as Agents
- Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
- A3: Android Agent Arena for Mobile GUI Agents - Agent-Arena/) |
- Travelplanner: A benchmark for real-world planning with language agents - nlp-group.github.io/TravelPlanner/) |
- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
- WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments - world.github.io/) |
- MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
- Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
-
General Benchmarks
-
-
🌷
-
General Benchmarks
-
Categories