awesome-ai-agent-testing
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
https://github.com/chaosync-org/awesome-ai-agent-testing
Last synced: 6 days ago
JSON representation
-
Foundations
-
Academic Papers
- Evaluating AI Agent Performance With Benchmarks - Comprehensive guide on evaluating AI agents in real-world scenarios with practical examples and metrics.
- Evaluating AI Agent Performance With Benchmarks - Comprehensive guide on evaluating AI agents in real-world scenarios with practical examples and metrics.
- 𝜏-Bench: Benchmarking AI agents for the real-world - Novel benchmark introducing task-based evaluation for AI agents' real-world performance and reliability.
- Generative Agents: Interactive Simulacra of Human Behavior - Stanford's groundbreaking paper on creating believable AI agents that simulate complex human behavior patterns.
- ReAct: Synergizing Reasoning and Acting in Language Models - Framework combining reasoning and acting in language models for improved agent performance.
- Voyager: An Open-Ended Embodied Agent with Large Language Models - Minecraft-based agent demonstrating continuous learning and skill acquisition.
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents - Benchmark for evaluating web-based shopping agents with real product data.
- AgentBench: Evaluating LLMs as Agents - Comprehensive benchmark suite for evaluating LLM-based agents across diverse environments.
- Holistic Evaluation of Language Models (HELM) - Stanford's comprehensive evaluation framework with multi-metric assessment.
- Multi-Agent Security: Securing Networks of AI Agents - Framework for risks in multi-agent systems including collusion and emergent attacks.
- 𝜏-Bench: Benchmarking AI agents for the real-world - Novel benchmark introducing task-based evaluation for AI agents' real-world performance and reliability.
- ReAct: Synergizing Reasoning and Acting in Language Models - Framework combining reasoning and acting in language models for improved agent performance.
- Voyager: An Open-Ended Embodied Agent with Large Language Models - Minecraft-based agent demonstrating continuous learning and skill acquisition.
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents - Benchmark for evaluating web-based shopping agents with real product data.
- AgentBench: Evaluating LLMs as Agents - Comprehensive benchmark suite for evaluating LLM-based agents across diverse environments.
- Holistic Evaluation of Language Models (HELM) - Stanford's comprehensive evaluation framework with multi-metric assessment.
- Safety Devolution in AI Agents - Study showing how adding tools/retrieval can degrade safety performance.
-
Surveys and Reviews
- A Survey on Evaluation of Large Language Model Based Agents - Systematic review of evaluation methods for LLM-based agents.
- Testing and Debugging AI Agents: A Survey - Survey focusing specifically on testing and debugging methodologies for AI agents.
- A Survey of LLM-based Autonomous Agents - Extensive survey covering construction, application, and evaluation of LLM-based autonomous agents.
- Benchmarking of AI Agents: A Perspective - Industry perspective on the critical role of benchmarking in accelerating AI agent adoption.
- What is AI Agent Evaluation? - IBM's comprehensive overview of AI agent evaluation methodologies and their importance.
- A Survey of LLM-based Autonomous Agents - Extensive survey covering construction, application, and evaluation of LLM-based autonomous agents.
- Benchmarking of AI Agents: A Perspective - Industry perspective on the critical role of benchmarking in accelerating AI agent adoption.
- What is AI Agent Evaluation? - IBM's comprehensive overview of AI agent evaluation methodologies and their importance.
- A Survey on Evaluation of Large Language Model Based Agents - Systematic review of evaluation methods for LLM-based agents.
- Testing and Debugging AI Agents: A Survey - Survey focusing specifically on testing and debugging methodologies for AI agents.
-
Books and Textbooks
- Artificial Intelligence: A Modern Approach - Classic textbook with chapters on agent testing and evaluation.
- Artificial Intelligence: A Modern Approach - Classic textbook with chapters on agent testing and evaluation.
- Reinforcement Learning: An Introduction - Foundational text covering agent learning and evaluation in RL contexts.
-
Categories
Sub Categories