awesome-ai-agent-testing

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
https://github.com/chaosync-org/awesome-ai-agent-testing

Last synced: about 10 hours ago
JSON representation

Foundations
- Academic Papers
  - Evaluating AI Agent Performance With Benchmarks - Comprehensive guide on evaluating AI agents in real-world scenarios with practical examples and metrics.
  - Generative Agents: Interactive Simulacra of Human Behavior - Stanford's groundbreaking paper on creating believable AI agents that simulate complex human behavior patterns.
  - Voyager: An Open-Ended Embodied Agent with Large Language Models - Minecraft-based agent demonstrating continuous learning and skill acquisition.
  - WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents - Benchmark for evaluating web-based shopping agents with real product data.
  - AgentBench: Evaluating LLMs as Agents - Comprehensive benchmark suite for evaluating LLM-based agents across diverse environments.
  - ReAct: Synergizing Reasoning and Acting in Language Models - Framework combining reasoning and acting in language models for improved agent performance.
  - Holistic Evaluation of Language Models (HELM) - Stanford's comprehensive evaluation framework with multi-metric assessment.
  - 𝜏-Bench: Benchmarking AI agents for the real-world - Novel benchmark introducing task-based evaluation for AI agents' real-world performance and reliability.
  - Multi-Agent Security: Securing Networks of AI Agents - Framework for risks in multi-agent systems including collusion and emergent attacks.
- Surveys and Reviews
  - Testing and Debugging AI Agents: A Survey - Survey focusing specifically on testing and debugging methodologies for AI agents.
  - A Survey of LLM-based Autonomous Agents - Extensive survey covering construction, application, and evaluation of LLM-based autonomous agents.
  - Benchmarking of AI Agents: A Perspective - Industry perspective on the critical role of benchmarking in accelerating AI agent adoption.
  - What is AI Agent Evaluation? - IBM's comprehensive overview of AI agent evaluation methodologies and their importance.
  - A Survey on Evaluation of Large Language Model Based Agents - Systematic review of evaluation methods for LLM-based agents.
- Books and Textbooks
  - Artificial Intelligence: A Modern Approach - Classic textbook with chapters on agent testing and evaluation.
  - Reinforcement Learning: An Introduction - Foundational text covering agent learning and evaluation in RL contexts.
  - Reinforcement Learning: An Introduction - Foundational text covering agent learning and evaluation in RL contexts.
  - Multi-Agent Systems: Algorithmic, Game-Theoretic, and Logical Foundations - Comprehensive coverage of multi-agent system testing.
AI Agent Categories
- Conversational Agents
  - Botium - Automated dialogue testing framework
- Autonomous Agents
  - WebArena - bench](https://sierra.ai/blog/benchmarking-ai-agents)
- Multi-Agent Systems
  - PettingZoo
- Tool-Using Agents
  - Berkeley Function Calling Leaderboard
Testing Frameworks
- Open Source Frameworks
  - LangChain - **15k+ stars** - Framework for developing applications powered by language models with extensive testing utilities.
  - LangSmith Evaluation - Comprehensive evaluation toolkit with automatic LLM-as-a-judge scoring
  - AutoGen - **20k+ stars** - Microsoft's framework for building conversational agents with comprehensive testing tools.
  - AgentVerse - Framework for building and testing multi-agent systems.
  - CAMEL - Communicative Agents for "Mind" Exploration of Large Scale Language Model Society.
  - MetaGPT - **35k+ stars** - Multi-agent meta programming framework.
- Commercial Solutions
  - Galileo AI - Comprehensive evaluation platform for AI agents.
  - Vertex AI Gen AI Evaluation Service - Google Cloud's agent evaluation service.
  - Athina AI - Specialized platform for LLM and agent evaluation.
  - Confident AI - LLM evaluation and testing platform.
  - Arize AI - ML observability platform with agent testing capabilities.
- Language-Specific Tools
  - DeepEval - Open-source LLM evaluation framework for testing complex agent behaviors.
  - CheckList - Behavioral testing methodology and tool for NLP models.
  - Jest-Agents - Jest extension for agent testing.
  - Agent-Testing-Library - Testing utilities for JS agents.
  - Cypress-AI - E2E testing for web-based agents.
  - JUnit-Agents - JUnit extensions for agent testing.
  - AgentTestKit - Comprehensive testing toolkit for Java agents.
- Multi-Agent Testing Frameworks
  - JADE Test Suite - Testing framework for JADE multi-agent systems.
  - MASON - Multi-agent simulation toolkit with testing capabilities.
  - NetLogo - Multi-agent programmable modeling environment.
  - Repast - Agent-based modeling and simulation platform.
- Category-Specific Testing Tools
  - Botium - Open-source testing framework for chatbots and voice assistants
  - Rasa Test - Testing framework for Rasa conversational AI
  - VoiceBench - Evaluation suite for voice assistants
  - τ-bench (TAU-bench) - Real-world task benchmark
  - AgentBench - Comprehensive agent evaluation platform
  - ToolBench - Large-scale tool-use evaluation
  - API-Bank - Tool-augmented LLM evaluation
  - HealthBench - Medical AI agent evaluation
  - LegalBench - Legal reasoning evaluation
  - FinBench - Financial AI evaluation
  - SIMA Benchmark - 3D virtual environment testing
  - Habitat - Embodied AI platform
  - RoboSuite - Robot learning benchmark
Chaos Engineering and Fault Injection
- Chaos Testing Tools
  - IBM Adversarial Robustness Toolbox (ART) - Python library for ML security testing.
  - Gremlin - Enterprise chaos engineering platform.
  - Chaos Monkey - Netflix's resiliency tool.
  - LitmusChaos - Cloud-native chaos engineering.
  - Chaos Toolkit - Open source chaos engineering toolkit.
- Fault Injection Libraries
  - Fault-Injection-Library - Generic fault injection for testing.
  - PyFI - Python fault injection library.
  - Chaos Engineering Toolkit - Comprehensive chaos engineering tools.
- Resilience Testing
  - Resilience4j - Fault tolerance library.
  - Hystrix - Latency and fault tolerance library.
Benchmarks and Evaluation
- Datasets
  - GAIA Benchmark - General AI Assistant benchmark for fundamental agent capabilities.
  - WorkBench - Dataset focusing on workplace tasks like email and scheduling.
  - WebShop - E-commerce environment for grounded language agents.
  - SWE-Bench - Software engineering agent benchmark for code generation.
  - TruthfulQA - 817 questions testing agent truthfulness across domains.
  - ALFWorld - Text-based embodied agents in interactive environments.
  - ScienceWorld - Science experiments and reasoning tasks.
  - TextWorld - Text-based game environments for RL agents.
  - MARL Benchmark - Multi-agent reinforcement learning tasks.
  - Hanabi - Cooperative multi-agent card game.
  - SMAC - StarCraft Multi-Agent Challenge.
- Leaderboards
  - LMSYS Chatbot Arena - Live competitive evaluation platform
  - AgentBench Leaderboard - Multi-environment agent rankings
  - HELM Benchmark - Holistic evaluation of language models
  - BIG-bench - Beyond the Imitation Game benchmark
- Evaluation Frameworks
  - HELM - Holistic Evaluation of Language Models
  - EleutherAI LM Evaluation Harness - Framework for few-shot evaluation
  - OpenAI Evals - Framework for evaluating LLMs
Simulation Environments
- Virtual Worlds
  - SIMA - DeepMind's 3D virtual environment agent
  - AI2-THOR - Interactive 3D environments
  - CARLA - Autonomous driving simulation
  - MineDojo - Minecraft-based agent environment
- Dynamic Testing Environments
  - OpenAI Gym - Toolkit for developing RL agents
  - Meta-World - Benchmark for multi-task RL
  - RLlib - Scalable RL with dynamic environments
- Game-Based Environments
  - StarCraft II LE - StarCraft II Learning Environment
  - Dota 2 Bot API - Complex multi-agent environment
  - OpenSpiel - Collection of game environments
  - MineRL - Minecraft competitions for RL
Safety and Security Testing
- Adversarial Testing
  - TextAttack - Framework for adversarial attacks on NLP models
  - CleverHans - Library for adversarial example generation
  - PAIR - Prompt Automatic Iterative Refinement
- Red Teaming
  - Microsoft PyRIT - Python Risk Identification Tool for GenAI
  - Anthropic Red Team Dataset - Curated red team prompts
  - AI Safety Benchmark - Comprehensive safety evaluation
  - LLM Guard - Security toolkit for LLMs
- Safety Evaluation
  - AI Safety Gridworlds - DeepMind's safety testing environments
  - Safety Gym - OpenAI's constrained RL environments
  - Alignment Research Center Evals - Alignment-focused evaluations
Performance Testing
- Load Testing
  - Locust - Scalable load testing framework
  - K6 - Modern load testing tool
  - Apache JMeter - Comprehensive testing tool
- Latency Analysis
  - OpenTelemetry - Observability framework
  - Jaeger - Distributed tracing system
- Scalability Testing
  - Ray - Distributed AI framework
  - Kubernetes - Container orchestration
Practical Resources
- Tutorials and Guides
  - AI Agents Testing 101 - Beginner's guide to agent testing
  - Building Your First Test Suite - Hands-on tutorial
  - Agent Testing Best Practices - Industry guidelines
  - From Manual to Automated Testing - Automation guide
  - Distributed Agent Testing - Testing at scale
  - Multi-Agent System Testing - Complex scenarios
  - Performance Optimization - Tuning guide
  - Security Testing Deep Dive - Advanced security
- Code Repositories
  - Agent Testing Examples - Collection of test cases
  - Testing Templates - Reusable test templates
  - Benchmark Implementations - Reference implementations
  - CI/CD Pipelines - Automation examples
- Videos and Courses
  - AI Agent Testing Fundamentals - 6-hour comprehensive course
  - Practical Agent Testing - Hands-on Coursera course
  - Advanced Testing Techniques - MIT OpenCourseWare
  - Multi-Agent Testing - Specialized course
  - Testing AI Agents at Scale - NeurIPS 2024 - Industry insights
  - Safety Testing for Production - ICML 2024 - Safety focus
  - Chaos Engineering for AI - KubeCon 2024 - Infrastructure testing
- Case Studies
  - Air Canada Chatbot Hallucination Case - Chatbot provided incorrect refund policy leading to legal liability.
  - OpenAI GPT-4 Tool Use Evaluation - Systematic evaluation of tool-using capabilities.
  - Meta's Cicero Diplomacy Agent - Human-level performance in strategy game.
  - DEFCON LLM Red Team Challenge 2023 - Public testing of LLM vulnerabilities.
  - AutoGPT Loop Failures - Early autonomous agent experiments.
Industry Applications
- Healthcare
  - FDA AI/ML Guidance - Regulatory framework
  - Healthcare AI Testing Standards - Industry standards
  - Clinical Validation Framework - Validation methods
- Finance
  - Financial AI Testing Guidelines - Industry standards
  - Regulatory Compliance Testing - Compliance frameworks
  - Backtesting Frameworks - Historical validation
- Autonomous Vehicles
  - ISO 26262 Compliance - Functional safety standard
  - SOTIF Guidelines - Safety of the intended functionality
  - Simulation Platforms - Testing environments
- Customer Service
  - Conversational AI Testing - Best practices
  - Customer Satisfaction Metrics - KPI frameworks
  - Multilingual Testing - Language coverage
Observability and Monitoring
- Production Monitoring Platforms
  - LangFuse - Open-source LLM observability platform.
  - WhyLabs - AI observability platform.
- Logging Standards
  - OpenTelemetry GenAI Convention - Emerging standard for AI observability.

Programming Languages

Python 20 Jupyter Notebook 4 Go 1 Java 1

awesome-ai-agent-testing

Foundations

Academic Papers

Surveys and Reviews

Books and Textbooks

AI Agent Categories

Conversational Agents

Autonomous Agents

Multi-Agent Systems

Tool-Using Agents

Testing Frameworks

Open Source Frameworks

Commercial Solutions

Language-Specific Tools

Multi-Agent Testing Frameworks

Category-Specific Testing Tools

Chaos Engineering and Fault Injection

Chaos Testing Tools

Fault Injection Libraries

Resilience Testing

Benchmarks and Evaluation

Datasets

Leaderboards

Evaluation Frameworks

Simulation Environments

Virtual Worlds

Dynamic Testing Environments

Game-Based Environments

Safety and Security Testing

Adversarial Testing

Red Teaming

Safety Evaluation

Performance Testing

Load Testing

Latency Analysis

Scalability Testing

Practical Resources

Tutorials and Guides

Code Repositories

Videos and Courses

Case Studies

Industry Applications

Healthcare

Finance

Autonomous Vehicles

Customer Service

Observability and Monitoring

Production Monitoring Platforms

Logging Standards