awesome-ai-agent-testing

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
https://github.com/chaosync-org/awesome-ai-agent-testing

Last synced: about 18 hours ago
JSON representation

Performance Testing
- Scalability Testing
  - Kubernetes - Container orchestration
  - Ray - Distributed AI framework
- Load Testing
  - Apache JMeter - Comprehensive testing tool
  - Locust - Scalable load testing framework
  - K6 - Modern load testing tool
- Latency Analysis
  - Jaeger - Distributed tracing system
  - OpenTelemetry - Observability framework
Testing Frameworks
- Commercial Solutions
  - Vertex AI Gen AI Evaluation Service - Google Cloud's agent evaluation service.
  - Arize AI - ML observability platform with agent testing capabilities.
  - Galileo AI - Comprehensive evaluation platform for AI agents.
  - Athina AI - Specialized platform for LLM and agent evaluation.
  - Confident AI - LLM evaluation and testing platform.
- Multi-Agent Testing Frameworks
  - MASON - Multi-agent simulation toolkit with testing capabilities.
  - NetLogo - Multi-agent programmable modeling environment.
  - Repast - Agent-based modeling and simulation platform.
  - JADE Test Suite - Testing framework for JADE multi-agent systems.
- Category-Specific Testing Tools
  - Habitat - Embodied AI platform
  - RoboSuite - Robot learning benchmark
  - ToolBench - Large-scale tool-use evaluation
  - AgentBench - Comprehensive agent evaluation platform
  - LegalBench - Legal reasoning evaluation
  - τ-bench (TAU-bench) - Real-world task benchmark
  - Botium - Open-source testing framework for chatbots and voice assistants
  - Rasa Test - Testing framework for Rasa conversational AI
  - VoiceBench - Evaluation suite for voice assistants
  - API-Bank - Tool-augmented LLM evaluation
  - HealthBench - Medical AI agent evaluation
  - FinBench - Financial AI evaluation
  - SIMA Benchmark - 3D virtual environment testing
- Open Source Frameworks
  - LangChain - **15k+ stars** - Framework for developing applications powered by language models with extensive testing utilities.
  - MetaGPT - **35k+ stars** - Multi-agent meta programming framework.
  - AutoGen - **20k+ stars** - Microsoft's framework for building conversational agents with comprehensive testing tools.
  - CAMEL - Communicative Agents for "Mind" Exploration of Large Scale Language Model Society.
  - LangSmith Evaluation - Comprehensive evaluation toolkit with automatic LLM-as-a-judge scoring
  - AgentVerse - Framework for building and testing multi-agent systems.
- Language-Specific Tools
  - DeepEval - Open-source LLM evaluation framework for testing complex agent behaviors.
  - CheckList - Behavioral testing methodology and tool for NLP models.
  - Jest-Agents - Jest extension for agent testing.
  - Agent-Testing-Library - Testing utilities for JS agents.
  - Cypress-AI - E2E testing for web-based agents.
  - JUnit-Agents - JUnit extensions for agent testing.
  - AgentTestKit - Comprehensive testing toolkit for Java agents.
Chaos Engineering and Fault Injection
- Chaos Testing Tools
  - Gremlin - Enterprise chaos engineering platform.
  - LitmusChaos - Cloud-native chaos engineering.
  - Chaos Toolkit - Open source chaos engineering toolkit.
  - IBM Adversarial Robustness Toolbox (ART) - Python library for ML security testing.
  - Chaos Monkey - Netflix's resiliency tool.
- Resilience Testing
  - Hystrix - Latency and fault tolerance library.
  - Resilience4j - Fault tolerance library.
- Fault Injection Libraries
  - Fault-Injection-Library - Generic fault injection for testing.
  - PyFI - Python fault injection library.
  - Chaos Engineering Toolkit - Comprehensive chaos engineering tools.
Observability and Monitoring
- Production Monitoring Platforms
  - WhyLabs - AI observability platform.
  - LangFuse - Open-source LLM observability platform.
- Logging Standards
  - OpenTelemetry GenAI Convention - Emerging standard for AI observability.
Foundations
- Academic Papers
  - Generative Agents: Interactive Simulacra of Human Behavior - Stanford's groundbreaking paper on creating believable AI agents that simulate complex human behavior patterns.
  - ReAct: Synergizing Reasoning and Acting in Language Models - Framework combining reasoning and acting in language models for improved agent performance.
  - WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents - Benchmark for evaluating web-based shopping agents with real product data.
  - Voyager: An Open-Ended Embodied Agent with Large Language Models - Minecraft-based agent demonstrating continuous learning and skill acquisition.
  - AgentBench: Evaluating LLMs as Agents - Comprehensive benchmark suite for evaluating LLM-based agents across diverse environments.
  - Holistic Evaluation of Language Models (HELM) - Stanford's comprehensive evaluation framework with multi-metric assessment.
  - Evaluating AI Agent Performance With Benchmarks - Comprehensive guide on evaluating AI agents in real-world scenarios with practical examples and metrics.
  - 𝜏-Bench: Benchmarking AI agents for the real-world - Novel benchmark introducing task-based evaluation for AI agents' real-world performance and reliability.
  - Multi-Agent Security: Securing Networks of AI Agents - Framework for risks in multi-agent systems including collusion and emergent attacks.
- Surveys and Reviews
  - A Survey of LLM-based Autonomous Agents - Extensive survey covering construction, application, and evaluation of LLM-based autonomous agents.
  - Benchmarking of AI Agents: A Perspective - Industry perspective on the critical role of benchmarking in accelerating AI agent adoption.
  - What is AI Agent Evaluation? - IBM's comprehensive overview of AI agent evaluation methodologies and their importance.
  - A Survey on Evaluation of Large Language Model Based Agents - Systematic review of evaluation methods for LLM-based agents.
  - Testing and Debugging AI Agents: A Survey - Survey focusing specifically on testing and debugging methodologies for AI agents.
- Books and Textbooks
  - Artificial Intelligence: A Modern Approach - Classic textbook with chapters on agent testing and evaluation.
  - Reinforcement Learning: An Introduction - Foundational text covering agent learning and evaluation in RL contexts.
  - Reinforcement Learning: An Introduction - Foundational text covering agent learning and evaluation in RL contexts.
  - Multi-Agent Systems: Algorithmic, Game-Theoretic, and Logical Foundations - Comprehensive coverage of multi-agent system testing.
Simulation Environments
- Dynamic Testing Environments
  - Meta-World - Benchmark for multi-task RL
  - OpenAI Gym - Toolkit for developing RL agents
  - RLlib - Scalable RL with dynamic environments
- Game-Based Environments
  - Dota 2 Bot API - Complex multi-agent environment
  - MineRL - Minecraft competitions for RL
  - StarCraft II LE - StarCraft II Learning Environment
  - OpenSpiel - Collection of game environments
- Virtual Worlds
  - CARLA - Autonomous driving simulation
  - AI2-THOR - Interactive 3D environments
  - MineDojo - Minecraft-based agent environment
  - SIMA - DeepMind's 3D virtual environment agent
Practical Resources
- Case Studies
  - OpenAI GPT-4 Tool Use Evaluation - Systematic evaluation of tool-using capabilities.
  - DEFCON LLM Red Team Challenge 2023 - Public testing of LLM vulnerabilities.
  - Air Canada Chatbot Hallucination Case - Chatbot provided incorrect refund policy leading to legal liability.
  - Meta's Cicero Diplomacy Agent - Human-level performance in strategy game.
  - AutoGPT Loop Failures - Early autonomous agent experiments.
- Tutorials and Guides
  - AI Agents Testing 101 - Beginner's guide to agent testing
  - Building Your First Test Suite - Hands-on tutorial
  - Agent Testing Best Practices - Industry guidelines
  - From Manual to Automated Testing - Automation guide
  - Distributed Agent Testing - Testing at scale
  - Multi-Agent System Testing - Complex scenarios
  - Performance Optimization - Tuning guide
  - Security Testing Deep Dive - Advanced security
- Code Repositories
  - Agent Testing Examples - Collection of test cases
  - Testing Templates - Reusable test templates
  - Benchmark Implementations - Reference implementations
  - CI/CD Pipelines - Automation examples
- Videos and Courses
  - AI Agent Testing Fundamentals - 6-hour comprehensive course
  - Practical Agent Testing - Hands-on Coursera course
  - Advanced Testing Techniques - MIT OpenCourseWare
  - Multi-Agent Testing - Specialized course
  - Testing AI Agents at Scale - NeurIPS 2024 - Industry insights
  - Safety Testing for Production - ICML 2024 - Safety focus
  - Chaos Engineering for AI - KubeCon 2024 - Infrastructure testing
AI Agent Categories
- Multi-Agent Systems
  - PettingZoo
- Tool-Using Agents
  - Berkeley Function Calling Leaderboard
- Autonomous Agents
  - WebArena - bench](https://sierra.ai/blog/benchmarking-ai-agents)
- Conversational Agents
  - Botium - Automated dialogue testing framework
Benchmarks and Evaluation
- Leaderboards
  - LMSYS Chatbot Arena - Live competitive evaluation platform
  - HELM Benchmark - Holistic evaluation of language models
  - BIG-bench - Beyond the Imitation Game benchmark
  - AgentBench Leaderboard - Multi-environment agent rankings
- Datasets
  - SWE-Bench - Software engineering agent benchmark for code generation.
  - SMAC - StarCraft Multi-Agent Challenge.
  - TruthfulQA - 817 questions testing agent truthfulness across domains.
  - WebShop - E-commerce environment for grounded language agents.
  - Hanabi - Cooperative multi-agent card game.
  - GAIA Benchmark - General AI Assistant benchmark for fundamental agent capabilities.
  - WorkBench - Dataset focusing on workplace tasks like email and scheduling.
  - ALFWorld - Text-based embodied agents in interactive environments.
  - ScienceWorld - Science experiments and reasoning tasks.
  - TextWorld - Text-based game environments for RL agents.
  - MARL Benchmark - Multi-agent reinforcement learning tasks.
- Evaluation Frameworks
  - EleutherAI LM Evaluation Harness - Framework for few-shot evaluation
  - OpenAI Evals - Framework for evaluating LLMs
  - HELM - Holistic Evaluation of Language Models
Industry Applications
- Autonomous Vehicles
  - ISO 26262 Compliance - Functional safety standard
  - SOTIF Guidelines - Safety of the intended functionality
  - Simulation Platforms - Testing environments
- Healthcare
  - FDA AI/ML Guidance - Regulatory framework
  - Healthcare AI Testing Standards - Industry standards
  - Clinical Validation Framework - Validation methods
- Finance
  - Financial AI Testing Guidelines - Industry standards
  - Regulatory Compliance Testing - Compliance frameworks
  - Backtesting Frameworks - Historical validation
- Customer Service
  - Conversational AI Testing - Best practices
  - Customer Satisfaction Metrics - KPI frameworks
  - Multilingual Testing - Language coverage
Safety and Security Testing
- Adversarial Testing
  - CleverHans - Library for adversarial example generation
  - TextAttack - Framework for adversarial attacks on NLP models
  - PAIR - Prompt Automatic Iterative Refinement
- Red Teaming
  - Microsoft PyRIT - Python Risk Identification Tool for GenAI
  - LLM Guard - Security toolkit for LLMs
  - Anthropic Red Team Dataset - Curated red team prompts
  - AI Safety Benchmark - Comprehensive safety evaluation
- Safety Evaluation
  - Safety Gym - OpenAI's constrained RL environments
  - AI Safety Gridworlds - DeepMind's safety testing environments
  - Alignment Research Center Evals - Alignment-focused evaluations

Programming Languages

Python 20 Jupyter Notebook 4 Go 1 Java 1

awesome-ai-agent-testing

Performance Testing

Scalability Testing

Load Testing

Latency Analysis

Testing Frameworks

Commercial Solutions

Multi-Agent Testing Frameworks

Category-Specific Testing Tools

Open Source Frameworks

Language-Specific Tools

Chaos Engineering and Fault Injection

Chaos Testing Tools

Resilience Testing

Fault Injection Libraries

Observability and Monitoring

Production Monitoring Platforms

Logging Standards

Foundations

Academic Papers

Surveys and Reviews

Books and Textbooks

Simulation Environments

Dynamic Testing Environments

Game-Based Environments

Virtual Worlds

Practical Resources

Case Studies

Tutorials and Guides

Code Repositories

Videos and Courses

AI Agent Categories

Multi-Agent Systems

Tool-Using Agents

Autonomous Agents

Conversational Agents

Benchmarks and Evaluation

Leaderboards

Datasets

Evaluation Frameworks

Industry Applications

Autonomous Vehicles

Healthcare

Finance

Customer Service

Safety and Security Testing

Adversarial Testing

Red Teaming

Safety Evaluation