awesome-ai-agent-testing
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
https://github.com/chaosync-org/awesome-ai-agent-testing
Last synced: about 18 hours ago
JSON representation
-
Performance Testing
-
Scalability Testing
- Kubernetes - Container orchestration
- Ray - Distributed AI framework
-
Load Testing
- Apache JMeter - Comprehensive testing tool
- Locust - Scalable load testing framework
- K6 - Modern load testing tool
-
Latency Analysis
- Jaeger - Distributed tracing system
- OpenTelemetry - Observability framework
-
-
Testing Frameworks
-
Commercial Solutions
- Vertex AI Gen AI Evaluation Service - Google Cloud's agent evaluation service.
- Arize AI - ML observability platform with agent testing capabilities.
- Galileo AI - Comprehensive evaluation platform for AI agents.
- Athina AI - Specialized platform for LLM and agent evaluation.
- Confident AI - LLM evaluation and testing platform.
-
Multi-Agent Testing Frameworks
- MASON - Multi-agent simulation toolkit with testing capabilities.
- NetLogo - Multi-agent programmable modeling environment.
- Repast - Agent-based modeling and simulation platform.
- JADE Test Suite - Testing framework for JADE multi-agent systems.
-
Category-Specific Testing Tools
- Habitat - Embodied AI platform
- RoboSuite - Robot learning benchmark
- ToolBench - Large-scale tool-use evaluation
- AgentBench - Comprehensive agent evaluation platform
- LegalBench - Legal reasoning evaluation
- τ-bench (TAU-bench) - Real-world task benchmark
- Botium - Open-source testing framework for chatbots and voice assistants
- Rasa Test - Testing framework for Rasa conversational AI
- VoiceBench - Evaluation suite for voice assistants
- API-Bank - Tool-augmented LLM evaluation
- HealthBench - Medical AI agent evaluation
- FinBench - Financial AI evaluation
- SIMA Benchmark - 3D virtual environment testing
-
Open Source Frameworks
- LangChain - **15k+ stars** - Framework for developing applications powered by language models with extensive testing utilities.
- MetaGPT - **35k+ stars** - Multi-agent meta programming framework.
- AutoGen - **20k+ stars** - Microsoft's framework for building conversational agents with comprehensive testing tools.
- CAMEL - Communicative Agents for "Mind" Exploration of Large Scale Language Model Society.
- LangSmith Evaluation - Comprehensive evaluation toolkit with automatic LLM-as-a-judge scoring
- AgentVerse - Framework for building and testing multi-agent systems.
-
Language-Specific Tools
- DeepEval - Open-source LLM evaluation framework for testing complex agent behaviors.
- CheckList - Behavioral testing methodology and tool for NLP models.
- Jest-Agents - Jest extension for agent testing.
- Agent-Testing-Library - Testing utilities for JS agents.
- Cypress-AI - E2E testing for web-based agents.
- JUnit-Agents - JUnit extensions for agent testing.
- AgentTestKit - Comprehensive testing toolkit for Java agents.
-
-
Chaos Engineering and Fault Injection
-
Chaos Testing Tools
- Gremlin - Enterprise chaos engineering platform.
- LitmusChaos - Cloud-native chaos engineering.
- Chaos Toolkit - Open source chaos engineering toolkit.
- IBM Adversarial Robustness Toolbox (ART) - Python library for ML security testing.
- Chaos Monkey - Netflix's resiliency tool.
-
Resilience Testing
- Hystrix - Latency and fault tolerance library.
- Resilience4j - Fault tolerance library.
-
Fault Injection Libraries
- Fault-Injection-Library - Generic fault injection for testing.
- PyFI - Python fault injection library.
- Chaos Engineering Toolkit - Comprehensive chaos engineering tools.
-
-
Observability and Monitoring
-
Production Monitoring Platforms
-
Logging Standards
- OpenTelemetry GenAI Convention - Emerging standard for AI observability.
-
-
Foundations
-
Academic Papers
- Generative Agents: Interactive Simulacra of Human Behavior - Stanford's groundbreaking paper on creating believable AI agents that simulate complex human behavior patterns.
- ReAct: Synergizing Reasoning and Acting in Language Models - Framework combining reasoning and acting in language models for improved agent performance.
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents - Benchmark for evaluating web-based shopping agents with real product data.
- Voyager: An Open-Ended Embodied Agent with Large Language Models - Minecraft-based agent demonstrating continuous learning and skill acquisition.
- AgentBench: Evaluating LLMs as Agents - Comprehensive benchmark suite for evaluating LLM-based agents across diverse environments.
- Holistic Evaluation of Language Models (HELM) - Stanford's comprehensive evaluation framework with multi-metric assessment.
- Evaluating AI Agent Performance With Benchmarks - Comprehensive guide on evaluating AI agents in real-world scenarios with practical examples and metrics.
- 𝜏-Bench: Benchmarking AI agents for the real-world - Novel benchmark introducing task-based evaluation for AI agents' real-world performance and reliability.
- Multi-Agent Security: Securing Networks of AI Agents - Framework for risks in multi-agent systems including collusion and emergent attacks.
-
Surveys and Reviews
- A Survey of LLM-based Autonomous Agents - Extensive survey covering construction, application, and evaluation of LLM-based autonomous agents.
- Benchmarking of AI Agents: A Perspective - Industry perspective on the critical role of benchmarking in accelerating AI agent adoption.
- What is AI Agent Evaluation? - IBM's comprehensive overview of AI agent evaluation methodologies and their importance.
- A Survey on Evaluation of Large Language Model Based Agents - Systematic review of evaluation methods for LLM-based agents.
- Testing and Debugging AI Agents: A Survey - Survey focusing specifically on testing and debugging methodologies for AI agents.
-
Books and Textbooks
- Artificial Intelligence: A Modern Approach - Classic textbook with chapters on agent testing and evaluation.
- Reinforcement Learning: An Introduction - Foundational text covering agent learning and evaluation in RL contexts.
- Reinforcement Learning: An Introduction - Foundational text covering agent learning and evaluation in RL contexts.
- Multi-Agent Systems: Algorithmic, Game-Theoretic, and Logical Foundations - Comprehensive coverage of multi-agent system testing.
-
-
Simulation Environments
-
Dynamic Testing Environments
- Meta-World - Benchmark for multi-task RL
- OpenAI Gym - Toolkit for developing RL agents
- RLlib - Scalable RL with dynamic environments
-
Game-Based Environments
- Dota 2 Bot API - Complex multi-agent environment
- MineRL - Minecraft competitions for RL
- StarCraft II LE - StarCraft II Learning Environment
- OpenSpiel - Collection of game environments
-
Virtual Worlds
-
-
Practical Resources
-
Case Studies
- OpenAI GPT-4 Tool Use Evaluation - Systematic evaluation of tool-using capabilities.
- DEFCON LLM Red Team Challenge 2023 - Public testing of LLM vulnerabilities.
- Air Canada Chatbot Hallucination Case - Chatbot provided incorrect refund policy leading to legal liability.
- Meta's Cicero Diplomacy Agent - Human-level performance in strategy game.
- AutoGPT Loop Failures - Early autonomous agent experiments.
-
Tutorials and Guides
- AI Agents Testing 101 - Beginner's guide to agent testing
- Building Your First Test Suite - Hands-on tutorial
- Agent Testing Best Practices - Industry guidelines
- From Manual to Automated Testing - Automation guide
- Distributed Agent Testing - Testing at scale
- Multi-Agent System Testing - Complex scenarios
- Performance Optimization - Tuning guide
- Security Testing Deep Dive - Advanced security
-
Code Repositories
- Agent Testing Examples - Collection of test cases
- Testing Templates - Reusable test templates
- Benchmark Implementations - Reference implementations
- CI/CD Pipelines - Automation examples
-
Videos and Courses
- AI Agent Testing Fundamentals - 6-hour comprehensive course
- Practical Agent Testing - Hands-on Coursera course
- Advanced Testing Techniques - MIT OpenCourseWare
- Multi-Agent Testing - Specialized course
- Testing AI Agents at Scale - NeurIPS 2024 - Industry insights
- Safety Testing for Production - ICML 2024 - Safety focus
- Chaos Engineering for AI - KubeCon 2024 - Infrastructure testing
-
-
AI Agent Categories
-
Benchmarks and Evaluation
-
Leaderboards
- LMSYS Chatbot Arena - Live competitive evaluation platform
- HELM Benchmark - Holistic evaluation of language models
- BIG-bench - Beyond the Imitation Game benchmark
- AgentBench Leaderboard - Multi-environment agent rankings
-
Datasets
- SWE-Bench - Software engineering agent benchmark for code generation.
- SMAC - StarCraft Multi-Agent Challenge.
- TruthfulQA - 817 questions testing agent truthfulness across domains.
- WebShop - E-commerce environment for grounded language agents.
- Hanabi - Cooperative multi-agent card game.
- GAIA Benchmark - General AI Assistant benchmark for fundamental agent capabilities.
- WorkBench - Dataset focusing on workplace tasks like email and scheduling.
- ALFWorld - Text-based embodied agents in interactive environments.
- ScienceWorld - Science experiments and reasoning tasks.
- TextWorld - Text-based game environments for RL agents.
- MARL Benchmark - Multi-agent reinforcement learning tasks.
-
Evaluation Frameworks
- EleutherAI LM Evaluation Harness - Framework for few-shot evaluation
- OpenAI Evals - Framework for evaluating LLMs
- HELM - Holistic Evaluation of Language Models
-
-
Industry Applications
-
Autonomous Vehicles
- ISO 26262 Compliance - Functional safety standard
- SOTIF Guidelines - Safety of the intended functionality
- Simulation Platforms - Testing environments
-
Healthcare
- FDA AI/ML Guidance - Regulatory framework
- Healthcare AI Testing Standards - Industry standards
- Clinical Validation Framework - Validation methods
-
Finance
- Financial AI Testing Guidelines - Industry standards
- Regulatory Compliance Testing - Compliance frameworks
- Backtesting Frameworks - Historical validation
-
Customer Service
- Conversational AI Testing - Best practices
- Customer Satisfaction Metrics - KPI frameworks
- Multilingual Testing - Language coverage
-
-
Safety and Security Testing
-
Adversarial Testing
- CleverHans - Library for adversarial example generation
- TextAttack - Framework for adversarial attacks on NLP models
- PAIR - Prompt Automatic Iterative Refinement
-
Red Teaming
- Microsoft PyRIT - Python Risk Identification Tool for GenAI
- LLM Guard - Security toolkit for LLMs
- Anthropic Red Team Dataset - Curated red team prompts
- AI Safety Benchmark - Comprehensive safety evaluation
-
Safety Evaluation
- Safety Gym - OpenAI's constrained RL environments
- AI Safety Gridworlds - DeepMind's safety testing environments
- Alignment Research Center Evals - Alignment-focused evaluations
-
Programming Languages
Categories
Sub Categories
Category-Specific Testing Tools
13
Datasets
11
Academic Papers
9
Tutorials and Guides
8
Language-Specific Tools
7
Videos and Courses
7
Open Source Frameworks
6
Chaos Testing Tools
5
Surveys and Reviews
5
Case Studies
5
Commercial Solutions
5
Leaderboards
4
Code Repositories
4
Books and Textbooks
4
Multi-Agent Testing Frameworks
4
Red Teaming
4
Virtual Worlds
4
Game-Based Environments
4
Evaluation Frameworks
3
Finance
3
Safety Evaluation
3
Dynamic Testing Environments
3
Adversarial Testing
3
Customer Service
3
Load Testing
3
Autonomous Vehicles
3
Fault Injection Libraries
3
Healthcare
3
Resilience Testing
2
Production Monitoring Platforms
2
Scalability Testing
2
Latency Analysis
2
Conversational Agents
1
Tool-Using Agents
1
Newsletters and Blogs
1
Multi-Agent Systems
1
Autonomous Agents
1
Logging Standards
1
Keywords
machine-learning
4
llm
3
adversarial-machine-learning
3
chatgpt
3
artificial-intelligence
2
security
2
llm-agent
2
evaluation-framework
2
large-language-models
2
ai
2
natural-language-processing
2
adversarial-attacks
2
adversarial-examples
2
agent
2
multi-agent-systems
1
deep-learning
1
cooperative-ai
1
communicative-ai
1
attack
1
blue-team
1
ai-societies
1
benchmarking
1
llm-framework
1
framework
1
autogen-ecosystem
1
autogen
1
agents
1
agentic-agi
1
agentic
1
multi-agent
1
metagpt
1
gpt
1
transformers
1
security-tools
1
prompt-injection
1
prompt-engineering
1
llmops
1
llm-security
1
nlp
1
data-augmentation
1
gpt-4
1
responsible-ai
1
red-team-tools
1
generative-ai
1
ai-red-team
1
starcraft-ii
1
reinforcement-learning
1
multiagent-systems
1
benchmark
1
llm-evaluation-metrics
1