{"id":27742448,"url":"https://github.com/ericflo/agentoptim","last_synced_at":"2025-09-08T12:38:46.331Z","repository":{"id":281444409,"uuid":"942372320","full_name":"ericflo/agentoptim","owner":"ericflo","description":"AgentOptim is a focused-but-powerful set of MCP tools that allows an MCP-aware agent to optimize a prompt in a data-driven way.","archived":false,"fork":false,"pushed_at":"2025-03-13T07:55:09.000Z","size":1383,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-28T16:57:12.290Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ericflo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-04T02:25:48.000Z","updated_at":"2025-03-11T06:22:23.000Z","dependencies_parsed_at":"2025-03-09T06:17:51.339Z","dependency_job_id":"cf41c7a9-ae81-4aa9-9678-deca23b49ad8","html_url":"https://github.com/ericflo/agentoptim","commit_stats":null,"previous_names":["ericflo/agentoptim"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ericflo/agentoptim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericflo%2Fagentoptim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericflo%2Fagentoptim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericflo%2Fagentoptim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericflo%2Fagentoptim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ericflo","download_url":"https://codeload.github.com/ericflo/agentoptim/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericflo%2Fagentoptim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274187113,"owners_count":25237721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-08T02:00:09.813Z","response_time":121,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-28T16:41:06.911Z","updated_at":"2025-09-08T12:38:46.314Z","avatar_url":"https://github.com/ericflo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n```\n  █████  ██████  ███████ ███    ██ ████████  ██████  ██████  ████████ ██ ███    ███ \n ██   ██ ██      ██      ████   ██    ██    ██    ██ ██   ██    ██    ██ ████  ████ \n ███████ ██  ███ █████   ██ ██  ██    ██    ██    ██ ██████     ██    ██ ██ ████ ██ \n ██   ██ ██   ██ ██      ██  ██ ██    ██    ██    ██ ██         ██    ██ ██  ██  ██ \n██   ██  █████  ███████ ██   ████    ██     ██████  ██         ██    ██ ██      ██\n```\n### 📚 Your Complete Toolkit for AI Conversation Evaluation and Optimization\n\n# 🔍 AgentOptim v2.1.1 ✨\n\n[![PyPI Version](https://img.shields.io/badge/pypi-v2.1.1-blue)](https://pypi.org/project/agentoptim/)\n[![Python Version](https://img.shields.io/badge/python-3.8%2B-brightgreen)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)\n[![Test Coverage](https://img.shields.io/badge/coverage-91%25-brightgreen)](https://github.com/ericflo/agentoptim)\n[![MCP Compatible](https://img.shields.io/badge/MCP-compatible-blue)](https://github.com/anthropics/mcp)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![Stars](https://img.shields.io/github/stars/ericflo/agentoptim?style=social)](https://github.com/ericflo/agentoptim)\n\n**Your Complete Toolkit for AI Conversation Evaluation and Optimization**\n### Measure, Compare, and Improve AI Conversations with Precision\n\n[Quickstart](docs/QUICKSTART.md) • \n[Documentation](docs/API_REFERENCE.md) • \n[Examples](examples/) • \n[Contributing](CONTRIBUTING.md)\n\n\u003c/div\u003e\n\nAgentOptim is a powerful toolkit built on the Model Context Protocol (MCP) that enables AI engineers, prompt engineers, and developers to systematically evaluate, optimize, and compare AI conversation quality. With its streamlined 2-tool architecture, AgentOptim provides a data-driven approach to measuring and improving agent interactions through:\n\n- **Objective evaluation criteria** to assess conversation quality\n- **Consistent measurement** across different models and approaches\n- **Quantitative insights** to identify improvement opportunities\n- **Parallel processing** for efficient large-scale evaluation\n- **Standardized metrics** to track progress over time\n\n## 📋 Evaluation Results Storage\n\nAgentOptim provides persistent storage for evaluation results, allowing you to retrieve past evaluation results by ID and list all evaluation runs. This feature is fully integrated and tested with comprehensive documentation.\n\n### Key Features\n\n- **Persistent Storage**: Evaluations are stored on disk and can be retrieved at any time\n- **Consistent IDs**: Each evaluation has a unique ID that remains consistent when retrieved\n- **Pagination Support**: Browse through large numbers of evaluations with pagination\n- **Rich Metadata**: Each evaluation stores its timestamp, EvalSet details, judge model, and full results\n- **Powerful Filtering**: List evaluations filtered by EvalSet ID\n- **Complete Access**: Get both summary metrics and detailed judgments for each evaluation\n\n### API Usage\n\n```python\n# Run evaluation and store results\neval_result = await manage_eval_runs_tool(\n    action=\"run\",\n    evalset_id=\"6f8d9e2a-5b4c-4a3f-8d1e-7f9a6b5c4d3e\",\n    conversation=[\n        {\"role\": \"user\", \"content\": \"How do I reset my password?\"},\n        {\"role\": \"assistant\", \"content\": \"To reset your password, go to the login page...\"}\n    ]\n)\n\n# Get the run ID for future reference\neval_run_id = eval_result[\"id\"]\n\n# Later, retrieve the evaluation by ID\npast_eval = await manage_eval_runs_tool(\n    action=\"get\",\n    eval_run_id=eval_run_id\n)\n\n# List all evaluation runs\nall_runs = await manage_eval_runs_tool(\n    action=\"list\",\n    page=1,\n    page_size=10\n)\n```\n\n### Example Usage\n\n```python\n# Run evaluation and store results\neval_result = await manage_eval_runs_tool(\n    action=\"run\",\n    evalset_id=\"6f8d9e2a-5b4c-4a3f-8d1e-7f9a6b5c4d3e\",\n    conversation=[\n        {\"role\": \"user\", \"content\": \"How do I reset my password?\"},\n        {\"role\": \"assistant\", \"content\": \"To reset your password, go to the login page...\"}\n    ]\n)\n\n# Get the run ID for future reference\neval_run_id = eval_result[\"id\"]\n\n# Later, retrieve the evaluation by ID\npast_eval = await manage_eval_runs_tool(\n    action=\"get\",\n    eval_run_id=eval_run_id\n)\n\n# List all evaluation runs\nall_runs = await manage_eval_runs_tool(\n    action=\"list\",\n    page=1,\n    page_size=10\n)\n```\n\nWhether you're fine-tuning production agents, comparing prompt strategies, or benchmarking different AI models, AgentOptim gives you the tools to make data-driven decisions about conversation quality.\n\n## 🚀 What's New in v2.1.1!\n\nVersion 2.1.1 adds delightful CLI enhancements that make AgentOptim even more user-friendly and productive:\n\n- ✨ **Enhanced User Experience** - Interactive conversation creation, colorful output, and smart command suggestions\n- 📊 **Intelligent Progress Visualization** - Real-time progress tracking with adaptive ETA estimation\n- 💡 **Productivity Features** - Command chaining, auto-completion, and contextual help system\n- 🔧 **Advanced Error Handling** - Actionable troubleshooting suggestions with executable commands\n- 🧩 **Personalization** - Theme support, skill level adaptation, and time-based interactions\n\nVersion 2.1.0 completed our architectural simplification by removing the legacy compatibility layer and delivering a clean, modern API:\n\n- **Removed compatibility layer** - No more legacy code or backward compatibility\n- **Streamlined API** - Just 2 powerful tools for all your evaluation needs\n- **Improved test coverage** - Enhanced reliability with comprehensive testing\n- **Comprehensive documentation** - API reference, architecture guide, quickstart, and more\n- **12+ detailed examples** - From basic usage to advanced techniques\n- **Performance enhancements** - Optimized for speed and reduced memory usage\n- **Expanded model support** - Works seamlessly with OpenAI, Claude, and LM Studio models\n\n## 🔄 Core Architecture: The 2-Tool Evaluation System\n\n```mermaid\n%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#3498db', 'primaryTextColor': '#fff', 'lineColor': '#2980b9', 'tertiaryColor': '#f5f5f5'}}}%%\nflowchart TD\n    User([AI Engineer/Developer]) --\u003e |\"\u0026nbsp;\u0026nbsp;Creates evaluation criteria\u0026nbsp;\u0026nbsp;\"| A[\"🛠️ manage_evalset_tool\"]\n    User --\u003e |\"\u0026nbsp;\u0026nbsp;Manages evaluations\u0026nbsp;\u0026nbsp;\"| B[\"🔬 manage_eval_runs_tool\"]\n    \n    subgraph Creation [\"📝 Evaluation Creation\"]\n        A --\u003e |\"\u0026nbsp;\u0026nbsp;Stores\u0026nbsp;\u0026nbsp;\"| C[(\"📊 EvalSets\u003cbr/\u003eCriteria, Templates, Metadata\")]\n    end\n    \n    subgraph Execution [\"⚙️ Evaluation Execution\"]\n        B --\u003e |\"\u0026nbsp;\u0026nbsp;Processes\u0026nbsp;\u0026nbsp;\"| E[\"🧩 Conversations\u003cbr/\u003e(User + AI interactions)\"]\n        E --\u003e |\"\u0026nbsp;\u0026nbsp;Analyzed by\u0026nbsp;\u0026nbsp;\"| D[\"🧠 Judge Models\u003cbr/\u003e(Claude/GPT/Local)\"]\n        D --\u003e |\"\u0026nbsp;\u0026nbsp;Produces\u0026nbsp;\u0026nbsp;\"| F[\"📈 Results\u003cbr/\u003eJudgments, Confidence\u003cbr/\u003e\u0026 Summary metrics\"]\n        F --\u003e |\"\u0026nbsp;\u0026nbsp;Stored as\u0026nbsp;\u0026nbsp;\"| G[(\"📋 EvalRuns\u003cbr/\u003ePersistent Results Storage\")]\n        B --\u003e |\"\u0026nbsp;\u0026nbsp;Retrieves\u0026nbsp;\u0026nbsp;\"| G\n    end\n    \n    C --\u003e |\"\u0026nbsp;\u0026nbsp;Provides criteria for\u0026nbsp;\u0026nbsp;\"| B\n    G --\u003e |\"\u0026nbsp;\u0026nbsp;Enables historical analysis\u0026nbsp;\u0026nbsp;\"| User\n    \n    classDef primary fill:#3498db,stroke:#2980b9,color:white,stroke-width:2px;\n    classDef tool1 fill:#2ecc71,stroke:#27ae60,color:white,stroke-width:2px;\n    classDef tool2 fill:#e74c3c,stroke:#c0392b,color:white,stroke-width:2px;\n    classDef storage fill:#9b59b6,stroke:#8e44ad,color:white,stroke-width:2px;\n    classDef model fill:#f39c12,stroke:#e67e22,color:white,stroke-width:2px;\n    classDef result fill:#1abc9c,stroke:#16a085,color:white,stroke-width:2px;\n    classDef conversation fill:#34495e,stroke:#2c3e50,color:white,stroke-width:2px;\n    classDef creation fill:#f5f5f5,stroke:#bdc3c7,color:#333,stroke-width:2px;\n    classDef execution fill:#f5f5f5,stroke:#bdc3c7,color:#333,stroke-width:2px;\n    \n    class User primary;\n    class A tool1;\n    class B tool2;\n    class C,G storage;\n    class D model;\n    class F result;\n    class E conversation;\n    class Creation creation;\n    class Execution execution;\n\n    %% Add tooltip descriptions\n    linkStyle 0 stroke:#2ecc71,stroke-width:2px;\n    linkStyle 1 stroke:#e74c3c,stroke-width:2px;\n    linkStyle 2,3,4,5,6,7,8 stroke:#7f8c8d,stroke-width:2px;\n```\n\nAgentOptim's architecture is built on two powerful tools that work together seamlessly:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e📊 manage_evalset_tool\u003c/b\u003e - Create and manage evaluation criteria sets\u003c/summary\u003e\n\n```python\n# Create an EvalSet with evaluation criteria\nevalset_result = await manage_evalset_tool(\n    action=\"create\",\n    name=\"Response Quality\",\n    questions=[\n        \"Is the response helpful?\",\n        \"Is the response clear?\",\n        \"Is the response accurate?\"\n    ],\n    short_description=\"Basic quality assessment\",\n    long_description=\"This EvalSet measures response quality across key dimensions. Use it to evaluate general helpfulness, clarity and accuracy of assistant responses.\" + \" \" * 50\n)\n\n# Get the EvalSet ID\nevalset_id = evalset_result[\"evalset\"][\"id\"]\n```\n\nThis tool allows you to:\n- Define yes/no questions to evaluate conversational quality\n- Organize evaluation criteria for different use cases\n- Create, get, update, list, and delete EvalSets\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e🔬 manage_eval_runs_tool\u003c/b\u003e - Run, store, and retrieve evaluations\u003c/summary\u003e\n\n```python\n# 1. Run a new evaluation\nresults = await manage_eval_runs_tool(\n    action=\"run\",\n    evalset_id=evalset_id,\n    conversation=[\n        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n        {\"role\": \"user\", \"content\": \"How do I reset my password?\"},\n        {\"role\": \"assistant\", \"content\": \"To reset your password, please...\"}\n    ]\n)\n\n# Check the results and note the evaluation ID\neval_id = results[\"id\"]\nprint(f\"Score: {results['summary']['yes_percentage']}%\")\nprint(f\"Evaluation ID: {eval_id}\")\n\n# 2. Later, retrieve the evaluation by ID\npast_eval = await manage_eval_runs_tool(\n    action=\"get\",\n    eval_run_id=eval_id\n)\n\n# 3. List all previous evaluations (paginated)\nall_evals = await manage_eval_runs_tool(\n    action=\"list\",\n    page=1,\n    page_size=10\n)\n```\n\nThis tool allows you to:\n- Run evaluations on conversations and store the results\n- Retrieve past evaluation results for analysis\n- List all previous evaluations with pagination\n- Track evaluation history over time\n\u003c/details\u003e\n\n## 📚 Documentation Roadmap\n\nWe're expanding our documentation to make AgentOptim more accessible and powerful. Here's our roadmap:\n\n- [x] **Core Documentation**\n  - [x] [README.md](README.md) - Project overview and quick start\n  - [x] [MIGRATION_GUIDE.md](docs/MIGRATION_GUIDE.md) - Migrating from v1.x to v2.x\n  - [x] [API_REFERENCE.md](docs/API_REFERENCE.md) - Comprehensive API documentation\n  - [x] [ARCHITECTURE.md](docs/ARCHITECTURE.md) - Detailed system architecture and design decisions\n  - [x] [CHANGELOG.md](CHANGELOG.md) - Detailed version history and changes\n\n- [ ] **Tutorials**\n  - [x] [TUTORIAL.md](docs/TUTORIAL.md) - Getting started with AgentOptim\n  - [x] [QUICKSTART.md](docs/QUICKSTART.md) - Get up and running in under 5 minutes\n  - [ ] [ADVANCED_TUTORIAL.md](docs/ADVANCED_TUTORIAL.md) - Advanced usage patterns and techniques\n  - [ ] [BEST_PRACTICES.md](docs/BEST_PRACTICES.md) - Recommendations for effective evaluations\n  - [ ] [CUSTOMIZATION_GUIDE.md](docs/CUSTOMIZATION_GUIDE.md) - Creating custom evaluation templates\n\n- [ ] **Use Case Guides**\n  - [ ] [AGENT_OPTIMIZATION.md](docs/AGENT_OPTIMIZATION.md) - Improving agent responses\n  - [ ] [COMPARATIVE_ANALYSIS.md](docs/COMPARATIVE_ANALYSIS.md) - Comparing different models or approaches\n  - [ ] [QUALITY_MONITORING.md](docs/QUALITY_MONITORING.md) - Monitoring response quality over time\n  - [ ] [MULTI_MODAL_EVALUATION.md](docs/MULTI_MODAL_EVALUATION.md) - Evaluating multi-modal conversations\n  - [ ] [ETHICAL_EVALUATIONS.md](docs/ETHICAL_EVALUATIONS.md) - Evaluating for ethical considerations\n  - [ ] [BIAS_DETECTION.md](docs/BIAS_DETECTION.md) - Detecting bias in model responses\n\n- [ ] **Technical Guides**\n  - [ ] [INTEGRATION_GUIDE.md](docs/INTEGRATION_GUIDE.md) - Integrating with existing systems\n  - [ ] [PERFORMANCE_TUNING.md](docs/PERFORMANCE_TUNING.md) - Optimizing for speed and efficiency\n  - [ ] [CUSTOM_MODELS.md](docs/CUSTOM_MODELS.md) - Using different judge models\n  - [ ] [SECURITY_GUIDE.md](docs/SECURITY_GUIDE.md) - Best practices for secure deployment\n  - [ ] [SCALING_GUIDE.md](docs/SCALING_GUIDE.md) - Scaling evaluations for production use\n  - [ ] [TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) - Common issues and their solutions\n\n- [x] **Example Library**\n  - [x] [usage_example.py](examples/usage_example.py) - Basic usage\n  - [x] [evalset_example.py](examples/evalset_example.py) - Comprehensive features\n  - [x] [support_response_evaluation.py](examples/support_response_evaluation.py) - Support response quality\n  - [x] [conversation_comparison.py](examples/conversation_comparison.py) - Comparing different conversation approaches\n  - [x] [prompt_testing.py](examples/prompt_testing.py) - Testing different system prompts\n  - [x] [multilingual_evaluation.py](examples/multilingual_evaluation.py) - Evaluating responses in different languages\n  - [x] [custom_template_example.py](examples/custom_template_example.py) - Creating custom templates\n  - [x] [batch_evaluation.py](examples/batch_evaluation.py) - Evaluating multiple conversations efficiently\n  - [x] [automated_reporting.py](examples/automated_reporting.py) - Generating evaluation reports\n  - [x] [conversation_benchmark.py](examples/conversation_benchmark.py) - Benchmarking conversation quality\n  - [x] [model_comparison.py](examples/model_comparison.py) - Comparing different judge models\n  - [x] [response_improvement.py](examples/response_improvement.py) - Iterative response improvement\n\n## 💻 Quick Example\n\n```python\nimport asyncio\nfrom agentoptim import manage_evalset_tool, manage_eval_runs_tool\n\nasync def main():\n    # 1️⃣ Create an EvalSet with quality criteria\n    evalset_result = await manage_evalset_tool(\n        action=\"create\",\n        name=\"Helpfulness Evaluation\",\n        questions=[\n            \"Is the response helpful for the user's needs?\",\n            \"Does the response directly address the user's question?\",\n            \"Is the response clear and easy to understand?\",\n            \"Is the response accurate?\",\n            \"Does the response provide complete information?\"\n        ],\n        short_description=\"Basic helpfulness evaluation\"\n    )\n    \n    # Get the EvalSet ID\n    evalset_id = evalset_result[\"evalset\"][\"id\"]\n    print(f\"Created evaluation set with ID: {evalset_id}\")\n    \n    # 2️⃣ Define a conversation to evaluate\n    conversation = [\n        {\"role\": \"system\", \"content\": \"You are a helpful AI assistant.\"},\n        {\"role\": \"user\", \"content\": \"How do I reset my password?\"},\n        {\"role\": \"assistant\", \"content\": \"To reset your password, please go to the login page and click on 'Forgot Password'. You'll receive an email with instructions to create a new password.\"}\n    ]\n    \n    # 3️⃣ Run the evaluation\n    results = await manage_eval_runs_tool(\n        action=\"run\",\n        evalset_id=evalset_id,\n        conversation=conversation\n    )\n    \n    # 4️⃣ View the results\n    print(f\"Overall score: {results['summary']['yes_percentage']}%\")\n    print(f\"Evaluation saved with ID: {results['id']}\")  # This ID is auto-generated\n    for item in results[\"results\"]:\n        print(f\"✅ {item['question']}\" if item[\"judgment\"] else f\"❌ {item['question']}\")\n    \n    # 5️⃣ Retrieve the evaluation later using the ID\n    retrieved_results = await manage_eval_runs_tool(\n        action=\"get\",\n        eval_run_id=results['id']\n    )\n    print(f\"\\nRetrieved evaluation (ID: {retrieved_results['eval_run']['id']})\")\n    print(f\"Score: {retrieved_results['eval_run']['summary']['yes_percentage']}%\")\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e📘 View output\u003c/b\u003e\u003c/summary\u003e\n\n```\nCreated evaluation set with ID: 6f8d9e2a-5b4c-4a3f-8d1e-7f9a6b5c4d3e\n\nOverall score: 100.0%\nEvaluation saved with ID: 9f8d7e6a-5b4c-4a3f-8d1e-7f9a6b5c4d3e\n\n✅ Is the response helpful for the user's needs?\n✅ Does the response directly address the question?\n✅ Is the response clear and easy to understand?\n✅ Is the response accurate?\n✅ Does the response provide complete information?\n\nRetrieved evaluation (ID: 9f8d7e6a-5b4c-4a3f-8d1e-7f9a6b5c4d3e)\nScore: 100.0%\n```\n\n**Note:** IDs are automatically generated by the system:\n- EvalSet IDs are created when you define evaluation criteria\n- Evaluation run IDs are created when you run an evaluation\n- You can always use `latest` to retrieve the most recent evaluation: `agentoptim run get latest`\n\n\u003c/details\u003e\n\nFor more comprehensive examples, check out our [examples directory](examples/) with 12+ detailed use cases.\n\n## 🔧 Installation and Setup\n\n### 📥 Installation\n\n```bash\npip install agentoptim\n```\n\n### 🚀 Using the AgentOptim CLI\n\nAgentOptim provides a powerful and delightful command-line interface for evaluation and optimization:\n\n```bash\n# Start the MCP server\nagentoptim server\n\n# EvalSet Management\nagentoptim evalset create --wizard          # Create a new evaluation set interactively\nagentoptim evalset list                     # List all evaluation sets with their IDs\nagentoptim evalset get \u003cid\u003e                 # Get details about a specific evaluation set\n\n# Run Management \nagentoptim run create \u003cevalset-id\u003e conversation.json   # Run an evaluation (generates ID automatically)\nagentoptim run get latest                   # Get the most recent evaluation result\nagentoptim run list                         # List all your evaluation runs\nagentoptim run get \u003cid\u003e                     # Get a specific evaluation by ID\n\n# Interactive Mode\nagentoptim run create \u003cevalset-id\u003e --interactive       # Create and evaluate a conversation interactively\n\n# Results Export\nagentoptim run export latest --format html --output report.html  # Export as HTML report\nagentoptim run export latest --format markdown --charts          # Export as Markdown with charts\nagentoptim run export latest --format csv --output results.csv   # Export as CSV data\n\n# Comparison\nagentoptim run compare latest latest-1      # Compare latest two evaluation runs\nagentoptim run compare latest-1 latest-2 --detailed   # Compare with detailed reasoning\nagentoptim run compare latest latest-1 --format html --output diff.html  # HTML comparison\n\n# Input Options\nagentoptim run create \u003cevalset-id\u003e --text response.txt # Evaluate a text file\n\n# Model Selection\nagentoptim run create \u003cevalset-id\u003e conversation.json --model \"gpt-4o\"   # Specify model\nagentoptim run create \u003cevalset-id\u003e conversation.json --provider openai   # Use OpenAI\n\n# Developer \u0026 Automation\nagentoptim dev cache                        # View cache statistics\nagentoptim run list --format json -q        # Machine-readable output for scripts\nagentoptim run get latest --format json -q  # Quiet mode for scripting\n\n# Command Completion\nagentoptim --install-completion             # Install shell tab completion\n```\n\nAll commands use auto-generated IDs - you don't need to remember them, and you can always use `latest` to refer to the most recent run!\n\nRun `agentoptim --help` for complete CLI documentation.\n\n### 🧠 CLI Power User Features\n\nAgentOptim includes several features designed for power users and automation:\n\n- **Command Timer**: Set `AGENTOPTIM_SHOW_TIMER=1` to see execution time for commands\n- **Command Suggestions**: Get helpful corrections when mistyping commands\n- **Shell Completion**: Install tab completion with `--install-completion`\n- **Latest Run References**: Use `latest`, `latest-1`, `latest-2`, etc., to refer to recent runs\n- **Progress Visualization**: Watch real-time progress during evaluations\n- **Export Formats**: Generate professional reports in HTML, Markdown, CSV, and more\n- **Quiet Mode**: Use `-q` or `--quiet` to suppress output for scripting/automation\n- **Auto-Open Reports**: Exported files automatically open in your browser\n\n### 🔄 CLI Migration Guide\n\n\u003e **Note:** In version 2.1.1, we've introduced a new, more intuitive CLI command structure. \n\u003e If you're updating from a previous version, you'll need to update your scripts and commands.\n\n| Old Command | New Command |\n|-------------|-------------|\n| `agentoptim list` | `agentoptim evalset list` |\n| `agentoptim get \u003cid\u003e` | `agentoptim evalset get \u003cid\u003e` |\n| `agentoptim create ...` | `agentoptim evalset create ...` |\n| `agentoptim update \u003cid\u003e ...` | `agentoptim evalset update \u003cid\u003e ...` |\n| `agentoptim delete \u003cid\u003e` | `agentoptim evalset delete \u003cid\u003e` |\n| `agentoptim eval \u003cid\u003e \u003cfile\u003e` | `agentoptim run create \u003cid\u003e \u003cfile\u003e` |\n| `agentoptim runs run \u003cid\u003e \u003cfile\u003e` | `agentoptim run create \u003cid\u003e \u003cfile\u003e` |\n| `agentoptim runs get \u003cid\u003e` | `agentoptim run get \u003cid\u003e` |\n| `agentoptim runs list` | `agentoptim run list` |\n| `agentoptim runs list --page-size 20` | `agentoptim run list --limit 20` |\n| `agentoptim eval \u003cid\u003e --no-reasoning` | `agentoptim run create \u003cid\u003e --brief` |\n| `agentoptim eval \u003cid\u003e --parallel 5` | `agentoptim run create \u003cid\u003e --concurrency 5` |\n| `agentoptim stats` | `agentoptim dev cache` |\n\nYou can also use the shorthand aliases for frequently used commands:\n- `agentoptim es` instead of `agentoptim evalset`\n- `agentoptim r` instead of `agentoptim run`\n\n### 🚀 Starting the MCP Server\n\nStart the AgentOptim server with:\n\n```bash\n# Simplest way to start the server\nagentoptim server\n\n# Alternative using Python module\npython -m agentoptim server\n```\n\nWhen started with no options, the server:\n- Runs on the default port (40000)\n- Uses the default judge model (meta-llama-3.1-8b-instruct)\n- Includes reasoning details in evaluation results\n\n### ⚙️ Configuration Options\n\nControl AgentOptim's behavior with these environment variables:\n\n| Environment Variable | Purpose | Default |\n|----------------------|---------|---------|\n| `AGENTOPTIM_DEBUG=1` | Enable detailed debug logging | Disabled (0) |\n| `AGENTOPTIM_JUDGE_MODEL=model-name` | Set default judge model | meta-llama-3.1-8b-instruct |\n| `AGENTOPTIM_OMIT_REASONING=1` | Omit reasoning in results | Disabled (0) |\n| `AGENTOPTIM_PORT=port` | Set custom port number | 40000 |\n\n**Example with custom settings:**\n```bash\n# Run with GPT-4o-mini as judge, omit reasoning details\nAGENTOPTIM_JUDGE_MODEL=gpt-4o-mini AGENTOPTIM_OMIT_REASONING=1 agentoptim server\n```\n\n### 🔌 Configuring Claude Code\n\nTo use AgentOptim with Claude Code, add it to your `config.json` file as an MCP server. Here are configuration examples for different LLM providers:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e📱 Local Models with LM Studio\u003c/b\u003e (recommended for getting started)\u003c/summary\u003e\n\n```json\n{\n  \"mcpServers\": {\n    \"optim\": {\n      \"command\": \"agentoptim\",\n      \"args\": [],\n      \"options\": {\n        \"env\": {\n          \"AGENTOPTIM_JUDGE_MODEL\": \"meta-llama-3.1-8b-instruct\"\n        }\n      }\n    }\n  }\n}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e☁️ OpenAI Models\u003c/b\u003e (for GPT-4, GPT-4o, etc.)\u003c/summary\u003e\n\n```json\n{\n  \"mcpServers\": {\n    \"optim\": {\n      \"command\": \"agentoptim\",\n      \"args\": [],\n      \"options\": {\n        \"env\": {\n          \"OPENAI_API_KEY\": \"your_openai_api_key_here\",\n          \"AGENTOPTIM_JUDGE_MODEL\": \"gpt-4o-mini\"\n        }\n      }\n    }\n  }\n}\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003e🧠 Anthropic Models\u003c/b\u003e (for Claude 3 Opus, Sonnet, Haiku)\u003c/summary\u003e\n\n```json\n{\n  \"mcpServers\": {\n    \"optim\": {\n      \"command\": \"agentoptim\",\n      \"args\": [],\n      \"options\": {\n        \"env\": {\n          \"ANTHROPIC_API_KEY\": \"your_anthropic_api_key_here\",\n          \"AGENTOPTIM_JUDGE_MODEL\": \"claude-3-sonnet-20240229\"\n        }\n      }\n    }\n  }\n}\n```\n\u003c/details\u003e\n\nAfter adding the configuration, launch Claude Code with:\n\n```bash\nclaude --mcp-server=optim\n```\n\n### 🧩 Model Selection and API Providers\n\nAgentOptim supports multiple AI providers and models for your evaluations:\n\n#### CLI Provider Selection\n\nUse the `--provider` flag to easily select different AI providers:\n\n```bash\n# Use OpenAI models (sets API base URL and default model)\nagentoptim eval \u003cevalset-id\u003e conversation.json --provider openai\n\n# Use Anthropic models\nagentoptim eval \u003cevalset-id\u003e conversation.json --provider anthropic\n\n# Use local models (default)\nagentoptim eval \u003cevalset-id\u003e conversation.json --provider local\n```\n\nEach provider sets appropriate defaults:\n- `openai`: Uses OpenAI API with gpt-4o-mini as default model\n- `anthropic`: Uses Anthropic API with claude-3-5-haiku as default model\n- `local`: Uses localhost:1234/v1 with meta-llama-3.1-8b-instruct as default model\n\n#### Model Selection Priority\n\nAgentOptim determines which model to use for evaluations through this order:\n\n| Priority | Method | Example |\n|----------|--------|---------|\n| 1️⃣ Highest | CLI model flag | `agentoptim eval \u003cid\u003e conv.json --model gpt-4o-mini` |\n| 2️⃣ Second | Environment variable | `AGENTOPTIM_JUDGE_MODEL=claude-3-haiku-20240307 agentoptim` |\n| 3️⃣ Third | Provider default | Based on selected `--provider` |\n| 4️⃣ Default | Built-in fallback | `meta-llama-3.1-8b-instruct` |\n\n**💡 Pro Tips:**\n- Use `--provider` for quick switching between OpenAI, Anthropic, and local models\n- For fine-grained control, use the `--model` flag to specify exact models\n- Set API keys via `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` environment variables\n- For consistent team usage, configure model and provider in Claude Code settings\n\n## 🏆 Key Use Cases\n\nAgentOptim solves critical challenges in AI conversation development:\n\n\u003cdiv align=\"center\"\u003e\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\" width=\"20%\"\u003e\u003ch3\u003e📊\u003c/h3\u003e\u003cb\u003eQuality Assurance\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e\n\u003cb\u003eProblem:\u003c/b\u003e Inconsistent quality across AI conversations\u003cbr/\u003e\n\u003cb\u003eSolution:\u003c/b\u003e Standardized evaluation criteria ensure your AI meets quality benchmarks for helpfulness, clarity, accuracy, and tone\u003cbr/\u003e\n\u003cb\u003eExample:\u003c/b\u003e \u003ca href=\"examples/conversation_benchmark.py\"\u003econversation_benchmark.py\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\u003ch3\u003e🔍\u003c/h3\u003e\u003cb\u003eA/B Testing\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e\n\u003cb\u003eProblem:\u003c/b\u003e Choosing between different conversation approaches\u003cbr/\u003e\n\u003cb\u003eSolution:\u003c/b\u003e Side-by-side evaluations of different prompts, models or response styles\u003cbr/\u003e\n\u003cb\u003eExample:\u003c/b\u003e \u003ca href=\"examples/prompt_testing.py\"\u003eprompt_testing.py\u003c/a\u003e, \u003ca href=\"examples/conversation_comparison.py\"\u003econversation_comparison.py\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\u003ch3\u003e📈\u003c/h3\u003e\u003cb\u003eContinuous Improvement\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e\n\u003cb\u003eProblem:\u003c/b\u003e Unsure where to focus improvement efforts\u003cbr/\u003e\n\u003cb\u003eSolution:\u003c/b\u003e Detailed reporting highlights specific weaknesses in agent responses\u003cbr/\u003e\n\u003cb\u003eExample:\u003c/b\u003e \u003ca href=\"examples/response_improvement.py\"\u003eresponse_improvement.py\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\u003ch3\u003e🌐\u003c/h3\u003e\u003cb\u003eMultilingual Testing\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e\n\u003cb\u003eProblem:\u003c/b\u003e Ensuring quality across languages\u003cbr/\u003e\n\u003cb\u003eSolution:\u003c/b\u003e Language-specific evaluation criteria and multilingual judge models\u003cbr/\u003e\n\u003cb\u003eExample:\u003c/b\u003e \u003ca href=\"examples/multilingual_evaluation.py\"\u003emultilingual_evaluation.py\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\u003ch3\u003e🔄\u003c/h3\u003e\u003cb\u003eRegression Testing\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e\n\u003cb\u003eProblem:\u003c/b\u003e New updates breaking existing functionality\u003cbr/\u003e\n\u003cb\u003eSolution:\u003c/b\u003e Automated quality checks to ensure changes don't degrade performance\u003cbr/\u003e\n\u003cb\u003eExample:\u003c/b\u003e \u003ca href=\"examples/batch_evaluation.py\"\u003ebatch_evaluation.py\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\nView our [examples directory](examples/) for complete implementations of these use cases and more.\n\n## 💯 Why AgentOptim?\n\n\u003cdiv align=\"center\"\u003e\n\u003ctable\u003e\n\u003ctr\u003e\n\u003cth width=\"25%\" align=\"center\"\u003e🛠️ Simple 2-Tool API\u003c/th\u003e\n\u003cth width=\"25%\" align=\"center\"\u003e🤖 Multiple Judge Models\u003c/th\u003e\n\u003cth width=\"25%\" align=\"center\"\u003e⚡ Parallel Evaluation\u003c/th\u003e\n\u003cth width=\"25%\" align=\"center\"\u003e🔌 MCP Native\u003c/th\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eJust two intuitive tools for all evaluation needs\u003c/td\u003e\n\u003ctd align=\"center\"\u003eOpenAI, Claude, LM Studio \u0026 custom models\u003c/td\u003e\n\u003ctd align=\"center\"\u003e40% faster evaluations with automatic parallelization\u003c/td\u003e\n\u003ctd align=\"center\"\u003eSeamless integration with Model Context Protocol\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n### Comparison with Alternatives\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003cth\u003eFeature\u003c/th\u003e\n\u003cth\u003eAgentOptim\u003c/th\u003e\n\u003cth\u003eRAGAS\u003c/th\u003e\n\u003cth\u003ePromptfoo\u003c/th\u003e\n\u003cth\u003eCustom Scripts\u003c/th\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eArchitecture\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ 2-tool MCP interface\u003c/td\u003e\n\u003ctd\u003ePython library\u003c/td\u003e\n\u003ctd\u003eCLI \u0026 configs\u003c/td\u003e\n\u003ctd\u003eCustom code\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eSetup Time\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ Minutes\u003c/td\u003e\n\u003ctd\u003eHours\u003c/td\u003e\n\u003ctd\u003eHours\u003c/td\u003e\n\u003ctd\u003eDays\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eJudge Models\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ OpenAI, Claude, LM Studio \u0026 custom\u003c/td\u003e\n\u003ctd\u003eLimited\u003c/td\u003e\n\u003ctd\u003eOpenAI only\u003c/td\u003e\n\u003ctd\u003eVaries\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eConversation Format\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ Standard chat format\u003c/td\u003e\n\u003ctd\u003eRAG-specific\u003c/td\u003e\n\u003ctd\u003eLimited\u003c/td\u003e\n\u003ctd\u003eCustom\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eParallel Evaluation\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ Automatic\u003c/td\u003e\n\u003ctd\u003e❌ Manual\u003c/td\u003e\n\u003ctd\u003e⚠️ Limited\u003c/td\u003e\n\u003ctd\u003e❌ Custom\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eCaching\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ Automatic\u003c/td\u003e\n\u003ctd\u003e❌ Manual\u003c/td\u003e\n\u003ctd\u003e⚠️ Limited\u003c/td\u003e\n\u003ctd\u003e❌ Custom\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eTemplate System\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ Full Jinja2\u003c/td\u003e\n\u003ctd\u003e❌ Limited\u003c/td\u003e\n\u003ctd\u003e⚠️ Basic\u003c/td\u003e\n\u003ctd\u003e✨ Custom\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eExamples \u0026 Docs\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ 12+ examples\u003c/td\u003e\n\u003ctd\u003e⚠️ Limited\u003c/td\u003e\n\u003ctd\u003e⚠️ Several\u003c/td\u003e\n\u003ctd\u003e❌ N/A\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eMCP Compatible\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003e✨ Native\u003c/td\u003e\n\u003ctd\u003e❌ No\u003c/td\u003e\n\u003ctd\u003e❌ No\u003c/td\u003e\n\u003ctd\u003e❌ Manual\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\nAgentOptim provides the simplest and most powerful approach for evaluating LLM conversations, with a focus on ease of use, flexibility, and performance. It's designed specifically for conversation evaluation, unlike general-purpose tools with limited features.\n\n## 📖 Additional Resources\n\nFor more information about using AgentOptim v2.1.0, please refer to:\n\n- [Quickstart](docs/QUICKSTART.md) - Get up and running in under 5 minutes\n- [Tutorial](docs/TUTORIAL.md) - A step-by-step guide to evaluating conversations\n- [API Reference](docs/API_REFERENCE.md) - Complete API documentation\n- [Architecture](docs/ARCHITECTURE.md) - Detailed system architecture\n- [Developer Guide](docs/DEVELOPER_GUIDE.md) - Technical details for developers\n- [Workflow Guide](docs/WORKFLOW.md) - Practical examples and workflows\n- [Examples Directory](examples/) - Comprehensive example scripts\n- [Contributing Guidelines](CONTRIBUTING.md) - How to contribute to AgentOptim\n\n## ⚡ Ready for Production\n\n```mermaid\n%%{init: {'theme': 'neutral', 'themeVariables': { 'primaryColor': '#3498db', 'primaryTextColor': '#fff', 'primaryBorderColor': '#2980b9', 'lineColor': '#2980b9', 'secondaryColor': '#f1c40f', 'tertiaryColor': '#2ecc71'}}}%%\ngraph TD\n    subgraph Key_Metrics [\"🚀 AgentOptim v2.1.0 Key Metrics\"]\n        direction LR\n        API[\"API Simplicity\u003cbr/\u003e95%\"] \n        Style[\"Coding Style\u003cbr/\u003eConsistency\u003cbr/\u003e98%\"]\n        Setup[\"Setup Time\u003cbr/\u003e5 minutes\u003cbr/\u003e90%\"]\n        Test[\"Test Coverage\u003cbr/\u003e91%\"]\n        Speed[\"Performance\u003cbr/\u003e40% Faster\u003cbr/\u003e93%\"]\n        Flex[\"Integration\u003cbr/\u003eFlexibility\u003cbr/\u003e85%\"]\n    end\n    \n    subgraph Production_Ready [\"✅ Production Ready Features\"]\n        Security[\"🔒 Secure\u003cbr/\u003e- No data retention\u003cbr/\u003e- Local model support\"]\n        Support[\"📚 Well Documented\u003cbr/\u003e- 12+ example scripts\u003cbr/\u003e- API reference\u003cbr/\u003e- Tutorials\"]\n        Maintain[\"⚙️ Maintainable\u003cbr/\u003e- Clean architecture\u003cbr/\u003e- 2-tool design\u003cbr/\u003e- Modern codebase\"]\n        Scale[\"⚡ Scalable\u003cbr/\u003e- Parallel evaluation\u003cbr/\u003e- Efficient caching\u003cbr/\u003e- Performance optimized\"]\n    end\n    \n    Key_Metrics --\u003e Production_Ready\n    \n    classDef metric fill:#3498db,stroke:#2980b9,color:white,stroke-width:2px,rx:10,ry:10;\n    classDef production fill:#2ecc71,stroke:#27ae60,color:white,stroke-width:2px,rx:10,ry:10;\n    classDef container fill:#f5f5f5,stroke:#bdc3c7,color:#333,stroke-width:2px,rx:10,ry:10;\n    \n    class API,Style,Setup,Test,Speed,Flex metric;\n    class Security,Support,Maintain,Scale production;\n    class Key_Metrics,Production_Ready container;\n\n    %% GitHub Mermaid doesn't fully support gradients and advanced styling\n    %% Using the class-based styling instead for better compatibility\n```\n\nAgentOptim v2.1.0 is ready for production use with:\n\n- **Streamlined API**: Just 2 tools for a simple integration experience\n- **Comprehensive documentation**: Quick start to advanced techniques\n- **Robust reliability**: 91% test coverage ensures dependable operation\n- **Proven performance**: 40% faster than previous versions\n- **Flexible integration**: Works with all major LLM providers\n\n## 📜 License\n\nMIT License\n\n---\n\n\u003cdiv align=\"center\"\u003e\n    \u003cp\u003eMade with ❤️ for AI engineers and developers\u003c/p\u003e\n    \u003cp\u003e© 2025 AgentOptim Team\u003c/p\u003e\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fericflo%2Fagentoptim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fericflo%2Fagentoptim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fericflo%2Fagentoptim/lists"}