{"id":42649251,"url":"https://github.com/openadaptai/openadapt-evals","last_synced_at":"2026-04-01T17:07:55.758Z","repository":{"id":333048683,"uuid":"1135998197","full_name":"OpenAdaptAI/openadapt-evals","owner":"OpenAdaptAI","description":"Evaluation infrastructure for GUI agent benchmarks","archived":false,"fork":false,"pushed_at":"2026-03-03T03:45:00.000Z","size":51666,"stargazers_count":1,"open_issues_count":5,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-03T03:51:39.527Z","etag":null,"topics":["benchmarks","evaluation","gui-automation","openadapt","python"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/openadapt-evals/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenAdaptAI.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-16T22:35:21.000Z","updated_at":"2026-03-03T03:40:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"8e74c62c-052d-45c5-93eb-c6a03ff615ce","html_url":"https://github.com/OpenAdaptAI/openadapt-evals","commit_stats":null,"previous_names":["openadaptai/openadapt-evals"],"tags_count":60,"template":false,"template_full_name":null,"purl":"pkg:github/OpenAdaptAI/openadapt-evals","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenAdaptAI%2Fopenadapt-evals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenAdaptAI%2Fopenadapt-evals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenAdaptAI%2Fopenadapt-evals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenAdaptAI%2Fopenadapt-evals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenAdaptAI","download_url":"https://codeload.github.com/OpenAdaptAI/openadapt-evals/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenAdaptAI%2Fopenadapt-evals/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30204109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T19:07:06.838Z","status":"ssl_error","status_checked_at":"2026-03-06T18:57:34.882Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarks","evaluation","gui-automation","openadapt","python"],"created_at":"2026-01-29T07:19:14.156Z","updated_at":"2026-04-01T17:07:55.738Z","avatar_url":"https://github.com/OpenAdaptAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OpenAdapt Evals\n\n[![Tests](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/test.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/test.yml)\n[![Build](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/release.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/release.yml)\n[![PyPI](https://img.shields.io/pypi/v/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nEvaluation infrastructure for GUI agent benchmarks, built for [OpenAdapt](https://github.com/OpenAdaptAI/OpenAdapt).\n\n## What is OpenAdapt Evals?\n\nOpenAdapt Evals is a unified framework for evaluating GUI automation agents against standardized benchmarks such as [Windows Agent Arena (WAA)](https://microsoft.github.io/WindowsAgentArena/). It provides benchmark adapters, agent interfaces, cloud VM infrastructure (Azure and AWS) for parallel evaluation, and result visualization -- everything needed to go from \"I have a GUI agent\" to \"here are its benchmark scores.\"\n\n## Benchmark Viewer\n\n![Benchmark Viewer Animation](https://raw.githubusercontent.com/OpenAdaptAI/openadapt-evals/main/animations/benchmark-viewer.webp)\n\n\u003cdetails\u003e\n\u003csummary\u003eMore screenshots\u003c/summary\u003e\n\n**Task Detail View** -- step-by-step replay with screenshots, actions, and execution logs:\n\n![Task Detail View](https://raw.githubusercontent.com/OpenAdaptAI/openadapt-evals/main/docs/screenshots/desktop_task_detail.png)\n\n**Cost Tracking Dashboard** -- real-time VM cost monitoring with tiered sizing and spot instances:\n\n![Cost Dashboard](https://raw.githubusercontent.com/OpenAdaptAI/openadapt-evals/main/screenshots/cost_dashboard_preview.png)\n\n\u003c/details\u003e\n\n## Key Features\n\n- **Benchmark adapters** for WAA (live, mock, and local modes), with an extensible base for OSWorld, WebArena, and others\n- **Task setup handlers** -- `verify_apps` and `install_apps` ensure required applications are present on the Windows VM before evaluation begins\n- **Agent interfaces** including `ApiAgent` (Claude / GPT), `ClaudeComputerUseAgent` (with coordinate clamping and fail-safe recovery), `RetrievalAugmentedAgent`, `RandomAgent`, and `PolicyAgent`\n- **Multi-cloud VM infrastructure** with `AzureVMManager`, `AWSVMManager`, `PoolManager`, `SSHTunnelManager`, and `VMMonitor` for running evaluations at scale on Azure or AWS\n- **End-to-end eval pipeline** (`scripts/run_eval_pipeline.py`) -- orchestrates demo generation, VM lifecycle, SSH tunnels, and ZS/DC evaluation in a single command\n- **Deterministic desktop parity mode** -- `--clean-desktop` suppresses OneDrive/toast/popover noise, `--force-tray-icons` keeps network/audio tray controls visible, and run metadata records requested/observed environment flags\n- **Standalone GRPO trainer** -- self-contained RL training loop with zero external ML dependencies, callback hooks (`on_model_loaded`, `on_before_collect`, `on_rollout_complete`, `on_step_complete`), configurable `vision_loss_mode` (exclude/include/checkpoint), and optional Outlines constrained decoding that forces `Thought: ...\\nAction: CLICK/TYPE/WAIT/DONE` format\n- **Demo executor** -- tiered demo execution that replays demonstrations with adaptive grounding. Keyboard/type actions execute deterministically (no VLM needed), click actions use a grounder to find elements by description. Validated: **0.00 → 1.00** on notepad-hello\n- **Correction flywheel** -- agent fails → human demos correct approach → agent retries with demo → score improves. Full pipeline: `DemoLibrary` stores demos, `DemoExecutor` replays them with adaptation, per-step milestone tracking captures transient states\n- **RL training environment** -- `RLEnvironment` wrapper provides a Gymnasium-style `reset`/`step`/`evaluate` interface for online RL (GRPO, PPO) with per-step milestone high-water marks and dense rewards\n- **Annotation pipeline** -- VLM-based screenshot annotation (`annotation.py`, `vlm.py`) migrated from openadapt-ml so the full record-annotate-evaluate workflow runs within this repo\n- **4-layer WAA probe** -- `probe --detailed` checks screenshot capture, accessibility tree, action pipeline, and scoring independently; supports `--json` and `--layers` filtering\n- **Demo recording and review** -- VNC-based demo capture with auto-persistence (incremental `meta.json`, hardlinked PNGs), JPEG thumbnail deduplication, and markdown review artifact generation\n- **CLI tools** -- `oa-vm` for VM and pool management (50+ commands), benchmark CLI for running evals\n- **Cost optimization** -- tiered VM sizing, spot instance support, and real-time cost tracking\n- **Results visualization** -- HTML viewer with step-by-step screenshot replay, execution logs, and domain breakdowns\n- **Trace export** for converting evaluation trajectories into training data\n- **Configuration via pydantic-settings** with automatic `.env` loading\n\n## Installation\n\n```bash\npip install openadapt-evals\n```\n\nWith optional dependencies:\n\n```bash\npip install openadapt-evals[training]   # GRPO trainer + Outlines constrained decoding\npip install openadapt-evals[azure]      # Azure VM management\npip install openadapt-evals[aws]        # AWS EC2 management\npip install openadapt-evals[retrieval]  # Demo retrieval agent\npip install openadapt-evals[viewer]     # Live results viewer\npip install openadapt-evals[all]        # Everything\n```\n\n## Quick Start\n\n### Run a mock evaluation (no VM required)\n\n```bash\nopenadapt-evals mock --tasks 10\n```\n\n### Run a live evaluation against a WAA server\n\n```bash\n# Start with a single VM (Azure by default)\noa-vm pool-create --workers 1\noa-vm pool-wait\n\n# Or use AWS\noa-vm pool-create --cloud aws --workers 1\noa-vm pool-wait --cloud aws\n\n# Run evaluation\nopenadapt-evals run --agent api-claude --task notepad_1\n\n# View results\nopenadapt-evals view --run-name live_eval\n\n# Clean up (stop billing)\noa-vm pool-cleanup -y\n```\n\n### Python API\n\n```python\nfrom openadapt_evals import (\n    ApiAgent,\n    WAALiveAdapter,\n    WAALiveConfig,\n    evaluate_agent_on_benchmark,\n    compute_metrics,\n)\n\nadapter = WAALiveAdapter(WAALiveConfig(server_url=\"http://localhost:5001\"))\nagent = ApiAgent(provider=\"anthropic\")\n\nresults = evaluate_agent_on_benchmark(agent, adapter, task_ids=[\"notepad_1\"])\nmetrics = compute_metrics(results)\nprint(f\"Success rate: {metrics['success_rate']:.1%}\")\n```\n\n### Demo-conditioned evaluation\n\nRecord demos on a remote VM via VNC, annotate with a VLM, then run demo-conditioned eval:\n\n```bash\n# 1. Pre-flight check: verify all required apps are installed\npython scripts/record_waa_demos.py record-waa \\\n  --tasks 04d9aeaf,0a0faba3 \\\n  --server http://localhost:5001 \\\n  --verify\n\n# 2. Record demos interactively (perform actions on VNC, press Enter after each step)\npython scripts/record_waa_demos.py record-waa \\\n  --tasks 04d9aeaf,0a0faba3 \\\n  --server http://localhost:5001 \\\n  --output waa_recordings/\n\n# 3. Annotate recordings with VLM\npython scripts/record_waa_demos.py annotate \\\n  --recordings waa_recordings/ \\\n  --output annotated_demos/ \\\n  --provider openai\n\n# 4. Run demo-conditioned eval\npython scripts/record_waa_demos.py eval \\\n  --demo_dir annotated_demos/ \\\n  --tasks 04d9aeaf,0a0faba3\n```\n\n### End-to-end eval pipeline\n\nFor a fully automated flow (demo generation, VM lifecycle, SSH tunnels, ZS and DC evaluation):\n\n```bash\n# Run for all recordings that have demos\npython scripts/run_eval_pipeline.py\n\n# Specific task(s)\npython scripts/run_eval_pipeline.py --tasks 04d9aeaf\n\n# Dry run\npython scripts/run_eval_pipeline.py --tasks 04d9aeaf --dry-run\n\n# AWS instead of Azure\npython scripts/run_eval_pipeline.py --cloud aws --vm-name waa-pool-00\n\n# Deterministic desktop parity + pinned image version metadata\npython scripts/run_eval_pipeline.py \\\n  --tasks 04d9aeaf \\\n  --clean-desktop \\\n  --force-tray-icons \\\n  --waa-image-version win11-24h2-2026-03-04\n```\n\n### Dedicated grounder endpoint (UI-Venus)\n\nFor higher click accuracy, serve [UI-Venus-1.5-8B](https://huggingface.co/inclusionAI/UI-Venus-1.5-8B) on a GPU and point the DemoExecutor or PlannerGrounderAgent at it. This replaces general VLM grounding (GPT-4.1-mini) with a purpose-built GUI grounding model.\n\n```bash\n# 1. On a GPU machine (A10G 24GB, RTX 4090, etc.):\nbash scripts/serve_ui_venus.sh\n# Serves at http://0.0.0.0:8000 by default\n\n# 2. Verify it's running:\ncurl http://gpu-host:8000/v1/models\n# Should list \"UI-Venus-1.5-8B\"\n\n# 3. Run the correction flywheel with the dedicated grounder:\npython scripts/run_correction_flywheel.py \\\n    --task-config example_tasks/clear-browsing-data-chrome.yaml \\\n    --demo-dir ./demos \\\n    --grounder-endpoint http://gpu-host:8000\n\n# 4. Or run the full evaluation with the grounder:\npython scripts/run_full_eval.py \\\n    --server-url http://localhost:5001 \\\n    --grounder-endpoint http://gpu-host:8000\n```\n\nThe endpoint uses the UI-Venus native bounding-box prompt format (`[x1,y1,x2,y2]`) and is compatible with vLLM, Ollama, or any OpenAI-compatible server. Both `DemoExecutor` and `PlannerGrounderAgent` use the same prompt format for consistency.\n\n### GRPO training with TRL (recommended)\n\nThe recommended path for RL training of VLM desktop agents uses TRL's `GRPOTrainer` with dense milestone rewards from WAA environments. This replaces the standalone GRPO trainer with a battle-tested implementation that supports Unsloth, vLLM, constrained decoding, and automatic telemetry.\n\n```bash\n# Basic training against a live WAA VM\npython scripts/train_trl_grpo.py \\\n    --task-dir ./example_tasks \\\n    --server-url http://localhost:5001 \\\n    --model Qwen/Qwen2.5-VL-7B-Instruct \\\n    --output ./grpo_output\n\n# With Unsloth (2x VRAM efficiency) + constrained decoding\npython scripts/train_trl_grpo.py \\\n    --task-dir ./example_tasks \\\n    --server-url http://localhost:5001 \\\n    --model Qwen/Qwen2.5-VL-7B-Instruct \\\n    --use-unsloth \\\n    --constrained-decoding \\\n    --output ./grpo_output\n\n# Mock mode (validates full pipeline without VM or GPU)\npython scripts/train_trl_grpo.py \\\n    --task-dir ./example_tasks \\\n    --mock \\\n    --output ./grpo_output_mock\n\n# With Weave tracing for experiment tracking\npython scripts/train_trl_grpo.py \\\n    --task-dir ./example_tasks \\\n    --server-url http://localhost:5001 \\\n    --model Qwen/Qwen2.5-VL-7B-Instruct \\\n    --weave-project openadapt-grpo \\\n    --output ./grpo_output\n```\n\nKey flags: `--constrained-decoding` (Outlines regex, eliminates unparseable output), `--vision-loss-mode` (exclude/include/checkpoint), `--weave-project` (Weave tracing), `--use-vllm` (faster generation), `--loss-type` (grpo/dapo/dr_grpo).\n\n### Parallel evaluation\n\n```bash\n# Create a pool of VMs and distribute tasks (Azure)\noa-vm pool-create --workers 5\noa-vm pool-wait\noa-vm pool-run --tasks 50\n\n# Same workflow on AWS\noa-vm pool-create --cloud aws --workers 5\noa-vm pool-wait --cloud aws\noa-vm pool-run --cloud aws --tasks 50\n\n# Or use Azure ML orchestration\nopenadapt-evals azure --workers 10 --waa-path /path/to/WindowsAgentArena\n```\n\n## Architecture\n\n```\nopenadapt_evals/\n├── agents/               # Agent implementations\n│   ├── base.py           #   BenchmarkAgent ABC\n│   ├── api_agent.py      #   ApiAgent (Claude, GPT)\n│   ├── claude_computer_use_agent.py  # ClaudeComputerUseAgent (coord clamping, fail-safe)\n│   ├── retrieval_agent.py#   RetrievalAugmentedAgent\n│   └── policy_agent.py   #   PolicyAgent (trained models)\n├── adapters/             # Benchmark adapters\n│   ├── base.py           #   BenchmarkAdapter ABC + data classes\n│   ├── rl_env.py         #   RLEnvironment (Gymnasium-style wrapper for GRPO/PPO)\n│   └── waa/              #   WAA live, mock, and local adapters\n├── infrastructure/       # Cloud VM and pool management\n│   ├── azure_vm.py       #   AzureVMManager\n│   ├── aws_vm.py         #   AWSVMManager\n│   ├── vm_provider.py    #   VMProvider protocol (multi-cloud abstraction)\n│   ├── pool.py           #   PoolManager\n│   ├── probe.py          #   4-layer WAA probe (screenshot, a11y, action, score)\n│   ├── ssh_tunnel.py     #   SSHTunnelManager\n│   └── vm_monitor.py     #   VMMonitor dashboard\n├── evaluation/           # Shared evaluation utilities\n│   └── metrics.py        #   fuzzy_match and scoring functions\n├── benchmarks/           # Evaluation runner, CLI, viewers\n│   ├── runner.py         #   evaluate_agent_on_benchmark()\n│   ├── cli.py            #   Benchmark CLI (run, mock, live, view, probe)\n│   ├── vm_cli.py         #   VM/Pool CLI (oa-vm, 50+ commands)\n│   ├── viewer.py         #   HTML results viewer\n│   ├── pool_viewer.py    #   Pool results viewer\n│   └── trace_export.py   #   Training data export\n├── waa_deploy/           # WAA Docker image \u0026 task setup\n│   ├── evaluate_server.py#   Flask server (port 5050): /setup, /evaluate, /task\n│   ├── Dockerfile        #   QEMU + Windows 11 + pre-downloaded apps\n│   └── tools_config.json #   App installer URLs and configs\n├── annotation.py         # VLM-based demo annotation pipeline\n├── vlm.py                # VLM provider abstraction (OpenAI, Anthropic)\n├── server/               # WAA server extensions\n├── config.py             # Settings (pydantic-settings, .env)\n└── __init__.py\nscripts/\n├── run_eval_pipeline.py      # End-to-end eval: demo gen + VM + ZS/DC eval\n├── record_waa_demos.py       # Record demos via VNC\n├── generate_demo_review.py   # Markdown review artifacts with thumbnails\n├── run_grpo_rollout.py       # Example: collect RL rollouts from WAA\n├── refine_demo.py            # Two-pass LLM demo refinement\n└── run_dc_eval.py            # Demo-conditioned evaluation\n```\n\n### How it fits together\n\n```\nLOCAL MACHINE                          CLOUD VM (Azure or AWS, Ubuntu)\n┌─────────────────────┐                ┌──────────────────────────────┐\n│  oa-vm CLI          │   SSH Tunnel   │  Docker                      │\n│  (pool management)  │ ─────────────\u003e │  ├─ evaluate_server (:5050)  │\n│                     │  :5001 → :5000 │  │  └─ /setup, /evaluate     │\n│  openadapt-evals    │  :5051 → :5050 │  ├─ Samba share (/tmp/smb/)  │\n│  (benchmark runner) │  :8006 → :8006 │  └─ QEMU (Win 11)           │\n│                     │                │     ├─ WAA Flask API (:5000) │\n│                     │                │     ├─ \\\\host.lan\\Data\\      │\n│                     │                │     └─ Agent                 │\n└─────────────────────┘                └──────────────────────────────┘\n```\n\nBoth backends use the same `VMProvider` protocol. Pass `--cloud azure` (default) or `--cloud aws` to any pool command. AWS now supports nested virtualization on C8i/M8i/R8i instances (from ~$0.19/hr, Feb 2026), though GPU families (g5, g6) still require metal instances. Azure uses `Standard_D8ds_v5` ($0.38/hr).\n\n![Windows 11 on AWS EC2](https://raw.githubusercontent.com/OpenAdaptAI/openadapt-evals/main/docs/aws-waa-windows-desktop.png)\n\n### UNIX Socket Bridge (Docker Port 5050 Workaround)\n\nThe WAA Docker container runs QEMU with `--cap-add NET_ADMIN` for TAP networking, which breaks Docker's standard port forwarding for port 5050 (`evaluate_server.py`). The workaround is a two-stage socat proxy using a UNIX socket:\n\n```bash\n# Stage 1: Bridge container network namespace to a UNIX socket\nCONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' \u003ccontainer_name\u003e)\nnsenter -t $CONTAINER_PID -n socat UNIX-LISTEN:/tmp/waa-bridge.sock,fork TCP:localhost:5050\n\n# Stage 2: Expose the UNIX socket as a TCP port on the VM host\nsocat TCP-LISTEN:5051,fork,reuseaddr UNIX-CONNECT:/tmp/waa-bridge.sock\n```\n\nThis makes `VM_HOST:5051` forward to container port 5050. Port 5000 (WAA Flask API) uses standard Docker port forwarding and works normally.\n\n**After a container restart**, remove the stale socket (`rm -f /tmp/waa-bridge.sock`) and re-run both stages with the new container PID.\n\nFor the full networking architecture, SSH tunnel setup, and data flow diagrams, see [docs/gpu_e2e_validation/architecture.md](docs/gpu_e2e_validation/architecture.md).\n\n## WAA Task Setup \u0026 App Management\n\nThe evaluate server (`waa_deploy/evaluate_server.py`) runs on the Docker Linux side (port 5050) and orchestrates task setup on the Windows VM. The `/setup` endpoint accepts a list of setup handlers:\n\n```bash\n# Check if required apps are installed on the Windows VM\ncurl -X POST http://localhost:5051/setup \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"config\": [{\"type\": \"verify_apps\", \"parameters\": {\"apps\": [\"libreoffice-calc\"]}}]}'\n# → 200 if all present, 422 if any missing\n\n# Install missing apps via two-phase pipeline\ncurl -X POST http://localhost:5051/setup \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"config\": [{\"type\": \"install_apps\", \"parameters\": {\"apps\": [\"libreoffice-calc\"]}}]}'\n```\n\n### Two-phase install pipeline\n\nLarge installers (e.g. LibreOffice 350MB MSI) can't be downloaded within the WAA server's 120s command timeout. The `install_apps` handler solves this with a two-phase approach:\n\n1. **Download on Linux** -- the evaluate server downloads the installer to the Samba share (`/tmp/smb/` = `\\\\host.lan\\Data\\` on Windows), with no timeout constraint\n2. **Install on Windows** -- a small PowerShell script is written to the Samba share and executed via the WAA server, running only `msiexec` (fast, no download)\n\nThe Dockerfile also pre-downloads LibreOffice at build time with dynamic version discovery, so first-boot installs work without depending on mirror availability.\n\n### Automatic app verification\n\nWhen a task config includes `related_apps`, the live adapter automatically prepends a `verify_apps` step before the task's setup config. The `--verify` flag on `record_waa_demos.py` provides a pre-flight check across all tasks before starting a recording session.\n\n![LibreOffice Calc running inside Windows 11 QEMU VM via noVNC in Chrome](screenshots/waa_libreoffice_desktop.png)\n\n## CLI Reference\n\n### Benchmark CLI (`openadapt-evals`)\n\n| Command    | Description                                   |\n|------------|-----------------------------------------------|\n| `run`        | Run live evaluation (localhost:5001 default)   |\n| `mock`       | Run with mock adapter (no VM required)         |\n| `live`       | Run against a WAA server (full control)        |\n| `eval-suite` | Automated full-cycle evaluation (ZS + DC)      |\n| `azure`      | Run parallel evaluation on Azure ML            |\n| `probe`      | Check WAA readiness (`--detailed` for 4-layer diagnostics, `--json`, `--layers`) |\n| `view`       | Generate HTML viewer for results               |\n| `estimate`   | Estimate Azure costs                           |\n\n### VM/Pool CLI (`oa-vm`)\n\n| Command         | Description                              |\n|-----------------|------------------------------------------|\n| `pool-create`   | Create N VMs with Docker and WAA         |\n| `pool-wait`     | Wait until WAA is ready on all workers   |\n| `pool-run`      | Distribute tasks across pool workers     |\n| `pool-status`   | Show status of all pool VMs              |\n| `pool-pause`    | Deallocate pool VMs (stop billing)       |\n| `pool-resume`   | Restart deallocated pool VMs             |\n| `pool-cleanup`  | Delete all pool VMs and resources        |\n| `image-create`  | Create golden image from a pool VM       |\n| `image-list`    | List available golden images             |\n| `vm monitor`    | Dashboard with SSH tunnels               |\n| `vm setup-waa`  | Deploy WAA container on a VM             |\n| `smoke-test-aws`| Verify AWS credentials, AMI, VPC, lifecycle |\n\nAll pool commands accept `--cloud azure` (default) or `--cloud aws`.\n\nRun `oa-vm --help` for the full list of 50+ commands.\n\n## Configuration\n\nSettings are loaded automatically from environment variables or a `.env` file in the project root via [pydantic-settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/).\n\n```bash\n# .env\nANTHROPIC_API_KEY=sk-ant-...\nOPENAI_API_KEY=sk-...\n\n# Azure (for --cloud azure VM management)\nAZURE_SUBSCRIPTION_ID=...\nAZURE_ML_RESOURCE_GROUP=...\nAZURE_ML_WORKSPACE_NAME=...\n```\n\n### AWS authentication\n\nAWS credentials are resolved via [boto3's default credential chain](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html). **SSO (IAM Identity Center) is recommended** for interactive use:\n\n```bash\n# One-time setup — opens a guided wizard\naws configure sso\n# Prompts for: SSO start URL, region, account, role name, profile name\n\n# Login (opens browser, caches short-lived token)\naws sso login\n\n# Verify it works\noa-vm smoke-test-aws\n\n# All oa-vm --cloud aws commands now work automatically\noa-vm pool-create --cloud aws --workers 1\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eExample \u003ccode\u003e~/.aws/config\u003c/code\u003e for SSO\u003c/summary\u003e\n\n```ini\n[default]\nsso_session = my-org\nsso_account_id = 111122223333\nsso_role_name = PowerUserAccess\nregion = us-east-1\n\n[sso-session my-org]\nsso_start_url = https://my-org.awsapps.com/start\nsso_region = us-east-1\nsso_registration_scopes = sso:account:access\n```\n\n\u003c/details\u003e\n\nStatic keys (`AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` in `.env`) also work but are not recommended for interactive use -- they don't expire and are a security risk if leaked.\n\nSee [`openadapt_evals/config.py`](openadapt_evals/config.py) for all available settings.\n\n## Custom Agents\n\nImplement the `BenchmarkAgent` interface to evaluate your own agent:\n\n```python\nfrom openadapt_evals import BenchmarkAgent, BenchmarkAction, BenchmarkObservation, BenchmarkTask\n\nclass MyAgent(BenchmarkAgent):\n    def act(\n        self,\n        observation: BenchmarkObservation,\n        task: BenchmarkTask,\n        history: list[tuple[BenchmarkObservation, BenchmarkAction]] | None = None,\n    ) -\u003e BenchmarkAction:\n        # Your agent logic here\n        return BenchmarkAction(type=\"click\", x=0.5, y=0.5)\n\n    def reset(self) -\u003e None:\n        pass\n```\n\n## Contributing\n\nWe welcome contributions. To get started:\n\n```bash\ngit clone https://github.com/OpenAdaptAI/openadapt-evals.git\ncd openadapt-evals\nuv sync --extra dev\nuv run pytest tests/ -v\n```\n\nSee [CLAUDE.md](https://github.com/OpenAdaptAI/openadapt-evals/blob/main/CLAUDE.md) for development conventions and architecture details.\n\n## Related Projects\n\n| Project | Description |\n|---------|-------------|\n| [OpenAdapt](https://github.com/OpenAdaptAI/OpenAdapt) | Desktop automation with demo-conditioned AI agents |\n| [openadapt-ml](https://github.com/OpenAdaptAI/openadapt-ml) | Training and policy runtime |\n| [openadapt-capture](https://github.com/OpenAdaptAI/openadapt-capture) | Screen recording and demo sharing |\n| [openadapt-consilium](https://github.com/OpenAdaptAI/openadapt-consilium) | Multi-model consensus library |\n| [openadapt-grounding](https://github.com/OpenAdaptAI/openadapt-grounding) | UI element localization |\n\n## License\n\n[MIT](https://opensource.org/licenses/MIT)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenadaptai%2Fopenadapt-evals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenadaptai%2Fopenadapt-evals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenadaptai%2Fopenadapt-evals/lists"}