{"id":50717582,"url":"https://github.com/openbmb/acebench","last_synced_at":"2026-06-14T00:01:44.191Z","repository":{"id":363060460,"uuid":"1261820100","full_name":"OpenBMB/AceBench","owner":"OpenBMB","description":null,"archived":false,"fork":false,"pushed_at":"2026-06-07T10:29:34.000Z","size":3246,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-09T20:22:38.613Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenBMB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-07T07:34:14.000Z","updated_at":"2026-06-08T03:43:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/OpenBMB/AceBench","commit_stats":null,"previous_names":["openbmb/acebench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenBMB/AceBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FAceBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FAceBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FAceBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FAceBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenBMB","download_url":"https://codeload.github.com/OpenBMB/AceBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FAceBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34170161,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-09T20:01:32.600Z","updated_at":"2026-06-10T21:01:01.854Z","avatar_url":"https://github.com/OpenBMB.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\n# AceBench\n\n\u003cimg src=\"assets/fig.png\" width=\"560\" alt=\"AceBench\"/\u003e\n\n[![tasks](https://img.shields.io/badge/tasks-128-1f72c1)](tasks/ACE_Bench)\n[![privacy](https://img.shields.io/badge/privacy_annotated-100-8a3ffc)](#tasks)\n[![strategies](https://img.shields.io/badge/strategies-6-2ea44f)](#quick-start)\n[![harness](https://img.shields.io/badge/harness-OpenClaw-f08c00)](https://openclaw.ai/)\n[![paper](https://img.shields.io/badge/paper-arXiv-b31b1b)](#citation)\n[![license](https://img.shields.io/badge/license-MIT-e8b500)](LICENSE)\n\n\u003e **Empowering Edge Agents:** A Multi-Dimensional Benchmark for Agentic Edge-Cloud Collaboration.\n\n\u003e 128 executable tasks · 100 privacy-annotated · 6 strategies · Utility · Cost · Privacy.\n\n\u003c/div\u003e\n\n---\n\n**AceBench** is a benchmark for **edge-cloud collaboration** in LLM agents. Cloud models reason best but see all your data; on-device edge models keep data local but are weaker — collaboration promises the best of both, *if* you organize it well. AceBench measures exactly that, but in a setting prior edge-cloud studies skip: **real agent execution**, where agents work over live workspaces (files, tools, commands, APIs, app states) and every cloud call *mid-trajectory* can expose accumulated local context.\n\nWe evaluate six execution strategies — pure edge, pure cloud, and four edge-cloud collaboration patterns — across **128 executable tasks** (100 with fine-grained **privacy annotations**) on an [OpenClaw](https://openclaw.ai/) harness, scoring every run on three axes at once: **task utility**, **resource cost**, and **privacy exposure**. The result exposes how *when* the cloud is invoked and *what* context is sent trade capability against cost and leakage.\n\n### Design Highlights\n\n| | What we test | Why it matters |\n| --- | --- | --- |\n| 🦞 **OpenClaw-native** | The real OpenClaw agent loop — bash, browser, file ops, APIs, and reusable `SKILL.md` skills — driving a live local workspace | Tasks need long-horizon planning, state tracking, and error recovery; cloud calls land *mid-trajectory* over accumulated workspace context, not on a static prompt |\n| 🔐 **Privacy-aware** | 100 tasks annotated with sensitivity units (PII + org secrets) | Every cloud invocation is a potential leakage channel — we audit what crosses the boundary |\n| ⚖️ **Multi-dimensional** | Utility · Cost · Privacy, reported jointly | No single number hides the trade-off; you see the whole Pareto picture |\n| 🔀 **Strategy-centric** | 6 edge / cloud / edge-cloud strategies, one task suite | Isolates *how* collaboration is organized from *which* models are used |\n| 📦 **Reproducible** | Each task runs in its own Docker container | Graders are injected only after the agent finishes — never visible during execution |\n\n---\n\n## Tasks\n\n**128 executable tasks** across **6 categories** (Chinese \u0026 English); **100** carry fine-grained privacy annotations. Each is a self-contained Markdown file under [`tasks/ACE_Bench/`](tasks/ACE_Bench/) — a prompt, an inline `grade()` verifier, and a workspace path.\n\n| Category | # | Example tasks | Core challenges |\n| --- | --- | --- | --- |\n| **Office \u0026 Daily Tasks** | 36 | ambiguous contact email, meeting notes, expense report, daily summary | Multi-source aggregation, clarification, structured output |\n| **Information Search \u0026 Gathering** | 34 | email search, competitive intelligence, paper affiliation lookup, CRM bug hunt | Web + local data reconciliation, source verification |\n| **Safety \u0026 Security** | 21 | leaked API-key detection, prompt injection, malicious skill, HIPAA/PHI referral | Adversarial robustness, credential awareness, refusal |\n| **Data Analysis** | 14 | order-profit analysis, month-end reconciliation, quarterly business insight | Spreadsheet reasoning, state verification |\n| **Development \u0026 Operations** | 13 | system health check, automation-failure recovery, LLM API gateway skill | Undocumented setups, debugging, skill creation |\n| **Automation** | 10 | flight booking, n8n workflow report, scheduled-briefing skill | Long-horizon orchestration, recovery |\n\n**Scoring.** Every run is graded on **three dimensions at once**:\n\n- **Utility** — completion score + `Pass³` (3-trial consistency), from each task's own verifier.\n- **Cost** — cloud tokens \u0026 USD, plus edge-side FLOPs.\n- **Privacy** — how much annotated sensitive context (PII / org secrets) reaches the cloud.\n\n---\n\n## Leaderboard\n\nEdge = **Qwen3.5-9B / 27B**, Cloud = **GPT-5.4**, judge = **GPT-5.4-mini**, averaged over 3 runs. Cloud Tok. = raw / cache / output (millions); Cost in USD; Edge FLOPs in PetaFLOPs; Utility \u0026 Privacy in %.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/leaderboard.png\" width=\"95%\" alt=\"AceBench main results\"/\u003e\n\u003c/div\u003e\n\nEdge-cloud collaboration beats both single-side extremes on the utility–privacy trade-off; **Sketch-Guided** keeps privacy at 100%, **Task-Routing** is the most balanced, and **Adaptive Assistance** gets the best `Pass³` at \u003c10% of Cloud-only cost.\n\n---\n\n## Quick Start\n\nAceBench runs each task in an isolated Docker container bundling the OpenClaw harness and per-task mock services.\n\n**1. Install dependencies \u0026 load the image**\n\nThe prebuilt container image is hosted on Hugging Face as a `docker save` tarball (Hugging Face is not a Docker registry, so download + `docker load` instead of `docker pull`):\n\n```bash\ncd AceBench\nconda create -n acebench python=3.13 -y\npip install -r requirements.txt\n# download the image\nhf download chengpingan/AceBench \\\n    Images/acebench-openclaw-v1.0.tar.gz --repo-type dataset --local-dir .\n\n# download the workspaces (single tarball) and extract\nhf download chengpingan/AceBench \\\n    workspace/ACE_Bench.tar.gz --repo-type dataset --local-dir .\ntar -xzf workspace/ACE_Bench.tar.gz      # extracts into workspace/ACE_Bench/\n\ndocker load -i Images/acebench-openclaw-v1.0.tar.gz   # loads acebench-openclaw:v1.0 (must match DOCKER_IMAGE in .env)\n```\n\n**2. Configure keys** — copy `.env.example` to `.env` and fill in:\n\n```bash\nOPENROUTER_API_KEY=...                 # cloud collaborator (any OpenAI-compatible provider)\nJUDGE_API_KEY=...                      # LLM-as-a-judge for utility \u0026 privacy\nJUDGE_MODEL=gpt-5.4-mini\n```\n\nEdge / cloud model endpoints live in [`my_api.json`](my_api.json) (e.g. a local vLLM server) and are passed via `--models-config`.\n\n**3. Prepare task assets**\n\n```bash\nbash script/prepare.sh         \n```\n\n**4. Run** — pick a strategy via `--run-mode`. Six strategies share one suite (full commands in [`script/run.sh`](script/run.sh)):\n\n| Strategy | `--run-mode` | Cloud use | Idea |\n| --- | --- | --- | --- |\n| **Edge-only** | `local-only` | none | All steps on the edge model |\n| **Cloud-only** | `cloud-only` | every step | Capability upper bound; highest exposure |\n| **Sketch-Guided** | `pipeline-plan-executor` | once, upfront | Cloud drafts a high-level sketch; edge executes |\n| **Task-Routing** | `query-router` | once, offline | RouteLLM routes the whole task to edge or cloud |\n| **Step-Routing** | `step-router` | per uncertain step | Edge-first; escalate to cloud on high token entropy |\n| **Adaptive Assistance** | `advisor` | on demand | Edge asks the cloud for a plan/hint when stuck |\n\n```bash\n# Edge-only baseline\npython3 eval/run_batch.py --category ACE_Bench --parallel 16 --repeat 3 \\\n  --edge-model vllm/Qwen/Qwen3.5-27B --models-config my_api.json \\\n  --output-dir output/edge_only/qwen3.5-27b\n\n# Edge-cloud (e.g. adaptive cloud assistance)\npython3 eval/run_batch.py --category ACE_Bench --parallel 8 --repeat 3 \\\n  --run-mode advisor \\\n  --edge-model vllm/Qwen/Qwen3.5-27B --cloud-model your-provider/gpt-5.4 \\\n  --models-config my_api.json \\\n  --output-dir output/adaptive-assistant/qwen3.5-27b_to_gpt5.4\n```\n\nSingle-task runs (`--task tasks/ACE_Bench/ACE_Bench_task_44_ambiguous_contact_email.md`), task filters (`--task-filter`), and privacy-judge timing (`--privacy-judge-mode {inline,deferred,off}`) are also supported.\n\n---\n\n## Check the Results\n\nPer-task outputs land under `output/\u003crun\u003e/\u003ctask_id\u003e/...` (scores, token/cost usage, agent trace, produced files), with a per-category and global summary generated automatically once the run finishes.\n\n---\n\n\n## Acknowledgements\n\nAceBench stands on the shoulders of a remarkable open-source agent community, and we are deeply grateful for it. The **OpenClaw** harness gives us a real, full-featured agent runtime — tools, skills, and a live workspace — to build on. Our tasks and evaluation design draw inspiration and adapted material from a series of outstanding agent benchmarks: **Claw-Eval**, **WildClawBench**, **QwenClawBench**, **LiveClawBench**, **PinchBench**, and **ClawBench**. Their meticulous task curation, rigorous grading, and reproducible harness design set the bar for trustworthy agent evaluation, and made the privacy-aware, edge-cloud extension in AceBench possible.\n\n## License\n\nReleased under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenbmb%2Facebench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenbmb%2Facebench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenbmb%2Facebench/lists"}