{"id":50676559,"url":"https://github.com/EuniAI/TerminalWorld","last_synced_at":"2026-06-25T14:00:48.929Z","repository":{"id":359608982,"uuid":"1231907165","full_name":"EuniAI/TerminalWorld","owner":"EuniAI","description":"Benchmarking Agents on Real-World Terminal Tasks","archived":false,"fork":false,"pushed_at":"2026-05-31T15:26:52.000Z","size":275,"stargazers_count":17,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-31T17:06:40.580Z","etag":null,"topics":["agent","benchmark","cli","dataset","evaluation","llm","terminal"],"latest_commit_sha":null,"homepage":"https://terminalworld.ai/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EuniAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-07T12:00:12.000Z","updated_at":"2026-05-31T15:26:56.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/EuniAI/TerminalWorld","commit_stats":null,"previous_names":["euniai/terminalworld"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EuniAI/TerminalWorld","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EuniAI%2FTerminalWorld","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EuniAI%2FTerminalWorld/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EuniAI%2FTerminalWorld/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EuniAI%2FTerminalWorld/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EuniAI","download_url":"https://codeload.github.com/EuniAI/TerminalWorld/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EuniAI%2FTerminalWorld/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34778079,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-25T02:00:05.521Z","response_time":101,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","benchmark","cli","dataset","evaluation","llm","terminal"],"created_at":"2026-06-08T16:00:30.851Z","updated_at":"2026-06-25T14:00:48.920Z","avatar_url":"https://github.com/EuniAI.png","language":"Python","funding_links":[],"categories":["3）参考实现与开源工具（GitHub）"],"sub_categories":["评测框架与 Agent Benchmarks"],"readme":"# TerminalWorld\n\n**TerminalWorld** is a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from in-the-wild terminal recordings. Processing 80,870 asciinema recordings, it yields a benchmark of **1,530 validated terminal tasks** spanning 19 real-world categories and 1,280 unique commands — authentic and scalable by construction.\n\n\u003e **Paper:** [*TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks*](https://arxiv.org/pdf/2605.22535)  \n\u003e **Code:** [https://github.com/EuniAI/TerminalWorld](https://github.com/EuniAI/TerminalWorld)  \n\u003e **Dataset:** [https://huggingface.co/datasets/EuniAI/TerminalWorld](https://huggingface.co/datasets/EuniAI/TerminalWorld)  \n\u003e **Website:** [https://terminalworld.ai/](https://terminalworld.ai/)\n\n\u003e **Note:**  \n\u003e 🔒 The `main` branch is frozen as the stable release for external review.  \n\u003e 🚀 Active development and latest updates are on the [`dev`](../../tree/dev) branch.\n\n---\n\n## Overview\n\nExisting terminal benchmarks rely on manual expert curation, which introduces an adversarial bias and cannot scale with evolving developer practices. TerminalWorld addresses this by reverse-engineering evaluation tasks from real developer recordings shared on [asciinema.org](https://asciinema.org), inheriting their authenticity by construction.\n\nThe pipeline operates in four stages:\n\n```\nasciinema recordings (80,870)\n        │\n        ▼  1. Data Retrieval \u0026 Filtering\n   9,492 high-quality recordings\n        │\n        ▼  2. Task Synthesis\n   instruction.md + solve.sh (reference solution)\n        │\n        ▼  3. Environment Reproduction\n   Dockerfile + docker-compose.yaml (5,035 reproduced)\n        │\n        ▼  4. Test Suite Generation \u0026 Validation\n   1,530 validated tasks (AllPassing / Nop / Partial trials)\n```\n\n---\n\n## Repository Structure\n\n```\nTerminalWorld/\n├── data_retrieval/          # Stage 1a: crawl and download asciinema recordings\n│   ├── scrape_pages.py      #   Index explore feeds (public / recent / featured / popular)\n│   ├── download_recordings.py #  Download .txt transcripts + info.json metadata\n│   ├── parsers.py           #   HTML parsers for listing and detail pages\n│   ├── config.py            #   Shared crawler configuration\n│   └── stats.py             #   Dataset statistics\n│\n├── data_filtering/          # Stage 1b: filter recordings by quality criteria\n│   ├── detect_pii.py        #   Flag PII, credentials, and malicious commands\n│   ├── classify_tui.py      #   Detect TUI tool invocations (vim, htop, tmux, …)\n│   ├── detect_external_urls.py # Identify and verify external repository links\n│   ├── analyze_duration.py  #   Extract and analyze recording durations\n│   ├── score_value.py       #   Two-stage LLM quality scoring (feasibility + value)\n│   └── filter_recordings.py #   Combine all filters into a single filtering pass\n│\n├── task_synthesis/          # Stage 2: synthesize task instruction and reference solution\n│   ├── generate_instruction.py  # LLM-based outcome-oriented instruction generation\n│   ├── extract_solution.py      # LLM-based reference solution extraction from transcript\n│   └── generate_task_metadata.py\n│\n├── environment_building/    # Stage 3: reproduce executable Docker environments\n│   ├── build_environment.py  # LLM agent: synthesize and refine Dockerfile\n│   ├── analyze_recording.py  # Parse transcript to extract environment signals\n│   ├── batch_build.py        # Parallel batch environment reproduction\n│   └── monitor.py\n│\n└── test_generation/         # Stage 4: generate and validate test suites\n    ├── generate_tests.py     # LLM agent: snapshot-guided test suite generation\n    ├── refine_task.py        # Trial-based refinement loop (AllPassing/Nop/Partial)\n    ├── batch_refine.py       # Parallel batch refinement\n    └── skill/                # Agent skill definitions and task format references\n```\n\n---\n\n## Pipeline Details\n\n### Stage 1 — Data Retrieval \u0026 Filtering\n\nWe index asciinema's public explore feeds and download the plain-text transcript (`recording.txt`) and metadata (`info.json`) for each recording. Raw cast files and generated media are intentionally **not** collected, in accordance with asciinema's terms of service.\n\nRecordings are then filtered by five sequential criteria:\n1. **Privacy \u0026 Safety** — exclude PII, exposed credentials, and malicious/destructive commands (`detect_pii.py`)\n2. **CLI-only** — discard recordings that invoke TUI applications (vim, nano, htop, …) (`classify_tui.py`)\n3. **Docker reproducibility** — remove recordings dependent on inaccessible URLs, Windows environments, or proprietary software (`detect_external_urls.py`)\n4. **Minimum length** — eliminate excessively short or aborted sessions (`analyze_duration.py`)\n5. **LLM quality scoring** — filter opaque or purely exploratory sessions using a two-stage scoring framework: rule-based feasibility pre-check followed by LLM scoring on three dimensions — *state-action alignment*, *task complexity*, and *signal clarity* (`score_value.py`)\n\n**Result:** 80,870 → **9,492** high-quality recordings.\n\n### Stage 2 — Task Synthesis\n\nAn LLM distills each transcript into two artifacts:\n- **Instruction** (`instruction.md`): outcome-oriented natural language goal; describes *what* to achieve, never *how*; specifies required output paths and formats.\n- **Reference solution** (`solve.sh`): clean, executable bash script extracted from the transcript; redirects final results to explicit file paths (e.g., `/app/result.txt`) for deterministic verification.\n\n### Stage 3 — Environment Reproduction\n\nAn LLM agent synthesizes a `Dockerfile` (and `docker-compose.yaml` for multi-service tasks) by inferring dependencies from the reference solution. If the recording includes an external repository link, the agent clones and scans it to infer precise requirements. Fake binaries, stubbed dependencies, and bypasses of real software installation are explicitly prohibited.\n\nThe agent then enters an execution-feedback loop: build the image → parse build logs → launch the container → execute the reference solution step by step → feed runtime anomalies back for targeted repair.\n\n**Result:** 9,492 → **5,035** reproduced environments.\n\n### Stage 4 — Test Suite Generation \u0026 Validation\n\nThe agent captures pre- and post-execution filesystem snapshots in the Docker environment and generates state-based test assertions calibrated to the actual final state. Tests target persistent artifacts (file existence, content hashes, structured outputs) and avoid brittle non-deterministic checks (timestamps, process IDs).\n\nEach test suite is then refined through three execution trials in fresh containers:\n\n| Trial | Execution | Requirement | Guarantees |\n|-------|-----------|-------------|------------|\n| **AllPassing** | Full reference solution | All tests pass | Task *solvability* |\n| **Nop** | Nothing (empty state) | All tests fail | Task *non-triviality* |\n| **Partial** | Truncated / ablated solution | At least one test fails | Test *discriminability* |\n\nA task is admitted only if all three trials pass simultaneously. Failed suites are iteratively repaired; tasks that exceed the computational budget are discarded.\n\n**Result:** 5,035 → **1,530** validated tasks.\n\n---\n\n## The TerminalWorld Benchmark\n\n| | Full Set | Verified Subset |\n|--|--|--|\n| **Tasks** | 1,530 | 200 |\n| **Categories** |  |  |\n| **Unique commands** | 1,280 | — |\n| **Commands absent from Terminal-Bench** | 91% | — |\n| **Human review** | Automated validation | ✓ (4 expert annotators) |\n\nThe **Verified** subset of 200 tasks was manually reviewed by four authors with 3+ years of terminal development experience. Each task was executed end-to-end inside the Docker environment to verify functional correctness and artifact alignment.\n\nBenchmarking on the Verified subset across 8 frontier LLMs and 6 agent frameworks shows that the best system (Claude Opus 4.7 + Terminus-2) achieves a **62.5% pass rate**, with TerminalWorld scores only weakly correlated with Terminal-Bench scores (Pearson *r* = 0.20), confirming that in-the-wild recordings probe capabilities that expert-curated benchmarks miss.\n\n---\n\n## Ethics \u0026 Data Policy\n\n- We collect only publicly listed `.txt` transcripts and their coupled metadata via asciinema's standard download links, in full compliance with `robots.txt`.\n- Raw `.cast` files and generated media are **never downloaded or redistributed**.\n- The released benchmark contains only synthesized artifacts (instructions, environments, tests) with hyperlinks back to the original recordings — no original transcripts are redistributed.\n- Recordings are filtered for PII, credentials, and malicious content before any synthesis occurs.\n\nSee the paper's ethics section for a full discussion of copyright compliance and the right-to-be-forgotten architecture.\n\n## Citation\n\nIf you use TerminalWorld in your research, please cite:\n\n```bibtex\n@article{chu2026terminalworld,\n  title   = {TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks},\n  author  = {Zhaoyang Chu and Jiarui Hu and Xingyu Jiang and Pengyu Zou and Han Li and Chao Peng and Peter O'Hearn and Earl T. Barr and Mark Harman and Federica Sarro and He Ye},\n  journal = {arXiv preprint arXiv:2605.22535},\n  year    = {2026}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEuniAI%2FTerminalWorld","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FEuniAI%2FTerminalWorld","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEuniAI%2FTerminalWorld/lists"}