{"id":44430787,"url":"https://github.com/internlm/emembench","last_synced_at":"2026-02-12T13:00:44.991Z","repository":{"id":334661830,"uuid":"1140528452","full_name":"InternLM/EMemBench","owner":"InternLM","description":"Official Repository of EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents","archived":false,"fork":false,"pushed_at":"2026-01-26T11:35:52.000Z","size":4840,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-26T20:45:10.293Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InternLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-23T12:00:13.000Z","updated_at":"2026-01-26T11:35:55.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/InternLM/EMemBench","commit_stats":null,"previous_names":["internlm/emembench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/InternLM/EMemBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FEMemBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FEMemBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FEMemBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FEMemBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InternLM","download_url":"https://codeload.github.com/InternLM/EMemBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FEMemBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29366558,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T08:51:36.827Z","status":"ssl_error","status_checked_at":"2026-02-12T08:51:26.849Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-12T13:00:15.067Z","updated_at":"2026-02-12T13:00:44.986Z","avatar_url":"https://github.com/InternLM.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003ch1 align=\"center\"\u003eEMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents\u003c/h1\u003e\n    \u003cp align=\"center\"\u003e\n    \u003ca href=\"https://lixinze777.github.io/\"\u003e\u003cstrong\u003eXinze Li\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003cstrong\u003eZiyue Zhu\u003c/strong\u003e\n    ·\n    \u003cstrong\u003eSiyuan Liu\u003c/strong\u003e\n    ·\n    \u003ca href=\"https://mayubo2333.github.io\"\u003e\u003cstrong\u003eYubo Ma\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://yuhangzang.github.io/\"\u003e\u003cstrong\u003eYuhang Zang\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://sites.google.com/view/yixin-homepage\"\u003e\u003cstrong\u003eYixin Cao\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://personal.ntu.edu.sg/axsun/\"\u003e\u003cstrong\u003eAixin Sun\u003c/strong\u003e\u003c/a\u003e\n  \u003c/p\u003e\n\n\nEMemBench is a **programmatic benchmark framework** for evaluating **episodic (experience-grounded) memory** in interactive agents.  \nInstead of using a fixed, static QA set, EMemBench generates questions **from each agent’s own interaction trajectory** and computes **verifiable ground-truth answers** from underlying game signals.\n\nThis repo provides an end-to-end pipeline for:\n- **Jericho** (text-only interactive fiction)\n- **Crafter** (visual, partially observed survival \u0026 crafting)\n\n\u003e EMemBench is not a single fixed dataset. It is a **benchmark generator + evaluation harness**: run an agent → log → generate QA with programmatic GT → answer \u0026 score.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"paper/emembench concept.png\" width=\"600\" /\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003cem\u003eFigure 1: EMemBench overview. An agent interacts with game environment to produce an episode trajectory. We log agent-observable signals and all underlying game signals. A carefully designed algorithm converts each episode into a QA set with calculated ground truths, and the same agent then answers these questions using only agent-observable context plus its own memory.\u003c/em\u003e\n\u003c/p\u003e\n\n\n---\n\n## Key Ideas\n\n- **Trajectory-conditioned QA**: questions are derived from the agent’s **own** interaction trace.\n- **Programmatic, verifiable ground truth**: answers are computed from game signals / structured logs.\n- **Query Horizon Control (QHC)**: templates can optionally restrict evidence selection and answer computation to a **prefix window** (e.g., steps 1..50) to reduce confounds from variable episode lengths.  \n  - **Legacy naming note**: current code passes QHC values via flags named `--difficulties` / `--difficulty`, and writes to folders like `DIF_-1`, `DIF_50`. These values correspond to **QHC settings**.\n\n---\n\n## Repository Layout\n\n### Text (Jericho)\n\n```\ntext_game/\n  game_envs/                 # Jericho ROMs (.z3/.z5/...)\n  run_jericho_openai.py      # play + log\n  generate_jericho_qa.py     # QA generation (+ indices/maps)\n  answer_jericho_qa.py       # answer + eval\n  run_text_game_pipeline.py  # E2E entry (play -\u003e gen -\u003e answer)\n\ngame_envs/\n  advent.z5\n  ...\n  zork3.z5\n\nlogs/\n  \u003cgame\u003e/..._logs.jsonl\n\ngenerated_qa/\n  \u003cgame\u003e/\u003crun_name\u003e/\n    DIF_-1/                  # legacy folder name = QHC=-1\n    DIF_50/                  # legacy folder name = QHC=50\n    ...\n\neval/\n  \u003cgame\u003e/\u003crun_name\u003e/...\n```\n\n### Visual (Crafter)\n\n```\nvisual_game/\n  instructions/\n  run_crafter_openai.py       # play + log + frames + map file\n  generate_crafter_qa.py      # QA generation\n  answer_crafter_qa.py        # answer + eval\n  run_visual_game_pipeline.py # E2E entry (play -\u003e gen -\u003e answer)\n\nlog/\n  seed{SEED}/{RUN_NAME}/\n    logs.jsonl\n    map_seed{SEED}.txt\n    frames/*.png\n\ngenerated_qa/\n  seed{SEED}/{RUN_NAME}/\n    qa_context.json\n    DIF_-1/qa.jsonl           # legacy folder name = QHC=-1\n    DIF_50/qa.jsonl           # legacy folder name = QHC=50\n    ...\n\neval/\n  seed{SEED}/{RUN_NAME}/...\n```\n\n---\n\n## Installation\n\n### 1) Python environment\n\n```bash\nconda create -n emembench python=3.10\nconda activate emembench\npip install -r requirements.txt\n```\n\n### 2) Jericho (text games)\n\nJericho typically requires Linux + basic build tools. Install and download the spaCy model:\n\n```bash\npip install jericho\npython -m spacy download en_core_web_sm\n```\n\nYou must place Jericho ROM files under `text_game/game_envs/` (they are not included in this repo).\n\n### 3) Crafter (visual game)\n\n```bash\npip install crafter\n```\n\n### 4) Model API (OpenAI-compatible)\n\nThe provided runners assume an **OpenAI-compatible chat API**.\n\n```bash\nexport OPENAI_API_KEY=\"YOUR_KEY\"\n# Optional (if your code supports OpenAI-compatible endpoints):\nexport OPENAI_BASE_URL=\"https://YOUR_ENDPOINT\"\n# Optional:\nexport OPENAI_MODEL=\"gpt-5.1\"\n```\n\n---\n\n## Quickstart: End-to-End Pipelines\n\n### A) Jericho (Text) — one command\n\nFrom the `text_game/` directory (or repo root, depending on your working directory):\n\n```bash\npython run_text_game_pipeline.py \\\n  --model gpt-5.1 \\\n  --max-steps 200 \\\n  --history-turns 30 \\\n  --difficulties -1 50 \\\n  --max-per-type 2 \\\n  --logs-root logs \\\n  --qa-root generated_qa\n```\n\nWhat it does (per game):\n1. **Play \u0026 log** → `logs/\u003cgame\u003e/*_logs.jsonl`  \n2. **Generate QA** (QHC values) → `generated_qa/\u003cgame\u003e/\u003crun_name\u003e/DIF_*`  \n3. **Answer \u0026 evaluate** → `eval/\u003cgame\u003e/\u003crun_name\u003e/...`\n\n**Notes**\n- `--history-turns` controls how many recent turns are included in the policy prompt during play.\n- The list of games is defined in `run_text_game_pipeline.py` (edit `JERICHO_GAMES` to run more/fewer titles).\n\n---\n\n### B) Crafter (Visual) — one command (multi-seed)\n\nFrom the `visual_game/` directory (or repo root):\n\n```bash\npython run_visual_game_pipeline.py \\\n  --seeds 1 42 43 100 123 \\\n  --steps 500 \\\n  --history-turns 10 \\\n  --difficulties -1 50 \\\n  --qa-source paraphrase \\\n  --qa-temperature 0.0 \\\n  --qa-max-tokens 4096 \\\n  --batch-size 8 \\\n  --frames-mode mosaic\n```\n\nOverride the answering model (optional):\n\n```bash\npython run_visual_game_pipeline.py \\\n  --seeds 42 \\\n  --qa-model gpt-5.1\n```\n\n**Notes**\n- `--frames-mode` controls how frames are packaged into evaluation prompts (`mosaic` is typically the most economical).\n- Outputs are grouped by seed: `log/seed{SEED}/{RUN_NAME}/...`\n\n---\n\n## Outputs\n\n### Logs\n- Jericho: `logs/\u003cgame\u003e/*_logs.jsonl`\n- Crafter: `log/seed{SEED}/{RUN_NAME}/logs.jsonl` + `frames/` + `map_seed{SEED}.txt`\n\n### QA artifacts\n- `qa_context.json`: agent-observable context used to build evaluation prompts\n- `qa.jsonl`: one QA per line (question, metadata, GT answer, evidence pointers, etc.)\n\n### Evaluation\n- per-question predictions: `answers.jsonl` (or equivalent)\n- aggregated metrics: `index.json` (or equivalent)\n\n---\n\n## Upstream Environments\n\n- Jericho: https://github.com/microsoft/jericho\n- Crafter: https://github.com/danijar/crafter\n\n## ✒️Citation\n```\n@misc{li2026emembenchinteractivebenchmarkingepisodic,\n      title={EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents}, \n      author={Xinze Li and Ziyue Zhu and Siyuan Liu and Yubo Ma and Yuhang Zang and Yixin Cao and Aixin Sun},\n      year={2026},\n      eprint={2601.16690},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2601.16690}, \n}\n```\n\n## 📄 License\n![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) **Usage and License Notices**: The data and code are intended and licensed for research use only.\nLicense: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finternlm%2Femembench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finternlm%2Femembench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finternlm%2Femembench/lists"}