{"id":47851744,"url":"https://github.com/xiaowu0162/LongMemEval","last_synced_at":"2026-06-07T12:00:53.727Z","repository":{"id":257914140,"uuid":"870849341","full_name":"xiaowu0162/LongMemEval","owner":"xiaowu0162","description":"Benchmarking Chat Assistants on Long-Term Interactive Memory (ICLR 2025)","archived":false,"fork":false,"pushed_at":"2026-05-11T22:49:24.000Z","size":1908,"stargazers_count":753,"open_issues_count":34,"forks_count":55,"subscribers_count":9,"default_branch":"main","last_synced_at":"2026-05-12T00:27:58.423Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xiaowu0162.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-10T19:24:15.000Z","updated_at":"2026-05-11T22:49:30.000Z","dependencies_parsed_at":"2024-11-21T00:17:51.666Z","dependency_job_id":"c82649c5-3750-4d3b-88f2-7d2629e19ca3","html_url":"https://github.com/xiaowu0162/LongMemEval","commit_stats":null,"previous_names":["xiaowu0162/longmemeval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/xiaowu0162/LongMemEval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaowu0162%2FLongMemEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaowu0162%2FLongMemEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaowu0162%2FLongMemEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaowu0162%2FLongMemEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xiaowu0162","download_url":"https://codeload.github.com/xiaowu0162/LongMemEval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaowu0162%2FLongMemEval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34020187,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-03T22:00:28.540Z","updated_at":"2026-06-07T12:00:53.718Z","avatar_url":"https://github.com/xiaowu0162.png","language":"Python","funding_links":[],"categories":["📏 Benchmarks","Benchmarks \u0026 Evaluation"],"sub_categories":["💬 Plain-Text Benchmarks"],"readme":"# LongMemEval\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://xiaowu0162.github.io/long-mem-eval/\"\u003e\u003cimg src=\"https://img.shields.io/badge/🌐-Website-red\" height=\"23\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/pdf/2410.10813.pdf\"\u003e\u003cimg src=\"https://img.shields.io/badge/📝-Paper (ICLR 2025)-blue\" height=\"23\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned\" \u003e\u003cimg src=\"https://img.shields.io/badge/🤗-Data-green\" height=\"23\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n🖋 [Di Wu](https://xiaowu0162.github.io/), [Hongwei Wang](https://hongweiw.net/), [Wenhao Yu](https://wyu97.github.io/), [Yuwei Zhang](https://zhang-yu-wei.github.io/), [Kai-Wei Chang](https://web.cs.ucla.edu/~kwchang/), and [Dong Yu](https://sites.google.com/view/dongyu888/)\n\nWe introduce LongMemEval, a comprehensive, challenging, and scalable benchmark for testing the long-term memory of chat assistants. \n\n## ⚠️ News\n\n* [2026/05] Check out [LongMemEval-V2](https://github.com/xiaowu0162/LongMemEval-V2): long-term memory in agentic context.\n* [2025/09] We have further cleaned up the history sessions to prevent interference on answer correctness. The updated benchmark can be found [here](https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/) and the change logs can be found [here](https://docs.google.com/spreadsheets/d/16cHPu2B4XhgC-VvolIoWNs8wwm0Zkbpgu8H9x-qhxWg/edit?usp=sharing). You may also access the file at this google drive [link](https://drive.google.com/file/d/1zo5C2sKsN3-2TUZt7kiRd2wsZLmyd-4y/view?usp=sharing).\n* [2025/02] LongMemEval is accepted at ICLR 2025. \n* [2024/10] Benchmark released.\n\n## 🧠 LongMemEval Overview\n\nWe release 500 high quality questions to test five core long-term memory abilities:\n* Information Extraction\n* Multi-Session Reasoning\n* Knowledge Updates\n* Temporal Reasoning\n* Abstention\n\n![Example Questions in LongMemEval](assets/longmemeval_examples.png)\n\n Inspired by the \"needle-in-a-haystack\" test, we design an attribute-controlled pipeline to compile a coherent, extensible, and timestamped chat history for each question. LongMemEval requires chat systems to parse the dynamic interactions online for memorization, and answer the question after all the interaction sessions.\n\n## 🛠️ Setup\n\n### Data\n\nThe LongMemEval dataset is officially released at [huggingface](https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned). Please download and uncompress the data to the `data/` folder. \n```\nmkdir -p data/\ncd data/\nwget https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_oracle.json\nwget https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json\nwget https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_m_cleaned.json\ncd ..\n```\n\n### Environment\n\nWe recommend using a conda environment for the project. You may follow the steps below to set up.\n\n#### Evaluation only\n\nIf you only need to calculate the metrics on the outputs produced by your own system, you can install this minimal requirement set which allows you to run `src/evaluation/evaluate_qa.py`.\n\n```\nconda create -n longmemeval-lite python=3.9\nconda activate longmemeval-lite\npip install -r requirements-lite.txt\n```\n\n#### Full support\n\nIf you also would like to run the memory systems introduced in the paper, please set up this environment instead. \n\n```\nconda create -n longmemeval python=3.9\nconda activate longmemeval\npip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121\npip install -r requirements-full.txt\n```\nWe have tested this environment on a Linux machine with CUDA 12.1. If you use a different platform, you may need to modify the requirements.\n\n## 📜 Dataset Format\n\nThree files are included in the data package:\n* `longmemeval_s.json`: The LongMemEval_S introduced in the paper. Concatenating all the chat history roughly consumes 115k tokens (~40 history sessions) for Llama 3. \n* `longmemeval_m.json`: The LongMemEval_M introduced in the paper. Each chat history contains roughly 500 sessions. \n* `longmemeval_oracle.json`: LongMemEval with oracle retrieval. Only the evidence sessions are included in the history. \n\nWithin each file, there are 500 evaluation instances, each of which contains the following fields:\n* `question_id`: the unique id for each question.\n* `question_type`: one of `single-session-user`, `single-session-assistant`, `single-session-preference`, `temporal-reasoning`, `knowledge-update`, and `multi-session`. If `question_id` ends with `_abs`, then the question is an `abstention` question. \n* `question`: the question content.\n* `answer`: the expected answer from the model.\n* `question_date`: the date of the question.\n* `haystack_session_ids`: a list of the ids of the history sessions (sorted by timestamp for `longmemeval_s.json` and `longmemeval_m.json`; not sorted for `longmemeval_oracle.json`). \n* `haystack_dates`: a list of the timestamps of the history sessions. \n* `haystack_sessions`: a list of the actual contents of the user-assistant chat history sessions. Each session is a list of turns. Each turn is a direct with the format `{\"role\": user/assistant, \"content\": message content}`. For the turns that contain the required evidence, an additional field `has_answer: true` is provided. This label is used for turn-level memory recall accuracy evaluation.\n* `answer_session_ids`: a list of session ids that represent the evidence sessions. This is used for session-level memory recall accuracy evaluation.\n\n## 📊 Testing Your System\n\nTo test on LongMemEval, you may directly feed the timestamped history to your own chat system, collect the output, and evaluate with the evaluation script we provide. To do so, save the outputs in a `jsonl` format with each line containing two fields: `question_id` and `hypothesis`. Then, you may run the evaluation script through the following command:\n\n```\nexport OPENAI_API_KEY=YOUR_API_KEY\nexport OPENAI_ORGANIZATION=YOUR_ORGANIZATION     # may be omitted if your key belongs to only one organization\ncd src/evaluation\npython3 evaluate_qa.py gpt-4o your_hypothesis_file ../../data/longmemeval_oracle.json\n```\n\nRunning this script will save the evaluation logs into a file called `[your_hypothesis_file].log`. In this file, each line will contain a new field called `autoeval_label`. While `evaluate_qa.py` already reports the averaged scores, you can also aggregate the scores from the log using the following command:\n\n```\n(assuming you are in the src/evaluation folder)\n\npython3 print_qa_metrics.py gpt-4o your_hypothesis_file.log ../../data/longmemeval_oracle.json\n```\n\n## 💬 Creating Custom Chat Histories \n\nLongMemEval supports compiling a chat history of arbitrary length for a question instance, so that you can easily scale up the difficulty over `LongMemEval_M`.\n\n### Downloading the Corpus\n\nPlease download the compressed data from [this link](https://drive.google.com/file/d/1loTKBdywbCfYL5h5zwfnVcqlh7QwnBQm/view?usp=sharing) and uncompress the data under `data/custom_history`. The released data contains three parts:\n* `1_attr_bg/data_1_attr_bg.json`: user attibutes and backgrounds. \n* `2_questions`: questions, answers, evidence statements, as well as the evidence sessions. \n* `5_filler_sess/data_5_filler_sess.json`: filler sessions sourced from ShareGPT and UltraChat. \n* `6_session_cache/data_6_session_cache.json`: simulated user sessions based on facts extracted from the backgrounds. \n\n### Reproducing LongMemEval's History Compilation\n\nYou may run `python sample_haystack_and_timestamp.py task n_questions min_n_haystack_filler max_n_haystack_filler enforce_json_length` to reproduce the history compilation in LongMemEval. \n\n* `task` is the name of the task. In the released data, we used a different naming compared to the paper. Here is the mapping.\n\n    | Name in Data               | Official Name                |\n    |----------------------------|------------------------------|\n    | single_hop                 | single-session-user          |\n    | implicit_preference_v2     | single-session-preference    |\n    | assistant_previnfo         | single-session-assistant     |\n    | two_hop                    | multi-session                |\n    | multi_session_synthesis    | multi-session                |\n    | temp_reasoning_implicit    | temporal-reasoning           |\n    | temp_reasoning_explicit    | temporal-reasoning           |\n    | knowledge_update           | knowledge-update             |\n\n* `n_questions` is the maximum number of questions used. \n* `min_n_haystack_filler` and `max_n_haystack_filler` sets limits on the number of sessions included in the history. 80 is used for `longmemeval_s` and 500 is used for `longmemeval_m`. \n* `enforce_json_length` is used to limit the length of the chat history. 115000 is used for `longmemeval_s`. For `longmemeval_m`, this criteria is not used, and you can set it to a large number. \n\n### Constructing Your Custom History\n\nTo construct your own chat history, you may follow the format in `2_questions` and `6_session_cache` to create the questions and evidence sessions. Then, you may run `sample_haystack_and_timestamp.py` with a similar command. \n\n## 🚀 Running Memory System Experiments\n\nWe provide the experiment code for memory retrieval and retrieval-augmented question answering under the folder `src/retrieval` and `src/generation`.\n\n### Preparation\n\nIf you would like to test OpenAI models as the reader, please provide your OpenAI organization ID and key in line 33 and 34 of `src/generation/run_generation.sh`. \n\nIf you want to test an open-weight reader LLM, we support it through an OpenAI API emulator locally-served via `vllm`. To start the server, use the following command:\n```\ncd src/utils\nbash serve_vllm.sh GPU MODEL PORT TP_SIZE\n```\n* `GPU` is a comma-separated list of the GPUs you want to use\n* `MODEL` is the alias of the model you want to use. You can view or configure it in `serve_vllm.sh`. \n* `PORT` is the port the server will listen to. It defaults to 8001. If you change the port, make sure it is reflected in `src/generation/run_generation.sh` line 38.\n* `TP_SIZE` is the tensor parallel size. It must be smaller than or equal to the number of GPUs specified in GPU.\n* If you need to limit the maximum number of tokens due to memory requirements, you can use the following command:\n```\nbash serve_vllm_with_maxlen.sh GPU MODEL MAXLEN PORT TP_SIZE\n```\n\n### Long-Context Generation\n\nTo run the long-context generation baseline where the model is provided with the full history, you can use the following commands:\n```\ncd src/generation\nbash run_generation.sh DATA_FILE MODEL full-history-session TOPK [HISTORY_FORMAT] [USERONLY] [READING_METHOD]\n```\n* `DATA_FILE` is the path to one of the released json files. Note that `longmemeval_s.json` and`longmemeval_oracle.json` are designed to fit into a model with 128k context, but `longmemeval_m.json` is too long for long-context testing.\n* `MODEL` is alias of the model you want to use. You can view or configure it in `run_generation.sh`.\n* `TOPK` is the maximum number of history sessions provided to the reader. We recommend setting it to a large number (e.g., 1000) to ensure including all the history sessions.\n* `HISTORY_FORMAT` is the format to present the history. It can take either `json` or `nl`. We recommend using `json`.\n* `USERONLY` removes the assistant-side messages in the prompt. We recommend setting it to `false`. \n* `READING_METHOD` is the reading style. It can take the value `direct`, `con`, or `con-separate`. We recommend `con`, which instructs the model to first extract useful information and then reason over it.\n\nA log file will be generated under the folder `generation_logs/`. You can then follow the instructions above in \"Testing Your System\" to evaluate the QA correctness.\n\n### Memory Retrieval\n\nTo run memory indexing and retrieval on LongMemEval, following the instructions below.\n\n#### Baseline Retrieval\n\nYou may use the following command to run the baseline retrieval:\n```\ncd src/retrieval\nbash run_retrieval.sh IN_FILE RETRIEVER GRANULARITY\n```\n* `IN_FILE`: the path \n* `RETRIEVER`: `flat-bm25`, `flat-contriever`, `flat-stella` (Stella V5 1.5B), or `flat-gte` (gte-Qwen2-7B-instruct). For `flat-stella`, our code requires downloading the model manually from [the original repository](https://huggingface.co/dunzhang/stella_en_1.5B_v5).\n* `GRANULARITY`: the value granularity of the memory index, we support `turn` or `session`.\n\nNote that for the dense embedding models, we support multi-GPU retrieval. By default, the code will utilize all the available GPUs. \n\nThe script will output the retrieval results under `retrieval_logs/` and print out the evaluation metrics. You may print out the metrics from the log as well by:\n```\npython3 src/evaluation/print_retrieval_metrics.py log_file\n```\n\nAlso note that for evaluating the retrieval, we always skip the 30 abstention instances. This is because these instances generally refer to non-existing events and do not have a ground truth answer location.\n\n#### Index Expansion\n\nTo run the experiments with key expansion, download the released key expansion outputs in this [link](https://drive.google.com/file/d/1Y6vImnr_WQZXv4bKzpqPdHMbyoat8b_P/view?usp=sharing) and place the data under the directory `LongMemEval/index_expansion_logs/`. Then, you may run the following command\n```\ncd src/retrieval\nbash run_retrieval.sh IN_FILE RETRIEVER GRANULARITY EXPANSION_TYPE JOIN_MODE CACHE\n```\n* `EXPANSION_TYPE`: we support `session-summ`, `session-keyphrase`, `session-userfact`, `turn-keyphrase`, `turn-userfact`\n* `JOIN_MODE`: we support three modes:\n    * `separate`: add a new (key, value) pair with the expansion as the key.\n    * `merge`: merge the expansion and the original key to get the new key.\n    * `replace`: replace the original key with the expansion content.\n* `CACHE`: the path to the cache file corresponding to `EXPANSION_TYPE`.\n\nWe release the code for generating the key expansion outputs offline under `src/index_expansion` for your reference.\n\n#### Time-Aware Query Expansion\n\nWe provide the implementation for pruning out the search space by extracting timestamped events from the sessions, inferring time rage from the query, and using the range to narrow down the search space. First, download the extracted timestamped events [here](https://drive.google.com/file/d/1ekTMgHPffoaBQV3ZOuv-qV9fW6wJj4eW/view?usp=sharing) and unzip the data under `LongMemEval/index_expansion_logs/`. Next, run the command\n```\ncd src/index_expansion\npython3 temp_query_search_pruning.py TIMESTAMP_EVENT_FILE RETRIEVAL_LOG GRANULARITY\n```\n* `TIMESTAMP_EVENT_FILE` is the extracted timestamped events file you downloaded.\n* `RETRIEVAL_LOG` is the output from any retrieval experiment.\n* `GRANULARITY` is `session` or `turn` and should be consistent with the other two arguments. For the paper, we used `session` as the granularity.\n\n### Retrieval-Augmented Generation\n\nYou may use the following command for question answering with the retrieved memory.\n\n```\ncd src/generation\nbash run_generation.sh RETRIEVAL_LOG_FILE MODEL EXP TOPK [HISTORY_FORMAT] [USERONLY] [READING_METHOD]\n```\n* `RETRIEVAL_LOG_FILE` should be the output from the retrieval step. Specifically, this step relies on the `retrieval_results` field added in the retrieval step.\n* `EXP` should be in the form `[RETRIEVER]-[GRANULARITY]` such as `flat-stella-session`. \n* The other parameters are the same as introduced above.\n\n## Citation\n\nIf you find the work useful, please cite:\n\n```\n@article{wu2024longmemeval,\n      title={LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory}, \n      author={Di Wu and Hongwei Wang and Wenhao Yu and Yuwei Zhang and Kai-Wei Chang and Dong Yu},\n      year={2024},\n      eprint={2410.10813},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2410.10813}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaowu0162%2FLongMemEval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxiaowu0162%2FLongMemEval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaowu0162%2FLongMemEval/lists"}