https://github.com/memtensor/halumem
HaluMem is the first operation level hallucination evaluation benchmark tailored to agent memory systems.
https://github.com/memtensor/halumem
ai benchmark hallucination hallucination-evaluation llm llm-memory long-term-memory mem0 memobase memory memory-system memos memzero
Last synced: 2 months ago
JSON representation
HaluMem is the first operation level hallucination evaluation benchmark tailored to agent memory systems.
- Host: GitHub
- URL: https://github.com/memtensor/halumem
- Owner: MemTensor
- Created: 2025-10-25T15:07:02.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-12-30T02:41:29.000Z (3 months ago)
- Last Synced: 2026-01-02T11:56:39.144Z (3 months ago)
- Topics: ai, benchmark, hallucination, hallucination-evaluation, llm, llm-memory, long-term-memory, mem0, memobase, memory, memory-system, memos, memzero
- Language: Python
- Homepage:
- Size: 23.1 MB
- Stars: 96
- Watchers: 1
- Forks: 9
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
HaluMem: A Comprehensive Benchmark for Evaluating Hallucinations in Memory Systems
## 📊 Why We Define the HaluMem Evaluation Tasks
Fig 1. HaluMem vs Existing Benchmarks for Memory Systems.
- **Limitations of Existing Frameworks**
Most existing evaluation frameworks treat memory systems as **black-box models**, assessing performance only through **end-to-end QA accuracy**.
However, this approach has **two major limitations**:
1. It lacks a **hallucination evaluation** specifically designed for the characteristics of memory systems.
2. It fails to examine the **core operational steps** in how memory is processed, such as retrieval and updating.
- **Motivation for HaluMem**
To address these issues, we introduce **HaluMem**, a comprehensive benchmark that defines fine-grained evaluation tasks tailored for memory systems.
------
## 🧩 What Is HaluMem?
The paper *“HaluMem: A Comprehensive Benchmark for Evaluating Hallucinations in Memory Systems”* presents the **first operation-level hallucination benchmark** designed explicitly for memory systems.
HaluMem decomposes the memory workflow into **three fundamental operations**:
- **🧩 Memory Extraction**
Evaluates whether the system can **accurately identify and store factual information** from dialogue sessions while avoiding **hallucinated or irrelevant memories**.
This task measures both **memory completeness** (how well the reference memory points are captured) and **memory accuracy** (how precisely the extracted memories reflect the ground truth).
- **🔄 Memory Update**
Evaluates whether the system can **correctly modify or overwrite existing memories** when new dialogue provides updated or contradictory information, ensuring internal **consistency and temporal coherence** within the memory base.
- **💬 Memory Question Answering**
Evaluates the system’s **end-to-end capability** to integrate multiple memory processes (including **extraction**, **update**, **retrieval**, and **response generation**) to produce factual, context-aware, and hallucination-free answers.
Each operation includes carefully designed evaluation tasks to **reveal hallucination behaviors** at different stages of memory handling.
------
## 🏆 LeaderBoard
We conducted a comprehensive evaluation of several state-of-the-art memory systems on HaluMem, including [Mem0](https://github.com/mem0ai/mem0) (both standard and graph versions), [Memobase](https://github.com/memodb-io/memobase), [MemOS](https://github.com/MemTensor/MemOS), [Supermemory](https://github.com/supermemoryai/supermemory), and [Zep](https://github.com/getzep/zep). For a comprehensive understanding of the methodology and metric calculations, please refer to our [paper](https://arxiv.org/abs/2511.03506).
---
🥇 1. Main Evaluation Results
The main evaluation assesses memory systems across **three core tasks**:
1. **Memory Extraction**: Evaluates **Memory Integrity** (Recall of reference memory points) and **Memory Accuracy** (Precision of extracted memory points).
2. **Memory Updating**: Evaluates accuracy, hallucination, and omission rates during memory updates.
3. **Question Answering**: Evaluates accuracy, hallucination, and omission rates in downstream QA tasks.
Fig 2. Hallucination Evaluation Process.
> **📝 Metric Legend:**
> * **R**: Recall (Higher is better ↑)
> * **Target P**: Target Memory Precision (Higher is better ↑)
> * **Acc.**: Memory Accuracy (Higher is better ↑)
> * **FMR**: False Memory Resistance (Higher is better ↑)
> * **F1**: Memory Extraction F1-score (Higher is better ↑)
> * **C**: Correct Rate / Accuracy (Higher is better ↑)
> * **H**: Hallucination Rate (Lower is better ↓)
> * **O**: Omission Rate (Lower is better ↓)
> * *(Values in parentheses in "Target P" and "Acc." columns represent the count of extracted memories)*
*The tables are sorted based on a holistic performance assessment, prioritizing **F1-score**, **QA Accuracy**, and **Updating Accuracy**.*
#### 🔹 HaluMem-Medium Dataset
System
Memory Extraction
Memory Updating
Question Answering
R↑
Weighted R↑
Target P↑
Acc.↑
FMR↑
F1↑
C↑
H↓
O↓
C↑
H↓
O↓
MemOS
74.07%
84.81%
86.25%(45190)
59.55%(71793)
44.94%
79.70%
62.11%
0.42%
37.48%
67.23%
15.17%
17.59%
Zep
-
-
-
-
-
-
47.28%
0.42%
52.31%
55.47%
21.92%
22.62%
Mem0-Graph
43.28%
65.52%
87.20%(10567)
61.86%(16230)
55.70%
57.85%
24.50%
0.26%
75.24%
54.66%
19.28%
26.06%
Mem0
42.91%
65.03%
86.26%(10556)
60.86%(16291)
56.80%
57.31%
25.50%
0.45%
74.02%
53.02%
19.17%
27.81%
Supermemory
41.53%
64.76%
90.32%(14134)
60.83%(22551)
51.77%
56.90%
16.37%
1.15%
82.47%
54.07%
22.24%
23.69%
Memobase
14.55%
25.88%
92.24%(5443)
32.29%(17081)
80.78%
25.13%
5.20%
0.55%
94.25%
35.33%
29.97%
34.71%
#### 🔸 HaluMem-Long Dataset
System
Memory Extraction
Memory Updating
Question Answering
R↑
Weighted R↑
Target P↑
Acc.↑
FMR↑
F1↑
C↑
H↓
O↓
C↑
H↓
O↓
MemOS
81.90%
89.56%
82.32%(48246)
43.77%(99462)
28.85%
82.11%
65.25%
0.29%
34.47%
64.44%
16.61%
18.95%
Supermemory
53.02%
70.73%
85.82%(24483)
29.71%(77134)
36.86%
65.54%
17.01%
0.58%
82.42%
53.77%
22.21%
24.02%
Zep
-
-
-
-
-
-
37.35%
0.48%
62.14%
50.19%
22.51%
27.30%
Memobase
6.18%
14.68%
88.56%(3077)
25.61%(11795)
85.39%
11.55%
4.10%
0.36%
95.38%
33.60%
29.46%
36.96%
Mem0-Graph
2.24%
10.76%
87.32%(785)
41.26%(1866)
88.36%
4.36%
1.47%
0.04%
98.40%
32.44%
21.82%
45.74%
Mem0
3.23%
11.89%
88.01%(1134)
46.01%(2433)
87.65%
6.22%
1.45%
0.03%
98.51%
28.11%
17.29%
54.60%
> **⚠️ Note on Zep:** Since Zep does not provide a *Get Dialogue Memory API*, metrics related to memory extraction cannot be computed.
🎯 2. Typewise Accuracy: Event, Persona, and Relationship
This section reports the **extraction accuracy** of each memory system across three specific memory categories: **Event**, **Persona**, and **Relationship**.
The statistics include all memory points derived from both the *Memory Extraction* and *Memory Updating* tasks, excluding distractor memories.
#### 🔹 HaluMem-Medium Dataset
System
Event
Persona
Relationship
MemOS
63.41%
59.77%
62.40%
Zep
44.83%*
49.75%*
38.81%*
Mem0
29.69%
33.74%
27.77%
Mem0-Graph
30.02%
33.71%
26.60%
Supermemory
28.66%
32.11%
20.67%
Memobase
5.12%
13.38%
6.79%
#### 🔸 HaluMem-Long Dataset
System
Event
Persona
Relationship
MemOS
70.92%
68.35%
71.68%
Supermemory
38.48%
40.85%
32.61%
Zep
35.76%*
39.07%*
31.16%*
Memobase
4.09%
5.32%
4.21%
Mem0
0.92%
3.01%
2.18%
Mem0-Graph
1.10%
2.00%
1.59%
> **⚠️ Note:** The memory entries for **Zep** include only those derived from the *Memory Updating* task.
❓ 3. Performance Across Question Types
Evaluation of memory system performance across different types of questions.
Fig 3. Radar Chart
#### 🔹 HaluMem-Medium Dataset
System
Basic Fact
Recall
Dynamic
Update
Multi-hop
Inference
Generalization
& Application
Memory
Conflict
Memory
Boundary
MemOS
67.02%
45.00%
41.92%
45.17%
86.35%
80.43%
Zep
47.81%
38.05%
28.68%
32.50%
77.05%
77.05%
Supermemory
43.68%
31.42%
29.07%
33.75%
82.26%
70.17%
Mem0-Graph
44.81%
28.32%
22.87%
34.38%
71.22%
84.66%
Mem0
40.18%
22.57%
24.42%
33.12%
70.22%
85.02%
Memobase
21.03%
18.58%
13.18%
18.82%
38.09%
73.79%
#### 🔸HaluMem- Long Dataset
System
Basic Fact
Recall
Dynamic
Update
Multi-hop
Inference
Generalization
& Application
Memory
Conflict
Memory
Boundary
MemOS
65.42%
47.22%
38.38%
45.58%
84.40%
71.98%
Supermemory
43.30%
29.65%
31.01%
37.26%
78.66%
69.20%
Zep
40.18%
29.65%
18.60%
29.36%
67.87%
78.14%
Memobase
18.52%
7.52%
15.12%
17.57%
29.40%
80.56%
Mem0-Graph
11.89%
17.78%
11.11%
16.48%
28.01%
82.50%
Mem0
7.76%
8.41%
6.20%
10.04%
18.86%
86.35%
⏱️ 4. Time Consumption & Latency
This section reports the time consumption of all memory systems during the evaluation process for **dialogue addition** and **memory retrieval**, as well as their **total runtime**.
#### 🔹 HaluMem-Medium Dataset
System
Dialogue Addition
Time (min)
Memory Retrieval
Time (min)
Total
Time (min)
Supermemory
273.21
95.53
368.74
Memobase
293.30
139.95
433.25
MemOS
1028.84
20.52
1049.37
Mem0
2768.14
41.66
2809.80
Mem0-Graph
2840.07
54.65
2894.72
Zep
-
53.34
-
#### 🔸HaluMem- Long Dataset
System
Dialogue Addition
Time (min)
Memory Retrieval
Time (min)
Total
Time (min)
Memobase
239.29
136.19
375.48
Mem0
691.62
39.15
730.77
Mem0-Graph
870.32
62.42
932.74
MemOS
1524.39
20.96
1545.34
Supermemory
1672.53
137.02
1809.55
Zep
-
50.22
-
> **⚠️ Note on Zep:** latency results for dialogue addition are unavailable because Zep lacks a synchronous *Add Dialogue API*, preventing the accurate measurement of processing time via its asynchronous interface.
------
## 💻 Usage & Resources
### ⚙️ Evaluation Code
The **HaluMem** benchmark includes a complete evaluation suite located in the [`eval/`](./eval) directory.
It supports **multiple memory systems** and provides standardized pipelines for testing hallucination resistance and memory performance.
#### 🚀 Quick Start
1. **Navigate to the evaluation directory**
```bash
cd eval
```
2. **Install dependencies**
```bash
poetry install --with eval
```
3. **Configure environment variables**
Copy `.env-example` to `.env`, then fill in the required API keys and runtime parameters.
```bash
cp .env-example .env
```
4. **Run evaluation (example: Mem0 system)**
```bash
# Step 1: Extract memories and perform QA retrieval
python eval_memzero.py
# Step 2: Evaluate memory extraction, update, and QA tasks
python evaluation.py --frame memzero --version default
```
* For the **Graph** version of Mem0, use `eval_memzero_graph.py`.
* For **MemOS**, use `eval_memos.py`.
* Other supported systems follow the same naming pattern.
5. **View results**
All evaluation outputs (task scores, FMR, aggregated metrics) are saved in the `results/` directory.
For full command details, configuration options, and examples, see [`eval/README.md`](./eval/README.md).
---
### 📦 Dataset Access
The complete **HaluMem dataset** is publicly available on **Hugging Face**:
🔗 [https://huggingface.co/datasets/IAAR-Shanghai/HaluMem](https://huggingface.co/datasets/IAAR-Shanghai/HaluMem)
Available versions:
* `Halu-Medium` — multi-turn dialogues with moderate context (~160k tokens per user)
* `Halu-Long` — extended 1M-token context with distractor interference
-----
> [!TIP]
>
> 🧩 **Recommended Workflow**
>
> 1. Download the dataset from Hugging Face.
> 2. Configure evaluation parameters in `eval/.env`.
> 3. Run evaluation scripts to compute metrics for your memory system.
> 4. Check results in the `results/` folder and compare across models.
>
> For reproducibility and further setup, refer to [`eval/README.md`](./eval/README.md).
-----
## 📚 Dataset Overview
HaluMem consists of two dataset versions:
| Dataset | #Users | #Dialogues | Avg. Sessions/User | Avg. Context Length | #Memory Points | #QA Pairs |
| --------------- | ------ | ---------- | ------------------ | ------------------- | -------------- | --------- |
| **Halu-Medium** | 20 | 30,073 | 70 | ~160k tokens | 14,948 | 3,467 |
| **Halu-Long** | 20 | 53,516 | 120 | ~1M tokens | 14,948 | 3,467 |
- **Halu-Medium** provides multi-turn human-AI dialogue sessions for evaluating memory hallucinations in standard-length contexts.
- **Halu-Long** extends context length to **1M tokens** per user, introducing large-scale **interference and distractor content** (e.g., factual QA and math problems) to assess robustness and hallucination resistance.
------
## 🧱 Dataset Structure
Each user’s data is stored as a **JSON object** containing:
| Field | Description |
| -------------- | ------------------------------------------------------------ |
| `uuid` | Unique user identifier |
| `persona_info` | Persona profile including background, traits, goals, and motivations |
| `sessions` | List of multi-turn conversational sessions |
Each `session` includes:
| Field | Description |
| ------------------------ | ----------------------------------------------------- |
| `start_time`, `end_time` | Session timestamps |
| `dialogue_turn_num` | Total turns in the dialogue |
| `dialogue` | Sequence of utterances between `user` and `assistant` |
| `memory_points` | List of extracted memory elements from the session |
| `questions` | QA pairs used for memory reasoning and evaluation |
| `dialogue_token_length` | Tokenized length of the full dialogue |
#### Memory Point Structure
Each memory point captures a **specific fact or event** derived from dialogue.
| Field | Description |
| ------------------- | ------------------------------------------------------------ |
| `index` | Memory ID within the session |
| `memory_content` | Text description of the memory |
| `memory_type` | Type (e.g., *Persona Memory*, *Event Memory*, *Relationship Memory*) |
| `memory_source` | Origin: `primary`, `secondary`, `interference`, or `system` |
| `is_update` | Indicates if it modifies an existing memory |
| `original_memories` | Previous related memories (if updated) |
| `importance` | Relative salience score (0–1) |
| `timestamp` | Time of creation or update |
#### Memory Point Example
```json
{
"index": 1,
"memory_content": "Martin Mark is considering a career change due to the impact of his current role on his mental health.",
"memory_type": "Event Memory",
"memory_source": "secondary",
"is_update": "True",
"original_memories": [
"Martin Mark is considering a career change due to health impacts from his current role."
],
"timestamp": "Dec 15, 2025, 08:41:23",
"importance": 0.75
}
```
#### Dialogue Structure
Each dialogue turn includes:
```json
[
{
"role": "user",
"content": "I've been reflecting a lot on my career lately, especially how my current role as a director at Huaxin Consulting is impacting my mental health. It's becoming increasingly clear that I need to make a change.",
"timestamp": "Dec 15, 2025, 06:11:23",
"dialogue_turn": 0
},
{
"role": "assistant",
"content": "It's great that you're taking the time to reflect on your career, Martin. Recognizing the impact on your mental health is a crucial step. Balancing professional responsibilities with health is essential, especially given your commitment to improving healthcare access globally. Have you considered how a career change might not only address your health concerns but also align with your humanitarian goals and personal well-being?",
"timestamp": "Dec 15, 2025, 06:11:23",
"dialogue_turn": 0
}
]
```
#### Question–Answer Structure
Each question tests **memory retrieval**, **reasoning**, or **hallucination control**:
```json
{
"question": "What type of new physical activity might Martin be interested in trying after April 10, 2026?",
"answer": "Other extreme sports.",
"evidence": [
{
"memory_content": "Martin has developed a newfound appreciation for extreme sports...",
"memory_type": "Persona Memory"
}
],
"difficulty": "medium",
"question_type": "Generalization & Application"
}
```
------
## 🧬 Dataset Construction Process
The **HaluMem dataset** was built through a **six-stage, carefully controlled pipeline** that combines **programmatic generation**, **LLM-assisted refinement**, and **human validation** to ensure realism, consistency, and reliability.
Fig 4. Framework of the HaluMem Construction Pipeline.
1. **🧑💼 Stage 1: Persona Construction**
Each dataset user begins with a richly detailed **virtual persona** consisting of three layers — *core profile information* (e.g., demographics, education, goals), *dynamic state information* (e.g., occupation, health, relationships), and *preference information* (e.g., food, music, hobbies).
Personas were initially generated via rule-based templates seeded from **Persona Hub (1B+ personas)** and then refined using **GPT-4o**, ensuring logical coherence and natural diversity.
2. **📈 Stage 2: Life Skeleton Planning**
A structured **life skeleton** defines each user’s evolving timeline, linking major career milestones and life events to the progression of dynamic and preference states.
Controlled probabilistic mechanisms ensure realistic variation and coherent event evolution, forming a narrative blueprint for downstream data generation.
3. **📜 Stage 3: Event Flow Generation**
The abstract life skeleton is converted into a **chronological event flow**, including:
- **Init Events** — derived from initial persona profiles
- **Career Events** — multi-stage professional or health-related developments
- **Daily Events** — lifestyle or preference changes
Together, these events form each user’s **memory timeline**, providing a consistent and interpretable narrative structure.
4. **🧠 Stage 4: Session Summaries & Memory Points**
Each event is transformed into a **session summary** simulating a human–AI interaction. From these summaries, **structured memory points** are extracted, categorized into *Persona*, *Event*, and *Relationship* memories.
Update-type memories maintain traceability by linking to their replaced versions, ensuring temporal consistency.
5. **💬 Stage 5: Multi-turn Session Generation**
The summaries are expanded into **full dialogues** containing **adversarial distractor memories** — subtly incorrect facts introduced by the AI to simulate hallucination challenges.
Additional irrelevant QAs are inserted to increase contextual complexity without altering original memories, mimicking real-world long-context noise.
6. **❓ Stage 6: Question Generation**
Based on the sessions and memory points, **six types of evaluation questions** are automatically generated, covering both factual recall and reasoning tasks.
Each question includes difficulty level, reasoning type, and direct evidence links to the supporting memory points.
7. **🧾 Human Annotation & Quality Verification**
A team of 8 annotators manually reviewed over **50% of Halu-Medium**, scoring each session’s memory points and QA pairs on **correctness**, **relevance**, and **consistency**.
Results demonstrate high data quality:
- ✅ **Accuracy:** 95.70%
- 📎 **Relevance:** 9.58 / 10
- 🔁 **Consistency:** 9.45 / 10
------
> [!NOTE]
>
> 🧩 **In Summary:**
> HaluMem provides a **comprehensive and standardized benchmark** for investigating hallucinations in memory systems.
> By covering **core memory operations**, scaling **context length**, and introducing **distractor interference**, it establishes a robust foundation for **systematic hallucination research** in large language model memory architectures.
------
## Citation
```
@misc{chen2025halumemevaluatinghallucinationsmemory,
title={HaluMem: Evaluating Hallucinations in Memory Systems of Agents},
author={Ding Chen and Simin Niu and Kehang Li and Peng Liu and Xiangping Zheng and Bo Tang and Xinchi Li and Feiyu Xiong and Zhiyu Li},
year={2025},
eprint={2511.03506},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.03506},
}
```