https://github.com/vectorize-io/hindsight-benchmarks
Hindsight Benchmarks Results
https://github.com/vectorize-io/hindsight-benchmarks
agentic-ai ai-agents memory
Last synced: 27 days ago
JSON representation
Hindsight Benchmarks Results
- Host: GitHub
- URL: https://github.com/vectorize-io/hindsight-benchmarks
- Owner: vectorize-io
- Created: 2025-12-01T10:05:36.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2026-03-19T20:09:10.000Z (3 months ago)
- Last Synced: 2026-03-20T11:20:37.874Z (3 months ago)
- Topics: agentic-ai, ai-agents, memory
- Language: Python
- Homepage: https://benchmarks.hindsight.vectorize.io/
- Size: 68.5 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hindsight Benchmarks
This repository contains:
- **Industry Benchmarks**: LongMemEval and LoComo benchmark results for the Hindsight memory system
- **Model Leaderboard**: Comparative performance metrics for LLMs on fact extraction tasks
- **Visualization Tools**: Interactive web interface to explore results
## Repository Structure
```
hindsight-benchmarks/
├── benchmark-runner/ # Python CLI tools for benchmarking
│ ├── src/hindsight_benchmark/ # Fast benchmark (speed/cost/reliability)
│ ├── quality_benchmark/ # Quality benchmark (accuracy via Hindsight)
│ │ ├── run_quality_benchmark.py
│ │ ├── locomo_quality.json
│ │ └── README.md
│ ├── datasets/
│ ├── benchmark_models.json
│ └── pyproject.toml
├── visualizer/ # Next.js web application
│ ├── app/
│ │ ├── page.tsx # Home page with both sections
│ │ ├── longmemeval/ # Industry benchmark pages
│ │ ├── locomo/ # Industry benchmark pages
│ │ └── leaderboard/ # Model leaderboard pages
│ ├── components/
│ └── lib/
└── results/ # Shared results directory
├── longmemeval.json.gz # Industry benchmark
├── locomo.json.gz # Industry benchmark
├── model-results/ # Fast benchmark results
└── quality/ # Quality benchmark results
```
## Quick Start
### Viewing Results
```bash
cd visualizer
npm install
npm run dev
# Open http://localhost:9998
```
### Running Model Benchmarks
```bash
cd benchmark-runner
uv run hindsight-benchmark --dataset simple
# Results saved to ../results/model-results/
```
---
## Model Leaderboard
### Overview
The Model Leaderboard compares LLM performance using two complementary benchmarks:
#### 1. Fast Benchmark (Speed, Cost, Reliability)
Direct model testing for operational metrics:
- **Speed (25%)**: Response latency and throughput
- **Cost (20%)**: Pricing per million tokens
- **Reliability (15%)**: Schema conformance rate
#### 2. Quality Benchmark (Accuracy)
Tests model performance within Hindsight using LoComo conversations:
- **Quality (40%)**: Answer accuracy on conversation recall tasks
- Measures real-world memory system performance
- Runs through Hindsight API to test model in context
### Fast Benchmark
Models are tested on 20 diverse conversation scenarios. Each test requires:
- Extracting structured facts from a conversation
- Returning results in a specific JSON schema format
- Valid JSON output with correct schema structure
**Running:**
```bash
cd benchmark-runner
uv run hindsight-benchmark --dataset simple
# Results saved to ../results/model-results/
```
Model configurations are defined in `benchmark-runner/benchmark_models.json`.
### Quality Benchmark
Measures accuracy by running a LoComo conversation through Hindsight with the model configured.
**Prerequisites:**
- Hindsight API running with the model to test
- hindsight-client Python package installed
- OpenAI API key for LLM judge (recommended)
**Running:**
```bash
cd benchmark-runner/quality_benchmark
# Start Hindsight with your model
cd /path/to/hindsight-wt1
export HINDSIGHT_API_LLM_MODEL=gpt-4o-mini
python -m hindsight_api
# Run quality benchmark
cd /path/to/hindsight-benchmarks/benchmark-runner/quality_benchmark
python run_quality_benchmark.py \
--api-url http://localhost:8888 \
--model-id gpt-4o-mini \
--provider-id openai \
--judge-api-key $OPENAI_API_KEY
```
See [`benchmark-runner/quality_benchmark/README.md`](benchmark-runner/quality_benchmark/README.md) for detailed instructions.
### Viewing Results
Navigate to `/leaderboard` in the visualizer to see:
- Interactive sortable table of all models
- Score breakdowns by dimension
- Non-viable models section
- Detailed metrics for each model
---
Explore the results yourself on the [Benchmarks Visualizer](https://benchmarks.hindsight.vectorize.io/)
## LongMemEval
### Overview
LongMemEval is a comprehensive benchmark designed to evaluate long-term memory capabilities in conversational AI systems. It tests the system's ability to retrieve and reason about information across multiple conversation sessions.
**Explore the Dataset:** You can explore the LongMemEval dataset using the [LongMemEval Inspector](https://nicoloboschi.github.io/longmemeval-inspector/inspector.html).
### State-of-the-Art Comparison
The table below shows performance across different memory systems on the LongMemEval benchmark (S setting, 500 questions):
| Method | Single-session User | Single-session Assistant | Single-session Preference | Knowledge Update | Temporal Reasoning | Multi-session | Overall |
|--------|---------------------|--------------------------|---------------------------|------------------|--------------------|---------------|---------|
| Full-context (GPT-4o) | 81.4% | 94.6% | 20.0% | 78.2% | 45.1% | 44.3% | 60.2% |
| Full-context (OSS-20B) | 38.6% | 80.4% | 20.0% | 60.3% | 31.6% | 21.1% | 39.0% |
| Zep (GPT-4o) | 92.9% | 80.4% | 56.7% | 83.3% | 62.4% | 57.9% | 71.2% |
| Supermemory (GPT-4o) | 97.1% | 96.4% | 70.0% | 88.5% | 76.7% | 71.4% | 81.6% |
| Supermemory (GPT-5) | 97.1% | 100.0% | 76.7% | 87.2% | 81.2% | 75.2% | 84.6% |
| Supermemory (Gemini-3) | 98.6% | 98.2% | 70.0% | 89.7% | 82.0% | 76.7% | 85.2% |
| Hindsight (OSS-20B) | 95.7% | 94.6% | 66.7% | 84.6% | 79.7% | 79.7% | 83.6% |
| Hindsight (OSS-120B) | **100.0%** | 98.2% | **86.7%** | 92.3% | 85.7% | 81.2% | 89.0% |
| **Hindsight (Gemini-3)** | 97.1% | 96.4% | 80.0% | **94.9%** | **91.0%** | **87.2%** | **91.4%** |
**Key Highlights:**
- **Hindsight with Gemini-3 Pro achieves 91.4% overall accuracy**, the best result across all systems and model backbones
- **Hindsight with OSS-120B achieves 89.0%**, outperforming Supermemory with GPT-4o (81.6%) and GPT-5 (84.6%)
- **+44.6 percentage point improvement**: Hindsight with OSS-20B (83.6%) vs Full-context OSS-20B baseline (39.0%) demonstrates that the memory architecture, not model size, drives performance
- The largest gains appear in long-horizon categories: multi-session improves from 21.1% to 79.7%, temporal reasoning from 31.6% to 79.7%
- Even with a smaller open-source 20B model, Hindsight surpasses Full-context GPT-4o (60.2%) and matches Supermemory+GPT-4o (81.6%)
**Cost Efficiency:** Exceptionally low costs achieved through sophisticated token reduction techniques in the Retain pipeline and **LLM-free memory recalls** - retrieving memories incurs zero LLM cost, enabling unlimited recall operations in production.
**Infrastructure:** Local MacBook with PostgreSQL - no specialized cloud infrastructure required
### Reproducibility
To reproduce these results, visit the main Hindsight repository:
**[github.com/vectorize-io/hindsight](https://github.com/vectorize-io/hindsight)**
Follow the benchmark instructions in the repository documentation.
---
## LoComo Benchmark Results
### Overview
LoComo (Long Conversation Memory) is a benchmark designed to test memory systems on long, multi-turn conversations with questions requiring recall of specific details from earlier in the dialogue.
### State-of-the-Art Comparison
The table below shows accuracy (%) by question type and overall for prior memory systems and Hindsight with different backbone models:
| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|--------|------------|-----------|-------------|----------|---------|
| Backboard | 89.36 | 75.00 | 91.20 | 91.90 | 90.00 |
| Memobase (v0.0.37) | 70.92 | 46.88 | 77.17 | 85.05 | 75.78 |
| Zep | 74.11 | 66.04 | 67.71 | 79.79 | 75.14 |
| Mem0-Graph | 65.71 | 47.19 | 75.71 | 58.13 | 68.44 |
| Mem0 | 67.13 | 51.15 | 72.93 | 55.51 | 66.88 |
| LangMem | 62.23 | 47.92 | 71.12 | 23.43 | 58.10 |
| OpenAI | 63.79 | 42.92 | 62.29 | 21.71 | 52.90 |
| Hindsight (OSS-20B) | 74.11 | 64.58 | 90.96 | 76.32 | 83.18 |
| Hindsight (OSS-120B) | 76.79 | 62.50 | 93.68 | 79.44 | 85.67 |
| **Hindsight (Gemini-3)** | **86.17** | **70.83** | **95.12** | **83.80** | **89.61** |
**Key Highlights:**
- Across all backbone sizes, Hindsight consistently outperforms prior open memory systems such as Memobase, Zep, Mem0, and LangMem
- **Hindsight raises overall accuracy from 75.78% (Memobase) to 83.18%** with OSS-20B and **85.67%** with OSS-120B
- **Hindsight with Gemini-3 attains 89.61% overall accuracy** and the highest Open Domain score (95.12%), closely matching Backboard's 90.00% overall performance
- These results demonstrate that the gains from Hindsight's memory architecture on LongMemEval transfer to realistic, multi-session human conversations
**Note:** We skipped the **Adversarial category** as it is almost impossible to evaluate reliably due to the subjective and ambiguous nature of the questions in that category.
### Important Note on Benchmark Validity
While Hindsight achieves solid performance on LoComo, **we do not consider this benchmark to be a reliable indicator of memory system quality** due to significant flaws in the dataset design and evaluation methodology.
**Known Issues with LoComo:**
1. **Missing and Flawed Ground Truth** - Some categories have missing ground truth answers, speaker attribution errors, and inconsistencies in what is marked as correct
2. **Ambiguous Questions** - Many questions have multiple valid interpretations and lack sufficient specificity to have a single correct answer
3. **Insufficient Challenge** - Conversations are too short (16k-26k tokens), fitting within modern LLM context windows, failing to genuinely test memory retrieval capabilities
4. **Limited Evaluation Scope** - Lacks critical tests for knowledge updates and temporal reasoning that are essential for real-world memory systems
5. **Data Quality Issues** - Multimodal errors (image references without descriptions), poor conversation design, and unrealistic dialogue patterns
**References:**
- [https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/]
- [https://www.kdjingpai.com/en/ai-zhinengtijiyiban/]
For these reasons, we recommend focusing on **LongMemEval** as a more reliable indicator of memory system performance. LongMemEval provides better-quality ground truth, more realistic conversation scenarios, and a broader evaluation of memory capabilities.
### Reproducibility
To reproduce these results, visit the main Hindsight repository:
**[github.com/vectorize-io/hindsight](https://github.com/vectorize-io/hindsight)**
---
## Exploring Results
To visualize the benchmark results:
```bash
cd visualizer
npm install
npm run dev
```
Then open http://localhost:9998 in your browser.
The visualizer provides:
- 📊 Interactive benchmark overview with category breakdowns
- 🔍 Advanced filtering (by category, correctness, item ID)
- 📝 Detailed question-level analysis with reasoning and retrieved memories
- 🎯 Beautiful, responsive UI built with Next.js and Tailwind CSS
For deployment options and more details, see [visualizer/README.md](./visualizer/README.md).