{"id":34744024,"url":"https://github.com/vectorize-io/hindsight-benchmarks","last_synced_at":"2026-05-28T07:32:26.437Z","repository":{"id":328935128,"uuid":"1107563330","full_name":"vectorize-io/hindsight-benchmarks","owner":"vectorize-io","description":"Hindsight Benchmarks Results","archived":false,"fork":false,"pushed_at":"2026-03-19T20:09:10.000Z","size":71841,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-20T11:20:37.874Z","etag":null,"topics":["agentic-ai","ai-agents","memory"],"latest_commit_sha":null,"homepage":"https://benchmarks.hindsight.vectorize.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vectorize-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-01T10:05:36.000Z","updated_at":"2026-03-19T20:09:15.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/vectorize-io/hindsight-benchmarks","commit_stats":null,"previous_names":["vectorize-io/hindsight-benchmarks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/vectorize-io/hindsight-benchmarks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fhindsight-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fhindsight-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fhindsight-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fhindsight-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vectorize-io","download_url":"https://codeload.github.com/vectorize-io/hindsight-benchmarks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fhindsight-benchmarks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33599465,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","ai-agents","memory"],"created_at":"2025-12-25T04:29:01.871Z","updated_at":"2026-05-28T07:32:26.431Z","avatar_url":"https://github.com/vectorize-io.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hindsight Benchmarks\n\nThis repository contains:\n- **Industry Benchmarks**: LongMemEval and LoComo benchmark results for the Hindsight memory system\n- **Model Leaderboard**: Comparative performance metrics for LLMs on fact extraction tasks\n- **Visualization Tools**: Interactive web interface to explore results\n\n## Repository Structure\n\n```\nhindsight-benchmarks/\n├── benchmark-runner/          # Python CLI tools for benchmarking\n│   ├── src/hindsight_benchmark/  # Fast benchmark (speed/cost/reliability)\n│   ├── quality_benchmark/        # Quality benchmark (accuracy via Hindsight)\n│   │   ├── run_quality_benchmark.py\n│   │   ├── locomo_quality.json\n│   │   └── README.md\n│   ├── datasets/\n│   ├── benchmark_models.json\n│   └── pyproject.toml\n├── visualizer/               # Next.js web application\n│   ├── app/\n│   │   ├── page.tsx          # Home page with both sections\n│   │   ├── longmemeval/      # Industry benchmark pages\n│   │   ├── locomo/          # Industry benchmark pages\n│   │   └── leaderboard/     # Model leaderboard pages\n│   ├── components/\n│   └── lib/\n└── results/                  # Shared results directory\n    ├── longmemeval.json.gz   # Industry benchmark\n    ├── locomo.json.gz        # Industry benchmark\n    ├── model-results/        # Fast benchmark results\n    └── quality/              # Quality benchmark results\n```\n\n## Quick Start\n\n### Viewing Results\n\n```bash\ncd visualizer\nnpm install\nnpm run dev\n# Open http://localhost:9998\n```\n\n### Running Model Benchmarks\n\n```bash\ncd benchmark-runner\nuv run hindsight-benchmark --dataset simple\n# Results saved to ../results/model-results/\n```\n\n---\n\n## Model Leaderboard\n\n### Overview\n\nThe Model Leaderboard compares LLM performance using two complementary benchmarks:\n\n#### 1. Fast Benchmark (Speed, Cost, Reliability)\nDirect model testing for operational metrics:\n- **Speed (25%)**: Response latency and throughput\n- **Cost (20%)**: Pricing per million tokens\n- **Reliability (15%)**: Schema conformance rate\n\n#### 2. Quality Benchmark (Accuracy)\nTests model performance within Hindsight using LoComo conversations:\n- **Quality (40%)**: Answer accuracy on conversation recall tasks\n- Measures real-world memory system performance\n- Runs through Hindsight API to test model in context\n\n### Fast Benchmark\n\nModels are tested on 20 diverse conversation scenarios. Each test requires:\n- Extracting structured facts from a conversation\n- Returning results in a specific JSON schema format\n- Valid JSON output with correct schema structure\n\n**Running:**\n```bash\ncd benchmark-runner\nuv run hindsight-benchmark --dataset simple\n# Results saved to ../results/model-results/\n```\n\nModel configurations are defined in `benchmark-runner/benchmark_models.json`.\n\n### Quality Benchmark\n\nMeasures accuracy by running a LoComo conversation through Hindsight with the model configured.\n\n**Prerequisites:**\n- Hindsight API running with the model to test\n- hindsight-client Python package installed\n- OpenAI API key for LLM judge (recommended)\n\n**Running:**\n```bash\ncd benchmark-runner/quality_benchmark\n\n# Start Hindsight with your model\ncd /path/to/hindsight-wt1\nexport HINDSIGHT_API_LLM_MODEL=gpt-4o-mini\npython -m hindsight_api\n\n# Run quality benchmark\ncd /path/to/hindsight-benchmarks/benchmark-runner/quality_benchmark\npython run_quality_benchmark.py \\\n    --api-url http://localhost:8888 \\\n    --model-id gpt-4o-mini \\\n    --provider-id openai \\\n    --judge-api-key $OPENAI_API_KEY\n```\n\nSee [`benchmark-runner/quality_benchmark/README.md`](benchmark-runner/quality_benchmark/README.md) for detailed instructions.\n\n### Viewing Results\n\nNavigate to `/leaderboard` in the visualizer to see:\n- Interactive sortable table of all models\n- Score breakdowns by dimension\n- Non-viable models section\n- Detailed metrics for each model\n\n---\n\nExplore the results yourself on the [Benchmarks Visualizer](https://benchmarks.hindsight.vectorize.io/)\n\n\u003ca href='https://benchmarks.hindsight.vectorize.io/'\u003e\u003cimg width=\"500\" height=\"350\" alt=\"Screenshot 2025-12-16 at 10 31 02\" src=\"https://github.com/user-attachments/assets/151e5a37-419f-4804-a0a3-6284964b62d9\" /\u003e\u003c/a\u003e\n\n\n## LongMemEval\n\n### Overview\n\nLongMemEval is a comprehensive benchmark designed to evaluate long-term memory capabilities in conversational AI systems. It tests the system's ability to retrieve and reason about information across multiple conversation sessions.\n\n**Explore the Dataset:** You can explore the LongMemEval dataset using the [LongMemEval Inspector](https://nicoloboschi.github.io/longmemeval-inspector/inspector.html).\n\n### State-of-the-Art Comparison\n\nThe table below shows performance across different memory systems on the LongMemEval benchmark (S setting, 500 questions):\n\n| Method | Single-session User | Single-session Assistant | Single-session Preference | Knowledge Update | Temporal Reasoning | Multi-session | Overall |\n|--------|---------------------|--------------------------|---------------------------|------------------|--------------------|---------------|---------|\n| Full-context (GPT-4o) | 81.4% | 94.6% | 20.0% | 78.2% | 45.1% | 44.3% | 60.2% |\n| Full-context (OSS-20B) | 38.6% | 80.4% | 20.0% | 60.3% | 31.6% | 21.1% | 39.0% |\n| Zep (GPT-4o) | 92.9% | 80.4% | 56.7% | 83.3% | 62.4% | 57.9% | 71.2% |\n| Supermemory (GPT-4o) | 97.1% | 96.4% | 70.0% | 88.5% | 76.7% | 71.4% | 81.6% |\n| Supermemory (GPT-5) | 97.1% | 100.0% | 76.7% | 87.2% | 81.2% | 75.2% | 84.6% |\n| Supermemory (Gemini-3) | 98.6% | 98.2% | 70.0% | 89.7% | 82.0% | 76.7% | 85.2% |\n| Hindsight (OSS-20B) | 95.7% | 94.6% | 66.7% | 84.6% | 79.7% | 79.7% | 83.6% |\n| Hindsight (OSS-120B) | **100.0%** | 98.2% | **86.7%** | 92.3% | 85.7% | 81.2% | 89.0% |\n| **Hindsight (Gemini-3)** | 97.1% | 96.4% | 80.0% | **94.9%** | **91.0%** | **87.2%** | **91.4%** |\n\n**Key Highlights:**\n- **Hindsight with Gemini-3 Pro achieves 91.4% overall accuracy**, the best result across all systems and model backbones\n- **Hindsight with OSS-120B achieves 89.0%**, outperforming Supermemory with GPT-4o (81.6%) and GPT-5 (84.6%)\n- **+44.6 percentage point improvement**: Hindsight with OSS-20B (83.6%) vs Full-context OSS-20B baseline (39.0%) demonstrates that the memory architecture, not model size, drives performance\n- The largest gains appear in long-horizon categories: multi-session improves from 21.1% to 79.7%, temporal reasoning from 31.6% to 79.7%\n- Even with a smaller open-source 20B model, Hindsight surpasses Full-context GPT-4o (60.2%) and matches Supermemory+GPT-4o (81.6%)\n\n**Cost Efficiency:** Exceptionally low costs achieved through sophisticated token reduction techniques in the Retain pipeline and **LLM-free memory recalls** - retrieving memories incurs zero LLM cost, enabling unlimited recall operations in production.\n\n**Infrastructure:** Local MacBook with PostgreSQL - no specialized cloud infrastructure required\n\n### Reproducibility\n\nTo reproduce these results, visit the main Hindsight repository:\n\n**[github.com/vectorize-io/hindsight](https://github.com/vectorize-io/hindsight)**\n\nFollow the benchmark instructions in the repository documentation.\n\n---\n\n## LoComo Benchmark Results\n\n### Overview\n\nLoComo (Long Conversation Memory) is a benchmark designed to test memory systems on long, multi-turn conversations with questions requiring recall of specific details from earlier in the dialogue.\n\n### State-of-the-Art Comparison\n\nThe table below shows accuracy (%) by question type and overall for prior memory systems and Hindsight with different backbone models:\n\n| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |\n|--------|------------|-----------|-------------|----------|---------|\n| Backboard | 89.36 | 75.00 | 91.20 | 91.90 | 90.00 |\n| Memobase (v0.0.37) | 70.92 | 46.88 | 77.17 | 85.05 | 75.78 |\n| Zep | 74.11 | 66.04 | 67.71 | 79.79 | 75.14 |\n| Mem0-Graph | 65.71 | 47.19 | 75.71 | 58.13 | 68.44 |\n| Mem0 | 67.13 | 51.15 | 72.93 | 55.51 | 66.88 |\n| LangMem | 62.23 | 47.92 | 71.12 | 23.43 | 58.10 |\n| OpenAI | 63.79 | 42.92 | 62.29 | 21.71 | 52.90 |\n| Hindsight (OSS-20B) | 74.11 | 64.58 | 90.96 | 76.32 | 83.18 |\n| Hindsight (OSS-120B) | 76.79 | 62.50 | 93.68 | 79.44 | 85.67 |\n| **Hindsight (Gemini-3)** | **86.17** | **70.83** | **95.12** | **83.80** | **89.61** |\n\n**Key Highlights:**\n- Across all backbone sizes, Hindsight consistently outperforms prior open memory systems such as Memobase, Zep, Mem0, and LangMem\n- **Hindsight raises overall accuracy from 75.78% (Memobase) to 83.18%** with OSS-20B and **85.67%** with OSS-120B\n- **Hindsight with Gemini-3 attains 89.61% overall accuracy** and the highest Open Domain score (95.12%), closely matching Backboard's 90.00% overall performance\n- These results demonstrate that the gains from Hindsight's memory architecture on LongMemEval transfer to realistic, multi-session human conversations\n\n**Note:** We skipped the **Adversarial category** as it is almost impossible to evaluate reliably due to the subjective and ambiguous nature of the questions in that category.\n\n### Important Note on Benchmark Validity\n\nWhile Hindsight achieves solid performance on LoComo, **we do not consider this benchmark to be a reliable indicator of memory system quality** due to significant flaws in the dataset design and evaluation methodology.\n\n**Known Issues with LoComo:**\n\n1. **Missing and Flawed Ground Truth** - Some categories have missing ground truth answers, speaker attribution errors, and inconsistencies in what is marked as correct\n2. **Ambiguous Questions** - Many questions have multiple valid interpretations and lack sufficient specificity to have a single correct answer\n3. **Insufficient Challenge** - Conversations are too short (16k-26k tokens), fitting within modern LLM context windows, failing to genuinely test memory retrieval capabilities\n4. **Limited Evaluation Scope** - Lacks critical tests for knowledge updates and temporal reasoning that are essential for real-world memory systems\n5. **Data Quality Issues** - Multimodal errors (image references without descriptions), poor conversation design, and unrealistic dialogue patterns\n\n**References:**\n\n- [https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/]\n- [https://www.kdjingpai.com/en/ai-zhinengtijiyiban/]\n\nFor these reasons, we recommend focusing on **LongMemEval** as a more reliable indicator of memory system performance. LongMemEval provides better-quality ground truth, more realistic conversation scenarios, and a broader evaluation of memory capabilities.\n\n### Reproducibility\n\nTo reproduce these results, visit the main Hindsight repository:\n\n**[github.com/vectorize-io/hindsight](https://github.com/vectorize-io/hindsight)**\n\n---\n\n## Exploring Results\n\nTo visualize the benchmark results:\n\n```bash\ncd visualizer\nnpm install\nnpm run dev\n```\n\nThen open http://localhost:9998 in your browser.\n\nThe visualizer provides:\n- 📊 Interactive benchmark overview with category breakdowns\n- 🔍 Advanced filtering (by category, correctness, item ID)\n- 📝 Detailed question-level analysis with reasoning and retrieved memories\n- 🎯 Beautiful, responsive UI built with Next.js and Tailwind CSS\n\nFor deployment options and more details, see [visualizer/README.md](./visualizer/README.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectorize-io%2Fhindsight-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvectorize-io%2Fhindsight-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectorize-io%2Fhindsight-benchmarks/lists"}