{"id":50632285,"url":"https://github.com/SalesforceAIResearch/LiveResearchBench","last_synced_at":"2026-06-23T19:00:44.014Z","repository":{"id":324109903,"uuid":"1078486332","full_name":"SalesforceAIResearch/LiveResearchBench","owner":"SalesforceAIResearch","description":"A live benchmark and evaluation framework for open-ended deep research in the wild.","archived":false,"fork":false,"pushed_at":"2025-11-13T20:37:52.000Z","size":1974,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-13T22:22:53.810Z","etag":null,"topics":["agentic-ai","deep-research","deep-research-agent","evaluation-framework","multiagent-systems"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2510.14240","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SalesforceAIResearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-17T20:13:04.000Z","updated_at":"2025-11-13T20:37:56.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/SalesforceAIResearch/LiveResearchBench","commit_stats":null,"previous_names":["salesforceairesearch/liveresearchbench"],"tags_count":null,"template":false,"template_full_name":"SalesforceAIResearch/oss-template","purl":"pkg:github/SalesforceAIResearch/LiveResearchBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SalesforceAIResearch%2FLiveResearchBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SalesforceAIResearch%2FLiveResearchBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SalesforceAIResearch%2FLiveResearchBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SalesforceAIResearch%2FLiveResearchBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SalesforceAIResearch","download_url":"https://codeload.github.com/SalesforceAIResearch/LiveResearchBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SalesforceAIResearch%2FLiveResearchBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34702919,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-23T02:00:07.161Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","deep-research","deep-research-agent","evaluation-framework","multiagent-systems"],"created_at":"2026-06-06T23:00:22.905Z","updated_at":"2026-06-23T19:00:44.008Z","avatar_url":"https://github.com/SalesforceAIResearch.png","language":"Python","funding_links":[],"categories":["Scientific \u0026 Research Environments"],"sub_categories":[],"readme":"# LiveResearchBench\n\nThis is the codebase for [LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild](https://arxiv.org/abs/2510.14240), including both the **LiveResearchBench benchmark** and the **DeepEval evaluation framework** for assessing deep research agents.\nLiveResearchBench provides expert-curated, real-world tasks spanning daily life, enterprise, and academia, each requiring extensive, real-time web search, multi-source reasoning, and cross-domain synthesis.\nDeepEval offers human-aligned protocols for reliable, systematic evaluation of agentic systems on open-ended deep research tasks.\n\n## 📌 Quick Links\n[![Project Page](https://img.shields.io/badge/🌐_Project_Page-blue?style=for-the-badge)](https://livedeepresearch.github.io/)\n[![Paper](https://img.shields.io/badge/📖_Paper-red?style=for-the-badge)](https://arxiv.org/abs/2510.14240)\n[![Dataset](https://img.shields.io/badge/🤗_Dataset-green?style=for-the-badge)](https://huggingface.co/datasets/Salesforce/LiveResearchBench)\n\n## Updates\n- **[2025.10.31]** DeepEval is here! Our evaluation framework for deep research agents is now live.\n- **[2025.10.23]** LiveResearchBench dataset is now available on [Huggingface](https://huggingface.co/datasets/Salesforce/LiveResearchBench)!\n- **[2025.10.16]** 📢 LiveResearchBench is officially out on [arXiv](https://arxiv.org/abs/2510.14240)!\n\n## 🔍 About LiveResearchBench\nDeep research—producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources---marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) **user-centric**, reflecting realistic information needs, (2) **dynamic**, requiring up-to-date information beyond parametric knowledge, (3) **unambiguous**, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce **DeepEval**, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./imgs/task_domain_dist.png\" width=\"100%\"\u003e \u003cbr\u003e\n  Domain distribution and task coverage of LiveResearchBench.\n\u003c/p\u003e\n\n📖 **For more details on the dataset structure, fields, and usage**, see [**DATASET.md**](docs/DATASET.md).\n\n## 🔍 About DeepEval\n\nDeepEval evaluates research reports across diverse criteria using state-of-the-art LLM-as-a-judges:\n\n- **Presentation \u0026 Organization** (Checklist-based)\n- **Factual \u0026 Logical Consistency** (Pointwise-additive)\n- **Coverage \u0026 Comprehensiveness** (Checklist-based)\n- **Analysis Depth** (Pairwise comparison)\n- **Citation Association** (Pointwise-additive)\n\nEach criterion uses the most appropriate evaluation protocol based on human alignment studies, ensuring high-quality, reliable assessments.\n\n## Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/your-org/LiveResearchBench.git\ncd LiveResearchBench\n\n# Create virtual environment and install dependencies with uv\nuv venv\nsource .venv/bin/activate\nuv sync\n\n# Configure API keys\ncp .env.example .env\n# Edit .env and add your API keys:\n#   OPENAI_API_KEY=your-key\n#   GEMINI_API_KEY=your-key\n#   HF_TOKEN=your-hf-token\n```\n\n### Basic usage to evaluate long-form reports\n\n**1. Preprocess Reports** (Create a JSON index mapping queries to report file locations)\n\n**Expected input directory structure:**\n```\n/path/to/model_outputs/\n├── model_name_1/\n│   ├── qid_\u003cqid\u003e_report.md\n│   ├── qid_\u003cqid\u003e_report.md\n│   └── ...\n├── model_name_2/\n│   ├── qid_\u003cqid\u003e_report.md\n│   └── ...\n```\n\n```bash\n# Process all models in a directory (recommended: use --use-realtime for live benchmark queries)\npython preprocess.py /path/to/model_outputs --use-realtime\n\n# Or process specific models (subdirectories) only\npython preprocess.py /path/to/model_outputs -m gpt-5-search gemini-pro --use-realtime \n\n# With custom output directory\npython preprocess.py /path/to/model_outputs -o extracted_reports/ --use-realtime\n\n# Optional: Use static queries without placeholder replacement (not recommended for live evaluation)\npython preprocess.py /path/to/model_outputs\n```\n\n**Expected output structure:**\n\nAfter preprocessing, a JSON file will be created in `extracted_reports/` (or the specified output directory) with the naming pattern `reports_{timestamp}.json`.\n\nThe JSON structure includes:\n\n```python\n{\n  \"metadata\": {\n    \"timestamp\": \"20250101_120000\",              # When the preprocessing was performed\n    \"total_reports\": 300,                         # Total number of reports contained and processed\n    \"total_models\": 3,                           # Number of model outputs included\n    \"base_path\": \"/path/to/model_outputs\",       # Base path to report outputs directory\n    \"use_realtime\": true                         # Whether to replace query placeholders with real-time values\n  },\n  \"reports\": [\n    {\n      \"model_name\": \"model-name-1\",              # Model/system name (subdirectory name)\n      \"query_id\": \"abc123xyz\",                   # Query identifier from the benchmark\n      \"query\": \"Research query text...\",         # Query loaded and processed from LiveResearchBench dataset\n      \"report_file_path\": \"/path/to/model_outputs/model-name-1/qid_abc123xyz_report.md\"\n    },\n    {\n      \"model_name\": \"model-name-2\",\n      \"query_id\": \"def456uvw\",\n      \"query\": \"Another research query...\",\n      \"report_file_path\": \"/path/to/model_outputs/model-name-2/qid_def456uvw_report.md\"\n    }\n    # ... more report entries\n  ]\n}\n```\n\n**2. Grade Single File**\n\n```bash\n# Single criterion\npython main.py \\\n    --input extracted_reports/reports_20250101_120000.json \\ # the json file created from preprocessing\n    --criteria presentation \\\n    --provider gemini --model gemini-2.5-pro \\\n    --verbose\n\n# Multiple criteria\npython main.py \\\n    --input extracted_reports/reports_20250101_120000.json \\\n    --criteria presentation,consistency,citation,coverage,depth \\\n    --provider openai --model gpt-5 \\\n    --verbose\n```\n\n## Evaluation Protocols\n\nBased on our human alignment study, we adopt three different protocols:\n\n### 1. Checklist-Based (Presentation, Coverage)\n- **Binary scoring**: 0 (fail) or 1 (pass) for each checklist item\n- **Presentation**: 10 fixed quality questions\n- **Coverage**: Custom checklist per query\n\n### 2. Pointwise/Additive (Consistency, Citation)\n- **Error counting**: Identify and count specific issues\n- **Score**: 10-100 based on number of issues found\n- **Consistency**: Count logical/factual contradictions\n- **Citation**: Count missing citations\n\n### 3. Pairwise Comparison (Depth)\n- **Side-by-side comparison**: Compare two reports directly\n- **5 dimensions**: Granularity, Insight, Critique, Evidence, Density\n- **Position-swap averaging**: Mitigates position bias by comparing in both directions\n\n## Directory Structure\n\n```\nLiveResearchBench/\n├── liveresearchbench/          # Main Python package\n│   ├── common/                 # Shared utilities\n│   ├── graders/                # Grading implementations\n│   ├── criteria/               # Criterion definitions\n│   └── batch_evaluator.py      # Batch grading orchestrator\n├── preprocess.py               # Preprocessing script\n├── main.py                     # Main grading script\n├── average_results.py          # Multi-provider averaging\n├── scripts/                    # Bash convenience scripts\n├── configs/                    # Configuration files\n├── data/                       # Data files (checklists, etc.)\n└── tests/                      # Test suite\n    ├── test_basic.py           # Unit tests\n    ├── test_mock_grading.py    # Integration tests\n    └── run_all_tests.sh        # Test runner\n```\n\n### Multi-Provider Grading \n\nFor more reliable results, grade with both GPT-5 and Gemini-2.5-Pro, then average.\n\n```bash\n# Step 1: Grade with OpenAI\npython main.py \\\n    --input extracted_reports/reports_20250101_120000.json \\\n    --criteria presentation,consistency,citation,coverage,depth \\\n    --provider openai --model gpt-5-2025-08-07 \\\n    --verbose\n\n# Step 2: Grade with Gemini\npython main.py \\\n    --input extracted_reports/reports_20250101_120000.json \\\n    --criteria presentation,consistency,citation,coverage,depth \\\n    --provider gemini --model gemini-2.5-pro \\\n    --verbose\n\n# Step 3: Average the summary files\npython average_results.py \\\n    --input-a results/reports_20250101_120000_graded_openai_gpt-5-2025-08-07/summary_*.json \\\n    --input-b results/reports_20250101_120000_graded_gemini_gemini-2.5-pro/summary_*.json \\\n    --output results/averaged/summary_multi_judge.json\n```\n\n### Resume Interrupted Runs\n\n**✨ Automatic Resume**: The system will save incremental JSONL saves! If your grading run is interrupted, simply **run the same command again** - it will automatically resume from where it left off.\n\n```bash\n# Original run (gets interrupted)\npython main.py \\\n    --input extracted_reports/my_reports.json \\\n    --criteria presentation,consistency,citation,coverage,depth \\\n    --provider gemini \\\n    --model gemini-2.5-pro\n\n# To resume: Just run the SAME command again!\npython main.py \\\n    --input extracted_reports/my_reports.json \\\n    --criteria presentation,consistency,citation,coverage,depth \\\n    --provider gemini \\\n    --model gemini-2.5-pro\n\n# The system will automatically:\n# Load existing results from incremental/ folder\n# Skip already graded reports\n# Continue grading only unfinished reports\n```\n\n**How It Works**: Each graded report is immediately saved to criterion-specific JSONL files in `results/{input}_graded_{provider}_{model}/incremental/`. When you re-run the command, these files are automatically detected and loaded.\n\n**Force Re-grade**: To ignore existing results and start fresh:\n```bash\npython main.py evaluate \\\n    --input extracted_reports/my_reports.json \\\n    --criteria presentation,coverage \\\n    --provider gemini \\\n    --force-regrade  # Ignores incremental saves\n```\n\n**Monitor Progress**: While grading is running, check real-time progress:\n```bash\n# Count completed reports\nwc -l results/*/incremental/*.jsonl\n\n# View latest result\ntail -1 results/my_reports_graded_gemini/incremental/presentation_results.jsonl | jq\n```\n\n### Filtering\n\nGrade specific subsets of reports:\n\n```bash\n# Filter by experiment name (modify JSON before grading)\n# Or process only specific JSON files in batch_config.yaml\n```\n\n## API Keys\n\nSet in `.env` file:\n\n```bash\n# OpenAI (for GPT-5)\nOPENAI_API_KEY=sk-...\n\n# Google Gemini\nGEMINI_API_KEY=AIza...\n```\n\n## Output Format\n\n### Output Directory Structure\n\nGrading results are organized by input file name, with incremental saves and final timestamped result files:\n\n```\nresults/\n└── reports_{json_file_name}_graded_{provider}_{model_name}/\n    ├── incremental/\n    │   ├── presentation_results.jsonl    # Real-time saves per criterion\n    │   ├── coverage_results.jsonl\n    │   ├── consistency_results.jsonl\n    │   ├── citation_results.jsonl\n    │   └── depth_results.jsonl\n    ├── summary_{evaluation_timestamp}.json          # Final summary stats\n    └── detailed_results_{evaluation_timestamp}.json # Complete results with all criteria\n```\n\n### Summary File (`summary.json`)\n\nContains aggregated statistics:\n\n```python\n{\n  \"metadata\": {\n    \"provider\": \"openai\",\n    \"model\": \"gpt-5-2025-08-07\",\n    \"graded_at\": \"2025-10-31T01:38:49.407295\",\n    \"total_reports\": 10,\n    \"criteria_evaluated\": [\"presentation\", \"coverage\", \"consistency\", \"citation\"]\n  },\n  \"results_by_model\": {\n    \"model-name-1\": {\n      \"presentation\": {\n        \"mean\": 85.5,           # Average pass rate across all reports\n        \"count\": 5,             # Number of reports graded\n        \"min\": 70.0,\n        \"max\": 100.0\n      },\n      \"coverage\": { ... },\n      \"consistency\": { ... }\n    },\n    \"model-name-2\": { ... }\n  },\n  \"overall_results\": {\n    \"presentation\": {\n      \"mean\": 82.3,             # Average across all models/reports\n      \"count\": 10,\n      \"min\": 60.0,\n      \"max\": 100.0\n    },\n    // ... other criteria\n  }\n}\n```\n\n### Detailed Results File (`detailed_results.json`)\n\nEach report is augmented with grading results:\n\n```json\n{\n  \"reports\": [\n    {\n      \"query_id\": \"qid_123\",\n      \"query\": \"What is...\",\n      \"report_file_path\": \"/path/to/report.md\",\n      \n      \"presentation_grading_results\": {\n        \"provider\": \"gemini\",\n        \"model\": \"gemini-2.5-pro\",\n        \"graded_at\": \"2025-...\",\n        \"evaluations\": {\n          \"p1\": {\"score\": 1, \"justification\": \"...\"},\n          \"p2\": {\"score\": 0, \"justification\": \"...\"}\n        },\n        \"summary\": {\n          \"total_criteria\": 10,\n          \"passed_count\": 8,\n          \"average_pass_rate\": 80.0\n        }\n      },\n      \n      \"consistency_grading_results\": {\n        \"provider\": \"gemini\",\n        \"model\": \"gemini-2.5-pro\",\n        \"graded_at\": \"2025-...\",\n        \"specific_issues\": [\"Issue 1...\", \"Issue 2...\"],\n        \"total_issues\": 2,\n        \"score\": 90,\n        \"justification\": \"...\"\n      }\n    }\n  ]\n}\n```\n## License\n\nThe dataset is released for research purposes only under CC-BY-NC 4.0 and should not be used to develop models that compete with OpenAI. The evaluation code is released under Apache 2.0.\n\n## Citation\n\nIf you find LiveResearchBench helpful, please consider citing:\n\n```bibtex\n@article{sfr2025liveresearchbench,\n      title={LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild}, \n      author={Jiayu Wang and Yifei Ming and Riya Dulepet and Qinglin Chen and Austin Xu and Zixuan Ke and Frederic Sala and Aws Albarghouthi and Caiming Xiong and Shafiq Joty},\n  year={2025},\n  url={https://arxiv.org/abs/2510.14240}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSalesforceAIResearch%2FLiveResearchBench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSalesforceAIResearch%2FLiveResearchBench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSalesforceAIResearch%2FLiveResearchBench/lists"}