{"id":34694066,"url":"https://github.com/mixpeek/multimodal-benchmarks","last_synced_at":"2026-03-12T22:33:13.756Z","repository":{"id":327788010,"uuid":"1110777898","full_name":"mixpeek/multimodal-benchmarks","owner":"mixpeek","description":"Open evaluation suite for multimodal retrieval systems with benchmarks for financial documents, medical devices, and educational content","archived":false,"fork":false,"pushed_at":"2025-12-10T16:08:21.000Z","size":1391,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-26T10:28:32.123Z","etag":null,"topics":["benchmark","document-retrieval","embeddings","evaluation","hybrid-search","information-retrieval","multimodal-retrieval","nlp","ocr","rag","semantic-search","table-extraction","vector-search"],"latest_commit_sha":null,"homepage":"https://mxp.co/benchmarks","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mixpeek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-05T17:49:12.000Z","updated_at":"2025-12-10T16:08:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mixpeek/multimodal-benchmarks","commit_stats":null,"previous_names":["mixpeek/multimodal-benchmarks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mixpeek/multimodal-benchmarks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mixpeek%2Fmultimodal-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mixpeek%2Fmultimodal-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mixpeek%2Fmultimodal-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mixpeek%2Fmultimodal-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mixpeek","download_url":"https://codeload.github.com/mixpeek/multimodal-benchmarks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mixpeek%2Fmultimodal-benchmarks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30446445,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-12T21:31:01.033Z","status":"ssl_error","status_checked_at":"2026-03-12T21:30:43.161Z","response_time":114,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","document-retrieval","embeddings","evaluation","hybrid-search","information-retrieval","multimodal-retrieval","nlp","ocr","rag","semantic-search","table-extraction","vector-search"],"created_at":"2025-12-24T22:13:53.219Z","updated_at":"2026-03-12T22:33:13.749Z","avatar_url":"https://github.com/mixpeek.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Mixpeek Benchmarks](assets/header.png)\n\n# Multimodal Benchmarks\n\nThe open evaluation suite for multimodal retrieval systems.\n\nStandard datasets, queries, and relevance judgments for benchmarking retrieval across video, image, audio, and document modalities—particularly in regulated and high-stakes domains.\n\n## 🎯 Quick Start\n\nChoose your benchmark and get started in 60 seconds:\n\n| Benchmark | Domain | Learn More | Leaderboard |\n|-----------|--------|------------|-------------|\n| **[Financial Documents](finance/)** | SEC filings, earnings reports | **[mxp.co/finance](https://mxp.co/finance)** | [View →](finance/LEADERBOARD.md) |\n| **[Medical Devices](device/)** | IFUs, regulatory docs | **[mxp.co/device](https://mxp.co/device)** | [View →](device/LEADERBOARD.md) |\n| **[Curriculum Search](learning/)** | Educational videos, lectures | **[mxp.co/learning](https://mxp.co/learning)** | [View →](learning/LEADERBOARD.md) |\n\n### Run Any Benchmark\n\n```bash\n# Finance benchmark\ncd finance \u0026\u0026 python run.py --quick\n\n# Medical device benchmark\ncd device \u0026\u0026 python run.py --quick\n\n# Curriculum benchmark\ncd learning \u0026\u0026 python run.py --quick\n```\n\nEach runs in ~1 second with demo data. See [QUICKSTART.md](QUICKSTART.md) for full guide.\n\n## Why This Exists\n\nMost retrieval benchmarks assume text-only search on clean web data. Real-world multimodal retrieval is harder:\n\n- **Medical device IFUs** with nested tables, diagrams, and regulatory language\n- **SEC filings** with embedded charts, footnotes, and cross-references\n- **Educational videos** requiring temporal understanding and code-lecture alignment\n- **Regulatory documents** spanning technical specs, clinical data, and safety reports\n\nThis repo provides ground-truth evaluation sets for these verticals—so you can measure what actually matters.\n\n## 📊 Benchmarks Overview\n\nAll benchmarks are **available now** and include sample queries with human-annotated relevance judgments.\n\n| Benchmark | Best NDCG@10 | Status | Documentation |\n|-----------|--------------|--------|---------------|\n| **[Finance](finance/)** | 0.78 | ✅ Available | [README](finance/README.md) · [Leaderboard](finance/LEADERBOARD.md) |\n| **[Device](device/)** | 0.78 | ✅ Available | [README](device/README.md) · [Leaderboard](device/LEADERBOARD.md) |\n| **[Learning](learning/)** | 0.84 | ✅ Available | [README](learning/README.md) · [Leaderboard](learning/LEADERBOARD.md) |\n\n## 📁 Structure\n\n```\nbenchmarks/\n├── shared/                      # Shared utilities\n│   ├── metrics.py              # Standard evaluation metrics\n│   ├── evaluator.py            # Benchmark runner\n│   └── __init__.py\n│\n├── finance/                     # Financial document benchmark\n│   ├── run.py                  # Main benchmark script\n│   ├── README.md               # Full documentation\n│   ├── LEADERBOARD.md          # Results leaderboard\n│   └── results/                # Benchmark results\n│\n├── device/                      # Medical device benchmark\n│   ├── run.py\n│   ├── README.md\n│   ├── LEADERBOARD.md\n│   └── results/\n│\n└── learning/                    # Curriculum search benchmark\n    ├── run.py\n    ├── README.md\n    ├── LEADERBOARD.md\n    └── results/\n```\n\n## 🚀 Quick Start\n\n### 1. Install Dependencies\n\n```bash\n# Install shared dependencies\npip install numpy\n```\n\n### 2. Run a Benchmark\n\n```bash\n# Run with demo data (no setup required)\ncd finance \u0026\u0026 python run.py --quick\n\n# Run with your own data\ncd finance \u0026\u0026 python run.py --data-dir /path/to/documents\n```\n\n### 3. Evaluate Your Retriever\n\nAll benchmarks use a standard interface:\n\n```python\nfrom shared import BenchmarkEvaluator, Query, RelevanceJudgment\n\n# Your retrieval function\ndef my_retriever(query: str) -\u003e list[str]:\n    # Returns ranked list of document IDs\n    ...\n\n# Create evaluator\nevaluator = BenchmarkEvaluator(\n    name=\"my-system\",\n    retriever_fn=my_retriever,\n    k_values=[5, 10, 20]\n)\n\n# Run benchmark\nqueries = [...]  # Load your queries\njudgments = [...]  # Load ground truth\nreport = evaluator.run(queries, judgments)\n\n# Print results\nevaluator.print_summary(report)\nevaluator.save_report(report, \"results.json\")\n```\n\n## 📏 Standard Metrics\n\nAll benchmarks use consistent evaluation metrics:\n\n- **NDCG@k** - Ranking quality (primary metric)\n- **Recall@k** - Coverage of relevant documents\n- **MRR** - Position of first relevant result\n- **Precision@k** - Accuracy at cutoff\n- **MAP** - Mean Average Precision\n- **Latency (p95)** - 95th percentile response time\n\nDetailed metric definitions in [shared/metrics.py](shared/metrics.py)\n\n## 🏆 Leaderboards\n\nEach benchmark maintains its own leaderboard:\n\n- **[Financial Documents →](finance/LEADERBOARD.md)** - Best: 0.78 NDCG@10\n- **[Medical Devices →](device/LEADERBOARD.md)** - Best: 0.78 NDCG@10\n- **[Curriculum Search →](learning/LEADERBOARD.md)** - Best: 0.84 NDCG@10\n\n### Submit Your Results\n\nBeat the baseline? Submit your results:\n\n1. Run benchmark: `cd finance \u0026\u0026 python run.py`\n2. Results in: `finance/results/benchmark_results.json`\n3. Open PR with results + system description\n4. Appear on the leaderboard!\n\nSee individual benchmark READMEs for detailed submission instructions.\n\n## 📚 Documentation\n\n- **[Quick Start Guide](QUICKSTART.md)** - Get started in 60 seconds\n- **[Finance Benchmark](finance/README.md)** - SEC filings, financial docs\n- **[Device Benchmark](device/README.md)** - Medical device IFUs, regulatory docs\n- **[Learning Benchmark](learning/README.md)** - Educational videos, lectures\n\n## Contributing a Benchmark\n\nWe welcome contributions from researchers and practitioners working on vertical-specific retrieval.\n\n### Requirements\n\n1. **Minimum 100 queries** with relevance judgments\n2. **Clear licensing** for underlying data\n3. **Reproducible baseline** using at least one open retriever\n4. **Documentation** describing the domain and evaluation protocol\n\n### Submission Process\n\n1. Fork this repo\n2. Add your benchmark under a new directory\n3. Include all required files (see structure above)\n4. Open a PR with benchmark description\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for full guidelines.\n\n## Citation\n\nIf you use these benchmarks in your research:\n\n```bibtex\n@misc{mixpeek-multimodal-benchmarks,\n  title={Multimodal Benchmarks: Evaluation Suite for Vertical Retrieval Systems},\n  author={Mixpeek},\n  year={2025},\n  url={https://github.com/mixpeek/multimodal-benchmarks}\n}\n```\n\n## License\n\nBenchmark code: MIT License\n\nDatasets: Individual licensing per benchmark (see each benchmark's `LICENSE` file)\n\n---\n\nBuilt by [Mixpeek](https://mixpeek.com) — Multimodal AI infrastructure for regulated industries.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmixpeek%2Fmultimodal-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmixpeek%2Fmultimodal-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmixpeek%2Fmultimodal-benchmarks/lists"}