https://github.com/mixpeek/multimodal-benchmarks
Open evaluation suite for multimodal retrieval systems with benchmarks for financial documents, medical devices, and educational content
https://github.com/mixpeek/multimodal-benchmarks
benchmark document-retrieval embeddings evaluation hybrid-search information-retrieval multimodal-retrieval nlp ocr rag semantic-search table-extraction vector-search
Last synced: 3 months ago
JSON representation
Open evaluation suite for multimodal retrieval systems with benchmarks for financial documents, medical devices, and educational content
- Host: GitHub
- URL: https://github.com/mixpeek/multimodal-benchmarks
- Owner: mixpeek
- Created: 2025-12-05T17:49:12.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-10T16:08:21.000Z (6 months ago)
- Last Synced: 2025-12-26T10:28:32.123Z (6 months ago)
- Topics: benchmark, document-retrieval, embeddings, evaluation, hybrid-search, information-retrieval, multimodal-retrieval, nlp, ocr, rag, semantic-search, table-extraction, vector-search
- Language: Python
- Homepage: https://mxp.co/benchmarks
- Size: 1.33 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# Multimodal Benchmarks
The open evaluation suite for multimodal retrieval systems.
Standard datasets, queries, and relevance judgments for benchmarking retrieval across video, image, audio, and document modalitiesβparticularly in regulated and high-stakes domains.
## π― Quick Start
Choose your benchmark and get started in 60 seconds:
| Benchmark | Domain | Learn More | Leaderboard |
|-----------|--------|------------|-------------|
| **[Financial Documents](finance/)** | SEC filings, earnings reports | **[mxp.co/finance](https://mxp.co/finance)** | [View β](finance/LEADERBOARD.md) |
| **[Medical Devices](device/)** | IFUs, regulatory docs | **[mxp.co/device](https://mxp.co/device)** | [View β](device/LEADERBOARD.md) |
| **[Curriculum Search](learning/)** | Educational videos, lectures | **[mxp.co/learning](https://mxp.co/learning)** | [View β](learning/LEADERBOARD.md) |
### Run Any Benchmark
```bash
# Finance benchmark
cd finance && python run.py --quick
# Medical device benchmark
cd device && python run.py --quick
# Curriculum benchmark
cd learning && python run.py --quick
```
Each runs in ~1 second with demo data. See [QUICKSTART.md](QUICKSTART.md) for full guide.
## Why This Exists
Most retrieval benchmarks assume text-only search on clean web data. Real-world multimodal retrieval is harder:
- **Medical device IFUs** with nested tables, diagrams, and regulatory language
- **SEC filings** with embedded charts, footnotes, and cross-references
- **Educational videos** requiring temporal understanding and code-lecture alignment
- **Regulatory documents** spanning technical specs, clinical data, and safety reports
This repo provides ground-truth evaluation sets for these verticalsβso you can measure what actually matters.
## π Benchmarks Overview
All benchmarks are **available now** and include sample queries with human-annotated relevance judgments.
| Benchmark | Best NDCG@10 | Status | Documentation |
|-----------|--------------|--------|---------------|
| **[Finance](finance/)** | 0.78 | β
Available | [README](finance/README.md) Β· [Leaderboard](finance/LEADERBOARD.md) |
| **[Device](device/)** | 0.78 | β
Available | [README](device/README.md) Β· [Leaderboard](device/LEADERBOARD.md) |
| **[Learning](learning/)** | 0.84 | β
Available | [README](learning/README.md) Β· [Leaderboard](learning/LEADERBOARD.md) |
## π Structure
```
benchmarks/
βββ shared/ # Shared utilities
β βββ metrics.py # Standard evaluation metrics
β βββ evaluator.py # Benchmark runner
β βββ __init__.py
β
βββ finance/ # Financial document benchmark
β βββ run.py # Main benchmark script
β βββ README.md # Full documentation
β βββ LEADERBOARD.md # Results leaderboard
β βββ results/ # Benchmark results
β
βββ device/ # Medical device benchmark
β βββ run.py
β βββ README.md
β βββ LEADERBOARD.md
β βββ results/
β
βββ learning/ # Curriculum search benchmark
βββ run.py
βββ README.md
βββ LEADERBOARD.md
βββ results/
```
## π Quick Start
### 1. Install Dependencies
```bash
# Install shared dependencies
pip install numpy
```
### 2. Run a Benchmark
```bash
# Run with demo data (no setup required)
cd finance && python run.py --quick
# Run with your own data
cd finance && python run.py --data-dir /path/to/documents
```
### 3. Evaluate Your Retriever
All benchmarks use a standard interface:
```python
from shared import BenchmarkEvaluator, Query, RelevanceJudgment
# Your retrieval function
def my_retriever(query: str) -> list[str]:
# Returns ranked list of document IDs
...
# Create evaluator
evaluator = BenchmarkEvaluator(
name="my-system",
retriever_fn=my_retriever,
k_values=[5, 10, 20]
)
# Run benchmark
queries = [...] # Load your queries
judgments = [...] # Load ground truth
report = evaluator.run(queries, judgments)
# Print results
evaluator.print_summary(report)
evaluator.save_report(report, "results.json")
```
## π Standard Metrics
All benchmarks use consistent evaluation metrics:
- **NDCG@k** - Ranking quality (primary metric)
- **Recall@k** - Coverage of relevant documents
- **MRR** - Position of first relevant result
- **Precision@k** - Accuracy at cutoff
- **MAP** - Mean Average Precision
- **Latency (p95)** - 95th percentile response time
Detailed metric definitions in [shared/metrics.py](shared/metrics.py)
## π Leaderboards
Each benchmark maintains its own leaderboard:
- **[Financial Documents β](finance/LEADERBOARD.md)** - Best: 0.78 NDCG@10
- **[Medical Devices β](device/LEADERBOARD.md)** - Best: 0.78 NDCG@10
- **[Curriculum Search β](learning/LEADERBOARD.md)** - Best: 0.84 NDCG@10
### Submit Your Results
Beat the baseline? Submit your results:
1. Run benchmark: `cd finance && python run.py`
2. Results in: `finance/results/benchmark_results.json`
3. Open PR with results + system description
4. Appear on the leaderboard!
See individual benchmark READMEs for detailed submission instructions.
## π Documentation
- **[Quick Start Guide](QUICKSTART.md)** - Get started in 60 seconds
- **[Finance Benchmark](finance/README.md)** - SEC filings, financial docs
- **[Device Benchmark](device/README.md)** - Medical device IFUs, regulatory docs
- **[Learning Benchmark](learning/README.md)** - Educational videos, lectures
## Contributing a Benchmark
We welcome contributions from researchers and practitioners working on vertical-specific retrieval.
### Requirements
1. **Minimum 100 queries** with relevance judgments
2. **Clear licensing** for underlying data
3. **Reproducible baseline** using at least one open retriever
4. **Documentation** describing the domain and evaluation protocol
### Submission Process
1. Fork this repo
2. Add your benchmark under a new directory
3. Include all required files (see structure above)
4. Open a PR with benchmark description
See [CONTRIBUTING.md](CONTRIBUTING.md) for full guidelines.
## Citation
If you use these benchmarks in your research:
```bibtex
@misc{mixpeek-multimodal-benchmarks,
title={Multimodal Benchmarks: Evaluation Suite for Vertical Retrieval Systems},
author={Mixpeek},
year={2025},
url={https://github.com/mixpeek/multimodal-benchmarks}
}
```
## License
Benchmark code: MIT License
Datasets: Individual licensing per benchmark (see each benchmark's `LICENSE` file)
---
Built by [Mixpeek](https://mixpeek.com) β Multimodal AI infrastructure for regulated industries.