https://github.com/mohsenhariri/scorio
Statistical evaluation, comparison, and ranking of Large Language Models
https://github.com/mohsenhariri/scorio
bayesian bayesian-inference evaluation large-language-models llm ranking statistics
Last synced: 4 months ago
JSON representation
Statistical evaluation, comparison, and ranking of Large Language Models
- Host: GitHub
- URL: https://github.com/mohsenhariri/scorio
- Owner: mohsenhariri
- License: other
- Created: 2025-10-01T21:24:41.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2026-02-09T00:37:07.000Z (4 months ago)
- Last Synced: 2026-02-09T04:20:28.850Z (4 months ago)
- Topics: bayesian, bayesian-inference, evaluation, large-language-models, llm, ranking, statistics
- Language: Python
- Homepage: https://mohsenhariri.github.io/scorio/
- Size: 292 KB
- Stars: 8
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Scorio
[](https://arxiv.org/abs/2510.04265)
[](https://arxiv.org/abs/2510.04265)
[](https://iclr.cc/virtual/2026/poster/10009669)
[](#license)
[](https://www.python.org/downloads/)
[](https://julialang.org/downloads/)
[](https://scorio.readthedocs.io/en/latest/)
[](https://mohsenhariri.github.io/scorio/julia/)
---
## News
- **February 2026** ✨: New paper released: ["Ranking Reasoning LLMs under Test-Time Scaling"](https://arxiv.org/abs/2510.04265)
- **February 2026** 🎉: Our paper ["Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation"](https://iclr.cc/virtual/2026/poster/10009669) has been accepted to **ICLR 2026**!
- **February 2026** 🔜: Reasoning traces will be released in ~2 weeks.
---
## Packages
This repository contains two packages:
1. **`scorio`** - Python implementation
2. **`Scorio.jl`** - Julia implementation
---
## Quick Start
### Python (scorio)
#### Installation
```bash
# Install from PyPI
pip install scorio
# Install latest from GitHub
pip install "git+https://github.com/mohsenhariri/scorio.git"
# Install a specific tag
pip install "git+https://github.com/mohsenhariri/scorio.git@v0.2.0"
# Install from local repository
pip install -e .
```
#### Basic Usage
```python
import numpy as np
from scorio import eval
# Outcomes R: shape (M, N) with integer categories in {0, ..., C}
R = np.array([[0, 1, 2, 2, 1],
[1, 1, 0, 2, 2]])
# Rubric weights w: length C+1
# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)
w = np.array([0.0, 0.5, 1.0])
# Optional prior outcomes R0: shape (M, D)
R0 = np.array([[0, 2],
[1, 2]])
# Bayesian evaluation with prior
mu, sigma = eval.bayes(R, w, R0)
print(f"μ = {mu:.6f}, σ = {sigma:.6f}")
# Expected: μ ≈ 0.575, σ ≈ 0.084275
# Bayesian evaluation without prior
mu2, sigma2 = eval.bayes(R, w)
print(f"μ = {mu2:.6f}, σ = {sigma2:.6f}")
# Expected: μ ≈ 0.5625, σ ≈ 0.091998
# Simple average
accuracy = eval.avg(R)
print(f"Average: {accuracy:.6f}")
```
### Julia (Scorio.jl)
#### Installation
```julia
using Pkg
# From local development
Pkg.develop(path="./julia/Scorio.jl")
# Or from Julia General Registry
# Pkg.add("Scorio")
```
#### Basic Usage
```julia
using Scorio
# Outcomes R: shape (M, N) with integer categories in {0, ..., C}
R = [0 1 2 2 1;
1 1 0 2 2]
# Rubric weights w: length C+1
# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)
w = [0.0, 0.5, 1.0]
# Optional prior outcomes R0: shape (M, D)
R0 = [0 2;
1 2]
# Bayesian evaluation with prior
mu, sigma = bayes(R, w, R0)
println("μ = $mu, σ = $sigma")
# Expected: μ ≈ 0.575, σ ≈ 0.084275
# Bayesian evaluation without prior
mu2, sigma2 = bayes(R, w)
println("μ = $mu2, σ = $sigma2")
# Expected: μ ≈ 0.5625, σ ≈ 0.091998
# Simple average
accuracy = avg(R)
println("Average: $accuracy")
```
---
### Evaluation Functions
#### `bayes(R, w, R0=None)`
Bayesian performance evaluation with uncertainty quantification using the Bayes@N framework.
- **`R`**: `M × N` integer matrix with entries in `{0, ..., C}` (outcomes for M questions over N trials)
- **`w`**: length `C+1` float vector of rubric weights mapping categories to scores
- **`R0`** (optional): `M × D` integer matrix of prior outcomes
- **Returns**: `(mu, sigma)` - posterior estimate and uncertainty
## Data and Shape Conventions
- **Categories**: Encode outcomes per trial as integers in `{0, ..., C}`
- **Weights**: Choose rubric weights `w` of length `C+1` (e.g., `[0, 1]` for binary outcomes)
- **Shapes**:
- `R` is `M × N` (M questions, N trials)
- `R0` is `M × D` (M questions, D prior trials)
- Both must share the same `M` and category set
---
## Requirements
### Python
- Python 3.10+
- NumPy 2.0+
### Julia
- Julia 1.6 or higher
---
## Documentation
[mohsenhariri.github.io/scorio](https://mohsenhariri.github.io/scorio/)
| APIs | Documentation | Status |
|----------|--------------|--------|
| **Python** | [scorio.readthedocs.io](https://scorio.readthedocs.io/en/latest/) | [](https://scorio.readthedocs.io/en/latest/) |
| **Julia** | [mohsenhariri.github.io/scorio/julia](https://mohsenhariri.github.io/scorio/julia/) | [](https://mohsenhariri.github.io/scorio/julia/) |
---
## Citation
If you use Scorio in your research, please cite the relevant papers:
### Bayesian Evaluation Framework
```bibtex
@inproceedings{hariri2026don,
title={Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},
author={Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://arxiv.org/abs/2510.04265}
}
```
### Ranking Methods
```bibtex
@article{hariri2026ranking,
title={Ranking Reasoning LLMs under Test-Time Scaling},
author={Hariri, Mohsen and Hinczewski, Michael and Ma, Jing and Chaudhary, Vipin},
journal={arXiv preprint arXiv:2510.04265},
year={2026},
url={https://arxiv.org/abs/2510.04265}
}
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## Links
- **Landing Page**: [mohsenhariri.github.io/scorio](https://mohsenhariri.github.io/scorio/)
- **Python Docs**: [scorio.readthedocs.io](https://scorio.readthedocs.io/en/latest/)
- **Julia Docs**: [mohsenhariri.github.io/scorio/julia](https://mohsenhariri.github.io/scorio/julia/)
- **Repository**: [github.com/mohsenhariri/scorio](https://github.com/mohsenhariri/scorio)
- **Issues**: [github.com/mohsenhariri/scorio/issues](https://github.com/mohsenhariri/scorio/issues)
- **Papers**:
- [Don't Pass@k (ICLR 2026)](https://iclr.cc/virtual/2026/poster/10009669) | [arXiv](https://arxiv.org/abs/2510.04265)
- [Ranking Reasoning LLMs](https://arxiv.org/abs/2510.04265)