https://github.com/mohsenhariri/scorio

Statistical evaluation, comparison, and ranking of Large Language Models
https://github.com/mohsenhariri/scorio

bayesian bayesian-inference evaluation large-language-models llm ranking statistics

Last synced: 4 months ago
JSON representation

Statistical evaluation, comparison, and ranking of Large Language Models

Host: GitHub
URL: https://github.com/mohsenhariri/scorio
Owner: mohsenhariri
License: other
Created: 2025-10-01T21:24:41.000Z (9 months ago)
Default Branch: main
Last Pushed: 2026-02-09T00:37:07.000Z (4 months ago)
Last Synced: 2026-02-09T04:20:28.850Z (4 months ago)
Topics: bayesian, bayesian-inference, evaluation, large-language-models, llm, ranking, statistics
Language: Python
Homepage: https://mohsenhariri.github.io/scorio/
Size: 292 KB
Stars: 8
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

          # Scorio

[![arXiv (Bayes Evaluation)](https://img.shields.io/badge/arXiv-2510.04265-b31b1b.svg)](https://arxiv.org/abs/2510.04265)

[![arXiv (Bayes Ranking)](https://img.shields.io/badge/arXiv-2510.04265-b31b1b.svg)](https://arxiv.org/abs/2510.04265)

[![ICLR 2026](https://img.shields.io/badge/ICLR-2026-blue.svg)](https://iclr.cc/virtual/2026/poster/10009669)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](#license)

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

[![Julia 1.6+](https://img.shields.io/badge/julia-1.6+-9558B2.svg)](https://julialang.org/downloads/)

[![Python Docs](https://readthedocs.org/projects/scorio/badge/?version=latest)](https://scorio.readthedocs.io/en/latest/)

[![Julia Docs](https://img.shields.io/badge/docs-Julia-9558B2.svg)](https://mohsenhariri.github.io/scorio/julia/)

---

## News

- **February 2026** ✨: New paper released: ["Ranking Reasoning LLMs under Test-Time Scaling"](https://arxiv.org/abs/2510.04265)

- **February 2026** 🎉: Our paper ["Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation"](https://iclr.cc/virtual/2026/poster/10009669) has been accepted to **ICLR 2026**!

- **February 2026** 🔜: Reasoning traces will be released in ~2 weeks.

---

## Packages

This repository contains two packages:

1. **`scorio`** - Python implementation

2. **`Scorio.jl`** - Julia implementation

---

## Quick Start

### Python (scorio)

#### Installation

```bash

# Install from PyPI

pip install scorio

# Install latest from GitHub 

pip install "git+https://github.com/mohsenhariri/scorio.git"

# Install a specific tag

pip install "git+https://github.com/mohsenhariri/scorio.git@v0.2.0"

# Install from local repository

pip install -e .

```

#### Basic Usage

```python

import numpy as np

from scorio import eval

# Outcomes R: shape (M, N) with integer categories in {0, ..., C}

R = np.array([[0, 1, 2, 2, 1],

              [1, 1, 0, 2, 2]])

# Rubric weights w: length C+1

# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)

w = np.array([0.0, 0.5, 1.0])

# Optional prior outcomes R0: shape (M, D)

R0 = np.array([[0, 2],

               [1, 2]])

# Bayesian evaluation with prior

mu, sigma = eval.bayes(R, w, R0)

print(f"μ = {mu:.6f}, σ = {sigma:.6f}")

# Expected: μ ≈ 0.575, σ ≈ 0.084275

# Bayesian evaluation without prior

mu2, sigma2 = eval.bayes(R, w)

print(f"μ = {mu2:.6f}, σ = {sigma2:.6f}")

# Expected: μ ≈ 0.5625, σ ≈ 0.091998

# Simple average

accuracy = eval.avg(R)

print(f"Average: {accuracy:.6f}")

```

### Julia (Scorio.jl)

#### Installation

```julia

using Pkg

# From local development

Pkg.develop(path="./julia/Scorio.jl")

# Or from Julia General Registry

# Pkg.add("Scorio")

```

#### Basic Usage

```julia

using Scorio

# Outcomes R: shape (M, N) with integer categories in {0, ..., C}

R = [0 1 2 2 1;

     1 1 0 2 2]

# Rubric weights w: length C+1

# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)

w = [0.0, 0.5, 1.0]

# Optional prior outcomes R0: shape (M, D)

R0 = [0 2;

      1 2]

# Bayesian evaluation with prior

mu, sigma = bayes(R, w, R0)

println("μ = $mu, σ = $sigma")

# Expected: μ ≈ 0.575, σ ≈ 0.084275

# Bayesian evaluation without prior

mu2, sigma2 = bayes(R, w)

println("μ = $mu2, σ = $sigma2")

# Expected: μ ≈ 0.5625, σ ≈ 0.091998

# Simple average

accuracy = avg(R)

println("Average: $accuracy")

```

---

### Evaluation Functions

#### `bayes(R, w, R0=None)`

Bayesian performance evaluation with uncertainty quantification using the Bayes@N framework.

- **`R`**: `M × N` integer matrix with entries in `{0, ..., C}` (outcomes for M questions over N trials)

- **`w`**: length `C+1` float vector of rubric weights mapping categories to scores

- **`R0`** (optional): `M × D` integer matrix of prior outcomes

- **Returns**: `(mu, sigma)` - posterior estimate and uncertainty

## Data and Shape Conventions

- **Categories**: Encode outcomes per trial as integers in `{0, ..., C}`

- **Weights**: Choose rubric weights `w` of length `C+1` (e.g., `[0, 1]` for binary outcomes)

- **Shapes**: 

  - `R` is `M × N` (M questions, N trials)

  - `R0` is `M × D` (M questions, D prior trials)

  - Both must share the same `M` and category set

---

## Requirements

### Python

- Python 3.10+

- NumPy 2.0+

### Julia

- Julia 1.6 or higher

---

## Documentation

[mohsenhariri.github.io/scorio](https://mohsenhariri.github.io/scorio/)

| APIs | Documentation | Status |

|----------|--------------|--------|

| **Python** | [scorio.readthedocs.io](https://scorio.readthedocs.io/en/latest/) | [![ReadTheDocs](https://readthedocs.org/projects/scorio/badge/?version=latest)](https://scorio.readthedocs.io/en/latest/) |

| **Julia** | [mohsenhariri.github.io/scorio/julia](https://mohsenhariri.github.io/scorio/julia/) | [![GitHub Pages](https://img.shields.io/badge/docs-stable-blue.svg)](https://mohsenhariri.github.io/scorio/julia/) |

---

## Citation

If you use Scorio in your research, please cite the relevant papers:

### Bayesian Evaluation Framework

```bibtex

@inproceedings{hariri2026don,

  title={Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},

  author={Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},

  booktitle={The Fourteenth International Conference on Learning Representations},

  year={2026},

  url={https://arxiv.org/abs/2510.04265}

}

```

### Ranking Methods

```bibtex

@article{hariri2026ranking,

  title={Ranking Reasoning LLMs under Test-Time Scaling},

  author={Hariri, Mohsen and Hinczewski, Michael and Ma, Jing and Chaudhary, Vipin},

  journal={arXiv preprint arXiv:2510.04265},

  year={2026},

  url={https://arxiv.org/abs/2510.04265}

}

```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Links

- **Landing Page**: [mohsenhariri.github.io/scorio](https://mohsenhariri.github.io/scorio/)

- **Python Docs**: [scorio.readthedocs.io](https://scorio.readthedocs.io/en/latest/)

- **Julia Docs**: [mohsenhariri.github.io/scorio/julia](https://mohsenhariri.github.io/scorio/julia/)

- **Repository**: [github.com/mohsenhariri/scorio](https://github.com/mohsenhariri/scorio)

- **Issues**: [github.com/mohsenhariri/scorio/issues](https://github.com/mohsenhariri/scorio/issues)

- **Papers**:

  - [Don't Pass@k (ICLR 2026)](https://iclr.cc/virtual/2026/poster/10009669) | [arXiv](https://arxiv.org/abs/2510.04265)

  - [Ranking Reasoning LLMs](https://arxiv.org/abs/2510.04265)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mohsenhariri/scorio

Awesome Lists containing this project

README