{"id":39645876,"url":"https://github.com/mohsenhariri/scorio","last_synced_at":"2026-02-16T20:34:57.130Z","repository":{"id":318397870,"uuid":"1068102354","full_name":"mohsenhariri/scorio","owner":"mohsenhariri","description":"Statistical evaluation, comparison, and ranking of Large Language Models","archived":false,"fork":false,"pushed_at":"2026-02-09T00:37:07.000Z","size":299,"stargazers_count":8,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-09T04:20:28.850Z","etag":null,"topics":["bayesian","bayesian-inference","evaluation","large-language-models","llm","ranking","statistics"],"latest_commit_sha":null,"homepage":"https://mohsenhariri.github.io/scorio/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mohsenhariri.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-01T21:24:41.000Z","updated_at":"2026-02-09T00:35:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"c104c770-eb4a-415c-bc73-8bc9e124a8f0","html_url":"https://github.com/mohsenhariri/scorio","commit_stats":null,"previous_names":["mohsenhariri/bayes-kit","mohsenhariri/scorio"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/mohsenhariri/scorio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohsenhariri%2Fscorio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohsenhariri%2Fscorio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohsenhariri%2Fscorio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohsenhariri%2Fscorio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mohsenhariri","download_url":"https://codeload.github.com/mohsenhariri/scorio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohsenhariri%2Fscorio/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29517613,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-16T18:37:19.720Z","status":"ssl_error","status_checked_at":"2026-02-16T18:36:46.920Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bayesian","bayesian-inference","evaluation","large-language-models","llm","ranking","statistics"],"created_at":"2026-01-18T09:05:28.312Z","updated_at":"2026-02-16T20:34:57.119Z","avatar_url":"https://github.com/mohsenhariri.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Scorio\n\n[![arXiv (Bayes Evaluation)](https://img.shields.io/badge/arXiv-2510.04265-b31b1b.svg)](https://arxiv.org/abs/2510.04265)\n[![arXiv (Bayes Ranking)](https://img.shields.io/badge/arXiv-2510.04265-b31b1b.svg)](https://arxiv.org/abs/2510.04265)\n[![ICLR 2026](https://img.shields.io/badge/ICLR-2026-blue.svg)](https://iclr.cc/virtual/2026/poster/10009669)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](#license)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![Julia 1.6+](https://img.shields.io/badge/julia-1.6+-9558B2.svg)](https://julialang.org/downloads/)\n[![Python Docs](https://readthedocs.org/projects/scorio/badge/?version=latest)](https://scorio.readthedocs.io/en/latest/)\n[![Julia Docs](https://img.shields.io/badge/docs-Julia-9558B2.svg)](https://mohsenhariri.github.io/scorio/julia/)\n\n---\n\n## News\n\n- **February 2026** ✨: New paper released: [\"Ranking Reasoning LLMs under Test-Time Scaling\"](https://arxiv.org/abs/2510.04265)\n\n- **February 2026** 🎉: Our paper [\"Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation\"](https://iclr.cc/virtual/2026/poster/10009669) has been accepted to **ICLR 2026**!\n\n- **February 2026** 🔜: Reasoning traces will be released in ~2 weeks.\n\n---\n\n## Packages\n\nThis repository contains two packages:\n\n1. **`scorio`** - Python implementation\n2. **`Scorio.jl`** - Julia implementation\n\n---\n\n## Quick Start\n\n### Python (scorio)\n\n#### Installation\n\n```bash\n# Install from PyPI\npip install scorio\n\n# Install latest from GitHub \npip install \"git+https://github.com/mohsenhariri/scorio.git\"\n\n# Install a specific tag\npip install \"git+https://github.com/mohsenhariri/scorio.git@v0.2.0\"\n\n# Install from local repository\npip install -e .\n\n```\n\n#### Basic Usage\n\n```python\nimport numpy as np\nfrom scorio import eval\n\n# Outcomes R: shape (M, N) with integer categories in {0, ..., C}\nR = np.array([[0, 1, 2, 2, 1],\n              [1, 1, 0, 2, 2]])\n\n# Rubric weights w: length C+1\n# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)\nw = np.array([0.0, 0.5, 1.0])\n\n# Optional prior outcomes R0: shape (M, D)\nR0 = np.array([[0, 2],\n               [1, 2]])\n\n# Bayesian evaluation with prior\nmu, sigma = eval.bayes(R, w, R0)\nprint(f\"μ = {mu:.6f}, σ = {sigma:.6f}\")\n# Expected: μ ≈ 0.575, σ ≈ 0.084275\n\n# Bayesian evaluation without prior\nmu2, sigma2 = eval.bayes(R, w)\nprint(f\"μ = {mu2:.6f}, σ = {sigma2:.6f}\")\n# Expected: μ ≈ 0.5625, σ ≈ 0.091998\n\n# Simple average\naccuracy = eval.avg(R)\nprint(f\"Average: {accuracy:.6f}\")\n```\n\n### Julia (Scorio.jl)\n\n#### Installation\n\n```julia\nusing Pkg\n\n# From local development\nPkg.develop(path=\"./julia/Scorio.jl\")\n\n# Or from Julia General Registry\n# Pkg.add(\"Scorio\")\n```\n\n#### Basic Usage\n\n```julia\nusing Scorio\n\n# Outcomes R: shape (M, N) with integer categories in {0, ..., C}\nR = [0 1 2 2 1;\n     1 1 0 2 2]\n\n# Rubric weights w: length C+1\n# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)\nw = [0.0, 0.5, 1.0]\n\n# Optional prior outcomes R0: shape (M, D)\nR0 = [0 2;\n      1 2]\n\n# Bayesian evaluation with prior\nmu, sigma = bayes(R, w, R0)\nprintln(\"μ = $mu, σ = $sigma\")\n# Expected: μ ≈ 0.575, σ ≈ 0.084275\n\n# Bayesian evaluation without prior\nmu2, sigma2 = bayes(R, w)\nprintln(\"μ = $mu2, σ = $sigma2\")\n# Expected: μ ≈ 0.5625, σ ≈ 0.091998\n\n# Simple average\naccuracy = avg(R)\nprintln(\"Average: $accuracy\")\n```\n\n---\n\n\n### Evaluation Functions\n\n#### `bayes(R, w, R0=None)`\nBayesian performance evaluation with uncertainty quantification using the Bayes@N framework.\n\n- **`R`**: `M × N` integer matrix with entries in `{0, ..., C}` (outcomes for M questions over N trials)\n- **`w`**: length `C+1` float vector of rubric weights mapping categories to scores\n- **`R0`** (optional): `M × D` integer matrix of prior outcomes\n- **Returns**: `(mu, sigma)` - posterior estimate and uncertainty\n\n\n## Data and Shape Conventions\n\n- **Categories**: Encode outcomes per trial as integers in `{0, ..., C}`\n- **Weights**: Choose rubric weights `w` of length `C+1` (e.g., `[0, 1]` for binary outcomes)\n- **Shapes**: \n  - `R` is `M × N` (M questions, N trials)\n  - `R0` is `M × D` (M questions, D prior trials)\n  - Both must share the same `M` and category set\n\n---\n\n## Requirements\n\n### Python\n- Python 3.10+\n- NumPy 2.0+\n\n### Julia\n- Julia 1.6 or higher\n\n---\n\n## Documentation\n\n[mohsenhariri.github.io/scorio](https://mohsenhariri.github.io/scorio/)\n\n| APIs | Documentation | Status |\n|----------|--------------|--------|\n| **Python** | [scorio.readthedocs.io](https://scorio.readthedocs.io/en/latest/) | [![ReadTheDocs](https://readthedocs.org/projects/scorio/badge/?version=latest)](https://scorio.readthedocs.io/en/latest/) |\n| **Julia** | [mohsenhariri.github.io/scorio/julia](https://mohsenhariri.github.io/scorio/julia/) | [![GitHub Pages](https://img.shields.io/badge/docs-stable-blue.svg)](https://mohsenhariri.github.io/scorio/julia/) |\n\n\n---\n\n## Citation\n\nIf you use Scorio in your research, please cite the relevant papers:\n\n### Bayesian Evaluation Framework\n\n```bibtex\n@inproceedings{hariri2026don,\n  title={Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},\n  author={Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},\n  booktitle={The Fourteenth International Conference on Learning Representations},\n  year={2026},\n  url={https://arxiv.org/abs/2510.04265}\n}\n```\n\n### Ranking Methods\n\n```bibtex\n@article{hariri2026ranking,\n  title={Ranking Reasoning LLMs under Test-Time Scaling},\n  author={Hariri, Mohsen and Hinczewski, Michael and Ma, Jing and Chaudhary, Vipin},\n  journal={arXiv preprint arXiv:2510.04265},\n  year={2026},\n  url={https://arxiv.org/abs/2510.04265}\n}\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## Links\n\n- **Landing Page**: [mohsenhariri.github.io/scorio](https://mohsenhariri.github.io/scorio/)\n- **Python Docs**: [scorio.readthedocs.io](https://scorio.readthedocs.io/en/latest/)\n- **Julia Docs**: [mohsenhariri.github.io/scorio/julia](https://mohsenhariri.github.io/scorio/julia/)\n- **Repository**: [github.com/mohsenhariri/scorio](https://github.com/mohsenhariri/scorio)\n- **Issues**: [github.com/mohsenhariri/scorio/issues](https://github.com/mohsenhariri/scorio/issues)\n- **Papers**:\n  - [Don't Pass@k (ICLR 2026)](https://iclr.cc/virtual/2026/poster/10009669) | [arXiv](https://arxiv.org/abs/2510.04265)\n  - [Ranking Reasoning LLMs](https://arxiv.org/abs/2510.04265)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohsenhariri%2Fscorio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmohsenhariri%2Fscorio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohsenhariri%2Fscorio/lists"}