https://github.com/ctrl-gaurav/beyondbench

[ICLR 2026 Accepted paper] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
https://github.com/ctrl-gaurav/beyondbench
evaluation evaluation-framework framework llms reasoning reasoning-language-models slms
Last synced: 22 days ago
JSON representation
[ICLR 2026 Accepted paper] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Host: GitHub
URL: https://github.com/ctrl-gaurav/beyondbench
Owner: ctrl-gaurav
License: apache-2.0
Created: 2026-02-05T07:28:43.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-04-10T05:27:40.000Z (3 months ago)
Last Synced: 2026-04-10T07:38:00.737Z (3 months ago)
Topics: evaluation, evaluation-framework, framework, llms, reasoning, reasoning-language-models, slms
Language: Python
Homepage: https://ctrl-gaurav.github.io/BeyondBench/
Size: 686 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project

README

          


  





[![Paper](https://img.shields.io/badge/📄_Paper-ArXiv%3A2509.24210-red?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2509.24210)

[![Conference](https://img.shields.io/badge/🏆_ICLR-2026-blue?style=for-the-badge)](https://iclr.cc/)

[![PyPI](https://img.shields.io/pypi/v/beyondbench.svg?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/beyondbench/)

[![Downloads](https://img.shields.io/pepy/dt/beyondbench?style=for-the-badge&logo=pypi&logoColor=white&label=Downloads)](https://pepy.tech/project/beyondbench)

[![Monthly Downloads](https://img.shields.io/pypi/dm/beyondbench?style=for-the-badge&logo=pypi&logoColor=white&label=Downloads%2Fmonth)](https://pypi.org/project/beyondbench/)

[![Python](https://img.shields.io/badge/Python-3.10%2B-blue?style=for-the-badge&logo=python&logoColor=white)](https://pypi.org/project/beyondbench/)

[![CI](https://img.shields.io/github/actions/workflow/status/ctrl-gaurav/BeyondBench/test.yml?branch=main&style=for-the-badge&logo=github&label=CI)](https://github.com/ctrl-gaurav/BeyondBench/actions/workflows/test.yml)

[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg?style=for-the-badge)](LICENSE)

[![Stars](https://img.shields.io/github/stars/ctrl-gaurav/BeyondBench?style=for-the-badge&logo=github)](https://github.com/ctrl-gaurav/BeyondBench/stargazers)

*Contamination-Resistant Evaluation of Reasoning in Language Models*

**🏆 101+ Models Evaluated • 🧠 79 Reasoning Tasks • 🎯 138 Variations • 📊 >10¹⁵ Unique Instances**

[**🌟 Explore Leaderboard**](https://ctrl-gaurav.github.io/BeyondBench/) | [**📖 Read Paper**](https://arxiv.org/abs/2509.24210) | [**📦 PyPI**](https://pypi.org/project/beyondbench/) | [**📚 Documentation**](docs/DOCUMENTATION.md)



---

## 📢 Latest News

| Date | Update |

|------|--------|

| **Apr 17, 2026** | v0.2.1 released — critical PyPI packaging fix (missing subpackages in wheel). See [Changelog](CHANGELOG.md) |

| **Apr 16, 2026** | v0.2.0 released — multi-GPU parallel eval, 1000+ tests, response caching, plugin SDK, Gradio dashboard. See [Changelog](CHANGELOG.md) |

| **Mar 6, 2026** | v0.1.0 released — FastAPI serve, CLI improvements, CI/CD, comprehensive tests. See [Changelog](CHANGELOG.md) |

| **Feb 25, 2026** | v0.0.2 released — critical bug fixes, much more stable! See [Changelog](CHANGELOG.md) |

| **Feb 25, 2026** | v0.0.1 released — 44 tasks, 117 variations, 101+ models |

| **Jan 2026** | Paper accepted at **ICLR 2026** |

| **Jan 2026** | Interactive leaderboard website launched |

| **Sep 2025** | Paper submitted: [arXiv:2509.24210](https://arxiv.org/abs/2509.24210) |

---

## 💡 What is BeyondBench?

BeyondBench introduces a **revolutionary approach** to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system **dynamically generates** novel problems across **79 distinct reasoning tasks** with **138 variations**, ensuring that models cannot memorize solutions and must demonstrate **true reasoning abilities**.











### 🌟 Key Highlights

#### 🔄 **Dynamic Problem Generation**

- Problem space >10^15 unique instances

- Zero risk of data contamination

- Fresh problems on every evaluation

#### 🎯 **Three Difficulty Levels**

- **Easy**: 44 fundamental reasoning tasks

- **Medium**: 15 tasks with 59 variations

- **Hard**: 20 tasks with 78 variations

#### 🤖 **Multi-Backend Support**

- OpenAI, Gemini, Anthropic APIs

- vLLM for high-throughput local inference

- HuggingFace Transformers

#### 📊 **Comprehensive Metrics**

- Accuracy across difficulty levels

- Instruction-following compliance

- Token efficiency analysis

#### 🛡️ **Contamination-Resistant**

- No static benchmark memorization

- Novel problem generation

- Fair model comparison

#### ⚡ **Extensive Coverage**

- 101+ models evaluated

- Open-source and proprietary

- Regular updates with new models

---

## 🚀 Installation

### From PyPI

```bash

pip install beyondbench

```

### From Source

```bash

git clone https://github.com/ctrl-gaurav/BeyondBench.git

cd BeyondBench

pip install -e .

```

### With Optional Dependencies

```bash

# All API clients (OpenAI, Gemini, Anthropic)

pip install beyondbench[all-apis]

# vLLM support (requires CUDA)

pip install beyondbench[vllm]

# Everything

pip install beyondbench[full]

```

```bash

# Performance optimization

pip install beyondbench[vllm]  # vLLM with prefix caching

pip install bitsandbytes       # 4-bit/8-bit quantization

```

---

## ⚡ Quick Start

### Interactive Wizard

```bash

beyondbench

```

### Command Line

```bash

# Evaluate GPT-4o on the easy suite

beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM

beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks

beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List available tasks

beyondbench list-tasks

```

### Python API

```python

from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler

model = ModelHandler(

    model_id="gpt-4o",

    api_provider="openai",

    api_key="your-api-key"

)

# Run evaluation

engine = EvaluationEngine(model_handler=model, output_dir="./results")

results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results

print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")

```

### API Server

```bash

# Start the BeyondBench API server

beyondbench serve --host 0.0.0.0 --port 8000

# API docs at http://localhost:8000/docs

```

### Configuration Files

```bash

# Create a config interactively

beyondbench init

# Run from config file

beyondbench run-config beyondbench/configs/default.yaml

```

### Results Viewer

```bash

# List past results

beyondbench results list

# Show detailed results

beyondbench results show ./beyondbench_results/final_results.json

# Compare two evaluations

beyondbench results compare result_a.json result_b.json

# Get task info

beyondbench info sorting

```

---

## 🔌 Supported Backends

| Backend | Models | Features |

|---------|--------|----------|

| **OpenAI** | GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini | Reasoning effort control |

| **Gemini** | Gemini 2.5 Pro, Gemini 2.5 Flash | Thinking budget configuration |

| **Anthropic** | Claude Sonnet 4, Claude Opus 4 | Latest Claude models |

| **vLLM** | Any HuggingFace model | Batch processing, tensor parallelism |

| **Transformers** | Any HuggingFace model | CPU/GPU inference |

---

## 📊 Results

### 🏆 Leaderboard (Top Models)

🏅 Rank

🤖 Model

📊 Overall

🎯 Instruction Following

🥇GPT-5*83.56%96.15%

🥈GPT-5-Nano*82.04%93.58%

🥉GPT-5-Mini*81.67%94.23%

4o3*80.36%94.96%

5o4-Mini*79.04%95.30%

_{*Models marked with * use reasoning/thinking tokens. Full results for 101+ models available in the [paper](https://arxiv.org/abs/2509.24210) and on the [leaderboard](https://ctrl-gaurav.github.io/BeyondBench/).}

### 🔍 Key Findings

- **Reasoning Gap**: Even top models show 20-30% performance drops on hard reasoning tasks

- **Scaling Effects**: Larger models generally perform better, but the relationship is not always linear

- **Instruction vs. Accuracy**: High accuracy does not guarantee perfect instruction-following

---

## ⚡ Performance

| Feature | Improvement |

|---------|-------------|

| **Multi-GPU Parallel Evaluation** | Up to 8x speedup on 8 GPUs |

| **Response Caching** | Near-instant repeat evaluations |

| **vLLM Prefix Caching** | 2-3x faster for shared-prefix tasks |

| **Quantization Support** | 4-bit/8-bit via bitsandbytes, GPTQ, AWQ |

| **Model Warm-up** | Eliminates cold-start overhead |

---

## 🧩 Task Suites

Easy Suite (44 Tasks)

| Category | Tasks |

|----------|-------|

| **Arithmetic** | sum, multiplication, subtraction, division, absolute_difference, weighted_sum, parity_check, dot_product |

| **Statistics** | mean, median, mode, running_average, moving_average, variance, standard_deviation |

| **Counting** | odd_count, even_count, count_negative, count_unique, count_greater_than_previous, count_palindromic, count_perfect_squares, count_multiples, local_maxima_count, element_frequency |

| **Extrema** | find_maximum, find_minimum, second_maximum, second_minimum, range, index_of_maximum, max_adjacent_difference, sum_of_max_indices |

| **Sequences** | sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits, cumulative_sum |

| **List Operations** | reverse_list, rotate_list, interleave_lists |

| **Set Operations** | set_intersection, set_difference |

| **Comparison** | comparison |

Medium Suite (15 Tasks, 59 Variations)

| Task | Variations |

|------|------------|

| **Fibonacci Sequence** | 6 (Tribonacci, Lucas numbers, modified recursive) |

| **Algebraic Sequence** | 10 (Polynomial, arithmetic, quadratic) |

| **Geometric Sequence** | 10 (Exponential, compound growth, factorial) |

| **Prime Sequence** | 11 (Prime gaps, twin primes, Sophie Germain) |

| **Complex Pattern** | 12 (Interleaved, conditional, multi-rule) |

| **Arithmetic Progression** | 1 (Varying common differences) |

| **Harmonic Sequence** | 1 (Reciprocal sequences) |

| **Collatz Sequence** | 1 (3n+1 conjecture) |

| **Polynomial Evaluation** | 1 (Evaluate at given point) |

| **Matrix Operations** | 1 (2x2 multiply, determinant, inverse) |

| **Number Base Conversion** | 1 (Decimal, binary, hexadecimal) |

| **Logical Operations** | 1 (AND, OR, NOT, XOR) |

| **Pattern Completion** | 1 (Numeric pattern inference) |

| **GCD/LCM** | 1 (Greatest common divisor, least common multiple) |

| **Combinatorics** | 1 (Permutations and combinations) |

Hard Suite (20 Tasks, 78 Variations)

| Task | Variations | Complexity |

|------|------------|------------|

| **Tower of Hanoi** | 6 | O(2^n) moves |

| **N-Queens** | 4 | NP-complete |

| **Graph Coloring** | 10 | NP-complete |

| **Boolean SAT** | 5 | NP-complete |

| **Sudoku** | 8 | Constraint satisfaction |

| **Cryptarithmetic** | 12 | Constraint satisfaction |

| **Matrix Chain** | 5 | Dynamic programming |

| **Modular Systems** | 5 | Number theory |

| **Constraint Optimization** | 5 | Operations research |

| **Shortest Path** | 1 | Dijkstra's algorithm |

| **Knapsack** | 1 | 0/1 dynamic programming |

| **Traveling Salesman** | 1 | NP-hard combinatorial |

| **Longest Common Subsequence** | 1 | Dynamic programming |

| **Minimax Game** | 1 | Game tree search |

| **Regex Matching** | 1 | Pattern matching |

| **Topological Sort** | 1 | DAG ordering |

| **Interval Scheduling** | 1 | Greedy algorithm |

| **Coin Change** | 1 | Dynamic programming |

| **Edit Distance** | 1 | String algorithms |

| **Logic Grid Puzzles** | 8 | Deductive reasoning |

---

## 📚 Documentation

- [**Full Documentation**](docs/DOCUMENTATION.md) — Complete API reference and configuration guide

- [**Usage Guide**](docs/USAGE.md) — Detailed usage examples for all backends

### Environment Variables

```bash

export OPENAI_API_KEY="sk-..."

export GEMINI_API_KEY="..."

export ANTHROPIC_API_KEY="sk-ant-..."

```

---

## 🤝 Contributing

We welcome contributions! See the [Contributing Guide](CONTRIBUTING.md) for details.

```bash

git clone https://github.com/ctrl-gaurav/BeyondBench.git

cd BeyondBench

pip install -e ".[dev]"

pre-commit install

pytest tests/ -v

```

### 🛠️ Ways to Contribute

- **🐛 Bug Reports**: Found an issue? [Report it here](https://github.com/ctrl-gaurav/BeyondBench/issues)

- **✨ Feature Requests**: Have ideas? [Share them here](https://github.com/ctrl-gaurav/BeyondBench/issues)

- **🔧 Code Contributions**: Submit PRs for improvements

- **📚 Documentation**: Help improve our docs

- **🤖 Model Submissions**: Suggest models for evaluation

---

## 📝 Citation

If you use BeyondBench in your research, please cite our paper (accepted at **ICLR 2026**):

```bibtex

@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,

      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},

      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},

      year={2025},

      eprint={2509.24210},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2509.24210},

}

```

---

## 📞 Contact & Support

- **📧 Email**: [gks@vt.edu](mailto:gks@vt.edu), [xuanw@vt.edu](mailto:xuanw@vt.edu)

- **🐛 Issues**: [GitHub Issues](https://github.com/ctrl-gaurav/BeyondBench/issues)

- **💬 Discussions**: [GitHub Discussions](https://github.com/ctrl-gaurav/BeyondBench/discussions)

---

## 📜 License

This project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for details.

---



## 🚀 Ready to Explore the Future of AI Evaluation?







**Made with ❤️ by the BeyondBench Team**

[![Virginia Tech](https://img.shields.io/badge/Virginia_Tech-CS_Department-maroon?style=flat&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjQiIGhlaWdodD0iMjQiIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPHBhdGggZD0iTTEyIDJMMTMuMDkgOC4yNkwyMCA5TDEzLjA5IDE1Ljc0TDEyIDIyTDEwLjkxIDE1Ljc0TDQgOUwxMC45MSA4LjI2TDEyIDJaIiBmaWxsPSJjdXJyZW50Q29sb3IiLz4KPC9zdmc+)](https://cs.vt.edu/)

[![Amazon AGI](https://img.shields.io/badge/Amazon-AGI-orange?style=flat&logo=amazon)](https://www.amazon.science/)

*Advancing the frontier of AI reasoning evaluation, one benchmark at a time* 🌟



---



| 🏠 [**Home**](https://ctrl-gaurav.github.io/BeyondBench/) | 📊 [**Leaderboard**](https://ctrl-gaurav.github.io/BeyondBench/#leaderboard) | 📖 [**Paper**](https://arxiv.org/abs/2509.24210) | 💻 [**Code**](https://github.com/ctrl-gaurav/BeyondBench) |

|:---:|:---:|:---:|:---:|

| Main website | Interactive rankings | Research paper | Source code |



> **🎯 Transform your understanding of AI capabilities.** BeyondBench reveals what language models can truly reason about, beyond memorization. [**Start exploring now →**](https://ctrl-gaurav.github.io/BeyondBench/)

---
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ctrl-gaurav/beyondbench

Awesome Lists containing this project

README