{"id":50744128,"url":"https://github.com/ctrl-gaurav/beyondbench","last_synced_at":"2026-06-10T19:01:16.772Z","repository":{"id":342480214,"uuid":"1150363086","full_name":"ctrl-gaurav/BeyondBench","owner":"ctrl-gaurav","description":"[ICLR 2026 Accepted paper] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models","archived":false,"fork":false,"pushed_at":"2026-04-10T05:27:40.000Z","size":702,"stargazers_count":1,"open_issues_count":6,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-10T07:38:00.737Z","etag":null,"topics":["evaluation","evaluation-framework","framework","llms","reasoning","reasoning-language-models","slms"],"latest_commit_sha":null,"homepage":"https://ctrl-gaurav.github.io/BeyondBench/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ctrl-gaurav.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-05T07:28:43.000Z","updated_at":"2026-04-10T05:27:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ctrl-gaurav/BeyondBench","commit_stats":null,"previous_names":["ctrl-gaurav/beyondbench"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ctrl-gaurav/BeyondBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctrl-gaurav%2FBeyondBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctrl-gaurav%2FBeyondBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctrl-gaurav%2FBeyondBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctrl-gaurav%2FBeyondBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ctrl-gaurav","download_url":"https://codeload.github.com/ctrl-gaurav/BeyondBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctrl-gaurav%2FBeyondBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34165482,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","evaluation-framework","framework","llms","reasoning","reasoning-language-models","slms"],"created_at":"2026-06-10T19:01:15.633Z","updated_at":"2026-06-10T19:01:16.743Z","avatar_url":"https://github.com/ctrl-gaurav.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/banner.svg\" alt=\"BeyondBench Banner\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![Paper](https://img.shields.io/badge/📄_Paper-ArXiv%3A2509.24210-red?style=for-the-badge\u0026logo=arxiv)](https://arxiv.org/abs/2509.24210)\n[![Conference](https://img.shields.io/badge/🏆_ICLR-2026-blue?style=for-the-badge)](https://iclr.cc/)\n[![PyPI](https://img.shields.io/pypi/v/beyondbench.svg?style=for-the-badge\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/beyondbench/)\n[![Downloads](https://img.shields.io/pepy/dt/beyondbench?style=for-the-badge\u0026logo=pypi\u0026logoColor=white\u0026label=Downloads)](https://pepy.tech/project/beyondbench)\n[![Monthly Downloads](https://img.shields.io/pypi/dm/beyondbench?style=for-the-badge\u0026logo=pypi\u0026logoColor=white\u0026label=Downloads%2Fmonth)](https://pypi.org/project/beyondbench/)\n[![Python](https://img.shields.io/badge/Python-3.10%2B-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white)](https://pypi.org/project/beyondbench/)\n[![CI](https://img.shields.io/github/actions/workflow/status/ctrl-gaurav/BeyondBench/test.yml?branch=main\u0026style=for-the-badge\u0026logo=github\u0026label=CI)](https://github.com/ctrl-gaurav/BeyondBench/actions/workflows/test.yml)\n[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg?style=for-the-badge)](LICENSE)\n[![Stars](https://img.shields.io/github/stars/ctrl-gaurav/BeyondBench?style=for-the-badge\u0026logo=github)](https://github.com/ctrl-gaurav/BeyondBench/stargazers)\n\n*Contamination-Resistant Evaluation of Reasoning in Language Models*\n\n**🏆 101+ Models Evaluated \u0026bull; 🧠 79 Reasoning Tasks \u0026bull; 🎯 138 Variations \u0026bull; 📊 \u003e10\u003csup\u003e15\u003c/sup\u003e Unique Instances**\n\n[**🌟 Explore Leaderboard**](https://ctrl-gaurav.github.io/BeyondBench/) | [**📖 Read Paper**](https://arxiv.org/abs/2509.24210) | [**📦 PyPI**](https://pypi.org/project/beyondbench/) | [**📚 Documentation**](docs/DOCUMENTATION.md)\n\n\u003c/div\u003e\n\n---\n\n## 📢 Latest News\n\n| Date | Update |\n|------|--------|\n| **Apr 17, 2026** | v0.2.1 released — critical PyPI packaging fix (missing subpackages in wheel). See [Changelog](CHANGELOG.md) |\n| **Apr 16, 2026** | v0.2.0 released — multi-GPU parallel eval, 1000+ tests, response caching, plugin SDK, Gradio dashboard. See [Changelog](CHANGELOG.md) |\n| **Mar 6, 2026** | v0.1.0 released \u0026mdash; FastAPI serve, CLI improvements, CI/CD, comprehensive tests. See [Changelog](CHANGELOG.md) |\n| **Feb 25, 2026** | v0.0.2 released \u0026mdash; critical bug fixes, much more stable! See [Changelog](CHANGELOG.md) |\n| **Feb 25, 2026** | v0.0.1 released \u0026mdash; 44 tasks, 117 variations, 101+ models |\n| **Jan 2026** | Paper accepted at **ICLR 2026** |\n| **Jan 2026** | Interactive leaderboard website launched |\n| **Sep 2025** | Paper submitted: [arXiv:2509.24210](https://arxiv.org/abs/2509.24210) |\n\n---\n\n## 💡 What is BeyondBench?\n\nBeyondBench introduces a **revolutionary approach** to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system **dynamically generates** novel problems across **79 distinct reasoning tasks** with **138 variations**, ensuring that models cannot memorize solutions and must demonstrate **true reasoning abilities**.\n\n\u003cdiv align=\"center\"\u003e\n\u003ca href=\"https://ctrl-gaurav.github.io/BeyondBench/\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/🎯_Visit_Leaderboard-Live_Demo-brightgreen?style=for-the-badge\u0026logo=rocket\" alt=\"Visit Leaderboard\"\u003e\n\u003c/a\u003e\n\u003c/div\u003e\n\n### 🌟 Key Highlights\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd width=\"33%\"\u003e\n\n#### 🔄 **Dynamic Problem Generation**\n- Problem space \u003e10^15 unique instances\n- Zero risk of data contamination\n- Fresh problems on every evaluation\n\n\u003c/td\u003e\n\u003ctd width=\"33%\"\u003e\n\n#### 🎯 **Three Difficulty Levels**\n- **Easy**: 44 fundamental reasoning tasks\n- **Medium**: 15 tasks with 59 variations\n- **Hard**: 20 tasks with 78 variations\n\n\u003c/td\u003e\n\u003ctd width=\"33%\"\u003e\n\n#### 🤖 **Multi-Backend Support**\n- OpenAI, Gemini, Anthropic APIs\n- vLLM for high-throughput local inference\n- HuggingFace Transformers\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"33%\"\u003e\n\n#### 📊 **Comprehensive Metrics**\n- Accuracy across difficulty levels\n- Instruction-following compliance\n- Token efficiency analysis\n\n\u003c/td\u003e\n\u003ctd width=\"33%\"\u003e\n\n#### 🛡️ **Contamination-Resistant**\n- No static benchmark memorization\n- Novel problem generation\n- Fair model comparison\n\n\u003c/td\u003e\n\u003ctd width=\"33%\"\u003e\n\n#### ⚡ **Extensive Coverage**\n- 101+ models evaluated\n- Open-source and proprietary\n- Regular updates with new models\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n---\n\n## 🚀 Installation\n\n### From PyPI\n\n```bash\npip install beyondbench\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/ctrl-gaurav/BeyondBench.git\ncd BeyondBench\npip install -e .\n```\n\n### With Optional Dependencies\n\n```bash\n# All API clients (OpenAI, Gemini, Anthropic)\npip install beyondbench[all-apis]\n\n# vLLM support (requires CUDA)\npip install beyondbench[vllm]\n\n# Everything\npip install beyondbench[full]\n```\n\n```bash\n# Performance optimization\npip install beyondbench[vllm]  # vLLM with prefix caching\npip install bitsandbytes       # 4-bit/8-bit quantization\n```\n\n---\n\n## ⚡ Quick Start\n\n### Interactive Wizard\n\n```bash\nbeyondbench\n```\n\n### Command Line\n\n```bash\n# Evaluate GPT-4o on the easy suite\nbeyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy\n\n# Evaluate a local model with vLLM\nbeyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all\n\n# Evaluate Claude on hard tasks\nbeyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard\n\n# List available tasks\nbeyondbench list-tasks\n```\n\n### Python API\n\n```python\nfrom beyondbench import EvaluationEngine, ModelHandler, TaskRegistry\n\n# Initialize model handler\nmodel = ModelHandler(\n    model_id=\"gpt-4o\",\n    api_provider=\"openai\",\n    api_key=\"your-api-key\"\n)\n\n# Run evaluation\nengine = EvaluationEngine(model_handler=model, output_dir=\"./results\")\nresults = engine.run_evaluation(suite=\"easy\", datapoints=100)\n\n# Print results\nprint(f\"Average Accuracy: {results['summary']['avg_accuracy']:.2%}\")\n```\n\n### API Server\n\n```bash\n# Start the BeyondBench API server\nbeyondbench serve --host 0.0.0.0 --port 8000\n\n# API docs at http://localhost:8000/docs\n```\n\n### Configuration Files\n\n```bash\n# Create a config interactively\nbeyondbench init\n\n# Run from config file\nbeyondbench run-config beyondbench/configs/default.yaml\n```\n\n### Results Viewer\n\n```bash\n# List past results\nbeyondbench results list\n\n# Show detailed results\nbeyondbench results show ./beyondbench_results/final_results.json\n\n# Compare two evaluations\nbeyondbench results compare result_a.json result_b.json\n\n# Get task info\nbeyondbench info sorting\n```\n\n---\n\n## 🔌 Supported Backends\n\n| Backend | Models | Features |\n|---------|--------|----------|\n| **OpenAI** | GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini | Reasoning effort control |\n| **Gemini** | Gemini 2.5 Pro, Gemini 2.5 Flash | Thinking budget configuration |\n| **Anthropic** | Claude Sonnet 4, Claude Opus 4 | Latest Claude models |\n| **vLLM** | Any HuggingFace model | Batch processing, tensor parallelism |\n| **Transformers** | Any HuggingFace model | CPU/GPU inference |\n\n---\n\n## 📊 Results\n\n### 🏆 Leaderboard (Top Models)\n\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth align=\"center\"\u003e🏅 Rank\u003c/th\u003e\n\u003cth align=\"left\"\u003e🤖 Model\u003c/th\u003e\n\u003cth align=\"center\"\u003e📊 Overall\u003c/th\u003e\n\u003cth align=\"center\"\u003e🎯 Instruction Following\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\u003ctd align=\"center\"\u003e🥇\u003c/td\u003e\u003ctd\u003e\u003cstrong\u003eGPT-5*\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e\u003cstrong\u003e83.56%\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e96.15%\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd align=\"center\"\u003e🥈\u003c/td\u003e\u003ctd\u003e\u003cstrong\u003eGPT-5-Nano*\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e\u003cstrong\u003e82.04%\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e93.58%\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd align=\"center\"\u003e🥉\u003c/td\u003e\u003ctd\u003e\u003cstrong\u003eGPT-5-Mini*\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e\u003cstrong\u003e81.67%\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e94.23%\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd align=\"center\"\u003e4\u003c/td\u003e\u003ctd\u003e\u003cstrong\u003eo3*\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e\u003cstrong\u003e80.36%\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e94.96%\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd align=\"center\"\u003e5\u003c/td\u003e\u003ctd\u003e\u003cstrong\u003eo4-Mini*\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e\u003cstrong\u003e79.04%\u003c/strong\u003e\u003c/td\u003e\u003ctd align=\"center\"\u003e95.30%\u003c/td\u003e\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n\u003csub\u003e*Models marked with * use reasoning/thinking tokens. Full results for 101+ models available in the [paper](https://arxiv.org/abs/2509.24210) and on the [leaderboard](https://ctrl-gaurav.github.io/BeyondBench/).\u003c/sub\u003e\n\n### 🔍 Key Findings\n\n- **Reasoning Gap**: Even top models show 20-30% performance drops on hard reasoning tasks\n- **Scaling Effects**: Larger models generally perform better, but the relationship is not always linear\n- **Instruction vs. Accuracy**: High accuracy does not guarantee perfect instruction-following\n\n---\n\n## ⚡ Performance\n\n| Feature | Improvement |\n|---------|-------------|\n| **Multi-GPU Parallel Evaluation** | Up to 8x speedup on 8 GPUs |\n| **Response Caching** | Near-instant repeat evaluations |\n| **vLLM Prefix Caching** | 2-3x faster for shared-prefix tasks |\n| **Quantization Support** | 4-bit/8-bit via bitsandbytes, GPTQ, AWQ |\n| **Model Warm-up** | Eliminates cold-start overhead |\n\n---\n\n## 🧩 Task Suites\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eEasy Suite (44 Tasks)\u003c/strong\u003e\u003c/summary\u003e\n\n| Category | Tasks |\n|----------|-------|\n| **Arithmetic** | sum, multiplication, subtraction, division, absolute_difference, weighted_sum, parity_check, dot_product |\n| **Statistics** | mean, median, mode, running_average, moving_average, variance, standard_deviation |\n| **Counting** | odd_count, even_count, count_negative, count_unique, count_greater_than_previous, count_palindromic, count_perfect_squares, count_multiples, local_maxima_count, element_frequency |\n| **Extrema** | find_maximum, find_minimum, second_maximum, second_minimum, range, index_of_maximum, max_adjacent_difference, sum_of_max_indices |\n| **Sequences** | sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits, cumulative_sum |\n| **List Operations** | reverse_list, rotate_list, interleave_lists |\n| **Set Operations** | set_intersection, set_difference |\n| **Comparison** | comparison |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eMedium Suite (15 Tasks, 59 Variations)\u003c/strong\u003e\u003c/summary\u003e\n\n| Task | Variations |\n|------|------------|\n| **Fibonacci Sequence** | 6 (Tribonacci, Lucas numbers, modified recursive) |\n| **Algebraic Sequence** | 10 (Polynomial, arithmetic, quadratic) |\n| **Geometric Sequence** | 10 (Exponential, compound growth, factorial) |\n| **Prime Sequence** | 11 (Prime gaps, twin primes, Sophie Germain) |\n| **Complex Pattern** | 12 (Interleaved, conditional, multi-rule) |\n| **Arithmetic Progression** | 1 (Varying common differences) |\n| **Harmonic Sequence** | 1 (Reciprocal sequences) |\n| **Collatz Sequence** | 1 (3n+1 conjecture) |\n| **Polynomial Evaluation** | 1 (Evaluate at given point) |\n| **Matrix Operations** | 1 (2x2 multiply, determinant, inverse) |\n| **Number Base Conversion** | 1 (Decimal, binary, hexadecimal) |\n| **Logical Operations** | 1 (AND, OR, NOT, XOR) |\n| **Pattern Completion** | 1 (Numeric pattern inference) |\n| **GCD/LCM** | 1 (Greatest common divisor, least common multiple) |\n| **Combinatorics** | 1 (Permutations and combinations) |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eHard Suite (20 Tasks, 78 Variations)\u003c/strong\u003e\u003c/summary\u003e\n\n| Task | Variations | Complexity |\n|------|------------|------------|\n| **Tower of Hanoi** | 6 | O(2^n) moves |\n| **N-Queens** | 4 | NP-complete |\n| **Graph Coloring** | 10 | NP-complete |\n| **Boolean SAT** | 5 | NP-complete |\n| **Sudoku** | 8 | Constraint satisfaction |\n| **Cryptarithmetic** | 12 | Constraint satisfaction |\n| **Matrix Chain** | 5 | Dynamic programming |\n| **Modular Systems** | 5 | Number theory |\n| **Constraint Optimization** | 5 | Operations research |\n| **Shortest Path** | 1 | Dijkstra's algorithm |\n| **Knapsack** | 1 | 0/1 dynamic programming |\n| **Traveling Salesman** | 1 | NP-hard combinatorial |\n| **Longest Common Subsequence** | 1 | Dynamic programming |\n| **Minimax Game** | 1 | Game tree search |\n| **Regex Matching** | 1 | Pattern matching |\n| **Topological Sort** | 1 | DAG ordering |\n| **Interval Scheduling** | 1 | Greedy algorithm |\n| **Coin Change** | 1 | Dynamic programming |\n| **Edit Distance** | 1 | String algorithms |\n| **Logic Grid Puzzles** | 8 | Deductive reasoning |\n\n\u003c/details\u003e\n\n---\n\n## 📚 Documentation\n\n- [**Full Documentation**](docs/DOCUMENTATION.md) \u0026mdash; Complete API reference and configuration guide\n- [**Usage Guide**](docs/USAGE.md) \u0026mdash; Detailed usage examples for all backends\n\n### Environment Variables\n\n```bash\nexport OPENAI_API_KEY=\"sk-...\"\nexport GEMINI_API_KEY=\"...\"\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"\n```\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions! See the [Contributing Guide](CONTRIBUTING.md) for details.\n\n```bash\ngit clone https://github.com/ctrl-gaurav/BeyondBench.git\ncd BeyondBench\npip install -e \".[dev]\"\npre-commit install\npytest tests/ -v\n```\n\n### 🛠️ Ways to Contribute\n- **🐛 Bug Reports**: Found an issue? [Report it here](https://github.com/ctrl-gaurav/BeyondBench/issues)\n- **✨ Feature Requests**: Have ideas? [Share them here](https://github.com/ctrl-gaurav/BeyondBench/issues)\n- **🔧 Code Contributions**: Submit PRs for improvements\n- **📚 Documentation**: Help improve our docs\n- **🤖 Model Submissions**: Suggest models for evaluation\n\n---\n\n## 📝 Citation\n\nIf you use BeyondBench in your research, please cite our paper (accepted at **ICLR 2026**):\n\n```bibtex\n@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,\n      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},\n      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},\n      year={2025},\n      eprint={2509.24210},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2509.24210},\n}\n```\n\n---\n\n## 📞 Contact \u0026 Support\n\n- **📧 Email**: [gks@vt.edu](mailto:gks@vt.edu), [xuanw@vt.edu](mailto:xuanw@vt.edu)\n- **🐛 Issues**: [GitHub Issues](https://github.com/ctrl-gaurav/BeyondBench/issues)\n- **💬 Discussions**: [GitHub Discussions](https://github.com/ctrl-gaurav/BeyondBench/discussions)\n\n---\n\n## 📜 License\n\nThis project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for details.\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n## 🚀 Ready to Explore the Future of AI Evaluation?\n\n\u003ca href=\"https://ctrl-gaurav.github.io/BeyondBench/\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/🎯_Explore_Leaderboard-Visit_Now-brightgreen?style=for-the-badge\u0026logo=rocket\" alt=\"Explore Leaderboard\"\u003e\n\u003c/a\u003e\n\n**Made with ❤️ by the BeyondBench Team**\n\n[![Virginia Tech](https://img.shields.io/badge/Virginia_Tech-CS_Department-maroon?style=flat\u0026logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjQiIGhlaWdodD0iMjQiIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPHBhdGggZD0iTTEyIDJMMTMuMDkgOC4yNkwyMCA5TDEzLjA5IDE1Ljc0TDEyIDIyTDEwLjkxIDE1Ljc0TDQgOUwxMC45MSA4LjI2TDEyIDJaIiBmaWxsPSJjdXJyZW50Q29sb3IiLz4KPC9zdmc+)](https://cs.vt.edu/)\n[![Amazon AGI](https://img.shields.io/badge/Amazon-AGI-orange?style=flat\u0026logo=amazon)](https://www.amazon.science/)\n\n*Advancing the frontier of AI reasoning evaluation, one benchmark at a time* 🌟\n\n\u003c/div\u003e\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n| 🏠 [**Home**](https://ctrl-gaurav.github.io/BeyondBench/) | 📊 [**Leaderboard**](https://ctrl-gaurav.github.io/BeyondBench/#leaderboard) | 📖 [**Paper**](https://arxiv.org/abs/2509.24210) | 💻 [**Code**](https://github.com/ctrl-gaurav/BeyondBench) |\n|:---:|:---:|:---:|:---:|\n| Main website | Interactive rankings | Research paper | Source code |\n\n\u003c/div\u003e\n\n\u003e **🎯 Transform your understanding of AI capabilities.** BeyondBench reveals what language models can truly reason about, beyond memorization. [**Start exploring now →**](https://ctrl-gaurav.github.io/BeyondBench/)\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/ctrl-gaurav/BeyondBench\"\u003e\n    \u003cimg src=\"assets/logo.svg\" alt=\"BeyondBench Logo\" width=\"100\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fctrl-gaurav%2Fbeyondbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fctrl-gaurav%2Fbeyondbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fctrl-gaurav%2Fbeyondbench/lists"}