{"id":26671074,"url":"https://github.com/googlecloudplatform/evalbench","last_synced_at":"2026-04-27T05:01:28.094Z","repository":{"id":284254440,"uuid":"804976804","full_name":"GoogleCloudPlatform/evalbench","owner":"GoogleCloudPlatform","description":"EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.","archived":false,"fork":false,"pushed_at":"2026-04-22T17:24:20.000Z","size":10837,"stargazers_count":38,"open_issues_count":6,"forks_count":16,"subscribers_count":8,"default_branch":"main","last_synced_at":"2026-04-22T19:24:20.566Z","etag":null,"topics":["databases","eval","evaluation-framework","nl2sql","text2sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":"docs/code-of-conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-05-23T16:32:21.000Z","updated_at":"2026-04-20T21:42:50.000Z","dependencies_parsed_at":"2025-12-31T08:02:28.483Z","dependency_job_id":null,"html_url":"https://github.com/GoogleCloudPlatform/evalbench","commit_stats":null,"previous_names":["googlecloudplatform/evalbench"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/GoogleCloudPlatform/evalbench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fevalbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fevalbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fevalbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fevalbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/evalbench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fevalbench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32323215,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["databases","eval","evaluation-framework","nl2sql","text2sql"],"created_at":"2025-03-25T23:32:26.508Z","updated_at":"2026-04-27T05:01:27.686Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EvalBench\n\nEvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks. As of now, it provides a comprehensive set of tools, and modules to evaluate models on NL2SQL tasks, including capability of running and scoring DQL, DML, and DDL queries across multiple supported databases. Its modular, plug-and-play architecture allows you to seamlessly integrate custom components while leveraging a robust evaluation pipeline, result storage, scoring strategies, and dashboarding capabilities.\n\n---\n\n## Getting Started \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/evalbench/blob/main/docs/examples/sqlite_example.ipynb)\n\nFollow the steps below to run EvalBench on your local VM:\n\u003e *Note*: Evalbench requires python 3.10 or higher.\n\n### 1. Clone the Repository\n\nClone the EvalBench repository from GitHub:\n\n```bash\ngit clone git@github.com:GoogleCloudPlatform/evalbench.git\n```\n\n### 2. Set Up a Virtual Environment\n\nNavigate to the repository directory and create a virtual environment:\n\n```bash\ncd evalbench\npython3 -m venv venv\nsource venv/bin/activate\n```\n\n### 3. Install Dependencies\n\nInstall the required Python dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\nDue to proto conflict between google-cloud packages you may need to force install common-protos:\n ```\n pip install --force-reinstall googleapis-common-protos==1.64.0\n ```\n\n### 4. Configure GCP Authentication (For Vertex AI | Gemini Examples)\n\nIf gcloud is not installed already, follow the steps in [gcloud installation guide](https://cloud.google.com/sdk/docs/install#installation_instructions).\n\nThen, authenticate using the Google Cloud CLI:\n\n```bash\ngcloud auth application-default login\n```\n\nThis step sets up the necessary credentials for accessing Vertex AI resources on your GCP project.\n\nWe can globally set our gcp_project_id using\n\n```bash\nexport EVAL_GCP_PROJECT_ID=your_project_id_here\nexport EVAL_GCP_PROJECT_REGION=your_region_here\n```\n\n### 5. Set Your Evaluation Configuration\n\nFor a quick start, let's run NL2SQL on some sqlite DQL queries.\n\n1. First, read through [sqlite/run_dql.yaml](/datasets/bat/example_run_config.yaml) and see the configuration settings we will be running.\n\nNow, configure your evaluation by setting the `EVAL_CONFIG` environment variable. For example, to run a configuration using the `db_blog` dataset on SQLite:\n\n```bash\nexport EVAL_CONFIG=datasets/bat/example_run_config.yaml\n```\n\n### 6. Run EvalBench\n\nStart the evaluation process using the provided shell script:\n\n```bash\n./evalbench/run.sh\n```\n\n---\n\n## Overview\n\nEvalBench's architecture is built around a modular design that supports diverse evaluation needs:\n- **Modular and Plug-and-Play:** Easily integrate custom scoring modules, data processors, and dashboard components.\n- **Flexible Evaluation Pipeline:** Seamlessly run DQL, DML, and DDL tasks while using a consistent base pipeline.\n- **Result Storage and Reporting:** Store results in various formats (e.g., CSV, BigQuery) and visualize performance with built-in dashboards.\n- **Customizability:** Configure and extend EvalBench to measure the performance of GenAI workflows tailored to your specific requirements.\n\nEvalbench allows quickly creating experiments and A/B testing improvements (Available when BigQuery reporting mode set in run_config)\n\n\u003cimg width=\"911\" alt=\"Evalbench Reporting\" src=\"https://github.com/user-attachments/assets/0881c43e-b359-472b-a7fd-e1fee6a9adf3\" /\u003e\n\nThis includes being able to measure and quantify the specific improvements on databases or specific dialects:\n\n\u003cimg width=\"911\" alt=\"Evalbench Reporting by Databaes / Dialects\" src=\"https://github.com/user-attachments/assets/e2172be1-045a-473d-92aa-304121843e7d\" /\u003e\n\nAnd allowing digging deeper into the exact details of the improvements and regressions including highlighting the changes, how they impacted the score and a LLM annotated explanation of the scoring changes if LLM rater is used.\n\n\u003cimg width=\"911\" alt=\"Evalbench Reporting by Databaes / Dialects\" src=\"https://github.com/user-attachments/assets/861696b5-42f1-44c7-a7d0-710f7a32918f\" /\u003e\n\u003cbr\u003e\u003cbr\u003e\n\nA complete guide of Evalbench's available functionality can be found in [run-config documentation](/docs/configs/run-config.md)\n\nPlease explore the repository to learn more about customizing your evaluation workflows, integrating new metrics, and leveraging the full potential of EvalBench.\n\n\n---\nFor additional documentation, examples, and support, please refer to the [EvalBench documentation](https://github.com/GoogleCloudPlatform/evalbench). Enjoy evaluating your GenAI models!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fevalbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fevalbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fevalbench/lists"}