{"id":29671428,"url":"https://github.com/tinybirdco/llm-benchmark","last_synced_at":"2025-07-22T20:07:50.801Z","repository":{"id":292023778,"uuid":"976040899","full_name":"tinybirdco/llm-benchmark","owner":"tinybirdco","description":"We assessed the ability of popular LLMs to generate accurate and efficient SQL from natural language prompts. Using a 200 million record dataset from the GH Archive uploaded to Tinybird, we asked the LLMs to generate SQL based on 50 prompts. ","archived":false,"fork":false,"pushed_at":"2025-07-14T10:56:07.000Z","size":9917,"stargazers_count":38,"open_issues_count":3,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-14T12:17:37.713Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tinybirdco.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-01T11:29:39.000Z","updated_at":"2025-07-14T10:56:11.000Z","dependencies_parsed_at":"2025-05-07T19:19:36.421Z","dependency_job_id":"987344f4-4ce7-4873-8ba9-eff804efdf1a","html_url":"https://github.com/tinybirdco/llm-benchmark","commit_stats":null,"previous_names":["tinybirdco/llm-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tinybirdco/llm-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fllm-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fllm-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fllm-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fllm-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tinybirdco","download_url":"https://codeload.github.com/tinybirdco/llm-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fllm-benchmark/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266563915,"owners_count":23948689,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-22T20:07:50.108Z","updated_at":"2025-07-22T20:07:50.783Z","avatar_url":"https://github.com/tinybirdco.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM SQL Generation Benchmark\n\nA tool for benchmarking various Large Language Models (LLMs) on their ability to generate correct analytical SQL queries for Tinybird.\n\nSee results: https://llm-benchmark.tinybird.live/\n\n![LLM SQL Benchmark](src/public/llm-sql-benchmark.png)\n\n## Overview\n\nThis benchmark evaluates how well different LLMs can generate analytical SQL queries based on natural language questions about data in Tinybird. It measures:\n\n- SQL query correctness\n- Execution success\n- Performance metrics (time to first token, total duration, token usage)\n- Error handling and recovery\n\nThe benchmark includes an automated retry mechanism that feeds execution errors back to the model for correction.\n\n## Supported Providers \u0026 Models\n\nThe benchmark currently supports the following providers and models through [OpenRouter](https://openrouter.ai/):\n\n- **X.AI**: Grok-3 Beta, Grok-3 Mini Beta, Grok-4\n- **Anthropic**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Opus 4, Claude Sonnet 4\n- **DeepSeek**: DeepSeek Chat v3 0324, DeepSeek Chat v3 0324 Free\n- **Google**: Gemini 2.0 Flash 001, Gemini 2.5 Flash, Gemini 2.5 Pro\n- **Meta**: Llama 4 Maverick, Llama 4 Scout, Llama 3.3 70B Instruct\n- **Mistral**: Ministral 8B, Mistral Small 3.1 24B Instruct, Mistral Nemo, Magistral Small 2506, Devstral Medium, Devstral Small\n- **OpenAI**: GPT-4.1, GPT-4.1 Nano, GPT-4o Mini, O3, O3 Mini, O3 Pro, O4 Mini, O4 Mini High\n\nIt can be extended to other models, see [how to benchmark a new model](#how-to-benchmark-a-new-model)\n\n## Methodology\n\nThe benchmark is based on this [github_events](https://github.com/tinybirdco/llm-benchmark/blob/main/src/tinybird/datasources/github_events.datasource) table schema deployed to a [free Tinybird account](https://www.tinybird.co/pricing).\n\nThe corpus consists of 200M rows from the public [GitHub archive](https://www.gharchive.org/). For ease of ingestion, they are provided as [Parquet files](https://storage.googleapis.com/dev-alrocar-public/github). Each file has 50M rows, so just 4 files are ingested for the benchmark. For this specific benchmark, scale is not critical since we are comparing models against each other, and performance data can be easily extrapolated.\n\nEach LLM must produce SQL from 50 natural language prompts or questions about public GitHub activity. You can find the complete list of questions in the `DESCRIPTION` of each Tinybird endpoint [here](https://github.com/tinybirdco/llm-benchmark/tree/main/src/tinybird/endpoints). These endpoints are deployed to Tinybird to be used as a baseline for output correctness.\n\nThe [benchmark](https://github.com/tinybirdco/llm-benchmark/blob/96a738aafafbae32a0a72e6f149da7ebab452130/src/benchmark/index.ts#L181) is a Node.js application that:\n\n1. Runs the Tinybird endpoints to get a [results-human.json](https://github.com/tinybirdco/llm-benchmark/blob/main/src/benchmark/results-human.json) as output baseline.\n2. Iterates through a [list of models](https://github.com/tinybirdco/llm-benchmark/blob/main/src/benchmark-config.json) to extract SQL generation performance metrics.\n3. Runs the generated queries and validates output correctness.\n\nEach model receives a [system prompt](https://github.com/tinybirdco/llm-benchmark/blob/main/src/benchmark/prompt.ts) and the [github_events.datasource](https://github.com/tinybirdco/llm-benchmark/blob/main/src/tinybird/datasources/github_events.datasource) schema as context and must produce SQL that returns an output when executed over the SQL API of Tinybird.\n\nIf the SQL produced is not valid, the LLM call is retried up to three times, passing the error from the SQL API as context. Output is stored in a [results.json](https://github.com/tinybirdco/llm-benchmark/blob/main/src/benchmark/results.json) file.\n\nOnce all models have been processed, `results-human.json` and `results.json` are compared to extract a metric for output correctness and stored in [validation-results.json](https://github.com/tinybirdco/llm-benchmark/blob/main/src/benchmark/validation-results.json).\n\nRead this [blog post](https://tbrd.co/LKHKD7c) to learn more about how the benchmark measures output performance and correctness.\n\n## Results presentation\n\nResults produced by the benchmark are stored in JSON files and presented in a web application deployed to https://llm-benchmark.tinybird.live/\n\nYou can find the source code [here](https://github.com/tinybirdco/llm-benchmark/tree/main/src/src)\n\n## How to benchmark a new model\n\nRun the benchmark locally and extend it to any other model following these instructions.\n\n### Prerequisites\n\n- Node.js 18+ and npm\n- OpenRouter API key\n- Tinybird workspace token and API access\n\n### Installation\n\n1. Clone this repository\n2. Install dependencies:\n\n```bash\ncd llm-benchmark/src\nnpm install\n```\n\n3. Prepare the Tinybird Workspace:\n\n```bash\ncurl https://tinybird.co | sh\ncd llm-benchmark/src/tinybird\ntb login\ntb --cloud deploy\ntb --cloud datasource append github_events https://storage.googleapis.com/dev-alrocar-public/github/01.parquet\n```\n\n4. Create a `.env` file with required credentials:\n\n```\nOPENROUTER_API_KEY=your_openrouter_api_key\nTINYBIRD_WORKSPACE_TOKEN=your_tinybird_token\nTINYBIRD_API_HOST=your_tinybird_api_host\n```\n\n### Usage\n\nRun the benchmark:\n\n```bash\nnpm run benchmark\n```\n\nThis will:\n1. Load the configured models from `benchmark-config.json`\n2. Run each model against a set of predefined questions\n3. Execute generated SQL queries against your Tinybird workspace\n4. Store results in `benchmark/results.json`\n\n### Test a new model\n\nEdit `benchmark-config.json` to customize which providers and models to test.\n\nRun the new model so the results file is updated:\n\n```bash\nnpm run benchmark -- --model=\"openai/o3\" --debug\n```\n\n## Results Analysis\n\nResults are saved in JSON format with detailed information about each query inside the `benchmark` folder in this repository. To visualize the results, you can start the Next.js application:\n\n```bash\ncd llm-benchmark/src\nnpm install\nnpm run dev\n```\n\n## Attribution\n\nThe GitHub dataset used in this benchmark is based on work by:\n\nMilovidov A., 2020. Everything You Ever Wanted To Know About GitHub (But Were Afraid To Ask), https://ghe.clickhouse.tech/\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. You can propose changes, extend the documentation, and share ideas by creating pull requests and issues on the GitHub repository.\n\n## License\n\nThis project is open-source and available under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license or [Apache 2](https://www.apache.org/licenses/LICENSE-2.0) license. Attribution is required when using or adapting this content.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftinybirdco%2Fllm-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftinybirdco%2Fllm-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftinybirdco%2Fllm-benchmark/lists"}