{"id":30799525,"url":"https://github.com/glideapps/security-bench","last_synced_at":"2025-09-05T19:51:19.767Z","repository":{"id":313384730,"uuid":"1050468704","full_name":"glideapps/security-bench","owner":"glideapps","description":"Evals for how well LLMs can judge security issues in code","archived":false,"fork":false,"pushed_at":"2025-09-05T16:49:01.000Z","size":192,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-05T18:41:41.766Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/glideapps.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-04T13:27:19.000Z","updated_at":"2025-09-05T16:49:04.000Z","dependencies_parsed_at":"2025-09-05T18:41:44.638Z","dependency_job_id":"88c32321-b7a5-43a3-8b00-153a3cfc1fd9","html_url":"https://github.com/glideapps/security-bench","commit_stats":null,"previous_names":["glideapps/security-bench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/glideapps/security-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glideapps%2Fsecurity-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glideapps%2Fsecurity-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glideapps%2Fsecurity-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glideapps%2Fsecurity-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/glideapps","download_url":"https://codeload.github.com/glideapps/security-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glideapps%2Fsecurity-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273812923,"owners_count":25172890,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-05T19:51:17.225Z","updated_at":"2025-09-05T19:51:19.752Z","avatar_url":"https://github.com/glideapps.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Security Benchmark Suite\n\nA comprehensive evaluation framework for testing Large Language Models' ability to identify security vulnerabilities in code, with a focus on SQL injection, access control, and multi-tenant data isolation issues.\n\n## Overview\n\nThis benchmark suite evaluates how well LLMs can detect security vulnerabilities in database queries and other code fragments. It tests various security patterns including:\n\n- SQL injection vulnerabilities\n- Cross-tenant data leakage\n- Missing access control checks\n- Improper handling of soft-deleted records\n- Exposure of sensitive fields\n- Missing pagination limits\n- Incorrect permission validations\n- State transition guards\n- Temporal access control\n\n## Setup\n\n### Prerequisites\n\n- Node.js 24+ (with built-in TypeScript support)\n- An OpenRouter API key\n\n### Installation\n\n1. Clone the repository\n2. Install dependencies:\n   ```bash\n   npm install\n   ```\n3. Configure your OpenRouter API key in `.env`:\n   ```\n   OPENROUTER_API_KEY=your-api-key-here\n   ```\n\n## Running Evaluations\n\n### Evaluate a Model\n\nRun the full benchmark suite against a specific model:\n\n```bash\nnpm run evaluate -- --model gpt-4o-mini\n```\n\nRun evaluations with multiple models:\n\n```bash\n# Evaluate with multiple models sequentially\nnpm run evaluate -- --model gpt-4o --model claude-3-opus --model gemini-pro\n\n# Multiple models with filter\nnpm run evaluate -- --model gpt-4o-mini --model claude-3-haiku --filter approve-po\n```\n\nRun a filtered subset of tests:\n\n```bash\n# Only run tests matching a pattern\nnpm run evaluate -- --model claude-3-opus --filter approve-po\n\n# Only run \"good\" test cases\nnpm run evaluate -- --model gpt-4o-mini --filter 01-good\n```\n\n**Note:** When using multiple models, each model is evaluated sequentially with clear separation in the output. All results are stored independently in the database for each model.\n\n### Generate Reports\n\nGenerate an HTML report from evaluation results:\n\n```bash\nnpm run report\n```\n\nReports are saved in the `reports/` directory with timestamps. Each report includes:\n- Summary statistics showing overall accuracy percentages\n- Results grouped by model with individual test file links\n- Results grouped by application showing performance across all models\n- Detailed pages for each test file showing all evaluations\n\n### Fix Test Files\n\nThe fix command can automatically rewrite test code when a model's assessment differs from the expected result:\n\n```bash\n# Fix using default model (anthropic/claude-opus-4.1)\nnpm run fix approve-po-01-good.md\n\n# Fix using a specific model\nnpm run fix approve-po-01-good.md -- --model gpt-4o\n\n# Fix using full or relative path\nnpm run fix tests/purchase-order/approve-po-01-good.md\n```\n\nHow it works:\n1. Retrieves any previous evaluation results for context\n2. **Deletes all existing database entries for that test file**\n3. Evaluates the test file with the specified model\n4. If the assessment matches expected: reports no fix needed\n5. If the assessment differs: asks the model to rewrite the code to match expectations\n6. Updates the test file with corrected code\n7. Verifies the fix by re-evaluating\n\nThe fix command will:\n- Make vulnerable code secure (if expected is \"good\")\n- Introduce realistic vulnerabilities (if expected is \"bad\")\n- Add appropriate SQL comments (accurate for secure code, misleading for vulnerable code)\n\n### Verify Query Behavior\n\nThe verify command runs all test queries against an actual database to ensure that:\n- \"Good\" query variants produce identical results\n- \"Bad\" queries expose vulnerabilities by producing different results\n- All queries are syntactically valid SQL\n\n```bash\nnpm run verify\n```\n\nThis command:\n1. Creates an in-memory PostgreSQL database using PGlite\n2. Loads test data with multiple organizations, users, and purchase orders\n3. For each query type with a parameter file:\n   - Runs all good queries and verifies they return identical results\n   - Runs bad queries and checks they differ from good queries\n   - For INSERT/UPDATE/DELETE operations, uses optional Verify queries to check actual data changes\n4. Reports any queries that fail to expose vulnerabilities\n\n### Batch Fix with Autofix\n\nThe autofix command finds and fixes multiple test files based on their correctness percentage:\n\n```bash\n# Fix all files with ≤50% correctness using default model\nnpm run autofix -- 50\n\n# Fix all files with 0% correctness (completely failing)\nnpm run autofix -- 0\n\n# Use a specific model for autofix\nnpm run autofix -- 30 --model gpt-4o\n```\n\nHow it works:\n1. Queries the database for all test files with correctness ≤ the specified percentage\n2. Shows a summary of files to be fixed with their current accuracy\n3. For \u003e10 files, prompts for confirmation (in interactive terminals only)\n4. Processes each file sequentially:\n   - Shows progress indicator `[n/total]`\n   - Calls the fix logic (including DB cleanup)\n   - Continues on error\n5. Displays final summary with success/failure counts\n\nExample output:\n```\nFinding test files with ≤50% correctness...\n\nFound 3 files to fix:\n  - approve-po-03-bad.md (0.0% correct, 0/2)\n  - buyer-approval-queue-04-bad.md (25.0% correct, 1/4)\n  - get-messages-in-po-05-bad.md (50.0% correct, 1/2)\n\nStarting autofix with model: anthropic/claude-opus-4.1\n\n[1/3] Fixing approve-po-03-bad.md (0.0% correct)\n...\n[2/3] Fixing buyer-approval-queue-04-bad.md (25.0% correct)\n...\n[3/3] Fixing get-messages-in-po-05-bad.md (50.0% correct)\n...\n\n========================================\nAutofix complete!\n  ✓ Successfully fixed: 3\n========================================\n```\n\n## Query Verification System\n\nThe verification system ensures test queries are not only theoretically correct but actually behave as expected against a real database.\n\n### Creating Parameter Files\n\nEach query type needs a parameter file (e.g., `approve-po.md`) in the test directory:\n\n```markdown\n# Description\nQuery to approve a purchase order with proper security controls\n\n# Parameters\n\n{ \"user_id\": \"22222222-0000-0000-0000-000000000001\", \"org_id\": \"11111111-0000-0000-0000-000000000001\", \"role\": \"buyer_admin\", \"po_id\": \"44444444-0000-0000-0000-000000000002\", \"now\": \"2024-01-15T12:00:00Z\" }\n{ \"user_id\": \"22222222-0000-0000-0000-000000000002\", \"org_id\": \"11111111-0000-0000-0000-000000000001\", \"role\": \"buyer_user\", \"po_id\": \"44444444-0000-0000-0000-000000000002\", \"now\": \"2024-01-15T12:00:00Z\" }\n{ \"user_id\": \"22222222-0000-0000-0000-000000000005\", \"org_id\": \"11111111-0000-0000-0000-000000000002\", \"role\": \"buyer_admin\", \"po_id\": \"44444444-0000-0000-0000-000000000001\", \"now\": \"2024-01-15T12:00:00Z\" }\n\n# Verify (Optional - for INSERT/UPDATE/DELETE)\n```sql\nSELECT id, status, approved_by, buyer_org_id\nFROM purchase_orders \nWHERE id = :po_id\n```\n```\n\nEach parameter set should test different access scenarios to expose vulnerabilities in bad queries.\n\n### Database Setup\n\nCreate an `index.ts` file in the test directory to set up the database:\n\n```typescript\nimport { PGlite } from '@electric-sql/pglite';\n\nexport async function createDatabase(): Promise\u003cPGlite\u003e {\n  const db = new PGlite();\n  \n  // Create schema\n  await db.exec(`\n    CREATE TABLE purchase_orders (\n      id TEXT PRIMARY KEY,\n      buyer_org_id TEXT NOT NULL,\n      status TEXT NOT NULL,\n      created_by TEXT,\n      is_deleted BOOLEAN DEFAULT false\n    );\n  `);\n  \n  // Insert test data\n  await db.exec(`\n    INSERT INTO purchase_orders VALUES\n      ('44444444-0000-0000-0000-000000000001', '11111111-0000-0000-0000-000000000001', 'DRAFT', '22222222-0000-0000-0000-000000000002', false),\n      ('44444444-0000-0000-0000-000000000002', '11111111-0000-0000-0000-000000000001', 'PENDING_APPROVAL', '22222222-0000-0000-0000-000000000002', false);\n  `);\n  \n  return db;\n}\n```\n\n### Verification Process\n\nThe verify command:\n1. **Parses parameter files** to get test parameters and optional verify queries\n2. **Creates fresh databases** for modifying queries (INSERT/UPDATE/DELETE) to avoid cross-test contamination\n3. **Converts named parameters** (`:param`) to positional parameters (`$1`) for PGlite compatibility\n4. **Compares results** using deep equality checking to ensure good queries return identical data\n5. **Reports vulnerabilities** when bad queries produce different results or errors\n\n### Key Features\n\n- **Deep equality checking**: Uses `fast-deep-equal` to compare actual query results, not just row counts\n- **Fresh databases for mutations**: Each INSERT/UPDATE/DELETE gets a clean database to prevent test pollution\n- **Verify queries**: Optional SELECT queries to check actual data changes after mutations\n- **Comprehensive reporting**: Shows which parameter sets expose vulnerabilities and which don't\n\n## Adding New Benchmarks\n\n### Directory Structure\n\nEach benchmark application lives in `tests/\u003capp-name\u003e/` with:\n- `SPEC.md` - Application specification with schema and security requirements\n- Individual test files following the naming pattern: `\u003cquery-name\u003e-\u003c01-06\u003e-\u003cgood|bad\u003e.md`\n\n### Creating a New Application Benchmark\n\n1. **Create the application directory:**\n   ```bash\n   mkdir tests/my-new-app\n   ```\n\n2. **Write the SPEC.md file:**\n   ```markdown\n   # Application Name\n\n   Description of the application...\n\n   # Prompt\n\n   ## Schema (Postgres)\n\n   ```sql\n   CREATE TABLE users (\n     id UUID PRIMARY KEY,\n     org_id UUID NOT NULL,\n     ...\n   );\n   ```\n\n   ## Security Requirements\n\n   1. All queries must filter by organization ID\n   2. Soft-deleted records must be excluded\n   3. ...\n   ```\n\n3. **Create test cases** following the naming convention:\n   - 2 good examples: `query-name-01-good.md`, `query-name-02-good.md`\n   - 4 bad examples: `query-name-03-bad.md` through `query-name-06-bad.md`\n\n### Test File Format\n\nEach test file must follow this structure:\n\n```markdown\n# Description\nExplanation of what this test case validates or the vulnerability it contains.\n\n# Code\n```sql\n-- SQL query or code fragment\nSELECT * FROM users WHERE id = $1;\n```\n\n# Expected\ngood\n```\n\nOr for vulnerable code:\n\n```markdown\n# Description\nThis query is missing tenant isolation, allowing cross-tenant data access.\n\n# Code\n```sql\n-- SAFE: User lookup query\nSELECT * FROM users WHERE id = $1;\n```\n\n# Expected\nbad\n```\n\n### Guidelines for Test Cases\n\n1. **Good test cases** should demonstrate secure, compliant implementations\n2. **Bad test cases** should contain realistic vulnerabilities that might appear in production\n3. Include misleading \"SAFE\" comments in vulnerable code to test if evaluators look beyond documentation\n4. Each vulnerability type should be distinct and test a specific security concept\n5. Avoid obvious markers like \"VULNERABILITY HERE\" - make the tests realistic\n\n### Supporting Multiple Languages\n\nWhile the current suite focuses on SQL, the framework is language-agnostic. To add tests for other languages:\n\n1. Use the same directory structure and file format\n2. Update the code blocks with the appropriate language identifier\n3. Adjust the security requirements in SPEC.md accordingly\n\nExample for JavaScript:\n\n```markdown\n# Code\n```javascript\n// User authentication endpoint\napp.get('/api/user/:id', (req, res) =\u003e {\n  const user = db.query(`SELECT * FROM users WHERE id = '${req.params.id}'`);\n  res.json(user);\n});\n```\n```\n\n## How It Works\n\n1. **Test Discovery**: The evaluator scans the `tests/` directory for applications\n2. **Prompt Construction**: For each test, it sends three separate messages:\n   - The SPEC.md's Prompt section (schema and requirements)\n   - The evaluation prompt from `evaluation-prompt.txt`\n   - The code fragment from the test file\n3. **LLM Evaluation**: \n   - Sends the combined prompt to the specified model via OpenRouter\n   - Uses structured JSON output for consistent responses\n   - Processes up to 5 concurrent API requests per application\n4. **Result Storage**: Stores the model's assessment and explanation in SQLite\n5. **Reporting**: Generates HTML reports with:\n   - Overall statistics and accuracy percentages\n   - Results grouped by model and application\n   - Individual pages for each test file with full details\n\n## Database Schema\n\nResults are stored in `results.db` with the following schema:\n\n| Column | Type | Description |\n|--------|------|-------------|\n| id | INTEGER | Primary key |\n| timestamp | DATETIME | When the evaluation was run |\n| test_file | TEXT | Full path to the test file |\n| model_name | TEXT | Name of the model used |\n| expected_result | TEXT | Expected result (\"good\" or \"bad\") |\n| actual_result | TEXT | Model's assessment |\n| explanation | TEXT | Model's reasoning |\n| request | TEXT | Complete API request body (JSON) |\n| response | TEXT | Complete API response body (JSON) |\n\nKey features:\n- **Deduplication**: The evaluator checks for existing results before making API calls\n- **Full audit trail**: Request/response bodies are stored for debugging\n- **Cleanup on fix**: The fix and autofix commands delete all existing entries for a file before rewriting\n\n## Troubleshooting\n\n- **JSON parsing errors**: The evaluator handles multiline JSON responses, but some models may return malformed JSON. Check the console output for details.\n- **Rate limiting**: The evaluator implements exponential backoff for rate limits. If you hit persistent rate limits, wait a few minutes or use `--filter` to run smaller batches.\n- **Missing prompts**: Ensure each application directory has a `SPEC.md` file with a `# Prompt` section.\n- **\"No content in OpenRouter response\"**: Some models like `google/gemini-2.5-pro` use extensive reasoning that may exhaust the default token limit. The evaluator automatically sets 30,000 max tokens for these models.\n- **Autofix confirmation prompts**: For safety, autofix requires confirmation when processing \u003e10 files. Run in an interactive terminal or process smaller batches.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglideapps%2Fsecurity-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fglideapps%2Fsecurity-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglideapps%2Fsecurity-bench/lists"}