https://github.com/glideapps/security-bench

Evals for how well LLMs can judge security issues in code
https://github.com/glideapps/security-bench
Last synced: 4 months ago
JSON representation
Evals for how well LLMs can judge security issues in code
Host: GitHub
URL: https://github.com/glideapps/security-bench
Owner: glideapps
Created: 2025-09-04T13:27:19.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-09-05T16:49:01.000Z (4 months ago)
Last Synced: 2025-09-05T18:41:41.766Z (4 months ago)
Language: TypeScript
Size: 188 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Security Benchmark Suite

A comprehensive evaluation framework for testing Large Language Models' ability to identify security vulnerabilities in code, with a focus on SQL injection, access control, and multi-tenant data isolation issues.

## Overview

This benchmark suite evaluates how well LLMs can detect security vulnerabilities in database queries and other code fragments. It tests various security patterns including:

- SQL injection vulnerabilities

- Cross-tenant data leakage

- Missing access control checks

- Improper handling of soft-deleted records

- Exposure of sensitive fields

- Missing pagination limits

- Incorrect permission validations

- State transition guards

- Temporal access control

## Setup

### Prerequisites

- Node.js 24+ (with built-in TypeScript support)

- An OpenRouter API key

### Installation

1. Clone the repository

2. Install dependencies:

   ```bash

   npm install

   ```

3. Configure your OpenRouter API key in `.env`:

   ```

   OPENROUTER_API_KEY=your-api-key-here

   ```

## Running Evaluations

### Evaluate a Model

Run the full benchmark suite against a specific model:

```bash

npm run evaluate -- --model gpt-4o-mini

```

Run evaluations with multiple models:

```bash

# Evaluate with multiple models sequentially

npm run evaluate -- --model gpt-4o --model claude-3-opus --model gemini-pro

# Multiple models with filter

npm run evaluate -- --model gpt-4o-mini --model claude-3-haiku --filter approve-po

```

Run a filtered subset of tests:

```bash

# Only run tests matching a pattern

npm run evaluate -- --model claude-3-opus --filter approve-po

# Only run "good" test cases

npm run evaluate -- --model gpt-4o-mini --filter 01-good

```

**Note:** When using multiple models, each model is evaluated sequentially with clear separation in the output. All results are stored independently in the database for each model.

### Generate Reports

Generate an HTML report from evaluation results:

```bash

npm run report

```

Reports are saved in the `reports/` directory with timestamps. Each report includes:

- Summary statistics showing overall accuracy percentages

- Results grouped by model with individual test file links

- Results grouped by application showing performance across all models

- Detailed pages for each test file showing all evaluations

### Fix Test Files

The fix command can automatically rewrite test code when a model's assessment differs from the expected result:

```bash

# Fix using default model (anthropic/claude-opus-4.1)

npm run fix approve-po-01-good.md

# Fix using a specific model

npm run fix approve-po-01-good.md -- --model gpt-4o

# Fix using full or relative path

npm run fix tests/purchase-order/approve-po-01-good.md

```

How it works:

1. Retrieves any previous evaluation results for context

2. **Deletes all existing database entries for that test file**

3. Evaluates the test file with the specified model

4. If the assessment matches expected: reports no fix needed

5. If the assessment differs: asks the model to rewrite the code to match expectations

6. Updates the test file with corrected code

7. Verifies the fix by re-evaluating

The fix command will:

- Make vulnerable code secure (if expected is "good")

- Introduce realistic vulnerabilities (if expected is "bad")

- Add appropriate SQL comments (accurate for secure code, misleading for vulnerable code)

### Verify Query Behavior

The verify command runs all test queries against an actual database to ensure that:

- "Good" query variants produce identical results

- "Bad" queries expose vulnerabilities by producing different results

- All queries are syntactically valid SQL

```bash

npm run verify

```

This command:

1. Creates an in-memory PostgreSQL database using PGlite

2. Loads test data with multiple organizations, users, and purchase orders

3. For each query type with a parameter file:

   - Runs all good queries and verifies they return identical results

   - Runs bad queries and checks they differ from good queries

   - For INSERT/UPDATE/DELETE operations, uses optional Verify queries to check actual data changes

4. Reports any queries that fail to expose vulnerabilities

### Batch Fix with Autofix

The autofix command finds and fixes multiple test files based on their correctness percentage:

```bash

# Fix all files with ≤50% correctness using default model

npm run autofix -- 50

# Fix all files with 0% correctness (completely failing)

npm run autofix -- 0

# Use a specific model for autofix

npm run autofix -- 30 --model gpt-4o

```

How it works:

1. Queries the database for all test files with correctness ≤ the specified percentage

2. Shows a summary of files to be fixed with their current accuracy

3. For >10 files, prompts for confirmation (in interactive terminals only)

4. Processes each file sequentially:

   - Shows progress indicator `[n/total]`

   - Calls the fix logic (including DB cleanup)

   - Continues on error

5. Displays final summary with success/failure counts

Example output:

```

Finding test files with ≤50% correctness...

Found 3 files to fix:

  - approve-po-03-bad.md (0.0% correct, 0/2)

  - buyer-approval-queue-04-bad.md (25.0% correct, 1/4)

  - get-messages-in-po-05-bad.md (50.0% correct, 1/2)

Starting autofix with model: anthropic/claude-opus-4.1

[1/3] Fixing approve-po-03-bad.md (0.0% correct)

...

[2/3] Fixing buyer-approval-queue-04-bad.md (25.0% correct)

...

[3/3] Fixing get-messages-in-po-05-bad.md (50.0% correct)

...

========================================

Autofix complete!

  ✓ Successfully fixed: 3

========================================

```

## Query Verification System

The verification system ensures test queries are not only theoretically correct but actually behave as expected against a real database.

### Creating Parameter Files

Each query type needs a parameter file (e.g., `approve-po.md`) in the test directory:

```markdown

# Description

Query to approve a purchase order with proper security controls

# Parameters

{ "user_id": "22222222-0000-0000-0000-000000000001", "org_id": "11111111-0000-0000-0000-000000000001", "role": "buyer_admin", "po_id": "44444444-0000-0000-0000-000000000002", "now": "2024-01-15T12:00:00Z" }

{ "user_id": "22222222-0000-0000-0000-000000000002", "org_id": "11111111-0000-0000-0000-000000000001", "role": "buyer_user", "po_id": "44444444-0000-0000-0000-000000000002", "now": "2024-01-15T12:00:00Z" }

{ "user_id": "22222222-0000-0000-0000-000000000005", "org_id": "11111111-0000-0000-0000-000000000002", "role": "buyer_admin", "po_id": "44444444-0000-0000-0000-000000000001", "now": "2024-01-15T12:00:00Z" }

# Verify (Optional - for INSERT/UPDATE/DELETE)

```sql

SELECT id, status, approved_by, buyer_org_id

FROM purchase_orders 

WHERE id = :po_id

```

```

Each parameter set should test different access scenarios to expose vulnerabilities in bad queries.

### Database Setup

Create an `index.ts` file in the test directory to set up the database:

```typescript

import { PGlite } from '@electric-sql/pglite';

export async function createDatabase(): Promise {

  const db = new PGlite();

  

  // Create schema

  await db.exec(`

    CREATE TABLE purchase_orders (

      id TEXT PRIMARY KEY,

      buyer_org_id TEXT NOT NULL,

      status TEXT NOT NULL,

      created_by TEXT,

      is_deleted BOOLEAN DEFAULT false

    );

  `);

  

  // Insert test data

  await db.exec(`

    INSERT INTO purchase_orders VALUES

      ('44444444-0000-0000-0000-000000000001', '11111111-0000-0000-0000-000000000001', 'DRAFT', '22222222-0000-0000-0000-000000000002', false),

      ('44444444-0000-0000-0000-000000000002', '11111111-0000-0000-0000-000000000001', 'PENDING_APPROVAL', '22222222-0000-0000-0000-000000000002', false);

  `);

  

  return db;

}

```

### Verification Process

The verify command:

1. **Parses parameter files** to get test parameters and optional verify queries

2. **Creates fresh databases** for modifying queries (INSERT/UPDATE/DELETE) to avoid cross-test contamination

3. **Converts named parameters** (`:param`) to positional parameters (`$1`) for PGlite compatibility

4. **Compares results** using deep equality checking to ensure good queries return identical data

5. **Reports vulnerabilities** when bad queries produce different results or errors

### Key Features

- **Deep equality checking**: Uses `fast-deep-equal` to compare actual query results, not just row counts

- **Fresh databases for mutations**: Each INSERT/UPDATE/DELETE gets a clean database to prevent test pollution

- **Verify queries**: Optional SELECT queries to check actual data changes after mutations

- **Comprehensive reporting**: Shows which parameter sets expose vulnerabilities and which don't

## Adding New Benchmarks

### Directory Structure

Each benchmark application lives in `tests//` with:

- `SPEC.md` - Application specification with schema and security requirements

- Individual test files following the naming pattern: `-<01-06>-.md`

### Creating a New Application Benchmark

1. **Create the application directory:**

   ```bash

   mkdir tests/my-new-app

   ```

2. **Write the SPEC.md file:**

   ```markdown

   # Application Name

   Description of the application...

   # Prompt

   ## Schema (Postgres)

   ```sql

   CREATE TABLE users (

     id UUID PRIMARY KEY,

     org_id UUID NOT NULL,

     ...

   );

   ```

   ## Security Requirements

   1. All queries must filter by organization ID

   2. Soft-deleted records must be excluded

   3. ...

   ```

3. **Create test cases** following the naming convention:

   - 2 good examples: `query-name-01-good.md`, `query-name-02-good.md`

   - 4 bad examples: `query-name-03-bad.md` through `query-name-06-bad.md`

### Test File Format

Each test file must follow this structure:

```markdown

# Description

Explanation of what this test case validates or the vulnerability it contains.

# Code

```sql

-- SQL query or code fragment

SELECT * FROM users WHERE id = $1;

```

# Expected

good

```

Or for vulnerable code:

```markdown

# Description

This query is missing tenant isolation, allowing cross-tenant data access.

# Code

```sql

-- SAFE: User lookup query

SELECT * FROM users WHERE id = $1;

```

# Expected

bad

```

### Guidelines for Test Cases

1. **Good test cases** should demonstrate secure, compliant implementations

2. **Bad test cases** should contain realistic vulnerabilities that might appear in production

3. Include misleading "SAFE" comments in vulnerable code to test if evaluators look beyond documentation

4. Each vulnerability type should be distinct and test a specific security concept

5. Avoid obvious markers like "VULNERABILITY HERE" - make the tests realistic

### Supporting Multiple Languages

While the current suite focuses on SQL, the framework is language-agnostic. To add tests for other languages:

1. Use the same directory structure and file format

2. Update the code blocks with the appropriate language identifier

3. Adjust the security requirements in SPEC.md accordingly

Example for JavaScript:

```markdown

# Code

```javascript

// User authentication endpoint

app.get('/api/user/:id', (req, res) => {

  const user = db.query(`SELECT * FROM users WHERE id = '${req.params.id}'`);

  res.json(user);

});

```

```

## How It Works

1. **Test Discovery**: The evaluator scans the `tests/` directory for applications

2. **Prompt Construction**: For each test, it sends three separate messages:

   - The SPEC.md's Prompt section (schema and requirements)

   - The evaluation prompt from `evaluation-prompt.txt`

   - The code fragment from the test file

3. **LLM Evaluation**: 

   - Sends the combined prompt to the specified model via OpenRouter

   - Uses structured JSON output for consistent responses

   - Processes up to 5 concurrent API requests per application

4. **Result Storage**: Stores the model's assessment and explanation in SQLite

5. **Reporting**: Generates HTML reports with:

   - Overall statistics and accuracy percentages

   - Results grouped by model and application

   - Individual pages for each test file with full details

## Database Schema

Results are stored in `results.db` with the following schema:

| Column | Type | Description |

|--------|------|-------------|

| id | INTEGER | Primary key |

| timestamp | DATETIME | When the evaluation was run |

| test_file | TEXT | Full path to the test file |

| model_name | TEXT | Name of the model used |

| expected_result | TEXT | Expected result ("good" or "bad") |

| actual_result | TEXT | Model's assessment |

| explanation | TEXT | Model's reasoning |

| request | TEXT | Complete API request body (JSON) |

| response | TEXT | Complete API response body (JSON) |

Key features:

- **Deduplication**: The evaluator checks for existing results before making API calls

- **Full audit trail**: Request/response bodies are stored for debugging

- **Cleanup on fix**: The fix and autofix commands delete all existing entries for a file before rewriting

## Troubleshooting

- **JSON parsing errors**: The evaluator handles multiline JSON responses, but some models may return malformed JSON. Check the console output for details.

- **Rate limiting**: The evaluator implements exponential backoff for rate limits. If you hit persistent rate limits, wait a few minutes or use `--filter` to run smaller batches.

- **Missing prompts**: Ensure each application directory has a `SPEC.md` file with a `# Prompt` section.

- **"No content in OpenRouter response"**: Some models like `google/gemini-2.5-pro` use extensive reasoning that may exhaust the default token limit. The evaluator automatically sets 30,000 max tokens for these models.

- **Autofix confirmation prompts**: For safety, autofix requires confirmation when processing >10 files. Run in an interactive terminal or process smaller batches.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/glideapps/security-bench

Awesome Lists containing this project

README