https://github.com/glideapps/security-bench
Evals for how well LLMs can judge security issues in code
https://github.com/glideapps/security-bench
Last synced: 4 months ago
JSON representation
Evals for how well LLMs can judge security issues in code
- Host: GitHub
- URL: https://github.com/glideapps/security-bench
- Owner: glideapps
- Created: 2025-09-04T13:27:19.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-09-05T16:49:01.000Z (4 months ago)
- Last Synced: 2025-09-05T18:41:41.766Z (4 months ago)
- Language: TypeScript
- Size: 188 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Security Benchmark Suite
A comprehensive evaluation framework for testing Large Language Models' ability to identify security vulnerabilities in code, with a focus on SQL injection, access control, and multi-tenant data isolation issues.
## Overview
This benchmark suite evaluates how well LLMs can detect security vulnerabilities in database queries and other code fragments. It tests various security patterns including:
- SQL injection vulnerabilities
- Cross-tenant data leakage
- Missing access control checks
- Improper handling of soft-deleted records
- Exposure of sensitive fields
- Missing pagination limits
- Incorrect permission validations
- State transition guards
- Temporal access control
## Setup
### Prerequisites
- Node.js 24+ (with built-in TypeScript support)
- An OpenRouter API key
### Installation
1. Clone the repository
2. Install dependencies:
```bash
npm install
```
3. Configure your OpenRouter API key in `.env`:
```
OPENROUTER_API_KEY=your-api-key-here
```
## Running Evaluations
### Evaluate a Model
Run the full benchmark suite against a specific model:
```bash
npm run evaluate -- --model gpt-4o-mini
```
Run evaluations with multiple models:
```bash
# Evaluate with multiple models sequentially
npm run evaluate -- --model gpt-4o --model claude-3-opus --model gemini-pro
# Multiple models with filter
npm run evaluate -- --model gpt-4o-mini --model claude-3-haiku --filter approve-po
```
Run a filtered subset of tests:
```bash
# Only run tests matching a pattern
npm run evaluate -- --model claude-3-opus --filter approve-po
# Only run "good" test cases
npm run evaluate -- --model gpt-4o-mini --filter 01-good
```
**Note:** When using multiple models, each model is evaluated sequentially with clear separation in the output. All results are stored independently in the database for each model.
### Generate Reports
Generate an HTML report from evaluation results:
```bash
npm run report
```
Reports are saved in the `reports/` directory with timestamps. Each report includes:
- Summary statistics showing overall accuracy percentages
- Results grouped by model with individual test file links
- Results grouped by application showing performance across all models
- Detailed pages for each test file showing all evaluations
### Fix Test Files
The fix command can automatically rewrite test code when a model's assessment differs from the expected result:
```bash
# Fix using default model (anthropic/claude-opus-4.1)
npm run fix approve-po-01-good.md
# Fix using a specific model
npm run fix approve-po-01-good.md -- --model gpt-4o
# Fix using full or relative path
npm run fix tests/purchase-order/approve-po-01-good.md
```
How it works:
1. Retrieves any previous evaluation results for context
2. **Deletes all existing database entries for that test file**
3. Evaluates the test file with the specified model
4. If the assessment matches expected: reports no fix needed
5. If the assessment differs: asks the model to rewrite the code to match expectations
6. Updates the test file with corrected code
7. Verifies the fix by re-evaluating
The fix command will:
- Make vulnerable code secure (if expected is "good")
- Introduce realistic vulnerabilities (if expected is "bad")
- Add appropriate SQL comments (accurate for secure code, misleading for vulnerable code)
### Verify Query Behavior
The verify command runs all test queries against an actual database to ensure that:
- "Good" query variants produce identical results
- "Bad" queries expose vulnerabilities by producing different results
- All queries are syntactically valid SQL
```bash
npm run verify
```
This command:
1. Creates an in-memory PostgreSQL database using PGlite
2. Loads test data with multiple organizations, users, and purchase orders
3. For each query type with a parameter file:
- Runs all good queries and verifies they return identical results
- Runs bad queries and checks they differ from good queries
- For INSERT/UPDATE/DELETE operations, uses optional Verify queries to check actual data changes
4. Reports any queries that fail to expose vulnerabilities
### Batch Fix with Autofix
The autofix command finds and fixes multiple test files based on their correctness percentage:
```bash
# Fix all files with ≤50% correctness using default model
npm run autofix -- 50
# Fix all files with 0% correctness (completely failing)
npm run autofix -- 0
# Use a specific model for autofix
npm run autofix -- 30 --model gpt-4o
```
How it works:
1. Queries the database for all test files with correctness ≤ the specified percentage
2. Shows a summary of files to be fixed with their current accuracy
3. For >10 files, prompts for confirmation (in interactive terminals only)
4. Processes each file sequentially:
- Shows progress indicator `[n/total]`
- Calls the fix logic (including DB cleanup)
- Continues on error
5. Displays final summary with success/failure counts
Example output:
```
Finding test files with ≤50% correctness...
Found 3 files to fix:
- approve-po-03-bad.md (0.0% correct, 0/2)
- buyer-approval-queue-04-bad.md (25.0% correct, 1/4)
- get-messages-in-po-05-bad.md (50.0% correct, 1/2)
Starting autofix with model: anthropic/claude-opus-4.1
[1/3] Fixing approve-po-03-bad.md (0.0% correct)
...
[2/3] Fixing buyer-approval-queue-04-bad.md (25.0% correct)
...
[3/3] Fixing get-messages-in-po-05-bad.md (50.0% correct)
...
========================================
Autofix complete!
✓ Successfully fixed: 3
========================================
```
## Query Verification System
The verification system ensures test queries are not only theoretically correct but actually behave as expected against a real database.
### Creating Parameter Files
Each query type needs a parameter file (e.g., `approve-po.md`) in the test directory:
```markdown
# Description
Query to approve a purchase order with proper security controls
# Parameters
{ "user_id": "22222222-0000-0000-0000-000000000001", "org_id": "11111111-0000-0000-0000-000000000001", "role": "buyer_admin", "po_id": "44444444-0000-0000-0000-000000000002", "now": "2024-01-15T12:00:00Z" }
{ "user_id": "22222222-0000-0000-0000-000000000002", "org_id": "11111111-0000-0000-0000-000000000001", "role": "buyer_user", "po_id": "44444444-0000-0000-0000-000000000002", "now": "2024-01-15T12:00:00Z" }
{ "user_id": "22222222-0000-0000-0000-000000000005", "org_id": "11111111-0000-0000-0000-000000000002", "role": "buyer_admin", "po_id": "44444444-0000-0000-0000-000000000001", "now": "2024-01-15T12:00:00Z" }
# Verify (Optional - for INSERT/UPDATE/DELETE)
```sql
SELECT id, status, approved_by, buyer_org_id
FROM purchase_orders
WHERE id = :po_id
```
```
Each parameter set should test different access scenarios to expose vulnerabilities in bad queries.
### Database Setup
Create an `index.ts` file in the test directory to set up the database:
```typescript
import { PGlite } from '@electric-sql/pglite';
export async function createDatabase(): Promise {
const db = new PGlite();
// Create schema
await db.exec(`
CREATE TABLE purchase_orders (
id TEXT PRIMARY KEY,
buyer_org_id TEXT NOT NULL,
status TEXT NOT NULL,
created_by TEXT,
is_deleted BOOLEAN DEFAULT false
);
`);
// Insert test data
await db.exec(`
INSERT INTO purchase_orders VALUES
('44444444-0000-0000-0000-000000000001', '11111111-0000-0000-0000-000000000001', 'DRAFT', '22222222-0000-0000-0000-000000000002', false),
('44444444-0000-0000-0000-000000000002', '11111111-0000-0000-0000-000000000001', 'PENDING_APPROVAL', '22222222-0000-0000-0000-000000000002', false);
`);
return db;
}
```
### Verification Process
The verify command:
1. **Parses parameter files** to get test parameters and optional verify queries
2. **Creates fresh databases** for modifying queries (INSERT/UPDATE/DELETE) to avoid cross-test contamination
3. **Converts named parameters** (`:param`) to positional parameters (`$1`) for PGlite compatibility
4. **Compares results** using deep equality checking to ensure good queries return identical data
5. **Reports vulnerabilities** when bad queries produce different results or errors
### Key Features
- **Deep equality checking**: Uses `fast-deep-equal` to compare actual query results, not just row counts
- **Fresh databases for mutations**: Each INSERT/UPDATE/DELETE gets a clean database to prevent test pollution
- **Verify queries**: Optional SELECT queries to check actual data changes after mutations
- **Comprehensive reporting**: Shows which parameter sets expose vulnerabilities and which don't
## Adding New Benchmarks
### Directory Structure
Each benchmark application lives in `tests//` with:
- `SPEC.md` - Application specification with schema and security requirements
- Individual test files following the naming pattern: `-<01-06>-.md`
### Creating a New Application Benchmark
1. **Create the application directory:**
```bash
mkdir tests/my-new-app
```
2. **Write the SPEC.md file:**
```markdown
# Application Name
Description of the application...
# Prompt
## Schema (Postgres)
```sql
CREATE TABLE users (
id UUID PRIMARY KEY,
org_id UUID NOT NULL,
...
);
```
## Security Requirements
1. All queries must filter by organization ID
2. Soft-deleted records must be excluded
3. ...
```
3. **Create test cases** following the naming convention:
- 2 good examples: `query-name-01-good.md`, `query-name-02-good.md`
- 4 bad examples: `query-name-03-bad.md` through `query-name-06-bad.md`
### Test File Format
Each test file must follow this structure:
```markdown
# Description
Explanation of what this test case validates or the vulnerability it contains.
# Code
```sql
-- SQL query or code fragment
SELECT * FROM users WHERE id = $1;
```
# Expected
good
```
Or for vulnerable code:
```markdown
# Description
This query is missing tenant isolation, allowing cross-tenant data access.
# Code
```sql
-- SAFE: User lookup query
SELECT * FROM users WHERE id = $1;
```
# Expected
bad
```
### Guidelines for Test Cases
1. **Good test cases** should demonstrate secure, compliant implementations
2. **Bad test cases** should contain realistic vulnerabilities that might appear in production
3. Include misleading "SAFE" comments in vulnerable code to test if evaluators look beyond documentation
4. Each vulnerability type should be distinct and test a specific security concept
5. Avoid obvious markers like "VULNERABILITY HERE" - make the tests realistic
### Supporting Multiple Languages
While the current suite focuses on SQL, the framework is language-agnostic. To add tests for other languages:
1. Use the same directory structure and file format
2. Update the code blocks with the appropriate language identifier
3. Adjust the security requirements in SPEC.md accordingly
Example for JavaScript:
```markdown
# Code
```javascript
// User authentication endpoint
app.get('/api/user/:id', (req, res) => {
const user = db.query(`SELECT * FROM users WHERE id = '${req.params.id}'`);
res.json(user);
});
```
```
## How It Works
1. **Test Discovery**: The evaluator scans the `tests/` directory for applications
2. **Prompt Construction**: For each test, it sends three separate messages:
- The SPEC.md's Prompt section (schema and requirements)
- The evaluation prompt from `evaluation-prompt.txt`
- The code fragment from the test file
3. **LLM Evaluation**:
- Sends the combined prompt to the specified model via OpenRouter
- Uses structured JSON output for consistent responses
- Processes up to 5 concurrent API requests per application
4. **Result Storage**: Stores the model's assessment and explanation in SQLite
5. **Reporting**: Generates HTML reports with:
- Overall statistics and accuracy percentages
- Results grouped by model and application
- Individual pages for each test file with full details
## Database Schema
Results are stored in `results.db` with the following schema:
| Column | Type | Description |
|--------|------|-------------|
| id | INTEGER | Primary key |
| timestamp | DATETIME | When the evaluation was run |
| test_file | TEXT | Full path to the test file |
| model_name | TEXT | Name of the model used |
| expected_result | TEXT | Expected result ("good" or "bad") |
| actual_result | TEXT | Model's assessment |
| explanation | TEXT | Model's reasoning |
| request | TEXT | Complete API request body (JSON) |
| response | TEXT | Complete API response body (JSON) |
Key features:
- **Deduplication**: The evaluator checks for existing results before making API calls
- **Full audit trail**: Request/response bodies are stored for debugging
- **Cleanup on fix**: The fix and autofix commands delete all existing entries for a file before rewriting
## Troubleshooting
- **JSON parsing errors**: The evaluator handles multiline JSON responses, but some models may return malformed JSON. Check the console output for details.
- **Rate limiting**: The evaluator implements exponential backoff for rate limits. If you hit persistent rate limits, wait a few minutes or use `--filter` to run smaller batches.
- **Missing prompts**: Ensure each application directory has a `SPEC.md` file with a `# Prompt` section.
- **"No content in OpenRouter response"**: Some models like `google/gemini-2.5-pro` use extensive reasoning that may exhaust the default token limit. The evaluator automatically sets 30,000 max tokens for these models.
- **Autofix confirmation prompts**: For safety, autofix requires confirmation when processing >10 files. Run in an interactive terminal or process smaller batches.