https://github.com/mervinpraison/praisonaibench
A simple, powerful LLM benchmarking tool built with PraisonAI Agents
https://github.com/mervinpraison/praisonaibench
Last synced: 4 months ago
JSON representation
A simple, powerful LLM benchmarking tool built with PraisonAI Agents
- Host: GitHub
- URL: https://github.com/mervinpraison/praisonaibench
- Owner: MervinPraison
- Created: 2025-08-29T15:04:36.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-23T16:16:06.000Z (8 months ago)
- Last Synced: 2025-09-23T18:16:23.549Z (8 months ago)
- Language: HTML
- Homepage:
- Size: 604 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PraisonAI Bench
🚀 **A simple, powerful LLM benchmarking tool built with PraisonAI Agents**
Benchmark any LiteLLM-compatible model with automatic HTML extraction, model-specific output organization, and flexible test suite management.
## 🎯 Testing Modes
| Feature | Single Test | Test Suite (YAML) |
|---------|-------------|-------------------|
| **📝 Description** | Run one prompt | Run multiple tests from YAML file |
| **🔧 Command** | `praisonaibench --test "prompt"` | `praisonaibench --suite tests.yaml` |
| **📊 Evaluation** | ✅ Enabled (Browser + LLM Judge) | ✅ Enabled (Browser + LLM Judge) |
| **🎨 HTML Extraction** | ✅ Auto-extracted | ✅ Auto-extracted |
| **📁 Output** | Single JSON result | Batch JSON results |
| **🖼️ Screenshots** | ✅ Generated | ✅ Generated |
| **⚡ Console Errors** | ✅ Detected | ✅ Detected |
| **🤖 LLM Judge** | ✅ gpt-5.1 quality scoring | ✅ gpt-5.1 quality scoring |
| **🔄 Retry Logic** | ✅ 3 attempts | ✅ 3 attempts |
| **⚡ Parallel Execution** | N/A | ✅ `--concurrent N` |
| **💰 Cost Tracking** | ✅ Token & cost per test | ✅ Cumulative cost summary |
| **📊 Export Formats** | JSON | JSON, CSV |
| **📈 HTML Reports** | N/A | ✅ `--report` |
| **🎯 Use Case** | Quick testing | Comprehensive benchmarking |
### 🔍 What's Included in Evaluation?
Our **research-backed hybrid evaluation system** provides comprehensive quality assessment:
| Component | What It Does | Score Weight |
|-----------|--------------|--------------|
| **📝 HTML Validation** | Static structure validation, DOCTYPE, required tags | 15% |
| **🌐 Functional** | Browser rendering, console errors, render time | 40% |
| **🎯 Expected Result** | Objective comparison (optional, for factual tasks) | 20%* |
| **🎨 Quality (LLM)** | Code quality, completeness, best practices | 25% |
| **📊 Overall** | Combined score (0-100) with pass/fail (≥70) | 100% |
*When `expected` field is not provided, weights are automatically normalized (HTML: 18.75%, Functional: 50%, LLM: 31.25%)
**Example Output (with expected)**:
```
HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: 95/100 (95% similarity with expected result)
Quality: 80/100 (good structure, minor issues)
Overall: 87/100 ✅ PASSED
```
**Example Output (without expected)**:
```
HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: N/A (not provided)
Quality: 80/100 (good structure, minor issues)
Overall: 85/100 ✅ PASSED
```
## ✨ Key Features
- 🎯 **Any LLM Model** - OpenAI, Anthropic, Google, XAI, local models via LiteLLM
- 🔄 **Single Agent Design** - Your prompt becomes the instruction (no complex configs)
- 💾 **Auto HTML Extraction** - Automatically saves HTML code from responses
- 📁 **Smart Organization** - Model-specific output folders (`output/gpt-4o/`, `output/xai/grok-code-fast-1/`)
- 🎛️ **Flexible Testing** - Run single tests, full suites, or filter specific tests
- ⚡ **Parallel Execution** - Run tests concurrently with `--concurrent N` for faster benchmarking
- 💰 **Cost & Token Tracking** - Automatic token usage and cost calculation for all supported models
- 📊 **Multiple Export Formats** - Export results as JSON or CSV for easy analysis
- 📈 **HTML Dashboard Reports** - Beautiful visual reports with interactive charts using `--report`
- 🛠️ **Modern Tooling** - Built with `pyproject.toml` and `uv` package manager
- 📋 **Comprehensive Results** - Complete metrics with timing, success rates, costs, and metadata
- 🔌 **Plugin System** - Extensible evaluators for any language (Python, TypeScript, Go, etc.) via plugins
## 🚀 Quick Start
### 1. Install from PyPI (Recommended)
```bash
pip install praisonaibench
```
[](https://pypi.org/project/praisonaibench/)
### 2. Install with uv
```bash
# Clone the repository
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
# Install with uv
uv sync
# Or install in development mode
uv pip install -e .
```
### 3. Alternative: Install with pip (from source)
```bash
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
pip install -e .
```
### 4. Set Your API Keys
```bash
# OpenAI
export OPENAI_API_KEY=your_openai_key
# XAI (Grok)
export XAI_API_KEY=your_xai_key
# Anthropic
export ANTHROPIC_API_KEY=your_anthropic_key
# Google
export GOOGLE_API_KEY=your_google_key
```
### 4. Run Your First Benchmark
```python
from praisonaibench import Bench
# Create benchmark suite
bench = Bench()
# Run a simple test
result = bench.run_single_test("What is 2+2?")
print(result['response'])
# Run with specific model
result = bench.run_single_test(
"Create a rotating cube HTML file",
model="xai/grok-code-fast-1"
)
# Get summary
summary = bench.get_summary()
print(summary)
```
## 📁 Project Structure
```
praisonaibench/
├── pyproject.toml # Modern Python packaging
├── src/praisonaibench/ # Source code
│ ├── __init__.py # Main imports
│ ├── bench.py # Core benchmarking engine
│ ├── agent.py # LLM agent wrapper
│ ├── cli.py # Command line interface
│ └── version.py # Version info
├── examples/ # Example configurations
│ ├── threejs_simulation_suite.yaml
│ └── config_example.yaml
└── output/ # Generated results
├── gpt-4o/ # Model-specific HTML files
├── xai/grok-code-fast-1/
└── benchmark_results_*.json
```
## 💻 CLI Usage
### Basic Commands
```bash
# Single test with default model
praisonaibench --test "Explain quantum computing"
# Single test with specific model
praisonaibench --test "Write a poem" --model gpt-4o
# Use any LiteLLM-compatible model
praisonaibench --test "Create HTML" --model xai/grok-code-fast-1
praisonaibench --test "Write code" --model gemini/gemini-1.5-flash-8b
praisonaibench --test "Analyze data" --model claude-3-sonnet-20240229
```
### Test Suites
```bash
# Run entire test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml
# Run specific test from suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --test-name "rotating_cube_simulation"
# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1
# Run tests in parallel (3 concurrent workers)
praisonaibench --suite tests.yaml --concurrent 3
```
### Cross-Model Comparison
```bash
# Compare across multiple models
praisonaibench --cross-model "Write a poem" --models gpt-4o,gpt-3.5-turbo,xai/grok-code-fast-1
```
### Extract HTML from Results
```bash
# Extract HTML from existing benchmark results
praisonaibench --extract output/benchmark_results_20250829_160426.json
# → Processes JSON file and saves any HTML content to .html files
# Works with any benchmark results JSON file
praisonaibench --extract my_results.json
```
### HTML Generation Examples
```bash
# Generate Three.js simulation (auto-saves HTML)
praisonaibench --test "Create a rotating cube HTML with Three.js" --model gpt-4o
# → Saves to: output/gpt-4o/test_cube.html
# Run Three.js test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --model xai/grok-code-fast-1
# → Saves to: output/xai/grok-code-fast-1/rotating_cube_simulation.html
```
### Cost & Token Tracking
Automatically track token usage and costs for all LLM API calls:
```bash
# Run tests with automatic cost tracking
praisonaibench --suite tests.yaml --model gpt-4o
# Output includes per-test costs:
💰 Cost: $0.002400 (1250 tokens)
# Summary shows total costs:
📊 Summary:
Total tests: 4
Success rate: 100.0%
Average time: 8.42s
💰 Cost Summary:
Total tokens: 5,420
Total cost: $0.0124
By model:
gpt-4o: $0.0124 (5,420 tokens)
```
**Supported models** include accurate pricing for:
- OpenAI (GPT-4o, GPT-4, GPT-3.5, O1)
- Anthropic (Claude 3 family)
- Google (Gemini 1.5 family)
- XAI (Grok models)
- Groq (optimised models)
Token usage is extracted from API responses when available, or estimated from text length. Costs are calculated using official provider pricing (updated December 2024).
### CSV Export
Export benchmark results to CSV for spreadsheet analysis:
```bash
# Export to CSV format
praisonaibench --suite tests.yaml --format csv
# Results saved to: output/csv/benchmark_results_20241211_123456.csv
```
**CSV includes:**
- Test names and status
- Model information
- Execution times
- Token usage (input/output/total)
- Costs per test
- Evaluation scores
- Prompts and response lengths
- Error messages (if any)
Perfect for:
- Spreadsheet analysis in Excel/Google Sheets
- Data visualization tools
- Statistical analysis
- Sharing results with non-technical stakeholders
### HTML Dashboard Reports
Generate beautiful interactive reports with comprehensive visualizations inspired by the React UI:
```bash
# Generate enhanced HTML report after running tests
praisonaibench --suite tests.yaml --report
# Generate report from existing results (without re-running tests)
praisonaibench --report-from output/json/benchmark_results_20241211_123456.json
# Compare multiple test results
praisonaibench --compare result1.json result2.json result3.json
# Reports saved to: output/reports/
```
**Enhanced Report Includes:**
**📊 Dashboard Tab:**
- Summary cards with key metrics (tests, models, success rate, avg time, cost, tokens)
- Interactive charts:
- Status distribution (success/failure)
- Execution time by model
- Evaluation scores (radar chart)
- Errors & warnings
**🏆 Leaderboard Tab:**
- Model rankings with multiple criteria:
- Overall Score (default)
- Functional Score
- Quality Score
- Pass Rate
- Speed (fastest first)
- Top 3 models highlighted with medals
- Detailed metrics per model (functional, quality, pass rate, time)
- Click criteria to re-rank dynamically
**⚖️ Comparison Tab:**
- Detailed side-by-side model comparison
- Comprehensive metrics table:
- Overall score, functional score, quality score
- Pass rate with color coding
- Average execution time
- Total errors and warnings count
- Full model names and stats
**📋 Results Tab:**
- Complete test results table
- Individual test status, scores, time, tokens, cost
- Sortable columns
- Status indicators
**Features:**
- 🎨 Modern UI with gradient headers and smooth transitions
- 📱 Fully responsive design
- ⚡ Fast and lightweight (no external dependencies)
- 🔄 Tab navigation for organized viewing
- 📊 Chart.js powered visualizations
- 🎯 Based on praisonaibench-ui React application
- 💾 Standalone HTML - works offline
- 📧 Easy to share via email or host on web
**Comparison Reports:**
Multi-run comparison shows:
- Side-by-side success rates
- Performance trends
- Cost and token usage evolution
- Model improvements over time
Open any generated HTML file in any modern browser!
## 📋 Test Suite Format
### Basic Test Suite (`tests.yaml`)
```yaml
tests:
- name: "math_test"
prompt: "What is 15 * 23?"
expected: "345" # Optional: for objective comparison
- name: "python_test"
language: "python" # Use plugin evaluator
prompt: "Write Python factorial function"
expected: "120"
- name: "creative_test"
prompt: "Write a short story about a robot"
# No expected field - subjective task
- name: "model_specific_test"
prompt: "Explain quantum physics"
model: "gpt-4o"
```
**Using the `expected` field**:
- ✅ **Use for**: Factual questions, math problems, code output, deterministic tasks
- ❌ **Skip for**: Creative tasks, open-ended questions, visual/interactive content
- When provided: Adds 20% objective scoring based on similarity
- When omitted: Weights automatically normalize (no penalty)
### Advanced Test Suite with Full Config Support
```yaml
# Global LLM configuration (applies to all tests)
config:
max_tokens: 4000
temperature: 0.7
top_p: 0.9
frequency_penalty: 0.0
presence_penalty: 0.0
# Any LiteLLM-compatible parameter is supported!
tests:
- name: "creative_writing"
prompt: "Write a detailed sci-fi story"
model: "gpt-4o"
- name: "code_generation"
prompt: "Create a Python web scraper"
model: "xai/grok-code-fast-1"
```
### Three.js HTML Generation Suite
```yaml
# examples/threejs_simulation_suite.yaml
tests:
- name: "rotating_cube_simulation"
prompt: |
Create a complete HTML file with Three.js that displays a rotating 3D cube.
The cube should have different colored faces, rotate continuously, and include proper lighting.
The HTML file should be self-contained with Three.js loaded from CDN.
Include camera controls for user interaction.
Save the output as 'rotating_cube.html'.
- name: "particle_system"
prompt: |
Create an HTML file with Three.js showing an animated particle system.
Include 1000+ particles with random colors, movement, and physics.
Add mouse interaction to influence particle behavior.
- name: "terrain_simulation"
prompt: |
Create a Three.js HTML file with a procedurally generated terrain landscape.
Include realistic textures, lighting, and a first-person camera.
Add fog effects and animated elements.
- name: "solar_system"
prompt: |
Create a Three.js solar system simulation in HTML.
Include the sun, planets with realistic orbits, textures, and lighting.
Add controls to speed up/slow down time.
```
## 🔧 Configuration
### Basic Configuration (`config.yaml`)
```yaml
# Default model (can be overridden per test)
default_model: "gpt-4o"
# Output settings
output_format: "json"
save_results: true
output_dir: "output"
# Performance settings
max_retries: 3
timeout: 60
```
### Supported Models
PraisonAI Bench supports **any LiteLLM-compatible model**:
```yaml
# OpenAI Models
- gpt-4o
- gpt-4o-mini
- gpt-3.5-turbo
# Anthropic Models
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- claude-3-haiku-20240307
# Google Models
- gemini/gemini-1.5-pro
- gemini/gemini-1.5-flash
- gemini/gemini-1.5-flash-8b
# XAI Models
- xai/grok-beta
- xai/grok-code-fast-1
# Local Models (via LM Studio, Ollama, etc.)
- ollama/llama2
- openai/gpt-3.5-turbo # with OPENAI_API_BASE set
```
## 📊 Results & Output
### Automatic HTML Extraction
When LLM responses contain HTML code blocks, they're automatically extracted and saved:
```
output/
├── gpt-4o/
│ ├── rotating_cube_simulation.html
│ └── particle_system.html
├── xai/
│ └── grok-code-fast-1/
│ ├── terrain_simulation.html
│ └── solar_system.html
└── benchmark_results_20250829_160426.json
```
### JSON Results Format
```json
[
{
"test_name": "rotating_cube_simulation",
"prompt": "Create a complete HTML file with Three.js...",
"response": "\n\n...",
"model": "xai/grok-code-fast-1",
"agent_name": "BenchAgent",
"execution_time": 8.24,
"status": "success",
"timestamp": "2025-08-29 16:04:26"
}
]
```
### Summary Statistics
```bash
📊 Summary:
Total tests: 4
Success rate: 100.0%
Average time: 12.34s
Results saved to: output/benchmark_results_20250829_160426.json
```
## 🎯 Advanced Features
### 🔄 **Universal Model Support**
- Works with **any LiteLLM-compatible model**
- No hardcoded model restrictions
- Automatic API key detection
### 💾 **Smart HTML Handling**
- Auto-detects HTML in multiple formats:
- Markdown-wrapped HTML (```html...```)
- Truncated HTML blocks (incomplete responses)
- Raw HTML content (direct HTML responses)
- Extracts and saves as `.html` files automatically
- Organizes by model in separate folders
- Extract HTML from existing benchmark results with `--extract`
- Perfect for Three.js, React, or any web development benchmarks
### 🎛️ **Flexible Test Management**
- Run entire suites or filter specific tests
- Override models per test or globally
- Cross-model comparisons with detailed metrics
### ⚡ **Modern Development**
- Built with `pyproject.toml` (no legacy `setup.py`)
- Optimized for `uv` package manager
- Fast dependency resolution and installation
### 🏗️ **Simple Architecture**
- **Single Agent Design** - Your prompt becomes the instruction
- **No Complex Configs** - Just write your test prompts
- **Minimal Dependencies** - Only what you need
## 🔌 Plugin System
**Extensible evaluators for any language or task** - Create plugins in one file.
### Create Plugin (One File)
```python
from praisonaibench import BaseEvaluator
class MyEvaluator(BaseEvaluator):
def get_language(self):
return 'mylang' # e.g., 'python', 'typescript', 'go'
def evaluate(self, code, test_name, prompt, expected=None):
return {
'score': 85, # 0-100
'passed': True, # score >= 70
'feedback': [{'level': 'success', 'message': '✅ Works!'}],
'details': {}
}
```
**Setup** (`pyproject.toml`):
```toml
[project]
name = "praisonaibench-mylang"
version = "0.1.0"
dependencies = ["praisonaibench>=0.1.0"]
[project.entry-points."praisonaibench.evaluators"]
mylang = "my_evaluator:MyEvaluator"
```
**Install**: `pip install -e .` or `uv pip install -e .`
### Use Plugin
```yaml
# tests.yaml
tests:
- name: "python_test"
language: "python" # Auto-discovered
prompt: "Write Python hello world"
expected: "Hello World"
```
**Run**: `praisonaibench --suite tests.yaml`
### Features
- ✅ **One file** (~50 lines) per plugin
- ✅ **Auto-discovery** - No config needed
- ✅ **Backwards compatible** - HTML evaluation unchanged
- ✅ **Language detection** - Auto-detects from code blocks or explicit `language` field
- ✅ **Any task** - Programming languages, text summarization, translation, etc.
**Example**: `examples/plugins/python_evaluator.py`
## 🚀 Use Cases
### Web Development Benchmarking
```bash
# Test HTML/CSS/JS generation across models
praisonaibench --suite web_dev_suite.yaml --model gpt-4o
```
### Code Generation Comparison
```bash
# Compare coding abilities
praisonaibench --cross-model "Write a Python web scraper" --models gpt-4o,claude-3-sonnet-20240229,xai/grok-code-fast-1
```
### Creative Content Testing
```bash
# Test creative writing
praisonaibench --test "Write a sci-fi short story" --model gemini/gemini-1.5-pro
```
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Install dependencies: `uv sync`
4. Make your changes
5. Run tests: `uv run pytest`
6. Submit a pull request
## 📄 License
MIT License - see LICENSE file for details.
---
**Perfect for developers who need powerful, flexible LLM benchmarking with zero complexity!** 🚀