https://github.com/semcod/qualbench
https://github.com/semcod/qualbench
Last synced: 8 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/semcod/qualbench
- Owner: semcod
- License: other
- Created: 2026-04-04T16:15:57.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-04T20:06:19.000Z (2 months ago)
- Last Synced: 2026-04-04T20:11:00.219Z (2 months ago)
- Language: Python
- Size: 289 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# QualBench — CI for AI-Generated Code
## AI Cost Tracking
 
This project uses AI-generated code. Total cost: **$2.8500** with **19** AI commits.
Generated on 2026-04-09 using [openrouter/qwen/qwen3-coder-next](https://openrouter.ai/models/openrouter/qwen/qwen3-coder-next)
---
> **Correct code is not the same as mergeable code.**
> eslint + code review, but for AI. Add to your pipeline in 2 minutes.
[](LICENSE)
[-green.svg)](dataset/)
[](action/)
---
## 60 seconds to your first score
```bash
pip install qualbench
qualbench quickstart
```
No config, no API keys. QualBench evaluates your current diff and prints a Quality Score.
## Add to CI in 2 minutes
```yaml
# .github/workflows/qualbench.yml
name: QualBench
on: [pull_request]
jobs:
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: semcod/qualbench-action@v1
with:
tool: prollama
fail_on_score: 70
```
Every AI-generated PR gets a quality review comment. Set `fail_on_score` and the pipeline fails if quality is below your threshold.
```
🧠 QualBench Review
Quality Score: 78/100
❌ Complexity increased (+12%)
⚠ Security: 1 new medium-severity finding
✔ Tests pass, no regressions
Verdict: needs_review
```
## CI/CD Examples
### GitHub Action (recommended)
```yaml
# .github/workflows/qualbench.yml
name: QualBench
on: [pull_request]
jobs:
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: semcod/qualbench-action@v1
with:
tool: prollama
fail_on_score: 70
```
### GitLab CI
```yaml
# .gitlab-ci.yml
qualbench:
stage: test
image: python:3.12-slim
before_script:
- pip install qualbench
script:
- qualbench run --tool prollama --json --fail-on-score 70
only:
- merge_requests
```
### Azure DevOps
```yaml
# azure-pipelines.yml
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.12'
- script: |
pip install qualbench
qualbench run --tool prollama --json --fail-on-score 70
displayName: 'QualBench Quality Check'
```
### Jenkins
```groovy
// Jenkinsfile
stage('Quality Check') {
steps {
sh '''
pip install qualbench
qualbench run --tool prollama --fail-on-score 70
'''
}
}
```
### CircleCI
```yaml
# .circleci/config.yml
version: 2.1
jobs:
quality:
docker:
- image: python:3.12-slim
steps:
- checkout
- run: pip install qualbench
- run: qualbench run --tool prollama --fail-on-score 70
workflows:
quality-check:
jobs:
- quality
```
## The problem
AI coding tools resolve 70–80% of benchmark tasks. But most AI-generated PRs are not mergeable without human fixes. Every existing benchmark asks "do tests pass?" — nobody asks "would a senior developer approve this PR?"
## Six dimensions of production readiness
| Dimension | What it measures | Weight |
|-----------|-----------------|--------|
| **Correctness** | All tests pass, no regressions | 25% |
| **Mergeability** | Would a senior dev merge this? (1–5) | 25% |
| **Security** | New vulnerabilities introduced | 15% |
| **Code quality** | Complexity delta, dead code | 15% |
| **Iterations** | Attempts to reach acceptable output | 10% |
| **Cost efficiency** | USD per successful patch | 10% |
**Verdicts:** `ready_to_merge` (≥85), `needs_review` (65–84), `not_merge_ready` (<65).
## CLI
```bash
qualbench run --tool prollama # score current diff
qualbench run --tool prollama --json # portable JSON output
qualbench run --mode cheap # lowest-cost models
qualbench quickstart # first score in 60 seconds
qualbench compare my_tool # vs leaderboard
qualbench info # dataset summary
qualbench doctor # check dependencies
```
## One portable format everywhere
CLI, API, GitHub Action — same JSON schema. See [docs/schema.md](docs/schema.md).
## Adding your tool
```bash
cp runners/template.py runners/my_tool.py
# Implement run() → return portable schema
qualbench run --tool my_tool
# Submit PR with results
```
## License
Licensed under Apache-2.0.
## Status
_Last updated by [taskill](https://github.com/oqlos/taskill) at 2026-04-25 13:46 UTC_
| Metric | Value |
|---|---|
| HEAD | `c199cf4` |
| Coverage | — |
| Failing tests | — |
| Commits in last cycle | 26 |
> Repository received a mix of fixes, refactors and configuration updates: tests were hardened (mocks added), a JSON test fix was applied, many vallm/style issues and magic numbers were addressed, and release-related features (v0.3.0, Supervisor AI, new runners) were added. PyQual configuration thresholds and gates were adjusted and documentation/TODOs were updated after a PyQual run.