https://github.com/geekychris/hitorro-prbench

HiTorro PR Bench - AI review bot benchmarking with replay engine, golden datasets, and F1 scoring
https://github.com/geekychris/hitorro-prbench
benchmark github hitorro infra
Last synced: about 1 month ago
JSON representation
HiTorro PR Bench - AI review bot benchmarking with replay engine, golden datasets, and F1 scoring
Host: GitHub
URL: https://github.com/geekychris/hitorro-prbench
Owner: geekychris
Created: 2026-04-20T00:55:53.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-20T23:49:43.000Z (3 months ago)
Last Synced: 2026-04-21T01:34:37.807Z (3 months ago)
Topics: benchmark, github, hitorro, infra
Language: Java
Size: 108 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # HiTorro PR Bench

A standalone benchmarking platform for evaluating AI code review bots. PR Bench replays real pull requests as synthetic PRs in mirror repositories, triggers AI reviewer workflows, collects their comments, and measures quality against a human-curated golden dataset using precision, recall, and F1 metrics.

Built with Java 21, Spring Boot 3.2, React 18, and H2 (file-based).

## Feature Highlights

**PR Benchmarking Pipeline**

- Define benchmark suites from real GitHub PRs with known review comments

- Register AI review bots with their GitHub Actions workflow definitions

- Replay PRs as synthetic PRs in mirror repos, injecting bot workflows automatically

- Collect all bot-generated comments (inline reviews, PR reviews, issue comments)

- Grade comments as VALID, INVALID, DUPLICATE, or NEEDS_REVIEW -- individually or in bulk

- Build a golden dataset by promoting exemplary comments

- Measure precision/recall/F1 per bot, compare runs with McNemar's significance test, track trends over time

**Repository Management**

- Browse and bulk-import repos from GitHub (personal, org, or by visibility)

- Tag repos locally with automatic sync to GitHub Topics

- AI-generated descriptions via Ollama (scans README, file tree, commits, package manifests)

- Docs scanning across all markdown/rst/adoc files in a repo

- Faceted search and filtering by owner, language, tag, fork status, description

- Markdown report generation with summary tables, doc links, and statistics

## Architecture

### System Overview

```mermaid

graph TB

    subgraph Frontend

        React[React 18 SPA
Vite + TypeScript]

    end

    subgraph Backend["Spring Boot 3.2 (port 8090)"]

        Controllers[REST Controllers]

        Orchestrator[RunOrchestrator]

        Replay[ReplayEngine]

        Collector[CommentCollector]

        Similarity[SimilarityService]

        Reporting[ReportingService]

        Ollama[OllamaService]

        GitTools[hitorro-gittools]

    end

    subgraph Storage

        H2[(H2 File DB
./data/prbench)]

    end

    subgraph External

        GitHub[GitHub API]

        OllamaServer[Ollama LLM
localhost:11434]

        Mirror[Mirror Repos
on GitHub]

    end

    React -->|REST API| Controllers

    Controllers --> Orchestrator

    Orchestrator --> Replay

    Orchestrator --> Collector

    Controllers --> Similarity

    Controllers --> Reporting

    Controllers --> Ollama

    Replay --> GitTools

    Replay --> GitHub

    Collector --> GitHub

    Ollama --> OllamaServer

    GitTools -->|clone/push/branch| Mirror

    Controllers --> H2

```

### Benchmark Run Flow

```mermaid

sequenceDiagram

    participant UI as React UI

    participant API as RunController

    participant Orch as RunOrchestrator

    participant RE as ReplayEngine

    participant GH as GitHub API

    participant CC as CommentCollector

    UI->>API: POST /api/runs (suiteId, botIds, concurrency)

    API->>Orch: executeRun(run) [async]

    API-->>UI: 200 OK (run created)

    Note over Orch: Snapshot bot configs

    loop For each SuitePr x Bot (semaphore-controlled)

        Orch->>RE: replay(suitePr, bot, repo)

        RE->>RE: Clone/fetch mirror repo

        RE->>RE: Create base branch at base SHA

        RE->>RE: Create head branch at head SHA

        RE->>RE: Inject bot workflow file + commit

        RE->>GH: Push branches, create PR

        GH-->>RE: PR number + URL

        RE-->>Orch: ReplayResult

        Note over Orch: Poll for bot completion

        loop Until checks/reviews complete or timeout

            Orch->>GH: Check run status / list reviews

        end

        Orch->>CC: collectReplayPrComments(replayPr)

        CC->>GH: List review comments, reviews, issue comments

        CC->>CC: Normalize text, compute Winnowing hashes

        CC-->>Orch: Comments saved

    end

    Orch->>Orch: Mark run COMPLETED

    UI->>API: GET /api/runs/{id}/progress

    API-->>UI: Status counts per replay PR

```

### Repository Management Components

```mermaid

graph LR

    subgraph "Repo Management"

        Import[Import from GitHub]

        Tags[Tag Management]

        Desc[Description Management]

        Docs[Docs Scanner]

        Report[Report Generator]

    end

    subgraph "External"

        GH[GitHub API]

        LLM[Ollama LLM]

    end

    Import -->|Browse user/org repos| GH

    Tags -->|Sync as Topics| GH

    Desc -->|Push description max 350 chars| GH

    Desc -->|Generate via scan| LLM

    Docs -->|Git tree API recursive| GH

    Report -->|Markdown with tables| Output[Markdown Output]

    LLM -.->|README + file tree + commits + manifest| GH

```

### Database Schema (Key Tables)

```mermaid

erDiagram

    exemplar_repos ||--o{ benchmark_suites : has

    benchmark_suites ||--o{ suite_prs : contains

    benchmark_suites ||--o{ benchmark_runs : has

    benchmark_runs }o--o{ bots : uses

    benchmark_runs ||--o{ replay_prs : creates

    benchmark_runs ||--o{ bot_snapshots : freezes

    replay_prs }o--|| suite_prs : replays

    replay_prs }o--|| bots : uses

    replay_prs ||--o{ review_comments : has

    suite_prs ||--o{ original_comments : has

    suite_prs ||--o{ golden_dataset_entries : has

    review_comments ||--o{ gradings : graded_by

    review_comments ||--o{ comment_similarities : compared_in

    original_comments ||--o{ comment_similarities : compared_in

    exemplar_repos {

        bigint id PK

        varchar name

        varchar github_url

        varchar owner

        varchar repo_name

        varchar mirror_org

        varchar mirror_repo_name

        varchar default_branch

    }

    benchmark_suites {

        bigint id PK

        varchar name

        bigint exemplar_repo_id FK

    }

    suite_prs {

        bigint id PK

        bigint suite_id FK

        int original_pr_number

        varchar base_commit_sha

        varchar head_commit_sha

        int files_changed

    }

    bots {

        bigint id PK

        varchar name

        varchar workflow_file_name

        clob workflow_content

        varchar wait_strategy

        int timeout_seconds

    }

    benchmark_runs {

        bigint id PK

        bigint suite_id FK

        varchar status

        int concurrency

        boolean golden_dataset_enabled

    }

    replay_prs {

        bigint id PK

        bigint run_id FK

        bigint suite_pr_id FK

        bigint bot_id FK

        int mirror_pr_number

        varchar status

    }

    review_comments {

        bigint id PK

        bigint replay_pr_id FK

        varchar source

        varchar comment_type

        varchar file_path

        int line_number

        clob body_normalized

        varchar fingerprint_hash

    }

    golden_dataset_entries {

        bigint id PK

        bigint suite_pr_id FK

        varchar file_path

        int line_number

        varchar issue_type

        clob canonical_body

        boolean active

    }

    gradings {

        bigint id PK

        bigint comment_id

        varchar verdict

        varchar severity

        int stars

    }

    comment_similarities {

        bigint id PK

        varchar strategy

        double score

        boolean is_match

    }

```

## Getting Started

### Prerequisites

- **Java 21** (JDK)

- **Maven 3.8+**

- **Node.js 18+** and npm (for the React frontend)

- **GitHub Personal Access Token** with `repo` scope

- **Ollama** (optional, for AI-generated descriptions) -- install from [ollama.com](https://ollama.com)

- **hitorro-gittools 3.0.0** in your local Maven repository

### Build

```bash

# Build the backend

mvn clean package -DskipTests

# Install frontend dependencies

cd react-app && npm install

```

### Run

The included `run.sh` starts both the backend and frontend dev server:

```bash

# Set your GitHub token

export GITHUB_TOKEN=ghp_your_token_here

# Start both servers

./run.sh

```

This starts:

- **Backend API** at `http://localhost:8090`

- **React dev server** at `http://localhost:3001`

- **Swagger UI** at `http://localhost:8090/swagger-ui.html`

- **H2 Console** at `http://localhost:8090/h2-console`

Alternatively, run each component separately:

```bash

# Backend only

java -jar target/hitorro-pr-bench-1.0.0.jar

# Frontend only (in react-app/)

npm run dev

```

## Configuration

### application.yml

| Property | Default | Description |

|:---------|:--------|:------------|

| `server.port` | `8090` | Backend HTTP port |

| `spring.datasource.url` | `jdbc:h2:file:./data/prbench` | H2 database file path |

| `app.github.token` | `${GITHUB_TOKEN:}` | GitHub PAT (env var or direct) |

| `app.github.api-url` | `https://api.github.com` | GitHub API base URL |

| `app.github.poll-interval-seconds` | `30` | Interval for polling bot completion |

| `app.github.default-bot-timeout-seconds` | `600` | Default timeout waiting for a bot |

| `app.workspace.base-path` | `~/.pr-bench/workspaces` | Local directory for git clones |

| `app.run.default-concurrency` | `2` | Default parallel replay PRs per run |

| `app.run.max-concurrency` | `10` | Maximum allowed concurrency |

| `app.similarity.text-similarity-threshold` | `0.8` | Jaro-Winkler threshold for a match |

| `app.similarity.winnowing-k` | `5` | Winnowing k-gram size |

| `app.similarity.winnowing-w` | `4` | Winnowing window size |

| `app.ollama.url` | `http://localhost:11434` | Ollama server URL |

| `app.ollama.model` | `llama3.2` | Ollama model for description generation |

### GitHub Token

Set via environment variable (recommended) or directly in `application.yml`:

```bash

export GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

```

The token needs `repo` scope for reading/writing repositories, creating PRs, and managing topics.

### Ollama (Optional)

For AI-generated repository descriptions:

```bash

# Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

# Pull the default model

ollama pull llama3.2

# Ollama runs on localhost:11434 by default -- no further config needed

```

## Feature Walkthrough

### Repository Import and Management

1. **Browse GitHub** -- navigate to Repositories, click "Browse GitHub" to see all repos accessible via your token (personal and org)

2. **Import** -- import individual repos or use "Import All" for bulk import

3. **Tag** -- add tags to repos; tags auto-sync to GitHub as Topics (visible on the repo page under "About")

4. **Describe** -- write descriptions manually or click "Generate Description" to have Ollama scan the repo (README, file tree, commits, package manifest) and produce a summary

5. **Scan Docs** -- discovers all markdown, rst, and adoc files in the repo tree

6. **Report** -- generate a markdown report with summary table, per-repo details, doc links, and statistics

### Benchmark Suite Setup

1. **Create Exemplar Repo** -- register the GitHub repo containing the PRs you want to benchmark

2. **Create Suite** -- name a benchmark suite and associate it with the exemplar repo

3. **Add PRs** -- select merged PRs to include; the system records base/head SHAs, changed files, and metadata

4. **Collect Original Comments** -- fetch human review comments from the original PRs for comparison

### Bot Configuration

1. **Create Bot** -- provide a name, GitHub Actions workflow YAML, and a wait strategy

2. **Wait Strategies** -- `CHECKS` (wait for check runs to complete), `REVIEWS` (wait for a review to appear), or `BOTH` (wait for both)

3. **Timeout** -- how long to wait before giving up (default 600 seconds)

### Running a Benchmark

1. **Create Run** -- select a suite, pick bots, set concurrency (max 10 parallel)

2. **Execution** -- the orchestrator creates synthetic PRs in the mirror repo, injects each bot's workflow, pushes, opens PRs, then polls for completion

3. **Monitor** -- the progress endpoint shows status counts (PENDING, CREATING_BRANCHES, WAITING_FOR_BOTS, COLLECTING_COMMENTS, COMPLETED, FAILED)

4. **Cleanup** -- after analysis, clean up mirror branches and close synthetic PRs

### Grading and Golden Dataset

1. **Grading Queue** -- review ungraded bot comments one by one or in bulk

2. **Verdicts** -- VALID (real issue found), INVALID (false positive), DUPLICATE, NEEDS_REVIEW

3. **Severity and Stars** -- rate comment quality with severity levels and star ratings

4. **Promote to Golden** -- promote validated comments to the golden dataset as ground truth

5. **Export/Import** -- export golden dataset entries for sharing or backup

### Similarity Analysis

Four strategies compare bot comments against original human comments:

| Strategy | Description |

|:---------|:------------|

| EXACT_MATCH | Normalized text is identical |

| FILE_LINE | Same file path and line number |

| NORMALIZED_TEXT | Jaro-Winkler similarity above threshold (default 0.8) |

| WINNOWING | Jaccard similarity of Winnowing fingerprint hash sets |

Text normalization strips markdown, URLs, punctuation, and lowercases before comparison.

### Reporting

- **Run Report** -- per-bot totals, verdict breakdowns, grading stats

- **Golden Comparison** -- precision, recall, F1 per bot against the golden dataset

- **Significance Test** -- McNemar's chi-squared test (with continuity correction) between two runs

- **Trend Charts** -- F1/precision/recall over time for a bot (rendered with Recharts)

## API Reference

### Setup (`/api/setup`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/status` | GitHub token status and connectivity |

| POST | `/token` | Set GitHub token at runtime |

| GET | `/rate-limit` | Current GitHub API rate limit |

### Repositories (`/api/repos`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/` | List repos (filter: tag, search, language, owner, hasNotes, isFork) |

| POST | `/` | Create repo by GitHub URL |

| POST | `/import` | Import single repo from GitHub |

| POST | `/import-all` | Bulk import repos |

| GET | `/{id}` | Get repo by ID |

| PUT | `/{id}` | Update repo |

| DELETE | `/{id}` | Delete repo |

| GET | `/{id}/github-status` | Compare local vs live GitHub state |

| POST | `/{id}/sync-to-github` | Push description + tags to GitHub |

| POST | `/{id}/tags` | Add tag (syncs to GitHub Topics) |

| DELETE | `/{id}/tags/{tag}` | Remove tag |

| POST | `/bulk-tag` | Add tag to multiple repos |

| GET | `/meta/tags` | List all tags |

| GET | `/meta/stats` | Faceted stats (by owner, language, tag, fork) |

| POST | `/{id}/notes` | Set description (pushes to GitHub, max 350 chars) |

| POST | `/{id}/generate-description` | AI-generate description via Ollama |

| POST | `/meta/generate-descriptions` | Bulk AI-generate for all repos without descriptions |

| GET | `/github/browse` | Browse GitHub repos accessible via token |

| GET | `/github/orgs` | List user's GitHub organizations |

| GET | `/github/orgs/{org}/repos` | List repos in an organization |

| GET | `/{id}/prs` | List PRs from GitHub for a repo |

| POST | `/{id}/scan-docs` | Scan repo for documentation files |

| POST | `/meta/scan-docs` | Bulk scan repos for docs |

| GET | `/{id}/docs` | Get scanned docs for a repo |

| POST | `/meta/report` | Generate markdown report for selected repos |

### Benchmark Suites (`/api/suites`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/` | List suites (filter: repoId) |

| POST | `/` | Create suite |

| GET | `/{id}` | Get suite |

| DELETE | `/{id}` | Delete suite |

| GET | `/{id}/prs` | List PRs in suite |

| POST | `/{id}/prs` | Add PR to suite |

| DELETE | `/{suiteId}/prs/{prId}` | Remove PR from suite |

| POST | `/suite-prs/{id}/collect-original-comments` | Collect human comments from original PR |

### Bots (`/api/bots`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/` | List all bots |

| POST | `/` | Create bot (name, workflow YAML, wait strategy, timeout) |

| GET | `/{id}` | Get bot |

| PUT | `/{id}` | Update bot |

| DELETE | `/{id}` | Delete bot |

### Runs (`/api/runs`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/` | List runs (filter: suiteId) |

| POST | `/` | Create and start run (suiteId, botIds, concurrency) |

| GET | `/{id}` | Get run details |

| GET | `/{id}/progress` | Status counts per replay PR |

| POST | `/{id}/cancel` | Cancel a running benchmark |

| GET | `/{id}/replay-prs` | List replay PRs for run |

| GET | `/{id}/bot-snapshots` | Bot config snapshots taken at run start |

| POST | `/{id}/cleanup` | Close mirror PRs and delete branches |

| GET | `/replay-prs/{id}/comments` | Comments collected for a replay PR |

| GET | `/{runId}/similarities` | Compute pairwise similarity analysis |

### Grading (`/api`)

| Method | Path | Description |

|:-------|:-----|:------------|

| POST | `/gradings` | Create grading (verdict, severity, stars, notes) |

| PUT | `/gradings/{id}` | Update grading |

| DELETE | `/gradings/{id}` | Delete grading |

| GET | `/comments/{id}/gradings` | Get gradings for a comment |

| POST | `/gradings/bulk` | Bulk grade multiple comments |

| GET | `/grading-queue` | Ungraded comments queue |

| GET | `/grading-progress` | Grading completion stats |

### Golden Dataset (`/api/golden-dataset`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/` | List entries (filter: suiteId) |

| POST | `/promote` | Promote a comment to golden dataset |

| PUT | `/{id}` | Update entry |

| DELETE | `/{id}` | Remove entry |

| GET | `/export` | Export golden dataset as JSON |

### Reports (`/api/reports`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/runs/{runId}` | Run report with per-bot stats |

| GET | `/runs/{runId}/comparison` | Compare against golden dataset (P/R/F1) |

| GET | `/bots/{botId}/trend` | F1 trend over recent runs |

| GET | `/runs/{runAId}/significance` | McNemar's test between two runs |

### Issue Types (`/api/issue-types`)

| Method | Path | Description |

|:-------|:-----|:------------|

| GET | `/` | List all issue types |

| GET | `/categories` | List issue type categories |

| GET | `/{id}` | Get issue type |

| GET | `/code/{code}` | Get issue type by code |

| POST | `/` | Create issue type |

| PUT | `/{id}` | Update issue type |

| DELETE | `/{id}` | Delete issue type |

Pre-seeded issue types: NULL_DEREF, SQL_INJECTION, XSS, RESOURCE_LEAK, RACE_CONDITION, ERROR_HANDLING, PERFORMANCE, CODE_STYLE, NAMING, DEAD_CODE, COMPLEXITY, DOCUMENTATION.

## Frontend Pages

The React SPA provides these pages via sidebar navigation:

| Page | Route | Description |

|:-----|:------|:------------|

| Dashboard | `/` | Overview stats and recent activity |

| Repositories | `/repos` | Browse, import, tag, describe, and manage repos |

| Report & Docs | `/repo-report` | Generate markdown reports and browse scanned docs |

| Suites | `/suites` | Create and manage benchmark suites |

| Suite Detail | `/suites/:id` | View/add PRs in a suite, collect original comments |

| Bots | `/bots` | Define AI review bots with workflow YAML |

| Runs | `/runs` | Start benchmark runs, view status |

| Run Detail | `/runs/:id` | Monitor replay PRs, view progress, trigger cleanup |

| Replay PR Detail | `/replay-prs/:id` | View collected comments for a single replay PR |

| Golden Dataset | `/golden-dataset` | Curate ground-truth entries, export/import |

| Grading Queue | `/grading-queue` | Grade bot comments (VALID/INVALID/DUPLICATE) |

| Run Report | `/reports/:runId` | Per-bot stats, verdict breakdowns, golden comparison |

| Trends | `/trend` | F1/precision/recall charts over time (Recharts) |

| Issue Types | `/issue-types` | Manage the issue type taxonomy |

| Settings | `/settings` | GitHub token, Ollama status, app config |

**Tech stack:** React 18, TypeScript, Vite 5, TanStack Query 5, Recharts 2, React Router 6.

## Integration with hitorro-gittools

The `ReplayEngine` depends on `hitorro-gittools` (v3.0.0) for all git operations during PR replay:

- **Clone** -- clones the mirror repository to the local workspace (`~/.pr-bench/workspaces/{org}/{repo}`)

- **Fetch** -- fetches latest refs before creating branches

- **Branch creation** -- creates base and head branches at the exact commit SHAs from the original PR

- **Checkout** -- switches between branches during replay

- **Push** -- pushes base and head branches to the mirror remote

- **Raw git commands** -- uses `gitService.getRunner().runOrThrow()` to stage and commit injected workflow files

The `GitService` and `GitCredentials` classes from hitorro-gittools handle authentication using the GitHub PAT.

## Testing

```bash

# Run all tests

mvn test

# Run with verbose output

mvn test -Dtest.verbose=true

```

The project uses:

- **JUnit 5** (Jupiter) via `spring-boot-starter-test`

- **Spring Boot Test** for integration testing with auto-configured H2

- **Flyway** migrations run automatically in test context

## Project Structure

```

hitorro-pr-bench/

|-- pom.xml                              # Maven build (Spring Boot 3.2 parent)

|-- run.sh                               # Start backend + frontend

|-- src/

|   |-- main/

|   |   |-- java/com/hitorro/prbench/

|   |   |   |-- PrBenchApplication.java  # Entry point (@EnableAsync, @EnableScheduling)

|   |   |   |-- controller/

|   |   |   |   |-- RepoController.java          # Repository management + GitHub browsing

|   |   |   |   |-- SuiteController.java          # Benchmark suite CRUD + PR selection

|   |   |   |   |-- BotController.java            # AI bot definitions

|   |   |   |   |-- RunController.java            # Benchmark run lifecycle

|   |   |   |   |-- GradingController.java        # Comment grading + queue

|   |   |   |   |-- GoldenDatasetController.java  # Golden dataset management

|   |   |   |   |-- ReportController.java         # Reporting endpoints

|   |   |   |   |-- IssueTypeController.java      # Issue type taxonomy

|   |   |   |   |-- SetupController.java          # Token + connectivity setup

|   |   |   |-- service/

|   |   |   |   |-- RunOrchestrator.java   # Async run execution with semaphore concurrency

|   |   |   |   |-- ReplayEngine.java      # Git-based PR replay via hitorro-gittools

|   |   |   |   |-- CommentCollector.java  # GitHub comment fetching + normalization

|   |   |   |   |-- SimilarityService.java # Pairwise comment comparison (4 strategies)

|   |   |   |   |-- ReportingService.java  # P/R/F1, McNemar's test, trend data

|   |   |   |   |-- OllamaService.java     # LLM description generation

|   |   |   |   |-- GitHubApiService.java  # GitHub REST API client

|   |   |   |   |-- TextNormalizer.java    # Text normalization + Winnowing + Jaro-Winkler

|   |   |   |-- entity/                    # JPA entities (13 tables)

|   |   |   |-- repository/               # Spring Data JPA repositories

|   |   |-- resources/

|   |       |-- application.yml            # App configuration

|   |       |-- db/migration/

|   |           |-- V1__core_tables.sql    # Core schema (12 tables)

|   |           |-- V2__webhook_events.sql # Webhooks, bot snapshots, issue types, schedules

|   |-- test/

|-- react-app/

|   |-- package.json                       # React 18 + Vite 5 + TanStack Query

|   |-- src/

|   |   |-- App.tsx                        # Router + sidebar navigation

|   |   |-- pages/                         # 15 page components

|-- data/

|   |-- prbench.mv.db                     # H2 database file (created at runtime)

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/geekychris/hitorro-prbench

Awesome Lists containing this project

README