https://github.com/ivkond/litmus
LLM coding agent benchmark TUI โ run scenarios across Claude Code, Codex, Aider, KiloCode and more, then compare results
https://github.com/ivkond/litmus
agents ai aider benchmark claude code-generation codex coding-agents evaluation llm python testing textual tui
Last synced: about 1 month ago
JSON representation
LLM coding agent benchmark TUI โ run scenarios across Claude Code, Codex, Aider, KiloCode and more, then compare results
- Host: GitHub
- URL: https://github.com/ivkond/litmus
- Owner: ivkond
- License: mit
- Created: 2026-03-24T19:02:45.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-03-31T12:28:24.000Z (about 2 months ago)
- Last Synced: 2026-04-04T03:56:20.833Z (about 1 month ago)
- Topics: agents, ai, aider, benchmark, claude, code-generation, codex, coding-agents, evaluation, llm, python, testing, textual, tui
- Language: Python
- Size: 155 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Litmus ๐งช
[](https://github.com/ivkond/litmus/actions/workflows/ci.yml)
[](https://github.com/ivkond/litmus/actions/workflows/bandit.yml)
[](https://github.com/ivkond/litmus/actions/workflows/osv-scanner.yml)
[](https://pypi.org/project/litmus-llm/)
[](https://pypi.org/project/litmus-llm/)
[](LICENSE)
**Terminal UI for running LLM agent scenarios and comparing their performance.**
Litmus executes coding tasks across multiple AI agents and models, runs tests against the results, and produces detailed evaluation reports โ all from a single TUI.
## What it does
1. **Detects agents** installed on your system (Claude Code, Codex, Aider, Cursor Agent, KiloCode, OpenCode)
2. **Runs scenarios** โ each scenario is a coding task with tests and scoring criteria
3. **Evaluates results** โ an LLM judge scores agent and model performance across 20 criteria each
4. **Generates reports** โ HTML reports with per-scenario breakdowns, logs, and scores
## Supported agents
| Agent | Binary | Model listing |
|-------|--------|---------------|
| Claude Code | `claude` | Built-in list |
| Codex | `codex` | Built-in list |
| OpenCode | `opencode` | `opencode models` |
| KiloCode | `kilocode` | `kilocode models` |
| Aider | `aider` | `aider --list-models` |
| Cursor Agent | `agent` | `agent models` |
Litmus auto-detects which agents are available and queries their model lists.
## Quick start
Requires **Python 3.12+**.
```bash
pip install litmus-llm
litmus init # create a workspace with a sample scenario
litmus # open the TUI
```
Or run without installing via [uv](https://docs.astral.sh/uv/):
```bash
uvx --from litmus-llm litmus
```
### Development setup
```bash
git clone https://github.com/ivkond/litmus.git
cd litmus
uv sync
uv run litmus
```
### TUI workflow
1. ๐ **Models** โ select agents and models to test
2. ๐งฉ **Scenarios** โ pick which coding tasks to run
3. โถ๏ธ **Run** โ watch execution progress in real time
4. ๐ **Analysis** โ review LLM-judged scores
5. ๐ **Reports** โ browse generated HTML reports
## How it works
Each scenario lives in `template//` and contains:
```
template/1-data-structure/
prompt.txt # Task description sent to the agent
task.txt # Detailed requirements
scoring.csv # Evaluation criteria
project/ # Starter code with tests
```
Execution pipeline per scenario:
```
uv sync -> agent call -> pytest -> collect logs
```
After all runs complete, an LLM judge evaluates the results using 20 agent criteria (tool efficiency, error recovery, reasoning depth...) and 20 model criteria (code correctness, instruction following, hallucination resistance...).
## Configuration
On first launch, Litmus generates a config file with detected agents and their settings. Configure the analysis model (any OpenAI-compatible API) through the TUI settings screen.
## Scenario packs
Litmus supports exporting and importing scenario archives (`.litmus-pack` ZIP files) for sharing test suites between machines or teams.
## Project structure
```
src/litmus/
__init__.py # Entry point, workspace init
app.py # Main app, menu screen
agents.py # Agent registry, detection, model listing
run.py # Scenario execution engine
analysis.py # LLM-powered evaluation (20+20 criteria)
report.py # HTML report generation
pack/ # Scenario export/import
screens/ # TUI screens (models, scenarios, run, results, analysis)
```
## Tech stack
- [Textual](https://textual.textualize.io/) โ TUI framework
- [Rich](https://rich.readthedocs.io/) โ terminal formatting
- [Pydantic](https://docs.pydantic.dev/) โ structured evaluation models
- [OpenAI SDK](https://github.com/openai/openai-python) โ LLM judge (any compatible API)
## License
[MIT](LICENSE)