https://github.com/dokimos-dev/dokimos
LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.
https://github.com/dokimos-dev/dokimos
agent-evaluation agentic-ai evaluation evaluation-framework evaluation-metrics java junit junit-extension koog kotlin langchain4j llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics rag rag-evaluation retrieval-augmented-generation spring-ai spring-ai-evaluation
Last synced: about 18 hours ago
JSON representation
LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.
- Host: GitHub
- URL: https://github.com/dokimos-dev/dokimos
- Owner: dokimos-dev
- License: mit
- Created: 2025-12-13T21:20:12.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2026-06-02T10:57:57.000Z (8 days ago)
- Last Synced: 2026-06-02T12:22:01.709Z (8 days ago)
- Topics: agent-evaluation, agentic-ai, evaluation, evaluation-framework, evaluation-metrics, java, junit, junit-extension, koog, kotlin, langchain4j, llm, llm-evaluation, llm-evaluation-framework, llm-evaluation-metrics, rag, rag-evaluation, retrieval-augmented-generation, spring-ai, spring-ai-evaluation
- Language: Java
- Homepage: https://dokimos.dev/overview
- Size: 2.74 MB
- Stars: 36
- Watchers: 1
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
- awesome-java - Dokimos
README
Dokimos
The LLM evaluation framework for Java and Kotlin
Documentation •
Getting Started •
Examples •
Issues
---
Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.
It integrates with **JUnit**, **LangChain4j**, **Spring AI**, **Spring AI Alibaba**, **Koog**, and **Embabel** so you can run evaluations as part of your existing test suite and CI/CD pipeline. It evaluates both LLM responses and agent behavior, including tool calls and execution traces.
## Why Dokimos?
- **JUnit integration**: Run evaluations as parameterized tests in your existing test suite.
- **Framework agnostic**: Works with LangChain4j, Spring AI, Spring AI Alibaba, Koog, and Embabel, or any LLM client. Powered by any LLM.
- **Built in evaluators**: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.
- **Agent evaluation**: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.
- **Cost & latency tracking**: Capture per-call tokens, cost, and latency across all five adapters, with a pluggable `PriceTable` seam (you supply the prices) and per-run roll-ups.
- **Custom evaluators**: Build your own metrics by extending `BaseEvaluator` or using `LLMJudgeEvaluator`.
- **Dataset support**: Load test cases from JSON, CSV, or define them programmatically.
- **CI/CD ready**: Runs locally or in any CI/CD environment. Fail builds when quality drops.
- **Kotlin as first-class citizen**: Compose all tests with a convenient Kotlin DSL.
## Quick Start
Add the dependency to your `pom.xml` (check [Maven Central](https://central.sonatype.com/artifact/dev.dokimos/dokimos-core) for the latest version):
```xml
dev.dokimos
dokimos-core
${dokimos.version}
```
### Run a standalone evaluator
Evaluate a single response directly:
#### Java
```java
Evaluator evaluator = ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build();
EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);
System.out.println("Passed: " + result.success()); // true
System.out.println("Score: " + result.score()); // 1.0
```
#### Kotlin
```kotlin
val evaluator = exactMatch {
name = "Exact Match"
threshold = 1.0
}
val testCase = EvalTestCase.of("What is 2+2?", "4", "4")
val result = evaluator.evaluate(testCase)
println("Passed: ${result.success()}") // true
println("Score: ${result.score()}") // 1.0
```
### Write a JUnit test
Use `@DatasetSource` to run evaluations as parameterized tests:
#### Java
```java
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);
Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer correct and complete?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judgeLM)
.build();
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
String response = assistant.chat(example.input());
EvalTestCase testCase = example.toTestCase(response);
Assertions.assertEval(testCase, correctnessEvaluator);
}
```
#### Kotlin
```kotlin
val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }
val correctnessEvaluator = llmJudge(judgeLM) {
name = "Correctness"
criteria = "Is the answer correct and complete?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}
class QaTests {
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
fun testQAResponses(example: Example) {
val response = assistant.chat(example.input())
val testCase = example.toTestCase(response)
Assertions.assertEval(testCase, correctnessEvaluator)
}
}
```
### Evaluate a dataset in bulk
Run experiments across entire datasets with aggregated metrics:
#### Java
```java
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);
Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer correct?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judgeLM)
.build();
Dataset dataset = Dataset.builder()
.name("QA Dataset")
.addExample(Example.of("What is 2+2?", "4"))
.addExample(Example.of("Capital of France?", "Paris"))
.build();
ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(example -> Map.of("output", yourLLM.generate(example.input())))
.evaluators(List.of(correctnessEvaluator))
.build()
.run();
// Check results
System.out.println("Pass rate: " + result.passRate());
System.out.println("Correctness avg: " + result.averageScore("Correctness"));
// Export to multiple formats
result.exportHtml(Path.of("report.html"));
result.exportJson(Path.of("results.json"));
```
#### Kotlin
```kotlin
val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }
val result = experiment {
name = "QA Evaluation"
dataset {
name = "QA Dataset"
example {
input = "What is 2+2?"
expected = "4"
}
example {
input = "Capital of France?"
expected = "Paris"
}
}
task { example ->
mapOf("output" to yourLLM.generate(example.input()))
}
evaluators {
llmJudge(judgeLM) {
name = "Correctness"
criteria = "Is the answer correct?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}
}
}.run()
println("Pass rate: ${result.passRate()}")
println("Correctness avg: ${result.averageScore("Correctness")}")
result.exportHtml(Path.of("report.html"))
result.exportJson(Path.of("results.json"))
```
See more patterns in the [dokimos-examples](./dokimos-examples) module.
## Features
**Dataset driven evaluation**
Load test cases from JSON, CSV, or build them programmatically. Version your datasets alongside your code.
**Built in evaluators**
Ready to use evaluators for hallucination detection, faithfulness, contextual relevance, and LLM as a judge patterns.
**Agent evaluation**
Evaluate AI agents that use tools: validate tool call correctness, check task completion, detect argument hallucinations, and assess tool definition quality.
**Experiment tracking**
Aggregate results across runs, calculate pass rates, and export to JSON, HTML, Markdown, or CSV.
**Extensible**
Build custom evaluators by extending `BaseEvaluator`, or use `LLMJudgeEvaluator` with your own criteria for quick semantic checks.
## Modules
| Module | Description |
|-------------------------|----------------------------------------------------------------------|
| `dokimos-core` | Core framework with datasets, evaluators, and experiments (required) |
| `dokimos-kotlin` | Convenient Kotlin DSL for all core building blocks. |
| `dokimos-junit` | JUnit integration with `@DatasetSource` for parameterized tests |
| `dokimos-langchain4j` | LangChain4j support for evaluating RAG systems and agents |
| `dokimos-spring-ai` | Spring AI integration using `ChatClient` and `ChatModel` as judges |
| `dokimos-spring-ai-alibaba` | Spring AI Alibaba graph-agent integration: capture a run as a trace |
| `dokimos-koog` | Koog integration using `AIAgent` as judge. |
| `dokimos-embabel` | Embabel agent integration: capture a run as a trace (Java 21+) |
| `dokimos-server` | Optional API and web UI for tracking experiments over time |
| `dokimos-server-client` | Client library for reporting to the Dokimos server |
| `dokimos-mcp-server` | MCP server exposing evaluation tools to any MCP client |
## Installation
### Maven
Add the modules you need (check [Maven Central](https://central.sonatype.com/artifact/dev.dokimos/dokimos-core) for the latest version):
```xml
dev.dokimos
dokimos-core
${dokimos.version}
dev.dokimos
dokimos-junit
${dokimos.version}
test
dev.dokimos
dokimos-langchain4j
${dokimos.version}
dev.dokimos
dokimos-spring-ai
${dokimos.version}
dev.dokimos
dokimos-spring-ai-alibaba
${dokimos.version}
dev.dokimos
dokimos-koog
${dokimos.version}
dev.dokimos
dokimos-embabel
${dokimos.version}
dev.dokimos
dokimos-kotlin
${dokimos.version}
```
Gradle
```groovy
dependencies {
implementation 'dev.dokimos:dokimos-core:$dokimosVersion'
testImplementation 'dev.dokimos:dokimos-junit:$dokimosVersion'
implementation 'dev.dokimos:dokimos-langchain4j:$dokimosVersion'
implementation 'dev.dokimos:dokimos-spring-ai:$dokimosVersion'
implementation 'dev.dokimos:dokimos-spring-ai-alibaba:$dokimosVersion'
implementation 'dev.dokimos:dokimos-koog:$dokimosVersion'
implementation 'dev.dokimos:dokimos-embabel:$dokimosVersion' // requires Java 21
implementation 'dev.dokimos:dokimos-kotlin:$dokimosVersion'
}
```
No additional repository configuration needed.
## Integrations
### JUnit
Use `@DatasetSource` to load test cases and `LLMJudgeEvaluator` with custom criteria:
#### Java
```java
// Create a judge from any LLM client
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);
@ParameterizedTest
@DatasetSource("classpath:support-tickets.json")
void testSupportResponses(Example example) {
String response = supportBot.answer(example.input());
EvalTestCase testCase = example.toTestCase(response);
Evaluator evaluator = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Is the response helpful and addresses the customer's issue?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judgeLM)
.threshold(0.7)
.build();
Assertions.assertEval(testCase, evaluator);
}
```
#### Kotlin
```kotlin
val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }
class SupportTests {
@ParameterizedTest
@DatasetSource("classpath:support-tickets.json")
fun testSupportResponses(example: Example) {
val response = supportBot.answer(example.input())
val testCase = example.toTestCase(response)
val evaluator = llmJudge(judgeLM) {
name = "Helpfulness"
criteria = "Is the response helpful and addresses the customer's issue?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
threshold = 0.7
}
Assertions.assertEval(testCase, evaluator)
}
}
```
### LangChain4j
Evaluate RAG pipelines and AI assistants built with LangChain4j:
#### Java
```java
// Create a judge from any LLM client
JudgeLM judgeLM = prompt -> chatLanguageModel.generate(prompt);
Evaluator faithfulness = FaithfulnessEvaluator.builder()
.judge(judgeLM)
.contextKey("retrievedContext")
.threshold(0.8)
.build();
Experiment.builder()
.dataset(dataset)
.task(example -> {
Result result = assistant.chat(example.input());
return Map.of(
"output", result.content(),
"retrievedContext", result.sources()
);
})
.evaluators(List.of(faithfulness))
.build()
.run();
```
#### Kotlin
```kotlin
val judgeLM = JudgeLM { prompt -> chatLanguageModel.generate(prompt) }
val result = experiment {
dataset(dataset)
task { example ->
val result = assistant.chat(example.input())
mapOf(
"output" to result.content(),
"retrievedContext" to result.sources()
)
}
evaluators {
faithfulness(judgeLM) {
contextKey = "retrievedContext"
threshold = 0.8
}
}
}.run()
```
### Spring AI
Use Spring AI's `ChatModel` as an evaluation judge:
#### Java
```java
JudgeLM judge = SpringAiSupport.asJudge(chatModel);
Evaluator evaluator = LLMJudgeEvaluator.builder()
.name("Accuracy")
.criteria("Is the response factually accurate?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judge)
.threshold(0.8)
.build();
```
#### Kotlin
```kotlin
val judge = SpringAiSupport.asJudge(chatModel)
val evaluator = llmJudge(judge) {
name = "Accuracy"
criteria = "Is the response factually accurate?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
threshold = 0.8
}
```
### Koog (Kotlin only)
```kotlin
// Koog agent as judge
val judge = asJudge(aiAgent::run)
val correctness = llmJudge(judge) {
name = "Correctness"
criteria = "Is the response correct and concise?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
threshold = 0.8
}
val result = experiment {
name = "Koog QA Evaluation"
dataset {
name = "Koog QA"
example {
input = "What is 2+2?"
expected = "4"
}
}
task { example -> mapOf("output" to aiAgent.runBlocking(example.input())) }
evaluators { evaluator(correctness) }
}.run()
println("Pass rate: ${result.passRate()}")
```
### Spring AI Alibaba
Capture a Spring AI Alibaba graph-agent run as an `AgentTrace` and score its tool calls. Targets the current 1.1.x line (`spring-ai-alibaba-agent-framework`). See the [Spring AI Alibaba integration guide](https://dokimos.dev/integrations/spring-ai-alibaba).
### Embabel (Java 21+)
Capture an Embabel agent run as an `AgentTrace` through an `AgenticEventListener`. Requires Java 21, since Embabel ships Java 21 bytecode. See the [Embabel integration guide](https://dokimos.dev/integrations/embabel).
## Experiment Server
The Dokimos server is an optional component for tracking experiment results over time. It provides a web UI for viewing runs, comparing results, and debugging failures.
```bash
curl -O https://raw.githubusercontent.com/dokimos-dev/dokimos/master/docker-compose.yml
docker compose up -d
```
Open [http://localhost:8080](http://localhost:8080) to view the dashboard.
See the [server documentation](https://dokimos.dev/server/overview) for deployment options.
## Roadmap
- More built in evaluators: misuse detection
- CLI for running evaluations outside of tests
- Server-side Dataset versioning and management
See the [full roadmap](https://dokimos.dev/overview/#whats-next) on the docs site.
## Get Help
- **Questions**: [GitHub Discussions](https://github.com/dokimos-dev/dokimos/discussions)
- **Bugs**: [GitHub Issues](https://github.com/dokimos-dev/dokimos/issues)
- **Contributing**: See [CONTRIBUTING.md](./CONTRIBUTING.md)
## License
MIT License. See [LICENSE](./LICENSE) for details.
---