https://github.com/dokimos-dev/dokimos

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.
https://github.com/dokimos-dev/dokimos

agent-evaluation agentic-ai evaluation evaluation-framework evaluation-metrics java junit junit-extension koog kotlin langchain4j llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics rag rag-evaluation retrieval-augmented-generation spring-ai spring-ai-evaluation

Last synced: about 1 month ago
JSON representation

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.

Host: GitHub
URL: https://github.com/dokimos-dev/dokimos
Owner: dokimos-dev
License: mit
Created: 2025-12-13T21:20:12.000Z (7 months ago)
Default Branch: master
Last Pushed: 2026-06-02T10:57:57.000Z (about 2 months ago)
Last Synced: 2026-06-02T12:22:01.709Z (about 2 months ago)
Topics: agent-evaluation, agentic-ai, evaluation, evaluation-framework, evaluation-metrics, java, junit, junit-extension, koog, kotlin, langchain4j, llm, llm-evaluation, llm-evaluation-framework, llm-evaluation-metrics, rag, rag-evaluation, retrieval-augmented-generation, spring-ai, spring-ai-evaluation
Language: Java
Homepage: https://dokimos.dev/overview
Size: 2.74 MB
Stars: 36
Watchers: 1
Forks: 3
Open Issues: 4
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Agents: AGENTS.md

Awesome Lists containing this project

awesome-java - Dokimos
fucking-awesome-java - Dokimos - Evaluation framework for LLM and AI-agent applications that scores responses, validates tool calls and execution traces, and catches quality regressions in CI. (Projects / Artificial Intelligence)
awesome-java - Dokimos - Evaluation framework for LLM and AI-agent applications that scores responses, validates tool calls and execution traces, and catches quality regressions in CI. (Projects / Artificial Intelligence)

README

          


  



Dokimos




  The LLM evaluation framework for Java and Kotlin





  Documentation •

  Getting Started •

  Examples •

  Issues





  

  

  

  



---

Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.

It integrates with **JUnit**, **LangChain4j**, **Spring AI**, **Spring AI Alibaba**, **Koog**, and **Embabel** so you can run evaluations as part of your existing test suite and CI/CD pipeline. It evaluates both LLM responses and agent behavior, including tool calls and execution traces.

## Why Dokimos?

- **JUnit integration**: Run evaluations as parameterized tests in your existing test suite.

- **Framework agnostic**: Works with LangChain4j, Spring AI, Spring AI Alibaba, Koog, and Embabel, or any LLM client. Powered by any LLM.

- **Built in evaluators**: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.

- **Agent evaluation**: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.

- **Cost & latency tracking**: Capture per-call tokens, cost, and latency across all five adapters, with a pluggable `PriceTable` seam (you supply the prices) and per-run roll-ups.

- **Custom evaluators**: Build your own metrics by extending `BaseEvaluator` or using `LLMJudgeEvaluator`.

- **Dataset support**: Load test cases from JSON, CSV, or define them programmatically.

- **CI/CD ready**: Runs locally or in any CI/CD environment. Fail builds when quality drops.

- **Kotlin as first-class citizen**: Compose all tests with a convenient Kotlin DSL.

## Quick Start

Add the dependency to your `pom.xml` (check [Maven Central](https://central.sonatype.com/artifact/dev.dokimos/dokimos-core) for the latest version):

```xml

    dev.dokimos

    dokimos-core

    ${dokimos.version}

```

### Run a standalone evaluator

Evaluate a single response directly:

#### Java

```java

Evaluator evaluator = ExactMatchEvaluator.builder()

    .name("Exact Match")

    .threshold(1.0)

    .build();

EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");

EvalResult result = evaluator.evaluate(testCase);

System.out.println("Passed: " + result.success());  // true

System.out.println("Score: " + result.score());     // 1.0

```

#### Kotlin

```kotlin

val evaluator = exactMatch {

    name = "Exact Match"

    threshold = 1.0

}

val testCase = EvalTestCase.of("What is 2+2?", "4", "4")

val result = evaluator.evaluate(testCase)

println("Passed: ${result.success()}")  // true

println("Score: ${result.score()}")     // 1.0

```

### Write a JUnit test

Use `@DatasetSource` to run evaluations as parameterized tests:

#### Java

```java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()

    .name("Correctness")

    .criteria("Is the answer correct and complete?")

    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))

    .judge(judgeLM)

    .build();

@ParameterizedTest

@DatasetSource("classpath:datasets/qa.json")

void testQAResponses(Example example) {

    String response = assistant.chat(example.input());

    EvalTestCase testCase = example.toTestCase(response);

    Assertions.assertEval(testCase, correctnessEvaluator);

}

```

#### Kotlin

```kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

val correctnessEvaluator = llmJudge(judgeLM) {

    name = "Correctness"

    criteria = "Is the answer correct and complete?"

    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)

}

class QaTests {

    @ParameterizedTest

    @DatasetSource("classpath:datasets/qa.json")

    fun testQAResponses(example: Example) {

        val response = assistant.chat(example.input())

        val testCase = example.toTestCase(response)

        Assertions.assertEval(testCase, correctnessEvaluator)

    }

}

```

### Evaluate a dataset in bulk

Run experiments across entire datasets with aggregated metrics:

#### Java

```java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()

    .name("Correctness")

    .criteria("Is the answer correct?")

    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))

    .judge(judgeLM)

    .build();

Dataset dataset = Dataset.builder()

    .name("QA Dataset")

    .addExample(Example.of("What is 2+2?", "4"))

    .addExample(Example.of("Capital of France?", "Paris"))

    .build();

ExperimentResult result = Experiment.builder()

    .name("QA Evaluation")

    .dataset(dataset)

    .task(example -> Map.of("output", yourLLM.generate(example.input())))

    .evaluators(List.of(correctnessEvaluator))

    .build()

    .run();

// Check results

System.out.println("Pass rate: " + result.passRate());

System.out.println("Correctness avg: " + result.averageScore("Correctness"));

// Export to multiple formats

result.exportHtml(Path.of("report.html"));

result.exportJson(Path.of("results.json"));

```

#### Kotlin

```kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

val result = experiment {

    name = "QA Evaluation"

    dataset {

        name = "QA Dataset"

        example {

            input = "What is 2+2?"

            expected = "4"

        }

        example {

            input = "Capital of France?"

            expected = "Paris"

        }

    }

    task { example ->

        mapOf("output" to yourLLM.generate(example.input()))

    }

    evaluators {

        llmJudge(judgeLM) {

            name = "Correctness"

            criteria = "Is the answer correct?"

            params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)

        }

    }

}.run()

println("Pass rate: ${result.passRate()}")

println("Correctness avg: ${result.averageScore("Correctness")}")

result.exportHtml(Path.of("report.html"))

result.exportJson(Path.of("results.json"))

```

See more patterns in the [dokimos-examples](./dokimos-examples) module.

## Features

**Dataset driven evaluation**

Load test cases from JSON, CSV, or build them programmatically. Version your datasets alongside your code.

**Built in evaluators**

Ready to use evaluators for hallucination detection, faithfulness, contextual relevance, and LLM as a judge patterns.

**Agent evaluation**

Evaluate AI agents that use tools: validate tool call correctness, check task completion, detect argument hallucinations, and assess tool definition quality.

**Experiment tracking**

Aggregate results across runs, calculate pass rates, and export to JSON, HTML, Markdown, or CSV.

**Extensible**

Build custom evaluators by extending `BaseEvaluator`, or use `LLMJudgeEvaluator` with your own criteria for quick semantic checks.

## Modules

| Module                  | Description                                                          |

|-------------------------|----------------------------------------------------------------------|

| `dokimos-core`          | Core framework with datasets, evaluators, and experiments (required) |

| `dokimos-kotlin`        | Convenient Kotlin DSL for all core building blocks.                  |

| `dokimos-junit`         | JUnit integration with `@DatasetSource` for parameterized tests      |

| `dokimos-langchain4j`   | LangChain4j support for evaluating RAG systems and agents            |

| `dokimos-spring-ai`     | Spring AI integration using `ChatClient` and `ChatModel` as judges   |

| `dokimos-spring-ai-alibaba` | Spring AI Alibaba graph-agent integration: capture a run as a trace |

| `dokimos-koog`          | Koog integration using `AIAgent` as judge.                           |

| `dokimos-embabel`       | Embabel agent integration: capture a run as a trace (Java 21+)       |

| `dokimos-server`        | Optional API and web UI for tracking experiments over time           |

| `dokimos-server-client` | Client library for reporting to the Dokimos server                   |

| `dokimos-mcp-server`    | MCP server exposing evaluation tools to any MCP client               |

## Installation

### Maven

Add the modules you need (check [Maven Central](https://central.sonatype.com/artifact/dev.dokimos/dokimos-core) for the latest version):

```xml

    

    

        dev.dokimos

        dokimos-core

        ${dokimos.version}

    

    

    

        dev.dokimos

        dokimos-junit

        ${dokimos.version}

        test

    

    

    

        dev.dokimos

        dokimos-langchain4j

        ${dokimos.version}

    

    

    

        dev.dokimos

        dokimos-spring-ai

        ${dokimos.version}

    

    

    

        dev.dokimos

        dokimos-spring-ai-alibaba

        ${dokimos.version}

    

    

    

        dev.dokimos

        dokimos-koog

        ${dokimos.version}

    

    

    

        dev.dokimos

        dokimos-embabel

        ${dokimos.version}

    

    

    

        dev.dokimos

        dokimos-kotlin

        ${dokimos.version}

    

```

Gradle

```groovy

dependencies {

    implementation 'dev.dokimos:dokimos-core:$dokimosVersion'

    testImplementation 'dev.dokimos:dokimos-junit:$dokimosVersion'

    implementation 'dev.dokimos:dokimos-langchain4j:$dokimosVersion'

    implementation 'dev.dokimos:dokimos-spring-ai:$dokimosVersion'

    implementation 'dev.dokimos:dokimos-spring-ai-alibaba:$dokimosVersion'

    implementation 'dev.dokimos:dokimos-koog:$dokimosVersion'

    implementation 'dev.dokimos:dokimos-embabel:$dokimosVersion' // requires Java 21

    implementation 'dev.dokimos:dokimos-kotlin:$dokimosVersion'

}

```

No additional repository configuration needed.

## Integrations

### JUnit

Use `@DatasetSource` to load test cases and `LLMJudgeEvaluator` with custom criteria:

#### Java

```java

// Create a judge from any LLM client

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

@ParameterizedTest

@DatasetSource("classpath:support-tickets.json")

void testSupportResponses(Example example) {

    String response = supportBot.answer(example.input());

    EvalTestCase testCase = example.toTestCase(response);

    Evaluator evaluator = LLMJudgeEvaluator.builder()

        .name("Helpfulness")

        .criteria("Is the response helpful and addresses the customer's issue?")

        .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))

        .judge(judgeLM)

        .threshold(0.7)

        .build();

    Assertions.assertEval(testCase, evaluator);

}

```

#### Kotlin

```kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

class SupportTests {

    @ParameterizedTest

    @DatasetSource("classpath:support-tickets.json")

    fun testSupportResponses(example: Example) {

        val response = supportBot.answer(example.input())

        val testCase = example.toTestCase(response)

        val evaluator = llmJudge(judgeLM) {

            name = "Helpfulness"

            criteria = "Is the response helpful and addresses the customer's issue?"

            params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)

            threshold = 0.7

        }

        Assertions.assertEval(testCase, evaluator)

    }

}

```

### LangChain4j

Evaluate RAG pipelines and AI assistants built with LangChain4j:

#### Java

```java

// Create a judge from any LLM client

JudgeLM judgeLM = prompt -> chatLanguageModel.generate(prompt);

Evaluator faithfulness = FaithfulnessEvaluator.builder()

    .judge(judgeLM)

    .contextKey("retrievedContext")

    .threshold(0.8)

    .build();

Experiment.builder()

    .dataset(dataset)

    .task(example -> {

        Result result = assistant.chat(example.input());

        return Map.of(

            "output", result.content(),

            "retrievedContext", result.sources()

        );

    })

    .evaluators(List.of(faithfulness))

    .build()

    .run();

```

#### Kotlin

```kotlin

val judgeLM = JudgeLM { prompt -> chatLanguageModel.generate(prompt) }

val result = experiment {

    dataset(dataset)

    task { example ->

        val result = assistant.chat(example.input())

        mapOf(

            "output" to result.content(),

            "retrievedContext" to result.sources()

        )

    }

    evaluators {

        faithfulness(judgeLM) {

            contextKey = "retrievedContext"

            threshold = 0.8

        }

    }

}.run()

```

### Spring AI

Use Spring AI's `ChatModel` as an evaluation judge:

#### Java

```java

JudgeLM judge = SpringAiSupport.asJudge(chatModel);

 

Evaluator evaluator = LLMJudgeEvaluator.builder()

    .name("Accuracy")

    .criteria("Is the response factually accurate?")

    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))

    .judge(judge)

    .threshold(0.8)

    .build();

```

#### Kotlin

```kotlin

val judge = SpringAiSupport.asJudge(chatModel)

val evaluator = llmJudge(judge) {

    name = "Accuracy"

    criteria = "Is the response factually accurate?"

    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)

    threshold = 0.8

}

```

### Koog (Kotlin only)

```kotlin

// Koog agent as judge

val judge = asJudge(aiAgent::run)

val correctness = llmJudge(judge) {

    name = "Correctness"

    criteria = "Is the response correct and concise?"

    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)

    threshold = 0.8

}

val result = experiment {

    name = "Koog QA Evaluation"

    dataset {

        name = "Koog QA"

        example {

            input = "What is 2+2?"

            expected = "4"

        }

    }

    task { example -> mapOf("output" to aiAgent.runBlocking(example.input())) }

    evaluators { evaluator(correctness) }

}.run()

println("Pass rate: ${result.passRate()}")

```

### Spring AI Alibaba

Capture a Spring AI Alibaba graph-agent run as an `AgentTrace` and score its tool calls. Targets the current 1.1.x line (`spring-ai-alibaba-agent-framework`). See the [Spring AI Alibaba integration guide](https://dokimos.dev/integrations/spring-ai-alibaba).

### Embabel (Java 21+)

Capture an Embabel agent run as an `AgentTrace` through an `AgenticEventListener`. Requires Java 21, since Embabel ships Java 21 bytecode. See the [Embabel integration guide](https://dokimos.dev/integrations/embabel).

## Experiment Server

The Dokimos server is an optional component for tracking experiment results over time. It provides a web UI for viewing runs, comparing results, and debugging failures.

```bash

curl -O https://raw.githubusercontent.com/dokimos-dev/dokimos/master/docker-compose.yml

docker compose up -d

```

Open [http://localhost:8080](http://localhost:8080) to view the dashboard.

See the [server documentation](https://dokimos.dev/server/overview) for deployment options.

## Roadmap

- More built in evaluators: misuse detection

- CLI for running evaluations outside of tests

- Server-side Dataset versioning and management

See the [full roadmap](https://dokimos.dev/overview/#whats-next) on the docs site.

## Get Help

- **Questions**: [GitHub Discussions](https://github.com/dokimos-dev/dokimos/discussions)

- **Bugs**: [GitHub Issues](https://github.com/dokimos-dev/dokimos/issues)

- **Contributing**: See [CONTRIBUTING.md](./CONTRIBUTING.md)

## License

MIT License. See [LICENSE](./LICENSE) for details.

---



  Documentation •

  GitHub

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dokimos-dev/dokimos

Awesome Lists containing this project

README

Dokimos