{"id":34627710,"url":"https://github.com/dokimos-dev/dokimos","last_synced_at":"2026-06-09T15:00:48.959Z","repository":{"id":330161705,"uuid":"1115936312","full_name":"dokimos-dev/dokimos","owner":"dokimos-dev","description":"LLM and agent evaluation for Java \u0026 Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.","archived":false,"fork":false,"pushed_at":"2026-06-02T10:57:57.000Z","size":2868,"stargazers_count":36,"open_issues_count":4,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-06-02T12:22:01.709Z","etag":null,"topics":["agent-evaluation","agentic-ai","evaluation","evaluation-framework","evaluation-metrics","java","junit","junit-extension","koog","kotlin","langchain4j","llm","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics","rag","rag-evaluation","retrieval-augmented-generation","spring-ai","spring-ai-evaluation"],"latest_commit_sha":null,"homepage":"https://dokimos.dev/overview","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dokimos-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-12-13T21:20:12.000Z","updated_at":"2026-06-02T10:58:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/dokimos-dev/dokimos","commit_stats":null,"previous_names":["dokimos-dev/dokimos","dokimos-io/dokimos"],"tags_count":21,"template":false,"template_full_name":null,"purl":"pkg:github/dokimos-dev/dokimos","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dokimos-dev%2Fdokimos","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dokimos-dev%2Fdokimos/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dokimos-dev%2Fdokimos/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dokimos-dev%2Fdokimos/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dokimos-dev","download_url":"https://codeload.github.com/dokimos-dev/dokimos/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dokimos-dev%2Fdokimos/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34112225,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","agentic-ai","evaluation","evaluation-framework","evaluation-metrics","java","junit","junit-extension","koog","kotlin","langchain4j","llm","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics","rag","rag-evaluation","retrieval-augmented-generation","spring-ai","spring-ai-evaluation"],"created_at":"2025-12-24T16:10:41.496Z","updated_at":"2026-06-09T15:00:48.942Z","avatar_url":"https://github.com/dokimos-dev.png","language":"Java","funding_links":[],"categories":["人工智能","Projects"],"sub_categories":["Spring Cloud框架","Artificial Intelligence"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/dokimos-dev/dokimos/master/docs/static/img/logo.jpeg\" alt=\"Dokimos Logo\" width=\"150\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eDokimos\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eThe LLM evaluation framework for Java and Kotlin\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://dokimos.dev/overview\"\u003eDocumentation\u003c/a\u003e •\n  \u003ca href=\"https://dokimos.dev/category/getting-started\"\u003eGetting Started\u003c/a\u003e •\n  \u003ca href=\"./dokimos-examples\"\u003eExamples\u003c/a\u003e •\n  \u003ca href=\"https://github.com/dokimos-dev/dokimos/issues\"\u003eIssues\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://central.sonatype.com/artifact/dev.dokimos/dokimos-core\"\u003e\u003cimg src=\"https://img.shields.io/maven-central/v/dev.dokimos/dokimos-core?label=Maven%20Central\" alt=\"Maven Central\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/dokimos-dev/dokimos/actions\"\u003e\u003cimg src=\"https://img.shields.io/github/actions/workflow/status/dokimos-dev/dokimos/ci.yml?branch=master\" alt=\"Build Status\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://opensource.org/licenses/MIT\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-blue.svg\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.oracle.com/java/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Java-17%2B-orange\" alt=\"Java 17+\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\nDokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.\n\nIt integrates with **JUnit**, **LangChain4j**, **Spring AI**, **Spring AI Alibaba**, **Koog**, and **Embabel** so you can run evaluations as part of your existing test suite and CI/CD pipeline. It evaluates both LLM responses and agent behavior, including tool calls and execution traces.\n\n## Why Dokimos?\n\n- **JUnit integration**: Run evaluations as parameterized tests in your existing test suite.\n- **Framework agnostic**: Works with LangChain4j, Spring AI, Spring AI Alibaba, Koog, and Embabel, or any LLM client. Powered by any LLM.\n- **Built in evaluators**: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.\n- **Agent evaluation**: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.\n- **Cost \u0026 latency tracking**: Capture per-call tokens, cost, and latency across all five adapters, with a pluggable `PriceTable` seam (you supply the prices) and per-run roll-ups.\n- **Custom evaluators**: Build your own metrics by extending `BaseEvaluator` or using `LLMJudgeEvaluator`.\n- **Dataset support**: Load test cases from JSON, CSV, or define them programmatically.\n- **CI/CD ready**: Runs locally or in any CI/CD environment. Fail builds when quality drops.\n- **Kotlin as first-class citizen**: Compose all tests with a convenient Kotlin DSL.\n\n## Quick Start\n\nAdd the dependency to your `pom.xml` (check [Maven Central](https://central.sonatype.com/artifact/dev.dokimos/dokimos-core) for the latest version):\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n    \u003cartifactId\u003edokimos-core\u003c/artifactId\u003e\n    \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Run a standalone evaluator\n\nEvaluate a single response directly:\n\n#### Java\n\n```java\nEvaluator evaluator = ExactMatchEvaluator.builder()\n    .name(\"Exact Match\")\n    .threshold(1.0)\n    .build();\n\nEvalTestCase testCase = EvalTestCase.of(\"What is 2+2?\", \"4\", \"4\");\nEvalResult result = evaluator.evaluate(testCase);\n\nSystem.out.println(\"Passed: \" + result.success());  // true\nSystem.out.println(\"Score: \" + result.score());     // 1.0\n```\n\n#### Kotlin\n\n```kotlin\nval evaluator = exactMatch {\n    name = \"Exact Match\"\n    threshold = 1.0\n}\n\nval testCase = EvalTestCase.of(\"What is 2+2?\", \"4\", \"4\")\nval result = evaluator.evaluate(testCase)\n\nprintln(\"Passed: ${result.success()}\")  // true\nprintln(\"Score: ${result.score()}\")     // 1.0\n```\n\n### Write a JUnit test\n\nUse `@DatasetSource` to run evaluations as parameterized tests:\n\n#### Java\n\n```java\nJudgeLM judgeLM = prompt -\u003e openAiClient.generate(prompt);\n\nEvaluator correctnessEvaluator = LLMJudgeEvaluator.builder()\n    .name(\"Correctness\")\n    .criteria(\"Is the answer correct and complete?\")\n    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))\n    .judge(judgeLM)\n    .build();\n\n@ParameterizedTest\n@DatasetSource(\"classpath:datasets/qa.json\")\nvoid testQAResponses(Example example) {\n    String response = assistant.chat(example.input());\n    EvalTestCase testCase = example.toTestCase(response);\n\n    Assertions.assertEval(testCase, correctnessEvaluator);\n}\n```\n\n#### Kotlin\n\n```kotlin\nval judgeLM = JudgeLM { prompt -\u003e openAiClient.generate(prompt) }\n\nval correctnessEvaluator = llmJudge(judgeLM) {\n    name = \"Correctness\"\n    criteria = \"Is the answer correct and complete?\"\n    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)\n}\n\nclass QaTests {\n    @ParameterizedTest\n    @DatasetSource(\"classpath:datasets/qa.json\")\n    fun testQAResponses(example: Example) {\n        val response = assistant.chat(example.input())\n        val testCase = example.toTestCase(response)\n\n        Assertions.assertEval(testCase, correctnessEvaluator)\n    }\n}\n```\n\n### Evaluate a dataset in bulk\n\nRun experiments across entire datasets with aggregated metrics:\n\n#### Java\n\n```java\nJudgeLM judgeLM = prompt -\u003e openAiClient.generate(prompt);\n\nEvaluator correctnessEvaluator = LLMJudgeEvaluator.builder()\n    .name(\"Correctness\")\n    .criteria(\"Is the answer correct?\")\n    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))\n    .judge(judgeLM)\n    .build();\n\nDataset dataset = Dataset.builder()\n    .name(\"QA Dataset\")\n    .addExample(Example.of(\"What is 2+2?\", \"4\"))\n    .addExample(Example.of(\"Capital of France?\", \"Paris\"))\n    .build();\n\nExperimentResult result = Experiment.builder()\n    .name(\"QA Evaluation\")\n    .dataset(dataset)\n    .task(example -\u003e Map.of(\"output\", yourLLM.generate(example.input())))\n    .evaluators(List.of(correctnessEvaluator))\n    .build()\n    .run();\n\n// Check results\nSystem.out.println(\"Pass rate: \" + result.passRate());\nSystem.out.println(\"Correctness avg: \" + result.averageScore(\"Correctness\"));\n\n// Export to multiple formats\nresult.exportHtml(Path.of(\"report.html\"));\nresult.exportJson(Path.of(\"results.json\"));\n```\n\n#### Kotlin\n\n```kotlin\nval judgeLM = JudgeLM { prompt -\u003e openAiClient.generate(prompt) }\n\nval result = experiment {\n    name = \"QA Evaluation\"\n    dataset {\n        name = \"QA Dataset\"\n        example {\n            input = \"What is 2+2?\"\n            expected = \"4\"\n        }\n        example {\n            input = \"Capital of France?\"\n            expected = \"Paris\"\n        }\n    }\n\n    task { example -\u003e\n        mapOf(\"output\" to yourLLM.generate(example.input()))\n    }\n\n    evaluators {\n        llmJudge(judgeLM) {\n            name = \"Correctness\"\n            criteria = \"Is the answer correct?\"\n            params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)\n        }\n    }\n}.run()\n\nprintln(\"Pass rate: ${result.passRate()}\")\nprintln(\"Correctness avg: ${result.averageScore(\"Correctness\")}\")\n\nresult.exportHtml(Path.of(\"report.html\"))\nresult.exportJson(Path.of(\"results.json\"))\n```\n\nSee more patterns in the [dokimos-examples](./dokimos-examples) module.\n\n## Features\n\n**Dataset driven evaluation**\nLoad test cases from JSON, CSV, or build them programmatically. Version your datasets alongside your code.\n\n**Built in evaluators**\nReady to use evaluators for hallucination detection, faithfulness, contextual relevance, and LLM as a judge patterns.\n\n**Agent evaluation**\nEvaluate AI agents that use tools: validate tool call correctness, check task completion, detect argument hallucinations, and assess tool definition quality.\n\n**Experiment tracking**\nAggregate results across runs, calculate pass rates, and export to JSON, HTML, Markdown, or CSV.\n\n**Extensible**\nBuild custom evaluators by extending `BaseEvaluator`, or use `LLMJudgeEvaluator` with your own criteria for quick semantic checks.\n\n## Modules\n\n| Module                  | Description                                                          |\n|-------------------------|----------------------------------------------------------------------|\n| `dokimos-core`          | Core framework with datasets, evaluators, and experiments (required) |\n| `dokimos-kotlin`        | Convenient Kotlin DSL for all core building blocks.                  |\n| `dokimos-junit`         | JUnit integration with `@DatasetSource` for parameterized tests      |\n| `dokimos-langchain4j`   | LangChain4j support for evaluating RAG systems and agents            |\n| `dokimos-spring-ai`     | Spring AI integration using `ChatClient` and `ChatModel` as judges   |\n| `dokimos-spring-ai-alibaba` | Spring AI Alibaba graph-agent integration: capture a run as a trace |\n| `dokimos-koog`          | Koog integration using `AIAgent` as judge.                           |\n| `dokimos-embabel`       | Embabel agent integration: capture a run as a trace (Java 21+)       |\n| `dokimos-server`        | Optional API and web UI for tracking experiments over time           |\n| `dokimos-server-client` | Client library for reporting to the Dokimos server                   |\n| `dokimos-mcp-server`    | MCP server exposing evaluation tools to any MCP client               |\n\n## Installation\n\n### Maven\n\nAdd the modules you need (check [Maven Central](https://central.sonatype.com/artifact/dev.dokimos/dokimos-core) for the latest version):\n\n```xml\n\u003cdependencies\u003e\n    \u003c!-- Core framework (required) --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-core\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n\n    \u003c!-- JUnit integration --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-junit\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n        \u003cscope\u003etest\u003c/scope\u003e\n    \u003c/dependency\u003e\n\n    \u003c!-- LangChain4j integration --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-langchain4j\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n\n    \u003c!-- Spring AI integration --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-spring-ai\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n\n    \u003c!-- Spring AI Alibaba integration --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-spring-ai-alibaba\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n\n    \u003c!-- Koog integration --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-koog\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n\n    \u003c!-- Embabel integration (requires Java 21) --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-embabel\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n\n    \u003c!-- Kotlin integration, applicable to all modules --\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003edev.dokimos\u003c/groupId\u003e\n        \u003cartifactId\u003edokimos-kotlin\u003c/artifactId\u003e\n        \u003cversion\u003e${dokimos.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n\n\u003c/dependencies\u003e\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eGradle\u003c/summary\u003e\n\n```groovy\ndependencies {\n    implementation 'dev.dokimos:dokimos-core:$dokimosVersion'\n    testImplementation 'dev.dokimos:dokimos-junit:$dokimosVersion'\n    implementation 'dev.dokimos:dokimos-langchain4j:$dokimosVersion'\n    implementation 'dev.dokimos:dokimos-spring-ai:$dokimosVersion'\n    implementation 'dev.dokimos:dokimos-spring-ai-alibaba:$dokimosVersion'\n    implementation 'dev.dokimos:dokimos-koog:$dokimosVersion'\n    implementation 'dev.dokimos:dokimos-embabel:$dokimosVersion' // requires Java 21\n    implementation 'dev.dokimos:dokimos-kotlin:$dokimosVersion'\n}\n```\n\n\u003c/details\u003e\n\nNo additional repository configuration needed.\n\n## Integrations\n\n### JUnit\n\nUse `@DatasetSource` to load test cases and `LLMJudgeEvaluator` with custom criteria:\n\n#### Java\n\n```java\n// Create a judge from any LLM client\nJudgeLM judgeLM = prompt -\u003e openAiClient.generate(prompt);\n\n@ParameterizedTest\n@DatasetSource(\"classpath:support-tickets.json\")\nvoid testSupportResponses(Example example) {\n    String response = supportBot.answer(example.input());\n    EvalTestCase testCase = example.toTestCase(response);\n\n    Evaluator evaluator = LLMJudgeEvaluator.builder()\n        .name(\"Helpfulness\")\n        .criteria(\"Is the response helpful and addresses the customer's issue?\")\n        .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))\n        .judge(judgeLM)\n        .threshold(0.7)\n        .build();\n\n    Assertions.assertEval(testCase, evaluator);\n}\n```\n\n#### Kotlin\n\n```kotlin\nval judgeLM = JudgeLM { prompt -\u003e openAiClient.generate(prompt) }\n\nclass SupportTests {\n    @ParameterizedTest\n    @DatasetSource(\"classpath:support-tickets.json\")\n    fun testSupportResponses(example: Example) {\n        val response = supportBot.answer(example.input())\n        val testCase = example.toTestCase(response)\n\n        val evaluator = llmJudge(judgeLM) {\n            name = \"Helpfulness\"\n            criteria = \"Is the response helpful and addresses the customer's issue?\"\n            params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)\n            threshold = 0.7\n        }\n\n        Assertions.assertEval(testCase, evaluator)\n    }\n}\n```\n\n### LangChain4j\n\nEvaluate RAG pipelines and AI assistants built with LangChain4j:\n\n#### Java\n\n```java\n// Create a judge from any LLM client\nJudgeLM judgeLM = prompt -\u003e chatLanguageModel.generate(prompt);\n\nEvaluator faithfulness = FaithfulnessEvaluator.builder()\n    .judge(judgeLM)\n    .contextKey(\"retrievedContext\")\n    .threshold(0.8)\n    .build();\n\nExperiment.builder()\n    .dataset(dataset)\n    .task(example -\u003e {\n        Result\u003cString\u003e result = assistant.chat(example.input());\n        return Map.of(\n            \"output\", result.content(),\n            \"retrievedContext\", result.sources()\n        );\n    })\n    .evaluators(List.of(faithfulness))\n    .build()\n    .run();\n```\n\n#### Kotlin\n\n```kotlin\nval judgeLM = JudgeLM { prompt -\u003e chatLanguageModel.generate(prompt) }\n\nval result = experiment {\n    dataset(dataset)\n    task { example -\u003e\n        val result = assistant.chat(example.input())\n        mapOf(\n            \"output\" to result.content(),\n            \"retrievedContext\" to result.sources()\n        )\n    }\n    evaluators {\n        faithfulness(judgeLM) {\n            contextKey = \"retrievedContext\"\n            threshold = 0.8\n        }\n    }\n}.run()\n```\n\n### Spring AI\n\nUse Spring AI's `ChatModel` as an evaluation judge:\n\n#### Java\n\n```java\nJudgeLM judge = SpringAiSupport.asJudge(chatModel);\n \nEvaluator evaluator = LLMJudgeEvaluator.builder()\n    .name(\"Accuracy\")\n    .criteria(\"Is the response factually accurate?\")\n    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))\n    .judge(judge)\n    .threshold(0.8)\n    .build();\n```\n\n#### Kotlin\n\n```kotlin\nval judge = SpringAiSupport.asJudge(chatModel)\n\nval evaluator = llmJudge(judge) {\n    name = \"Accuracy\"\n    criteria = \"Is the response factually accurate?\"\n    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)\n    threshold = 0.8\n}\n```\n\n### Koog (Kotlin only)\n\n```kotlin\n// Koog agent as judge\nval judge = asJudge(aiAgent::run)\n\nval correctness = llmJudge(judge) {\n    name = \"Correctness\"\n    criteria = \"Is the response correct and concise?\"\n    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)\n    threshold = 0.8\n}\n\nval result = experiment {\n    name = \"Koog QA Evaluation\"\n    dataset {\n        name = \"Koog QA\"\n        example {\n            input = \"What is 2+2?\"\n            expected = \"4\"\n        }\n    }\n    task { example -\u003e mapOf(\"output\" to aiAgent.runBlocking(example.input())) }\n    evaluators { evaluator(correctness) }\n}.run()\n\nprintln(\"Pass rate: ${result.passRate()}\")\n```\n\n### Spring AI Alibaba\n\nCapture a Spring AI Alibaba graph-agent run as an `AgentTrace` and score its tool calls. Targets the current 1.1.x line (`spring-ai-alibaba-agent-framework`). See the [Spring AI Alibaba integration guide](https://dokimos.dev/integrations/spring-ai-alibaba).\n\n### Embabel (Java 21+)\n\nCapture an Embabel agent run as an `AgentTrace` through an `AgenticEventListener`. Requires Java 21, since Embabel ships Java 21 bytecode. See the [Embabel integration guide](https://dokimos.dev/integrations/embabel).\n\n## Experiment Server\n\nThe Dokimos server is an optional component for tracking experiment results over time. It provides a web UI for viewing runs, comparing results, and debugging failures.\n\n```bash\ncurl -O https://raw.githubusercontent.com/dokimos-dev/dokimos/master/docker-compose.yml\ndocker compose up -d\n```\n\nOpen [http://localhost:8080](http://localhost:8080) to view the dashboard.\n\nSee the [server documentation](https://dokimos.dev/server/overview) for deployment options.\n\n## Roadmap\n\n- More built in evaluators: misuse detection\n- CLI for running evaluations outside of tests\n- Server-side Dataset versioning and management\n\nSee the [full roadmap](https://dokimos.dev/overview/#whats-next) on the docs site.\n\n## Get Help\n\n- **Questions**: [GitHub Discussions](https://github.com/dokimos-dev/dokimos/discussions)\n- **Bugs**: [GitHub Issues](https://github.com/dokimos-dev/dokimos/issues)\n- **Contributing**: See [CONTRIBUTING.md](./CONTRIBUTING.md)\n\n## License\n\nMIT License. See [LICENSE](./LICENSE) for details.\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://dokimos.dev/overview\"\u003eDocumentation\u003c/a\u003e •\n  \u003ca href=\"https://github.com/dokimos-dev/dokimos\"\u003eGitHub\u003c/a\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdokimos-dev%2Fdokimos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdokimos-dev%2Fdokimos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdokimos-dev%2Fdokimos/lists"}