{"id":49843313,"url":"https://github.com/dynatrace-oss/dt-evals","last_synced_at":"2026-05-14T08:02:11.400Z","repository":{"id":357652118,"uuid":"1198248710","full_name":"dynatrace-oss/dt-evals","owner":"dynatrace-oss","description":"AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability","archived":false,"fork":false,"pushed_at":"2026-05-13T18:53:54.000Z","size":988,"stargazers_count":5,"open_issues_count":31,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-13T19:10:17.423Z","etag":null,"topics":["agents","ai","evals","evaluations","llm-as-judge","observability"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dynatrace-oss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-01T08:47:16.000Z","updated_at":"2026-05-13T10:08:33.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/dynatrace-oss/dt-evals","commit_stats":null,"previous_names":["dynatrace-oss/dt-evals"],"tags_count":17,"template":false,"template_full_name":"dynatrace-oss/template-project","purl":"pkg:github/dynatrace-oss/dt-evals","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fdt-evals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fdt-evals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fdt-evals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fdt-evals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dynatrace-oss","download_url":"https://codeload.github.com/dynatrace-oss/dt-evals/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fdt-evals/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33015817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"online","status_checked_at":"2026-05-14T02:00:06.663Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","ai","evals","evaluations","llm-as-judge","observability"],"created_at":"2026-05-14T08:02:10.240Z","updated_at":"2026-05-14T08:02:11.388Z","avatar_url":"https://github.com/dynatrace-oss.png","language":"TypeScript","funding_links":[],"categories":["Observability"],"sub_categories":["Streaming Operations"],"readme":"# dt-evals\n\n[![npm version](https://img.shields.io/npm/v/@dynatrace-oss/dt-evals/alpha?style=flat-square\u0026label=npm\u0026color=cb3837)](https://www.npmjs.com/package/@dynatrace-oss/dt-evals)\n[![npm downloads](https://img.shields.io/npm/dm/@dynatrace-oss/dt-evals?style=flat-square\u0026color=cb3837)](https://www.npmjs.com/package/@dynatrace-oss/dt-evals)\n[![Build](https://github.com/dynatrace-oss/dt-evals/actions/workflows/ci-cli.yml/badge.svg?branch=main)](https://github.com/dynatrace-oss/dt-evals/actions/workflows/ci-cli.yml)\n[![License](https://img.shields.io/badge/license-Apache_2.0-blue?style=flat-square)](LICENSE)\n[![Node](https://img.shields.io/node/v/@dynatrace-oss/dt-evals/alpha?style=flat-square)](package.json)\n[![Lib on npm](https://img.shields.io/npm/v/@dynatrace-oss/dt-eval-lib/alpha?style=flat-square\u0026label=lib\u0026color=cb3837)](https://www.npmjs.com/package/@dynatrace-oss/dt-eval-lib)\n\nEnd-to-end LLM evaluation toolkit for Dynatrace AI Observability.\n\n`dt-evals` is the main interface. It pulls live `gen_ai.*` spans from your Dynatrace environment, masks sensitive data in memory, scores real production interactions with an LLM judge, and writes structured evaluation results back to Dynatrace as business events — keeping evals, traces, metrics, alerts, and dashboards in one place.\n\n![dt-evals welcome](assets/dt-evals-welcome.gif)\n\n## Packages\n\n| Package | Description |\n|---------|-------------|\n| [`dt-evals`](dt-eval-cli) | CLI — configure, run, schedule, inspect, and deploy evals |\n| [`dt-eval-lib`](dt-eval-lib) | TypeScript library — run judge-based evals in code, tests, and CI |\n| [`dt-eval-deploy`](dt-eval-deploy) | Deployment resources — Docker image and serverless runners |\n\n\u003e **Early Development**: This project is in active development. If you encounter any bugs or issues, please [file a GitHub issue](https://github.com/dynatrace-oss/dt-evals/issues/new). Contributions and feedback are welcome!\n\n## Requirements\n\n- Node.js `\u003e=20`\n- A Dynatrace environment with GenAI spans (`gen_ai.*` OTEL attributes)\n- [`dtctl`](https://docs.dynatrace.com/docs/deliver/dynatrace-cli) installed for first-time setup (OAuth token generation)\n- Credentials for your judge provider (OpenAI, Anthropic, Google, AWS Bedrock, or Azure OpenAI)\n\n## Install\n\n```bash\nnpm install -g @dynatrace-oss/dt-evals\n```\n\nOr run without installing:\n\n```bash\nnpx @dynatrace-oss/dt-evals \u003ccommand\u003e\n```\n\n## Quick Start\n\n```bash\n# 1. Run the doctor — authenticates via dtctl (browser OAuth), checks permissions,\n#    generates a platform token, and writes it to your .env\ndt-evals doctor\n\n# 2. Configure your service and judge provider\ndt-evals configure\n\n# 3. Run evals on the last hour of traces\ndt-evals run --since 1h --sample 10\n```\n\n---\n\n## CLI Reference\n\n### `doctor`\n\nDiagnose your environment end-to-end. Uses `dtctl` for browser-based OAuth, checks all required Dynatrace permissions, generates a scoped platform token, and writes it to your `.env`. Run this once on first setup or whenever something breaks.\n\n```bash\n# Full interactive check (recommended for first-time setup)\ndt-evals doctor\n\n# Generate a platform token only (skips the full health check)\ndt-evals doctor create-token\n\n# Use an existing dtctl context\ndt-evals doctor --context my-env\ndt-evals doctor create-token --context my-env\n\n# Point at a specific environment URL\ndt-evals doctor --env-url https://abc12345.apps.dynatrace.com\n\n# Skip token generation (if you already have DT_API_TOKEN set)\ndt-evals doctor --skip-token\n\n# Config, provider, and run history only — no dtctl auth required\ndt-evals doctor --skip-auth\n```\n\n**What it checks:**\n\n| Section | Checks |\n|---------|--------|\n| Dependencies | Node.js ≥18, dtctl installed and version |\n| Authentication | dtctl context selection or creation, browser OAuth flow |\n| Permissions | DQL read, bizevent write, metrics ingest, GenAI span count (last 24h) |\n| Platform Token | Creates a scoped API token, writes `DT_API_TOKEN` and `DT_ENV_URL` to `.env` |\n| AI Provider | API key presence and provider reachability |\n| Config \u0026 Runs | Config schema validation, last run status, failure rate over 7 days |\n\nEach section produces a pass/warn/fail result with actionable steps for anything that needs attention.\n\n---\n\n### `configure`\n\nSet up Dynatrace and judge provider credentials. Writes to `.dt-eval.yaml` in the current directory or `~/.dt-eval/config.yaml` globally.\n\n```bash\n# Interactive wizard\ndt-evals configure\n\n# Non-interactive\ndt-evals configure \\\n  --env-url https://your-env.live.dynatrace.com \\\n  --api-token \"$DT_API_TOKEN\" \\\n  --provider openai \\\n  --api-key \"$OPENAI_API_KEY\" \\\n  --model gpt-4.1\n\n# Show resolved config with secrets redacted\ndt-evals configure --show\n```\n\n---\n\n### `validate`\n\nCheck config schema, Dynatrace connectivity, and judge provider reachability before running.\n\n```bash\ndt-evals validate\n```\n\n---\n\n### `run`\n\nEvaluate recent GenAI traces from Dynatrace.\n\n```bash\n# Run all enabled evaluators over the last 2 hours, 20% sample\ndt-evals run --since 2h --sample 20\n\n# Run a single evaluator\ndt-evals run --since 6h --metric faithfulness\n\n# Preview what would run — no judge calls, no result writes\ndt-evals run --since 1h --sample 5 --dry-run\n\n# CI mode — JSON output, exit 1 on threshold breach\ndt-evals run --since 6h --metric relevance --ci\n\n# Parallel workers for faster throughput\ndt-evals run --since 2h --sample 20 --concurrency 8 --debug\n```\n\n**Flags:**\n\n| Flag | Description |\n|------|-------------|\n| `--since \u003cduration\u003e` | Trace lookback window, e.g. `1h`, `6h`, `24h` |\n| `--sample \u003cpercent\u003e` | Override sampling: percentage of traces to evaluate (0–100). When omitted, uses the strategy from your config file (default: random 5%) |\n| `--metric \u003cname\u003e` | Run only one evaluator |\n| `--dry-run` | Fetch and transform traces, skip judge calls and writes |\n| `--ci` | JSON result output and exit code `1` on threshold breach |\n| `--concurrency \u003cn\u003e` | Number of parallel evaluation workers |\n| `--debug` | Per-step timing logs |\n| `--config \u003cpath\u003e` | Path to a specific config file |\n\n**GitHub Actions example:**\n\n```yaml\n- name: Run LLM eval gate\n  run: npx @dynatrace-oss/dt-evals run --since 6h --metric faithfulness --ci\n  env:\n    DT_ENV_URL: ${{ secrets.DT_ENV_URL }}\n    DT_API_TOKEN: ${{ secrets.DT_API_TOKEN }}\n    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}\n```\n\n---\n\n### `evaluators`\n\nInspect, test, and manage built-in and custom evaluators.\n\n```bash\n# List all available evaluators\ndt-evals evaluators list\n\n# Show details for one evaluator (prompt, required fields, scoring scale)\ndt-evals evaluators show faithfulness\n\n# Send a test trace through the judge for an evaluator\ndt-evals evaluators test relevance\n\n# Add a custom evaluator interactively\ndt-evals evaluators add\n\n# Remove a custom evaluator\ndt-evals evaluators delete my-custom-eval\n```\n\n---\n\n### `runs`\n\nView and export local run history from `~/.dt-eval/runs.json`.\n\n```bash\n# List recent runs\ndt-evals runs list --limit 20\n\n# Inspect a single run in detail\ndt-evals runs show run-2026-04-10T12-00-00-ab12cd34\n\n# Export run history\ndt-evals runs export --format csv --output runs.csv\ndt-evals runs export --format json --output runs.json\n```\n\n---\n\n### `schedule`\n\nConfigure recurring evaluation runs stored in `~/.dt-eval/schedules.json`.\n\n```bash\n# Create a schedule\ndt-evals schedule add --name hourly-rag --cron \"0 * * * *\" --since 1h --sample 10\n\n# List schedules\ndt-evals schedule list\n\n# Trigger a schedule immediately\ndt-evals schedule run \u003cschedule-id\u003e\n\n# Pause or resume\ndt-evals schedule disable \u003cschedule-id\u003e\ndt-evals schedule enable \u003cschedule-id\u003e\n\n# Remove\ndt-evals schedule delete \u003cschedule-id\u003e\n```\n\n---\n\n### `status`\n\nShow resolved config, connectivity state, and last run summary.\n\n```bash\ndt-evals status\n```\n\n---\n\n### `deploy`\n\nPackage and deploy the eval runner as a serverless function for continuous scheduled evaluation.\n\n```bash\ndt-evals deploy --provider aws      # AWS Lambda\ndt-evals deploy --provider gcp      # Google Cloud Run\ndt-evals deploy --provider azure    # Azure Functions\ndt-evals deploy --teardown          # Destroy deployed resources\n```\n\nSee [`dt-eval-deploy`](dt-eval-deploy) for Docker-based deployment.\n\n---\n\n## Required Dynatrace Permissions\n\n### dt-evals CLI\n\nThe platform token (or OAuth scope) used by the CLI needs the following permissions:\n\n| Scope | Required for | Notes |\n|-------|-------------|-------|\n| `storage:spans:read` | `dt-evals run` | Fetches GenAI OTel spans via DQL (`fetch spans`) |\n| `storage:events:read` | `dt-evals run` with drift | Reads historical evaluation results for drift baseline (`fetch bizevents`) |\n| `storage:events:write` | `dt-evals run` | Writes evaluation results back as business events |\n| `metrics:ingest` | Optional | Writes evaluation metrics to Dynatrace metrics API |\n\nRun `dt-evals doctor create-token` to generate a token with exactly these scopes via OAuth.\n\n**Manually create a token** in Dynatrace → Settings → Access Tokens with the scopes above, then set:\n\n```bash\nDT_ENV_URL=https://your-env.apps.dynatrace.com\nDT_API_TOKEN=dt0c01.xxxxx\n```\n\n### dt-ai-ingest (Python library)\n\n| Scope | Required for |\n|-------|-------------|\n| `storage:events:write` | Sending evaluation results as business events |\n| `openTelemetryTrace.ingest` | Exporting OTel traces from MLflow / Langfuse |\n\n---\n\n## Built-in Evaluators\n\n13 built-in LLM judge evaluators plus statistical drift detection.\n\n| Evaluator | Measures |\n|-----------|----------|\n| `toxicity` | Harmful, offensive, or unsafe output |\n| `faithfulness` | Answer grounded in provided context |\n| `hallucination` | Unsupported or fabricated claims |\n| `relevance` | Answer addresses the user request |\n| `coherence` | Structure, clarity, and logical flow |\n| `factual-accuracy` | Accuracy against a reference answer |\n| `answer-completeness` | All parts of the request answered |\n| `context-relevance` | Retrieval quality for supplied context |\n| `pii-leakage` | PII present in the output |\n| `prompt-injection` | Injection attempts in the input |\n| `bias` | Harmful bias or unfair framing |\n| `summarization-quality` | Summary faithfulness, coverage, conciseness |\n| `conciseness` | Avoids filler and unnecessary padding |\n| `drift` | Score regression against a 7 day baseline |\n\n---\n\n## Supported Providers\n\n| Provider | Default model | Notes |\n|----------|--------------|-------|\n| `openai` | `gpt-5.4` | `OPENAI_API_KEY` |\n| `anthropic` | `claude-sonnet-4-7` | `ANTHROPIC_API_KEY` |\n| `vertex` | `gemini-3-pro` | `GOOGLE_API_KEY` |\n| `gemini` | `gemini-3.1-flash-live` | `GOOGLE_API_KEY` |\n| `bedrock` | `anthropic.claude-opus-4-7` | `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` |\n| `azure-openai` | user-provided deployment name | `AZURE_OPENAI_API_KEY` + `AZURE_OPENAI_ENDPOINT` + `AZURE_OPENAI_API_VERSION` |\n\nOverride the model with `--model \u003cid\u003e` or set `judge.model` in config.\n\n---\n\n## Configuration\n\nConfig resolves in this order: environment variables → project `.dt-eval.yaml` → global `~/.dt-eval/config.yaml` → built-in defaults.\n\n```yaml\nschemaVersion: 1\nname: travel-assistant-prod\n\ndynatrace:\n  environmentUrl: https://your-env.live.dynatrace.com\n  apiToken: dt0c01.xxxxx\n\njudge:\n  provider: openai\n  model: gpt-4.1\n  timeout: 30000\n  maxRetries: 2\n\nscope:\n  service: travel-assistant\n  since: 1h\n  # sampling is optional — defaults to random 5% when omitted\n  sampling:\n    strategy: random\n    percent: 10\n\nmetrics:\n  enabled:\n    - faithfulness\n    - hallucination\n    - relevance\n    - drift\n\nalerts:\n  thresholds:\n    faithfulness: 0.7\n    relevance: 0.7\n```\n\n**Bedrock example:**\n\n```yaml\njudge:\n  provider: bedrock\n  model: us.anthropic.claude-3-5-haiku-20241022-v1:0\n  region: us-east-1\n  # or use AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY env vars\n  apiKey: \u003cAWS_ACCESS_KEY_ID\u003e\n  secretKey: \u003cAWS_SECRET_ACCESS_KEY\u003e\n```\n\n**Azure OpenAI example:**\n\n```yaml\njudge:\n  provider: azure-openai\n  model: my-gpt4-deployment\n  baseUrl: https://my-resource.openai.azure.com/\n  apiVersion: 2025-04-01-preview\n  # or use AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT + AZURE_OPENAI_API_VERSION env vars\n```\n\nKey environment variables:\n\n```bash\nDT_ENV_URL=https://your-env.live.dynatrace.com\nDT_API_TOKEN=dt0c01.xxxxx\n\nJUDGE_PROVIDER=openai\nJUDGE_MODEL=gpt-4.1\n\n# OpenAI\nOPENAI_API_KEY=sk-...\n# Anthropic\nANTHROPIC_API_KEY=sk-ant-...\n# Google (Vertex / Gemini)\nGOOGLE_API_KEY=...\n# AWS Bedrock\nAWS_ACCESS_KEY_ID=...\nAWS_SECRET_ACCESS_KEY=...\nAWS_REGION=us-east-1\n# Azure OpenAI\nAZURE_OPENAI_API_KEY=...\nAZURE_OPENAI_ENDPOINT=https://my-resource.openai.azure.com/\nAZURE_OPENAI_API_VERSION=2025-04-01-preview\n```\n\n---\n\n## Results in Dynatrace\n\nEvaluation results land as business events with `event.type == \"gen_ai.evaluation.result\"`, correlating to the original trace.\n\n```dql\nfetch bizevents\n| filter event.type == \"gen_ai.evaluation.result\"\n| summarize avg_score = avg(gen_ai.evaluation.score.value), by: { gen_ai.evaluation.name }\n| sort avg_score asc\n```\n\n---\n\n## Development\n\n```bash\n# Install all workspace dependencies\nnpm install\n\n# Test dt-eval-lib\nmake test-lib\n\n# Build dt-eval-lib\nmake build-lib\n\n# Build the Go engine\nmake build-engine\n\n# Lint all Markdown\nmake markdownlint\n```\n\nRun the CLI locally without a build:\n\n```bash\ncd dt-eval-cli\nnpm run dev -- configure\nnpm run dev -- run --since 1h --dry-run\n```\n\n---\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdynatrace-oss%2Fdt-evals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdynatrace-oss%2Fdt-evals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdynatrace-oss%2Fdt-evals/lists"}