An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with evals

A curated list of projects in awesome lists tagged with evals .

https://github.com/mastra-ai/mastra

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

agents ai chatbots evals javascript llm mcp nextjs nodejs reactjs tts typescript workflows

Last synced: 14 May 2025

https://github.com/agentops-ai/agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI

agent agentops agents-sdk ai anthropic autogen cost-estimation crewai evals evaluation-metrics groq langchain llm mistral ollama openai openai-agents

Last synced: 17 Nov 2025

https://github.com/AgentOps-AI/agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI

agent agentops agents-sdk ai anthropic autogen cost-estimation crewai evals evaluation-metrics groq langchain llm mistral ollama openai openai-agents

Last synced: 26 Mar 2025

https://github.com/kiln-ai/kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

ai chain-of-thought collaboration dataset-generation evals evaluation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows

Last synced: 23 Apr 2025

https://github.com/lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llm-workflow llmops monitoring observability open-source pipeline-builder rag rust-lang self-hosted

Last synced: 14 Apr 2026

https://github.com/harbor-framework/harbor

Harbor is a framework for running agent evaluations and creating and using RL environments.

evals rl-environments terminal-bench

Last synced: 30 Apr 2026

https://github.com/mattpocock/evalite

Test your LLM-powered apps with TypeScript. No API key required.

ai evals typescript

Last synced: 14 May 2025

https://github.com/laude-institute/harbor

Harbor is a framework for running agent evaluations and creating and using RL environments.

evals rl-environments terminal-bench

Last synced: 01 Feb 2026

https://github.com/agentevals-dev/agentevals

agentevals is a framework-agnostic evaluations solution based on OpenTelemetry traces

agentevals agents evals

Last synced: 26 May 2026

https://github.com/voratiq/voratiq

Agent ensembles to design, generate, and select the best code for every task.

agents ai cli code-generation evals multi-agent orchestration-framework sandboxing spec-driven-development

Last synced: 21 Apr 2026

https://github.com/agentevals-dev/evaluators

Collection of evaluators for agentevals

agentevals evals evaluators

Last synced: 07 Apr 2026

https://github.com/kylejeong2/mcpvals

An MCP Evaluation Library

evals mcp

Last synced: 05 Mar 2026

https://github.com/nuxt/nuxt-evals

Evals for Nuxt to test AI model competency at Nuxt.

ai evals nuxt

Last synced: 06 Mar 2026

https://github.com/maragudk/gai

Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

ai embeddings eval evals go llm

Last synced: 28 Aug 2025

https://github.com/aianytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

eval evals rag

Last synced: 30 Apr 2025

https://github.com/vstorm-co/awesome-pydantic-ai

An opinionated list of awesome Pydantic-AI frameworks, libraries, software and resources.

agents awesome collections evals llm llm-agent logfire pydantic-ai pydantic-v2 python python-framework python-library python-resources

Last synced: 05 Feb 2026

https://github.com/geval-labs/geval

Decision orchestration and reconciliation for AI changes.

ai-agents aievals evals evaluation geval llm-evaluation llms open-source

Last synced: 01 Apr 2026

https://github.com/browser-use/stress-tests

A collection of particularly difficult test scenarios for evaluating browser-use.

browser-use browsers evals forms headless html playwright puppeteer

Last synced: 29 Apr 2026

https://github.com/mclenhard/mcp-evals

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

ai evals mcp

Last synced: 05 May 2025

https://github.com/openlayer-ai/templates

Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.

ai evals examples

Last synced: 18 Oct 2025

https://github.com/dynatrace-oss/dt-evals

AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability

agents ai evals evaluations llm-as-judge observability

Last synced: 14 May 2026

https://github.com/agent-pattern-labs/iso

Isomorphic agent tooling: author once, run anywhere. Build, lint, route, fan out, eval, trace, guard, contract, and ledger AI-agent workflows across Cursor, Claude Code, Codex, and OpenCode.

agent-evals agent-harness agent-orchestration agents ai-agents claude-code codex cursor evals iso-contract iso-guard iso-ledger llm monorepo observability opencode prompt-engineering runtime-control typescript workflow-automation

Last synced: 21 May 2026

https://github.com/surus-lat/benchy

A benchmarking engine for evaluating AI systems on task-specific performance.

ai benchmarks engine evals

Last synced: 05 Apr 2026

https://github.com/andrewginns/agents-mcp-usage

Demonstrate Agentic use of Model Context Protocol (MCP) server tools with several Agent Frameworks

adk-python agents agents-sdk evals evaluation gemini langgraph llm logfire mcp mcp-server openai pydantic-ai streamlit tool

Last synced: 25 Dec 2025

https://github.com/rootly-ai-labs/gmcq-benchmark

Evaluation benchmark for language models to understand code to close pull requests.

ai benchmark evals evaluation-metrics llm sre

Last synced: 25 Feb 2026

https://github.com/exospherehost/ai-reliability-standards

Architectural standards and best practices for building reliable AI Agents and LLM workflows. Defining the framework for AI Reliability Engineering (AIRE).

ai ai-agents ai-reliability aiops durable-execution enterprise evals evaluation observability reliability-engineering sre

Last synced: 15 Feb 2026

https://github.com/wolfeidau/mcp-evals

A Go library and CLI for evaluating Model Context Protocol (MCP) servers using Claude.

ai claude evals go mcp

Last synced: 15 May 2026

https://github.com/razroo/iso

Isomorphic agent tooling: author once, run on frontier or 7B. Build, lint, fan out, eval, and trace AI agent harnesses across Cursor, Claude Code, Codex, and OpenCode.

agent-harness agents ai-agents claude-code codex cursor evals llm markdown-linter monorepo observability opencode prompt-engineering typescript

Last synced: 25 Apr 2026

https://github.com/maximhq/maxim-docs

Maxim Docs

ai evals genai llm

Last synced: 03 Mar 2026

https://github.com/sbalnojan/ai-chaos-awesome

Awesome list for AI chaos engineering: experiments, evaluations, guardrails & observability for LLM/RAG.

ai-chaos-engineering awesome awesome-list chaos-engineering evals llm mlops rag red-teaming reliability

Last synced: 07 Sep 2025

https://github.com/valohai/valohai-llm

Track and report LLM and GenAI evaluations to Valohai LLM

evals genai llm

Last synced: 06 Apr 2026

https://github.com/gokayfem/dspy-ollama-colab

dspy with ollama and llamacpp on google colab

agents colab-notebook dspy evals evaluation llamacpp llm ollama vlm

Last synced: 06 May 2026

https://github.com/homemade-software-inc/completion-kit

Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and compare runs to see what got better.

ai anthropic evals genai llm llm-as-judge llm-eval llm-evaluation mcp ollama openai prompt-engineering prompt-testing rails rails-engine ruby ruby-on-rails

Last synced: 21 May 2026

https://github.com/svilupp/layercode-gym

Unofficial utilities for Layercode Voice Agents. Run hundreds of voice AI conversations concurrently. Test with text, audio files, or AI-driven personas.

evals generative-ai layercode voice-ai-agents

Last synced: 08 Mar 2026

https://github.com/raphaelpor/katt

Katt is a lightweight testing framework for running AI Evals.

agent ai cli copilot evals

Last synced: 19 Feb 2026

https://github.com/rogerchappel/qasmoke

Tiny fixture-driven QA smoke tests for LLM and prompt regressions.

cli evals fixtures llm local-first qa regression-testing smoke-test typescript

Last synced: 26 May 2026

https://github.com/olesyastorchakprojects/agentic_reasoning_playground

Agentic diagnostic assistant for distributed-system incidents: multi-turn RAG, hypothesis updates, evidence packing, golden evals, and failure-attributed run reports.

agentic-workflows ai-agents distributed-systems evals evaluation-metrics golden-dataset incident-diagnosis llm-evaluation opentelemetry rag rust

Last synced: 29 May 2026

https://github.com/maragudk/evals-action

A GitHub Action to parse LLM eval results, display and aggregate them.

evals github-action llm

Last synced: 11 Feb 2026

https://github.com/ghost146767/openai-agents-python

🤖 Build efficient multi-agent workflows with the OpenAI Agents SDK, supporting OpenAI APIs and 100+ other LLMs for flexible solutions.

agent agent-runtime agentops ai4science api chatgpt cli crewai cursor cursor-agent-tools dspy evals evaluation-metrics framework language-model ollama openai python

Last synced: 30 Apr 2026

https://github.com/maragudk/gai-starter-kit

Get started with LLMs, FTS and vector search, RAG, and more, in Go!

ai evals fts go llm rag sqlite vector-search

Last synced: 11 May 2026

https://github.com/auraoneai/judge-bench

Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.

ai-evaluation benchmark evals llm-as-judge

Last synced: 28 May 2026

https://github.com/lennart-finke/picturebooks

Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.

evals inspect vision-transformer

Last synced: 14 Oct 2025

https://github.com/blackwell-systems/mcp-assert

Deterministic correctness testing for MCP servers. Assert your tools return the right results, not just any results. No LLM-as-judge.

ai-agents assertions ci deterministic-testing developer-tools evals evaluation golang language-server-protocol mcp mcp-server model-context-protocol testing

Last synced: 10 May 2026

https://github.com/auraoneai/evalkit-playground

Browser playground for scoring rubrics and responses with no install or account.

ai-evaluation evalkit evals playground

Last synced: 28 May 2026

https://github.com/auraoneai/iaa-kit

Modern inter-annotator agreement metrics with bootstrap intervals, ordinal support, and missing-data handling.

ai-evaluation evals inter-annotator-agreement statistics

Last synced: 28 May 2026

https://github.com/auraoneai/synthetic-disagreement

Synthetic reviewer disagreement generators for testing IAA and adjudication workflows.

ai-evaluation evals inter-annotator-agreement synthetic-data

Last synced: 28 May 2026

https://github.com/marvinvista/callback-alpha

Open-source Codex skills and evals for practical B2B revenue work.

codex evals go-to-market openai openai-codex revenue-operations revops sales

Last synced: 03 May 2026

https://github.com/fswair/vowel

YAML Based Eval Specification Language for LLMs and Developers.

evals llms pydantic-evals specification yaml

Last synced: 28 Feb 2026

https://github.com/auraoneai/contamination-audit

Local contamination checks for eval data overlap, hashes, and n-gram leakage.

ai-evaluation data-contamination evals leakage

Last synced: 28 May 2026

https://github.com/auraoneai/eval-adapter

Adapters between rubric-spec and common evaluation framework inputs.

adapters ai-evaluation evals rubric

Last synced: 28 May 2026

https://github.com/fdionisi/evals

A deadly simple evaluation framework for AI models

evals mcp

Last synced: 15 May 2026

https://github.com/auraoneai/datasheet-ci

GitHub Action for validating dataset cards and required metadata in pull requests.

ai-evaluation dataset-card evals github-actions

Last synced: 28 May 2026

https://github.com/auraoneai/evalkit-action

GitHub Action for running EvalKit validation, scoring, and reporting in CI.

ai-evaluation evalkit evals github-actions

Last synced: 28 May 2026

https://github.com/auraoneai/rubric-spec

Portable rubric schema, validator, linter, diff, adapters, and conformance tests for AI evaluation.

ai-evaluation evals json-schema rubric

Last synced: 28 May 2026

https://github.com/auraoneai/eval-run-manifest

Portable manifest envelope for eval run provenance, artifacts, and reproducibility.

ai-evaluation evals manifest provenance

Last synced: 28 May 2026

https://github.com/ben-ranford/cellin

build long-lived multimodal memory, dream over it, and retrieve context with transparent weighting

agent-memory evals knowledge-graph llm-memory memory multimodal python retrieval

Last synced: 08 Apr 2026

https://github.com/auraoneai/eval-conformance-suite

Executable rubric-spec v1 conformance checks and embeddable SVG badges.

ai-evaluation conformance evals rubric

Last synced: 28 May 2026

https://github.com/lavanyashukla/ai-starter-templates

Production-ready AI starter templates — agents, SDR outbound, RAG, evals, RLHF.

agents ai best-practices boilerplate evals examples llm production rag rlhf starter-template

Last synced: 20 Apr 2026

https://github.com/largonarco/eval

LLM system evaluations for a mock system

evals llm-as-a-judge openai

Last synced: 13 Feb 2026

https://github.com/urmzd/generative-artifact-protocol

Generative Artifact Protocol (GAP) — an open standard for token-efficient artifact updates and streaming. Rust apply engine + Python eval framework.

apply-engine artifacts diff evals gap generative-artifact-protocol llm open-standard protocol python rust sse streaming text-diff token-efficient wasm

Last synced: 19 Apr 2026

https://github.com/alucek/pii-masking-rlenv

RL Environment built using Verifiers for PII information masking

evals fine-tuning llms reinforcement-learning rl rl-environment

Last synced: 17 May 2026

https://github.com/auraoneai/open

Open tools for the human-judgment layer of AI evaluation: EvalKit (Python package + CLI), Robotics ReviewKit, and the Buying Toolkit.

ai-safety auraone evals evaluation human-feedback lerobot llm openx rlds robotics rubrics teleoperation

Last synced: 28 May 2026

https://github.com/auraoneai/rubric-pr-bot

Rubric diffs and lint feedback for pull requests that change evaluation criteria.

ai-evaluation evals github-app rubric

Last synced: 28 May 2026

https://github.com/auraoneai/judge-card

A disclosure format for judge prompts, calibration results, known bias, and recommended use envelopes.

ai-evaluation evals llm-as-judge model-card

Last synced: 28 May 2026

https://github.com/jancervenka/czech-simpleqa

How well can language models answer questions in Czech?

ai artificial-intelligence claude evals gpt language-model llm

Last synced: 11 Mar 2025

https://github.com/gafnts/agentic-kie-evals

Benchmarking agentic and single-pass extraction strategies across LLM providers on the Kleister NDA dataset

agentic-ai agentic-kie document-ai evals key-information-extraction kie langsmith

Last synced: 30 Apr 2026