Projects in Awesome Lists tagged with evals

https://github.com/mastra-ai/mastra

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

agents ai chatbots evals javascript llm mcp nextjs nodejs reactjs tts typescript workflows

Last synced: 14 May 2025

https://github.com/arize-ai/phoenix

AI Observability & Evaluation

agents ai-monitoring ai-observability aiengineering anthropic datasets evals langchain llamaindex llm-eval llm-evaluation llmops llms openai prompt-engineering smolagents

Last synced: 27 May 2026

https://github.com/Arize-ai/phoenix

AI Observability & Evaluation

agents ai-monitoring ai-observability aiengineering anthropic datasets evals langchain llamaindex llm-eval llm-evaluation llmops llms openai prompt-engineering smolagents

Last synced: 26 Mar 2025

https://github.com/agentops-ai/agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI

agent agentops agents-sdk ai anthropic autogen cost-estimation crewai evals evaluation-metrics groq langchain llm mistral ollama openai openai-agents

Last synced: 17 Nov 2025

https://github.com/AgentOps-AI/agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI

agent agentops agents-sdk ai anthropic autogen cost-estimation crewai evals evaluation-metrics groq langchain llm mistral ollama openai openai-agents

Last synced: 26 Mar 2025

https://github.com/kiln-ai/kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

ai chain-of-thought collaboration dataset-generation evals evaluation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows

Last synced: 23 Apr 2025

https://github.com/truera/trulens

Evaluation and Tracking for LLM Experiments and AI Agents

agent-evaluation agentops ai-agents ai-monitoring ai-observability evals explainable-ml llm-eval llm-evaluation llmops llms machine-learning neural-networks

Last synced: 10 Mar 2026

https://github.com/mcpjam/inspector

Testing and evaluation platform to chat, inspect, and debug MCP servers, MCP apps, and ChatGPT apps.

anthropic chatgpt cicd debugger evals evaluation inspector mcp mcp-apps mcp-clients mcp-inspector mcp-server mcp-tools modelcontextprotocol oauth oauth2 openai openai-apps-sdk opensource tracing

Last synced: 09 Jun 2026

https://github.com/lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llm-workflow llmops monitoring observability open-source pipeline-builder rag rust-lang self-hosted

Last synced: 14 Apr 2026

https://github.com/harbor-framework/harbor

Harbor is a framework for running agent evaluations and creating and using RL environments.

evals rl-environments terminal-bench

Last synced: 30 Apr 2026

https://github.com/superlinear-ai/raglite

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite

chainlit colbert evals hybrid-search late-chunking late-interaction llm markdown pdf pgvector postgres postgresql query-adapter rag reranker reranking retrieval-augmented-generation sqlite tsvector vector-search

Last synced: 14 May 2025

https://github.com/mattpocock/evalite

Test your LLM-powered apps with TypeScript. No API key required.

ai evals typescript

Last synced: 14 May 2025

https://github.com/laude-institute/harbor

Harbor is a framework for running agent evaluations and creating and using RL environments.

evals rl-environments terminal-bench

Last synced: 01 Feb 2026

https://github.com/agentevals-dev/agentevals

agentevals is a framework-agnostic evaluations solution based on OpenTelemetry traces

agentevals agents evals

Last synced: 26 May 2026

https://github.com/dustalov/evalica

Evalica, your favourite evaluation toolkit

arena bradley-terry elo evalica evals evaluation hacktoberfest leaderboard library llm pagerank pairwise-comparison pyo3 python ranking rating rust serbia statistics winrate

Last synced: 11 Mar 2026

https://github.com/voratiq/voratiq

Agent ensembles to design, generate, and select the best code for every task.

agents ai cli code-generation evals multi-agent orchestration-framework sandboxing spec-driven-development

Last synced: 21 Apr 2026

https://github.com/agentevals-dev/evaluators

Collection of evaluators for agentevals

agentevals evals evaluators

Last synced: 07 Apr 2026

https://github.com/kylejeong2/mcpvals

An MCP Evaluation Library

evals mcp

Last synced: 05 Mar 2026

https://github.com/nuxt/nuxt-evals

Evals for Nuxt to test AI model competency at Nuxt.

ai evals nuxt

Last synced: 06 Mar 2026

https://github.com/vero-labs-ai/vero-eval

Open source framework for evaluating AI Agents

dataset-generation datasets evals evaluation evaluation-framework evaluation-metrics langgraph llm-evaluation llm-evaluation-framework python rag-evaluation rag-testing synthetic-dataset-generation testing testing-framework testing-library user-persona

Last synced: 07 Apr 2026

https://github.com/maragudk/gai

Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

ai embeddings eval evals go llm

Last synced: 28 Aug 2025

https://github.com/aianytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

eval evals rag

Last synced: 30 Apr 2025

https://github.com/nirantk/rag-to-riches

evals rag search

Last synced: 12 Apr 2025

https://github.com/vstorm-co/awesome-pydantic-ai

An opinionated list of awesome Pydantic-AI frameworks, libraries, software and resources.

agents awesome collections evals llm llm-agent logfire pydantic-ai pydantic-v2 python python-framework python-library python-resources

Last synced: 05 Feb 2026

https://github.com/browser-use/stress-tests

A collection of particularly difficult test scenarios for evaluating browser-use.

browser-use browsers evals forms headless html playwright puppeteer

Last synced: 29 Apr 2026

https://github.com/geval-labs/geval

Decision orchestration and reconciliation for AI changes.

ai-agents aievals evals evaluation geval llm-evaluation llms open-source

Last synced: 01 Apr 2026

https://github.com/mclenhard/mcp-evals

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

ai evals mcp

Last synced: 05 May 2025

https://github.com/getlarge/themoltnet

Trusted context for AI agents

agentic-ai autonomous-agents claude coding-agent context-engineering context-lifecycle decentralized-identity evals

Last synced: 11 Jun 2026

https://github.com/Moai-Team-LLC/agentic-product-standard

The canonical standard for building production-grade agentic products — autonomy ladder, composition patterns, the 7-layer harness, eval discipline — plus a Claude Code skill set that operationalizes it.

agent-architecture agentic-ai ai-agents claude-code evals llm mcp prompt-engineering standard

Last synced: 16 Jun 2026

https://github.com/root-signals/scorable-sdk

Scorable SDK

evals evaluation llm llm-as-a-judge observability

Last synced: 01 Mar 2026

https://github.com/openlayer-ai/templates

Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.

ai evals examples

Last synced: 18 Oct 2025

https://github.com/aryaminus/controlkeel

Agent control plane for governed AI coding: validate changes, enforce policy gates, track findings, proofs, and evals based on your habits.

agents ai-agents ai-governance benchmark code-review compliance compliance-as-code devsecops elixir evals llm mcp model-context-protocol observability phoenix policy-as-code security skills tooling

Last synced: 13 Jun 2026

https://github.com/dynatrace-oss/dt-evals

AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability

agents ai evals evaluations llm-as-judge observability

Last synced: 14 May 2026

https://github.com/agent-pattern-labs/iso

Isomorphic agent tooling: author once, run anywhere. Build, lint, route, fan out, eval, trace, guard, contract, and ledger AI-agent workflows across Cursor, Claude Code, Codex, and OpenCode.

agent-evals agent-harness agent-orchestration agents ai-agents claude-code codex cursor evals iso-contract iso-guard iso-ledger llm monorepo observability opencode prompt-engineering runtime-control typescript workflow-automation

Last synced: 21 May 2026

https://github.com/root-signals/root-signals-mcp

MCP for Root Signals Evaluation Platform

agentic-ai evals llm-as-a-judge mcp model-context-protocol pydantic-ai

Last synced: 03 May 2025

https://github.com/andrewginns/agents-mcp-usage

Demonstrate Agentic use of Model Context Protocol (MCP) server tools with several Agent Frameworks

adk-python agents agents-sdk evals evaluation gemini langgraph llm logfire mcp mcp-server openai pydantic-ai streamlit tool

Last synced: 25 Dec 2025

https://github.com/surus-lat/benchy

A benchmarking engine for evaluating AI systems on task-specific performance.

ai benchmarks engine evals

Last synced: 05 Apr 2026

https://github.com/wolfeidau/mcp-evals

A Go library and CLI for evaluating Model Context Protocol (MCP) servers using Claude.

ai claude evals go mcp

Last synced: 15 May 2026

https://github.com/razroo/iso

Isomorphic agent tooling: author once, run on frontier or 7B. Build, lint, fan out, eval, and trace AI agent harnesses across Cursor, Claude Code, Codex, and OpenCode.

agent-harness agents ai-agents claude-code codex cursor evals llm markdown-linter monorepo observability opencode prompt-engineering typescript

Last synced: 25 Apr 2026

https://github.com/avi350751/test-llm-with-deepeval

A hands-on exploration of Deepeval — an open-source framework for evaluating and red-teaming large language models (LLMs). This repository documents my journey of testing, benchmarking, and improving LLM reliability using custom prompts, metrics, and pipelines.

deepeval evals llmtesting

Last synced: 08 Jun 2026

https://github.com/maximhq/maxim-docs

Maxim Docs

ai evals genai llm

Last synced: 03 Mar 2026

https://github.com/ictup/production-rag-assistant

Production-ready RAG backend with FastAPI, pgvector, hybrid retrieval, eval gates, observability, RBAC, async exports, and OpenAI providers.

ai-engineering alembic docker evals fastapi hybrid-search llm observability openai pgvector postgresql production-ready prometheus python rag rbac retrieval-augmented-generation semantic-search sse vector-search

Last synced: 06 Jun 2026

https://github.com/rootly-ai-labs/gmcq-benchmark

Evaluation benchmark for language models to understand code to close pull requests.

ai benchmark evals evaluation-metrics llm sre

Last synced: 25 Feb 2026

https://github.com/exospherehost/ai-reliability-standards

Architectural standards and best practices for building reliable AI Agents and LLM workflows. Defining the framework for AI Reliability Engineering (AIRE).

ai ai-agents ai-reliability aiops durable-execution enterprise evals evaluation observability reliability-engineering sre

Last synced: 15 Feb 2026

https://github.com/priyanshuchawda/tracepilot-gemini-cli

TracePilot: Gemini CLI fork with Phoenix/OpenInference tracing, MCP self-introspection, safety gates, redaction, evals, and a verified broken-repo repair loop

ai-agents arize-phoenix evals gemini-cli mcp openinference opentelemetry typescript

Last synced: 17 Jun 2026

https://github.com/valohai/valohai-llm

Track and report LLM and GenAI evaluations to Valohai LLM

evals genai llm

Last synced: 06 Apr 2026

https://github.com/gokayfem/dspy-ollama-colab

dspy with ollama and llamacpp on google colab

agents colab-notebook dspy evals evaluation llamacpp llm ollama vlm

Last synced: 06 May 2026

https://github.com/svilupp/layercode-gym

Unofficial utilities for Layercode Voice Agents. Run hundreds of voice AI conversations concurrently. Test with text, audio files, or AI-driven personas.

evals generative-ai layercode voice-ai-agents

Last synced: 08 Mar 2026

https://github.com/raphaelpor/katt

Katt is a lightweight testing framework for running AI Evals.

agent ai cli copilot evals

Last synced: 19 Feb 2026

https://github.com/rogerchappel/qasmoke

Tiny fixture-driven QA smoke tests for LLM and prompt regressions.

cli evals fixtures llm local-first qa regression-testing smoke-test typescript

Last synced: 26 May 2026

https://github.com/olesyastorchakprojects/agentic_reasoning_playground

Agentic diagnostic assistant for distributed-system incidents: multi-turn RAG, hypothesis updates, evidence packing, golden evals, and failure-attributed run reports.

agentic-workflows ai-agents distributed-systems evals evaluation-metrics golden-dataset incident-diagnosis llm-evaluation opentelemetry rag rust

Last synced: 29 May 2026

https://github.com/cleanlab/tlm

Score the trustworthiness of outputs from any LLM in real-time

ai-agents ai-safety confidence-estimation data-extraction data-labeling error-detection evals evaluation guardrails hallucination hallucination-detection human-in-the-loop-ai llm llm-as-a-judge llm-evaluation rag structured-outputs trustworthy-ai uncertainty-quantification verifiers

Last synced: 23 Feb 2026

https://github.com/grnbtqdbyx-create/trace-to-skill

Check whether a repo is Codex-ready, then turn failed AI coding-agent runs into reusable AGENTS.md rules, skills, and eval gates.

agent-benchmark agent-evals agent-skills agent-workflows agents-md agents-md-linter ai-agents ai-code-review ai-coding-agents claude-code codex codex-cli codex-readiness evals github-action mcp mcp-security open-source-maintainers openai-codex prompt-injection

Last synced: 31 May 2026

https://github.com/maragudk/evals-action

A GitHub Action to parse LLM eval results, display and aggregate them.

evals github-action llm

Last synced: 11 Feb 2026

https://github.com/ghost146767/openai-agents-python

🤖 Build efficient multi-agent workflows with the OpenAI Agents SDK, supporting OpenAI APIs and 100+ other LLMs for flexible solutions.

agent agent-runtime agentops ai4science api chatgpt cli crewai cursor cursor-agent-tools dspy evals evaluation-metrics framework language-model ollama openai python

Last synced: 30 Apr 2026

https://github.com/maragudk/gai-starter-kit

Get started with LLMs, FTS and vector search, RAG, and more, in Go!

ai evals fts go llm rag sqlite vector-search

Last synced: 11 May 2026

https://github.com/sbalnojan/ai-chaos-awesome

Awesome list for AI chaos engineering: experiments, evaluations, guardrails & observability for LLM/RAG.

ai-chaos-engineering awesome awesome-list chaos-engineering evals llm mlops rag red-teaming reliability

Last synced: 07 Sep 2025

https://github.com/auraoneai/rubric-pr-bot

Rubric diffs and lint feedback for pull requests that change evaluation criteria.

ai-evaluation evals github-app rubric

Last synced: 28 May 2026

https://github.com/auraoneai/judge-card

A disclosure format for judge prompts, calibration results, known bias, and recommended use envelopes.

ai-evaluation evals llm-as-judge model-card

Last synced: 28 May 2026

https://github.com/auraoneai/judge-bench

Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.

ai-evaluation benchmark evals llm-as-judge

Last synced: 28 May 2026

https://github.com/fdionisi/evals

A deadly simple evaluation framework for AI models

evals mcp

Last synced: 15 May 2026

https://github.com/fswair/vowel

YAML Based Eval Specification Language for LLMs and Developers.

evals llms pydantic-evals specification yaml

Last synced: 28 Feb 2026

https://github.com/auraoneai/evalkit-playground

Browser playground for scoring rubrics and responses with no install or account.

ai-evaluation evalkit evals playground

Last synced: 28 May 2026

https://github.com/auraoneai/iaa-kit

Modern inter-annotator agreement metrics with bootstrap intervals, ordinal support, and missing-data handling.

ai-evaluation evals inter-annotator-agreement statistics

Last synced: 28 May 2026

https://github.com/alucek/pii-masking-rlenv

RL Environment built using Verifiers for PII information masking

evals fine-tuning llms reinforcement-learning rl rl-environment

Last synced: 17 May 2026

https://github.com/auraoneai/synthetic-disagreement

Synthetic reviewer disagreement generators for testing IAA and adjudication workflows.

ai-evaluation evals inter-annotator-agreement synthetic-data

Last synced: 28 May 2026

https://github.com/auraoneai/contamination-audit

Local contamination checks for eval data overlap, hashes, and n-gram leakage.

ai-evaluation data-contamination evals leakage

Last synced: 28 May 2026

https://github.com/auraoneai/eval-adapter

Adapters between rubric-spec and common evaluation framework inputs.

adapters ai-evaluation evals rubric

Last synced: 28 May 2026

https://github.com/ben-ranford/cellin

build long-lived multimodal memory, dream over it, and retrieve context with transparent weighting

agent-memory evals knowledge-graph llm-memory memory multimodal python retrieval

Last synced: 08 Apr 2026

https://github.com/auraoneai/datasheet-ci

GitHub Action for validating dataset cards and required metadata in pull requests.

ai-evaluation dataset-card evals github-actions

Last synced: 28 May 2026

https://github.com/lavanyashukla/ai-starter-templates

Production-ready AI starter templates — agents, SDR outbound, RAG, evals, RLHF.

agents ai best-practices boilerplate evals examples llm production rag rlhf starter-template

Last synced: 20 Apr 2026

https://github.com/auraoneai/evalkit-action

GitHub Action for running EvalKit validation, scoring, and reporting in CI.

ai-evaluation evalkit evals github-actions

Last synced: 28 May 2026

https://github.com/auraoneai/rubric-spec

Portable rubric schema, validator, linter, diff, adapters, and conformance tests for AI evaluation.

ai-evaluation evals json-schema rubric

Last synced: 28 May 2026

https://github.com/auraoneai/eval-run-manifest

Portable manifest envelope for eval run provenance, artifacts, and reproducibility.

ai-evaluation evals manifest provenance

Last synced: 28 May 2026

https://github.com/auraoneai/eval-conformance-suite

Executable rubric-spec v1 conformance checks and embeddable SVG badges.

ai-evaluation conformance evals rubric

Last synced: 28 May 2026

https://github.com/largonarco/eval

LLM system evaluations for a mock system

evals llm-as-a-judge openai

Last synced: 13 Feb 2026

https://github.com/urmzd/generative-artifact-protocol

Generative Artifact Protocol (GAP) — an open standard for token-efficient artifact updates and streaming. Rust apply engine + Python eval framework.

apply-engine artifacts diff evals gap generative-artifact-protocol llm open-standard protocol python rust sse streaming text-diff token-efficient wasm

Last synced: 19 Apr 2026

https://github.com/gafnts/agentic-kie-evals

Benchmarking agentic and single-pass extraction strategies across LLM providers on the Kleister NDA dataset

agentic-ai agentic-kie document-ai evals key-information-extraction kie langsmith

Last synced: 30 Apr 2026

https://github.com/jtmuller5/vibe-checker

The TypeScript LLM Evaluation File

ai devtools evals evaluation-metrics evaluations gemini gemini-api gemini-flash javascript llm nodejs testing typescript vitest

Last synced: 01 May 2026

https://github.com/op12no2/patchwork

An informal cumulative and comptitive frontier model eval using a Javascript chess engine

agents ai chess chess-engine competition eval evals evaluation frontier javascript llm

Last synced: 09 Jun 2026

https://github.com/marvinvista/callback-alpha

Open-source Codex skills and evals for practical B2B revenue work.

codex evals go-to-market openai openai-codex revenue-operations revops sales

Last synced: 03 May 2026

https://github.com/lennart-finke/picturebooks

Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.

evals inspect vision-transformer

Last synced: 14 Oct 2025