Projects in Awesome Lists tagged with evals
A curated list of projects in awesome lists tagged with evals .
https://github.com/mastra-ai/mastra
The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.
agents ai chatbots evals javascript llm mcp nextjs nodejs reactjs tts typescript workflows
Last synced: 14 May 2025
https://github.com/arize-ai/phoenix
AI Observability & Evaluation
agents ai-monitoring ai-observability aiengineering anthropic datasets evals langchain llamaindex llm-eval llm-evaluation llmops llms openai prompt-engineering smolagents
Last synced: 27 May 2026
https://github.com/Arize-ai/phoenix
AI Observability & Evaluation
agents ai-monitoring ai-observability aiengineering anthropic datasets evals langchain llamaindex llm-eval llm-evaluation llmops llms openai prompt-engineering smolagents
Last synced: 26 Mar 2025
https://github.com/agentops-ai/agentops
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI
agent agentops agents-sdk ai anthropic autogen cost-estimation crewai evals evaluation-metrics groq langchain llm mistral ollama openai openai-agents
Last synced: 17 Nov 2025
https://github.com/AgentOps-AI/agentops
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI
agent agentops agents-sdk ai anthropic autogen cost-estimation crewai evals evaluation-metrics groq langchain llm mistral ollama openai openai-agents
Last synced: 26 Mar 2025
https://github.com/kiln-ai/kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
ai chain-of-thought collaboration dataset-generation evals evaluation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows
Last synced: 23 Apr 2025
https://github.com/truera/trulens
Evaluation and Tracking for LLM Experiments and AI Agents
agent-evaluation agentops ai-agents ai-monitoring ai-observability evals explainable-ml llm-eval llm-evaluation llmops llms machine-learning neural-networks
Last synced: 10 Mar 2026
https://github.com/lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llm-workflow llmops monitoring observability open-source pipeline-builder rag rust-lang self-hosted
Last synced: 14 Apr 2026
https://github.com/harbor-framework/harbor
Harbor is a framework for running agent evaluations and creating and using RL environments.
evals rl-environments terminal-bench
Last synced: 30 Apr 2026
https://github.com/superlinear-ai/raglite
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite
chainlit colbert evals hybrid-search late-chunking late-interaction llm markdown pdf pgvector postgres postgresql query-adapter rag reranker reranking retrieval-augmented-generation sqlite tsvector vector-search
Last synced: 14 May 2025
https://github.com/mattpocock/evalite
Test your LLM-powered apps with TypeScript. No API key required.
Last synced: 14 May 2025
https://github.com/laude-institute/harbor
Harbor is a framework for running agent evaluations and creating and using RL environments.
evals rl-environments terminal-bench
Last synced: 01 Feb 2026
https://github.com/agentevals-dev/agentevals
agentevals is a framework-agnostic evaluations solution based on OpenTelemetry traces
Last synced: 26 May 2026
https://github.com/dustalov/evalica
Evalica, your favourite evaluation toolkit
arena bradley-terry elo evalica evals evaluation hacktoberfest leaderboard library llm pagerank pairwise-comparison pyo3 python ranking rating rust serbia statistics winrate
Last synced: 11 Mar 2026
https://github.com/voratiq/voratiq
Agent ensembles to design, generate, and select the best code for every task.
agents ai cli code-generation evals multi-agent orchestration-framework sandboxing spec-driven-development
Last synced: 21 Apr 2026
https://github.com/agentevals-dev/evaluators
Collection of evaluators for agentevals
Last synced: 07 Apr 2026
https://github.com/nuxt/nuxt-evals
Evals for Nuxt to test AI model competency at Nuxt.
Last synced: 06 Mar 2026
https://github.com/maragudk/gai
Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.
ai embeddings eval evals go llm
Last synced: 28 Aug 2025
https://github.com/vero-labs-ai/vero-eval
Open source framework for evaluating AI Agents
dataset-generation datasets evals evaluation evaluation-framework evaluation-metrics langgraph llm-evaluation llm-evaluation-framework python rag-evaluation rag-testing synthetic-dataset-generation testing testing-framework testing-library user-persona
Last synced: 07 Apr 2026
https://github.com/aianytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
Last synced: 30 Apr 2025
https://github.com/vstorm-co/awesome-pydantic-ai
An opinionated list of awesome Pydantic-AI frameworks, libraries, software and resources.
agents awesome collections evals llm llm-agent logfire pydantic-ai pydantic-v2 python python-framework python-library python-resources
Last synced: 05 Feb 2026
https://github.com/geval-labs/geval
Decision orchestration and reconciliation for AI changes.
ai-agents aievals evals evaluation geval llm-evaluation llms open-source
Last synced: 01 Apr 2026
https://github.com/browser-use/stress-tests
A collection of particularly difficult test scenarios for evaluating browser-use.
browser-use browsers evals forms headless html playwright puppeteer
Last synced: 29 Apr 2026
https://github.com/mclenhard/mcp-evals
A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.
Last synced: 05 May 2025
https://github.com/getlarge/themoltnet
Trusted context for AI agents
agentic-ai autonomous-agents claude coding-agent context-engineering context-lifecycle decentralized-identity evals
Last synced: 26 May 2026
https://github.com/root-signals/scorable-sdk
Scorable SDK
evals evaluation llm llm-as-a-judge observability
Last synced: 01 Mar 2026
https://github.com/openlayer-ai/templates
Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.
Last synced: 18 Oct 2025
https://github.com/dynatrace-oss/dt-evals
AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability
agents ai evals evaluations llm-as-judge observability
Last synced: 14 May 2026
https://github.com/agent-pattern-labs/iso
Isomorphic agent tooling: author once, run anywhere. Build, lint, route, fan out, eval, trace, guard, contract, and ledger AI-agent workflows across Cursor, Claude Code, Codex, and OpenCode.
agent-evals agent-harness agent-orchestration agents ai-agents claude-code codex cursor evals iso-contract iso-guard iso-ledger llm monorepo observability opencode prompt-engineering runtime-control typescript workflow-automation
Last synced: 21 May 2026
https://github.com/root-signals/root-signals-mcp
MCP for Root Signals Evaluation Platform
agentic-ai evals llm-as-a-judge mcp model-context-protocol pydantic-ai
Last synced: 03 May 2025
https://github.com/surus-lat/benchy
A benchmarking engine for evaluating AI systems on task-specific performance.
Last synced: 05 Apr 2026
https://github.com/andrewginns/agents-mcp-usage
Demonstrate Agentic use of Model Context Protocol (MCP) server tools with several Agent Frameworks
adk-python agents agents-sdk evals evaluation gemini langgraph llm logfire mcp mcp-server openai pydantic-ai streamlit tool
Last synced: 25 Dec 2025
https://github.com/rootly-ai-labs/gmcq-benchmark
Evaluation benchmark for language models to understand code to close pull requests.
ai benchmark evals evaluation-metrics llm sre
Last synced: 25 Feb 2026
https://github.com/exospherehost/ai-reliability-standards
Architectural standards and best practices for building reliable AI Agents and LLM workflows. Defining the framework for AI Reliability Engineering (AIRE).
ai ai-agents ai-reliability aiops durable-execution enterprise evals evaluation observability reliability-engineering sre
Last synced: 15 Feb 2026
https://github.com/razroo/iso
Isomorphic agent tooling: author once, run on frontier or 7B. Build, lint, fan out, eval, and trace AI agent harnesses across Cursor, Claude Code, Codex, and OpenCode.
agent-harness agents ai-agents claude-code codex cursor evals llm markdown-linter monorepo observability opencode prompt-engineering typescript
Last synced: 25 Apr 2026
https://github.com/sbalnojan/ai-chaos-awesome
Awesome list for AI chaos engineering: experiments, evaluations, guardrails & observability for LLM/RAG.
ai-chaos-engineering awesome awesome-list chaos-engineering evals llm mlops rag red-teaming reliability
Last synced: 07 Sep 2025
https://github.com/valohai/valohai-llm
Track and report LLM and GenAI evaluations to Valohai LLM
Last synced: 06 Apr 2026
https://github.com/gokayfem/dspy-ollama-colab
dspy with ollama and llamacpp on google colab
agents colab-notebook dspy evals evaluation llamacpp llm ollama vlm
Last synced: 06 May 2026
https://github.com/homemade-software-inc/completion-kit
Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and compare runs to see what got better.
ai anthropic evals genai llm llm-as-judge llm-eval llm-evaluation mcp ollama openai prompt-engineering prompt-testing rails rails-engine ruby ruby-on-rails
Last synced: 21 May 2026
https://github.com/svilupp/layercode-gym
Unofficial utilities for Layercode Voice Agents. Run hundreds of voice AI conversations concurrently. Test with text, audio files, or AI-driven personas.
evals generative-ai layercode voice-ai-agents
Last synced: 08 Mar 2026
https://github.com/rogerchappel/qasmoke
Tiny fixture-driven QA smoke tests for LLM and prompt regressions.
cli evals fixtures llm local-first qa regression-testing smoke-test typescript
Last synced: 26 May 2026
https://github.com/olesyastorchakprojects/agentic_reasoning_playground
Agentic diagnostic assistant for distributed-system incidents: multi-turn RAG, hypothesis updates, evidence packing, golden evals, and failure-attributed run reports.
agentic-workflows ai-agents distributed-systems evals evaluation-metrics golden-dataset incident-diagnosis llm-evaluation opentelemetry rag rust
Last synced: 29 May 2026
https://github.com/cleanlab/tlm
Score the trustworthiness of outputs from any LLM in real-time
ai-agents ai-safety confidence-estimation data-extraction data-labeling error-detection evals evaluation guardrails hallucination hallucination-detection human-in-the-loop-ai llm llm-as-a-judge llm-evaluation rag structured-outputs trustworthy-ai uncertainty-quantification verifiers
Last synced: 23 Feb 2026
https://github.com/grnbtqdbyx-create/trace-to-skill
Check whether a repo is Codex-ready, then turn failed AI coding-agent runs into reusable AGENTS.md rules, skills, and eval gates.
agent-benchmark agent-evals agent-skills agent-workflows agents-md agents-md-linter ai-agents ai-code-review ai-coding-agents claude-code codex codex-cli codex-readiness evals github-action mcp mcp-security open-source-maintainers openai-codex prompt-injection
Last synced: 31 May 2026
https://github.com/maragudk/evals-action
A GitHub Action to parse LLM eval results, display and aggregate them.
Last synced: 11 Feb 2026
https://github.com/ghost146767/openai-agents-python
🤖 Build efficient multi-agent workflows with the OpenAI Agents SDK, supporting OpenAI APIs and 100+ other LLMs for flexible solutions.
agent agent-runtime agentops ai4science api chatgpt cli crewai cursor cursor-agent-tools dspy evals evaluation-metrics framework language-model ollama openai python
Last synced: 30 Apr 2026
https://github.com/maragudk/gai-starter-kit
Get started with LLMs, FTS and vector search, RAG, and more, in Go!
ai evals fts go llm rag sqlite vector-search
Last synced: 11 May 2026
https://github.com/auraoneai/judge-bench
Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.
ai-evaluation benchmark evals llm-as-judge
Last synced: 28 May 2026
https://github.com/jtmuller5/vibe-checker
The TypeScript LLM Evaluation File
ai devtools evals evaluation-metrics evaluations gemini gemini-api gemini-flash javascript llm nodejs testing typescript vitest
Last synced: 01 May 2026
https://github.com/lennart-finke/picturebooks
Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.
evals inspect vision-transformer
Last synced: 14 Oct 2025
https://github.com/blackwell-systems/mcp-assert
Deterministic correctness testing for MCP servers. Assert your tools return the right results, not just any results. No LLM-as-judge.
ai-agents assertions ci deterministic-testing developer-tools evals evaluation golang language-server-protocol mcp mcp-server model-context-protocol testing
Last synced: 10 May 2026
https://github.com/auraoneai/evalkit-playground
Browser playground for scoring rubrics and responses with no install or account.
ai-evaluation evalkit evals playground
Last synced: 28 May 2026
https://github.com/auraoneai/iaa-kit
Modern inter-annotator agreement metrics with bootstrap intervals, ordinal support, and missing-data handling.
ai-evaluation evals inter-annotator-agreement statistics
Last synced: 28 May 2026
https://github.com/auraoneai/synthetic-disagreement
Synthetic reviewer disagreement generators for testing IAA and adjudication workflows.
ai-evaluation evals inter-annotator-agreement synthetic-data
Last synced: 28 May 2026
https://github.com/marvinvista/callback-alpha
Open-source Codex skills and evals for practical B2B revenue work.
codex evals go-to-market openai openai-codex revenue-operations revops sales
Last synced: 03 May 2026
https://github.com/fswair/vowel
YAML Based Eval Specification Language for LLMs and Developers.
evals llms pydantic-evals specification yaml
Last synced: 28 Feb 2026
https://github.com/auraoneai/contamination-audit
Local contamination checks for eval data overlap, hashes, and n-gram leakage.
ai-evaluation data-contamination evals leakage
Last synced: 28 May 2026
https://github.com/auraoneai/eval-adapter
Adapters between rubric-spec and common evaluation framework inputs.
adapters ai-evaluation evals rubric
Last synced: 28 May 2026
https://github.com/fdionisi/evals
A deadly simple evaluation framework for AI models
Last synced: 15 May 2026
https://github.com/auraoneai/datasheet-ci
GitHub Action for validating dataset cards and required metadata in pull requests.
ai-evaluation dataset-card evals github-actions
Last synced: 28 May 2026
https://github.com/auraoneai/evalkit-action
GitHub Action for running EvalKit validation, scoring, and reporting in CI.
ai-evaluation evalkit evals github-actions
Last synced: 28 May 2026
https://github.com/auraoneai/rubric-spec
Portable rubric schema, validator, linter, diff, adapters, and conformance tests for AI evaluation.
ai-evaluation evals json-schema rubric
Last synced: 28 May 2026
https://github.com/auraoneai/eval-run-manifest
Portable manifest envelope for eval run provenance, artifacts, and reproducibility.
ai-evaluation evals manifest provenance
Last synced: 28 May 2026
https://github.com/ben-ranford/cellin
build long-lived multimodal memory, dream over it, and retrieve context with transparent weighting
agent-memory evals knowledge-graph llm-memory memory multimodal python retrieval
Last synced: 08 Apr 2026
https://github.com/auraoneai/eval-conformance-suite
Executable rubric-spec v1 conformance checks and embeddable SVG badges.
ai-evaluation conformance evals rubric
Last synced: 28 May 2026
https://github.com/lavanyashukla/ai-starter-templates
Production-ready AI starter templates — agents, SDR outbound, RAG, evals, RLHF.
agents ai best-practices boilerplate evals examples llm production rag rlhf starter-template
Last synced: 20 Apr 2026
https://github.com/largonarco/eval
LLM system evaluations for a mock system
Last synced: 13 Feb 2026
https://github.com/urmzd/generative-artifact-protocol
Generative Artifact Protocol (GAP) — an open standard for token-efficient artifact updates and streaming. Rust apply engine + Python eval framework.
apply-engine artifacts diff evals gap generative-artifact-protocol llm open-standard protocol python rust sse streaming text-diff token-efficient wasm
Last synced: 19 Apr 2026
https://github.com/alucek/pii-masking-rlenv
RL Environment built using Verifiers for PII information masking
evals fine-tuning llms reinforcement-learning rl rl-environment
Last synced: 17 May 2026
https://github.com/auraoneai/open
Open tools for the human-judgment layer of AI evaluation: EvalKit (Python package + CLI), Robotics ReviewKit, and the Buying Toolkit.
ai-safety auraone evals evaluation human-feedback lerobot llm openx rlds robotics rubrics teleoperation
Last synced: 28 May 2026
https://github.com/auraoneai/rubric-pr-bot
Rubric diffs and lint feedback for pull requests that change evaluation criteria.
ai-evaluation evals github-app rubric
Last synced: 28 May 2026
https://github.com/auraoneai/judge-card
A disclosure format for judge prompts, calibration results, known bias, and recommended use envelopes.
ai-evaluation evals llm-as-judge model-card
Last synced: 28 May 2026
https://github.com/jancervenka/czech-simpleqa
How well can language models answer questions in Czech?
ai artificial-intelligence claude evals gpt language-model llm
Last synced: 11 Mar 2025
https://github.com/gafnts/agentic-kie-evals
Benchmarking agentic and single-pass extraction strategies across LLM providers on the Kleister NDA dataset
agentic-ai agentic-kie document-ai evals key-information-extraction kie langsmith
Last synced: 30 Apr 2026