Projects in Awesome Lists tagged with evaluations
A curated list of projects in awesome lists tagged with evaluations .
https://github.com/scale3-labs/langtrace
Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations and metrics for popular LLMs, LLM frameworks, vectorDBs and more.. Integrate using Typescript, Python. 🚀💻📊
ai datasets evaluations gpt langchain llm llm-framework llmops observability open-source open-telemetry openai prompt-engineering tracing
Last synced: 15 May 2025
https://github.com/Scale3-Labs/langtrace
Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations and metrics for popular LLMs, LLM frameworks, vectorDBs and more.. Integrate using Typescript, Python. 🚀💻📊
ai datasets evaluations gpt langchain llm llm-framework llmops observability open-source open-telemetry openai prompt-engineering tracing
Last synced: 30 Oct 2025
https://github.com/log10-io/log10
Python client library for improving your LLM app accuracy
agents ai anthropic artificial-intelligence autonomous-agents debugging evaluations feedback fine-tuning llmops llms logging monitoring openai python rlhf
Last synced: 11 Apr 2025
https://github.com/dreadnode/airtbench-code
Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
agents ai ai-agents artificial-intelligence benchmark benchmark-datasets benchmarking ctf cyber-evals cybersecurity evaluations hacking llm offensive-security research security
Last synced: 30 Apr 2026
https://github.com/boxbeam/crunch
The fastest java expression compiler/evaluator
evaluating-mathematical-expressions evaluations
Last synced: 06 Apr 2025
https://github.com/llm-evaluation-s-always-fatiguing/leaf-playground
A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.
agent agent-based-simulation agents automation chatgpt evaluations llm-evaluation
Last synced: 02 Mar 2025
https://github.com/greynewell/mcpbr
Evaluate MCP servers with Model Context Protocol Benchmark Runner
ai-tools anthropic benchmarking benchmarks claude-code claude-code-plugin claude-code-skills cli cybergym evaluations llm-agents mcp-server swe-bench
Last synced: 12 Feb 2026
https://github.com/fwdai/reticle
Postman for AI - design, evaluate, and debug LLM interactions with full transparency.
agentic-ai ai ai-agents ai-testing ai-tool ai-tools desktop desktop-app developer-tools evaluations llm llm-tools prompt-engineering tauri
Last synced: 04 Apr 2026
https://github.com/yisaienkov/evaluations
This library implements various metrics (including Kaggle Competition, Medicine) for evaluating ML, DL, AI models, and algorithms. 📐📊📈📉📏
evaluations kaggle kaggle-competition metrics metrics-library pypi python python-library python3
Last synced: 13 Apr 2025
https://github.com/evaluation-context-protocol/ecp
ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol
evaluation-metrics evaluations llm-evaluation model-evaluation
Last synced: 25 Apr 2026
https://github.com/dynatrace-oss/dt-evals
AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability
agents ai evals evaluations llm-as-judge observability
Last synced: 14 May 2026
https://github.com/fkapsahili/entrag
EntRAG - Enterprise RAG Benchmark
benchmark dataset evaluation evaluations generative-ai knowledge-graph llm llm-evaluation rag rag-evaluation retrieval retrieval-augmented-generation
Last synced: 09 Mar 2026
https://github.com/bhadresh-laiya/program-evaluation.com
Do a program evaluation that really counts! That will help other students and will put really make universities and colleges take students experiences to heart!
blade-template built colleges counts evaluation evaluation-data evaluations laravel-framework laravel6 program students students-experiences universities using
Last synced: 09 Feb 2026
https://github.com/dreadnode/AIRTBench-Code
Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
agents ai ai-agents artificial-intelligence benchmark benchmark-datasets benchmarking ctf cyber-evals cybersecurity evaluations hacking llm offensive-security research security
Last synced: 24 Jun 2025
https://github.com/itprodirect/claims-intelligence-foundation
Reusable claims AI foundation for schemas, prompts, evals, Tinker training, W&B tracking, and integration into claims-ops-workbench.
aws claims-ai claims-ops evaluations insurance insurance-ai machine-learning mlops prompt-engineering python tinker weights-and-biases
Last synced: 04 Apr 2026
https://github.com/mizcausevic-dev/ai-operations-console
React + TypeScript control plane for prompt operations, evaluations, model routing, guardrail incidents, and AI workflow governance
ai-operations control-plane evaluations frontend guardrails prompt-engineering react typescript vite
Last synced: 01 Jun 2026
https://github.com/jtmuller5/vibe-checker
The TypeScript LLM Evaluation File
ai devtools evals evaluation-metrics evaluations gemini gemini-api gemini-flash javascript llm nodejs testing typescript vitest
Last synced: 01 May 2026
https://github.com/parthapray/llm_evaluation_metrics_localized
This repo contains code for localized LLM evaluation metrics vis a framework using Ollama and edge resource and novel derived metrics
evaluation evaluation-framework evaluation-metrics evaluations flask large-language-models metrics ollama-api restful-api
Last synced: 18 Apr 2026