An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with evaluation

A curated list of projects in awesome lists tagged with evaluation .

https://github.com/langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

analytics autogen evaluation langchain large-language-models llama-index llm llm-evaluation llm-observability llmops monitoring observability open-source openai playground prompt-engineering prompt-management self-hosted ycombinator

Last synced: 13 May 2025

https://github.com/explodinggradients/ragas?tab=readme-ov-file

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Last synced: 04 Apr 2025

https://github.com/explodinggradients/ragas

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Last synced: 02 Apr 2025

https://github.com/promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

Last synced: 14 Mar 2025

https://github.com/open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai

Last synced: 14 May 2025

https://github.com/internlm/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai

Last synced: 13 Dec 2024

https://github.com/open-compass/OpenCompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai

Last synced: 04 Dec 2024

https://github.com/marker-inc-korea/autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation

Last synced: 12 May 2025

https://github.com/knetic/govaluate

Arbitrary expression evaluation for golang

evaluation expression go parsing

Last synced: 25 Mar 2025

https://github.com/Knetic/govaluate

Arbitrary expression evaluation for golang

evaluation expression go parsing

Last synced: 28 Mar 2025

https://github.com/kiln-ai/kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

ai chain-of-thought collaboration dataset-generation evals evaluation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows

Last synced: 23 Apr 2025

https://github.com/cluebenchmark/superclue

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

chatgpt chinese evaluation foundation-models gpt-4

Last synced: 15 May 2025

https://github.com/CLUEbenchmark/SuperCLUE

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

chatgpt chinese evaluation foundation-models gpt-4

Last synced: 24 Mar 2025

https://github.com/ianarawjo/chainforge

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models llmops llms prompt-engineering

Last synced: 14 May 2025

https://github.com/ianarawjo/ChainForge?tab=readme-ov-file

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models llmops llms prompt-engineering

Last synced: 08 May 2025

https://github.com/evolvinglmms-lab/lmms-eval

Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

agi evaluation large-language-models multimodal

Last synced: 13 May 2025

https://github.com/uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

autoevaluation evaluation experimentation hallucination-detection jailbreak-detection llm-eval llm-prompting llm-test llmops machine-learning monitoring openai-evals prompt-engineering root-cause-analysis

Last synced: 14 May 2025

https://github.com/ianarawjo/ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models llmops llms prompt-engineering

Last synced: 27 Mar 2025

https://github.com/open-compass/vlmevalkit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 13 May 2025

https://github.com/huggingface/evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

evaluation machine-learning

Last synced: 14 May 2025

https://github.com/ContinualAI/avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

benchmarks continual-learning continualai deep-learning evaluation framework library lifelong-learning metrics pytorch strategies training

Last synced: 06 May 2025

https://github.com/continualai/avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

benchmarks continual-learning continualai deep-learning evaluation framework library lifelong-learning metrics pytorch strategies training

Last synced: 25 Apr 2025

https://github.com/Helicone/helicone

🧊 Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc. Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain, and more. 🍓 YC W23

agent-monitoring analytics evaluation gpt langchain large-language-models llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-engineering prompt-management ycombinator

Last synced: 31 Mar 2025

https://github.com/langwatch/langwatch

The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨

ai analytics datasets dspy evaluation gpt llm llmops low-code observability openai prompt-engineering

Last synced: 13 May 2025

https://github.com/lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llm-workflow llmops monitoring observability open-source pipeline-builder rag rust-lang self-hosted

Last synced: 14 May 2025

https://github.com/Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

ai ai-challenges angular7 angularjs artificial-intelligence challenge django docker evalai evaluation leaderboard machine-learning python reproducibility reproducible-research

Last synced: 26 Mar 2025

https://github.com/xinshuoweng/ab3dmot

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking

Last synced: 15 May 2025

https://github.com/tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

Last synced: 13 May 2025

https://github.com/xinshuoweng/AB3DMOT

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking

Last synced: 20 Mar 2025

https://tatsu-lab.github.io/alpaca_eval/

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

Last synced: 23 Mar 2025

https://github.com/MLGroupJLU/LLM-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

benchmark evaluation large-language-models llm llms model-assessment

Last synced: 04 Apr 2025

https://github.com/mlgroupjlu/llm-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

benchmark evaluation large-language-models llm llms model-assessment

Last synced: 26 Mar 2025

https://github.com/huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-framework evaluation-metrics huggingface

Last synced: 15 Apr 2025

https://github.com/lunary-ai/lunary

The production toolkit for LLMs. Observability, prompt management and evaluations.

ai evaluation hacktoberfest langchain llm logs monitoring observability openai prompts self-hosted testing

Last synced: 29 Apr 2025

https://github.com/open-compass/VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 28 Nov 2024

https://github.com/google/fuzzbench

FuzzBench - Fuzzer benchmarking as a service.

benchmark-framework benchmarking evaluation fuzzing security

Last synced: 14 May 2025

https://google.github.io/fuzzbench/

FuzzBench - Fuzzer benchmarking as a service.

benchmark-framework benchmarking evaluation fuzzing security

Last synced: 01 Apr 2025

https://github.com/modelscope/evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

evaluation llm performance rag vlm

Last synced: 14 May 2025

https://github.com/prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

evaluation gpt4 litellm llm llm-as-a-judge llm-as-evaluator llmops python vllm

Last synced: 05 Apr 2025

https://github.com/prbonn/semantic-kitti-api

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

dataset deep-learning evaluation labels large-scale-dataset machine-learning semantic-scene-completion semantic-segmentation

Last synced: 15 May 2025

https://github.com/intellabs/rag-fit

Framework for enhancing LLMs for RAG tasks using fine-tuning.

evaluation fine-tuning information-retrieval llm nlp question-answering rag semantic-search

Last synced: 15 May 2025

https://github.com/CBLUEbenchmark/CBLUE

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

acl2022 benchmark biomedical-tasks chinese chineseblue corpus dataset evaluation

Last synced: 01 Apr 2025

https://github.com/dbolya/tide

A General Toolbox for Identifying Object Detection Errors

error-detection errors evaluation instance-segmentation object-detection toolbox

Last synced: 08 Apr 2025

https://github.com/tecnickcom/tcexam

TCExam is a CBA (Computer-Based Assessment) system (e-exam, CBT - Computer Based Testing) for universities, schools and companies, that enables educators and trainers to author, schedule, deliver, and report on surveys, quizzes, tests and exams.

cba cbt computer-based-assessment computer-based-testing e-exam essay evaluation exam mcma mcsa multiple-choice school tcexam testing university

Last synced: 15 May 2025

https://ucinlp.github.io/autoprompt/

AutoPrompt: Automatic Prompt Construction for Masked Language Models.

evaluation language-model nlp

Last synced: 15 May 2025

https://github.com/ucinlp/autoprompt

AutoPrompt: Automatic Prompt Construction for Masked Language Models.

evaluation language-model nlp

Last synced: 15 May 2025

https://github.com/langchain-ai/langsmith-sdk

LangSmith Client SDK Implementations

evaluation language-model observability

Last synced: 13 May 2025

https://github.com/caserec/CaseRecommender

Case Recommender: A Flexible and Extensible Python Framework for Recommender Systems

evaluation python ranking rating-prediction recommendation-system recommender-systems top-k

Last synced: 25 Nov 2024

https://github.com/zzzprojects/eval-expression.net

C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.

csharp dotnet eval eval-expression evaluation evaluator

Last synced: 15 May 2025

https://github.com/zzzprojects/Eval-Expression.NET

C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.

csharp dotnet eval eval-expression evaluation evaluator

Last synced: 24 Mar 2025

https://github.com/ModelTC/llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

awq benchmark deployment evaluation internlm2 large-language-models lightllm llama3 llm lvlm mixtral omniquant post-training-quantization pruning quantization quarot smoothquant spinquant tool vllm

Last synced: 23 Apr 2025

https://github.com/thu-keg/evaluationpapers4chatgpt

Resource, Evaluation and Detection Papers for ChatGPT

chatgpt detection evaluation large-language-models resource

Last synced: 13 May 2025

https://github.com/danthedeckie/simpleeval

Simple Safe Sandboxed Extensible Expression Evaluator for Python

evaluation library python

Last synced: 29 Mar 2025

https://github.com/THU-KEG/EvaluationPapers4ChatGPT

Resource, Evaluation and Detection Papers for ChatGPT

chatgpt detection evaluation large-language-models resource

Last synced: 04 Apr 2025

https://github.com/X-PLUG/CValues

面向中文大模型价值观的评估与对齐研究

benchmark chinese-llms evaluation human-values llms multi-choice responsibility safety

Last synced: 09 May 2025

https://github.com/davidstutz/superpixel-benchmark

An extensive evaluation and comparison of 28 state-of-the-art superpixel algorithms on 5 datasets.

benchmark computer-vision evaluation image-procesing opencv superpixel-algorithms superpixels

Last synced: 05 Apr 2025

https://github.com/alipay/ant-application-security-testing-benchmark

xAST评价体系,让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".

application benchmark dast evaluation iast sast sca security testing

Last synced: 15 May 2025

https://github.com/audiolabs/webmushra

a MUSHRA compliant web audio API based experiment software

audio bs1534 evaluation js mushra

Last synced: 15 May 2025

https://github.com/microsoft/genaiops-promptflow-template

GenAIOps with Prompt Flow is a "GenAIOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.

aistudio azure azuremachinelearning cloud docker evaluation experimentation genai genaiops largelanguagemodels llm llmops machine-learning mlops mlops-template orchestration prompt promptengineering promptflow python

Last synced: 15 May 2025

https://github.com/cvangysel/pytrec_eval

pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.

evaluation information-retrieval

Last synced: 29 Apr 2025

https://github.com/rentruewang/bocoel

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.

bayesian-optimization benchmarking evaluation language-model llm machine-learning

Last synced: 05 Apr 2025

https://github.com/shmsw25/FActScore

A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"

emnlp2023 evaluation factuality language language-modeling

Last synced: 02 Feb 2025

https://github.com/belambert/asr-evaluation

Python module for evaluating ASR hypotheses (e.g. word error rate, word recognition rate).

asr error-rate evaluation speech-recognition

Last synced: 27 Nov 2024

https://github.com/Wscats/compile-hero

🔰Visual Studio Code Extension For Compiling Language

automatic compile es6 evaluation gulp jade javascript json jsx less pug sass scss typescript

Last synced: 24 Mar 2025

https://github.com/wscats/compile-hero

🔰Visual Studio Code Extension For Compiling Language

automatic compile es6 evaluation gulp jade javascript json jsx less pug sass scss typescript

Last synced: 07 Apr 2025

https://github.com/clovaai/generative-evaluation-prdc

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

deep-learning diversity evaluation evaluation-metrics fidelity generative-adversarial-network generative-model icml icml-2020 icml2020 machine-learning precision recall

Last synced: 09 Apr 2025

https://github.com/microsoft/rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

acs azure chunking dense embedding evaluation experiment genai indexing information-retrieval llm openai rag sparse vectors

Last synced: 16 May 2025

https://github.com/evfro/polara

Recommender system and evaluation framework for top-n recommendations tasks that respects polarity of feedbacks. Fast, flexible and easy to use. Written in python, boosted by scientific python stack.

collaborative-filtering evaluation matrix-factorization recommender-system tensor-factorization top-n-recommendations

Last synced: 04 Apr 2025

https://github.com/devmount/germanwordembeddings

Toolkit to obtain and preprocess German text corpora, train models and evaluate them with generated testsets. Built with Gensim and Tensorflow.

deep-learning deep-neural-networks evaluation gensim german-language model natural-language-processing neural-network nlp training word-embeddings word2vec

Last synced: 06 Apr 2025