Projects in Awesome Lists tagged with evaluation

https://github.com/langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

analytics autogen evaluation langchain large-language-models llama-index llm llm-evaluation llm-observability llmops monitoring observability open-source openai playground prompt-engineering prompt-management self-hosted ycombinator

Last synced: 13 May 2025

https://github.com/explodinggradients/ragas?tab=readme-ov-file

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Last synced: 04 Apr 2025

https://github.com/explodinggradients/ragas

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Last synced: 02 Apr 2025

https://github.com/promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

Last synced: 14 Mar 2025

https://github.com/open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai

Last synced: 14 May 2025

https://github.com/internlm/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai

Last synced: 13 Dec 2024

https://github.com/open-compass/OpenCompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai

Last synced: 04 Dec 2024

https://github.com/marker-inc-korea/autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation

Last synced: 12 May 2025

https://github.com/knetic/govaluate

Arbitrary expression evaluation for golang

evaluation expression go parsing

Last synced: 25 Mar 2025

https://github.com/michaelgrupp/evo

Python package for the evaluation of odometry and SLAM

benchmark euroc evaluation kitti mapping metrics odometry robotics ros ros2 slam trajectory trajectory-analysis trajectory-evaluation tum

Last synced: 16 May 2025

https://github.com/Knetic/govaluate

Arbitrary expression evaluation for golang

evaluation expression go parsing

Last synced: 28 Mar 2025

https://github.com/helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

agent-monitoring analytics evaluation gpt langchain large-language-models llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-engineering prompt-management ycombinator

Last synced: 08 May 2025

https://github.com/MichaelGrupp/evo

Python package for the evaluation of odometry and SLAM

benchmark euroc evaluation kitti mapping metrics odometry robotics ros ros2 slam trajectory trajectory-analysis trajectory-evaluation tum

Last synced: 11 Apr 2025

https://github.com/kiln-ai/kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

ai chain-of-thought collaboration dataset-generation evals evaluation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows

Last synced: 23 Apr 2025

https://github.com/sdiehl/write-you-a-haskell

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

book compiler evaluation functional-language functional-programming haskel hindley-milner intermediate-representation lambda-calculus pdf-book type type-checking type-inference type-system type-theory

Last synced: 15 May 2025

https://github.com/cluebenchmark/superclue

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

chatgpt chinese evaluation foundation-models gpt-4

Last synced: 15 May 2025

https://github.com/CLUEbenchmark/SuperCLUE

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

chatgpt chinese evaluation foundation-models gpt-4

Last synced: 24 Mar 2025

https://github.com/viebel/klipse

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

brainfuck clojure clojurescript code-evaluation codemirror-editor common-lisp evaluation interactive-snippets javascript klipse-plugin lua ocaml prolog python react reactjs reasonml ruby scheme

Last synced: 14 May 2025

https://github.com/zzw922cn/Automatic_Speech_Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

audio automatic-speech-recognition chinese-speech-recognition cnn data-preprocessing deep-learning end-to-end evaluation feature-vector layer-normalization lstm paper phonemes rnn rnn-encoder-decoder speech-recognition tensorflow timit-dataset

Last synced: 02 Apr 2025

https://github.com/zzw922cn/automatic_speech_recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

audio automatic-speech-recognition chinese-speech-recognition cnn data-preprocessing deep-learning end-to-end evaluation feature-vector layer-normalization lstm paper phonemes rnn rnn-encoder-decoder speech-recognition tensorflow timit-dataset

Last synced: 15 May 2025

https://github.com/microsoft/promptbench

A unified evaluation framework for large language models

adversarial-attacks benchmark chatgpt evaluation large-language-models prompt prompt-engineering robustness

Last synced: 13 May 2025

https://github.com/ianarawjo/chainforge

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models llmops llms prompt-engineering

Last synced: 14 May 2025

https://github.com/ianarawjo/ChainForge?tab=readme-ov-file

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models llmops llms prompt-engineering

Last synced: 08 May 2025

https://github.com/evolvinglmms-lab/lmms-eval

Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

agi evaluation large-language-models multimodal

Last synced: 13 May 2025

https://github.com/uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

autoevaluation evaluation experimentation hallucination-detection jailbreak-detection llm-eval llm-prompting llm-test llmops machine-learning monitoring openai-evals prompt-engineering root-cause-analysis

Last synced: 14 May 2025

https://github.com/ianarawjo/ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models llmops llms prompt-engineering

Last synced: 27 Mar 2025

https://github.com/open-compass/vlmevalkit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 13 May 2025

https://github.com/huggingface/evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

evaluation machine-learning

Last synced: 14 May 2025

https://github.com/ContinualAI/avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

benchmarks continual-learning continualai deep-learning evaluation framework library lifelong-learning metrics pytorch strategies training

Last synced: 06 May 2025

https://github.com/continualai/avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

benchmarks continual-learning continualai deep-learning evaluation framework library lifelong-learning metrics pytorch strategies training

Last synced: 25 Apr 2025

https://github.com/Helicone/helicone

🧊 Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc. Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain, and more. 🍓 YC W23

agent-monitoring analytics evaluation gpt langchain large-language-models llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-engineering prompt-management ycombinator

Last synced: 31 Mar 2025

https://github.com/cloud-cv/evalai

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

ai ai-challenges angularjs artificial-intelligence challenge codecov coveralls django docker evalai evaluation leaderboard machine-learning python reproducibility reproducible-research travis-ci

Last synced: 14 May 2025

https://github.com/langwatch/langwatch

The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨

ai analytics datasets dspy evaluation gpt llm llmops low-code observability openai prompt-engineering

Last synced: 13 May 2025

https://github.com/lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llm-workflow llmops monitoring observability open-source pipeline-builder rag rust-lang self-hosted

Last synced: 14 May 2025

https://github.com/Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

ai ai-challenges angular7 angularjs artificial-intelligence challenge django docker evalai evaluation leaderboard machine-learning python reproducibility reproducible-research

Last synced: 26 Mar 2025

https://github.com/xinshuoweng/ab3dmot

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking

Last synced: 15 May 2025

https://github.com/tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

Last synced: 13 May 2025

https://github.com/xinshuoweng/AB3DMOT

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking

Last synced: 20 Mar 2025

https://tatsu-lab.github.io/alpaca_eval/

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

Last synced: 23 Mar 2025

https://github.com/MLGroupJLU/LLM-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

benchmark evaluation large-language-models llm llms model-assessment

Last synced: 04 Apr 2025

https://github.com/sepandhaghighi/pycm

Multi-class confusion matrix library in Python

accuracy ai artificial-intelligence classification confusion-matrix data data-analysis data-mining data-science deep-learning deeplearning evaluation machine-learning mathematics matrix ml multiclass-classification neural-network statistical-analysis statistics

Last synced: 13 May 2025

https://github.com/mlgroupjlu/llm-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

benchmark evaluation large-language-models llm llms model-assessment

Last synced: 26 Mar 2025

https://github.com/huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-framework evaluation-metrics huggingface

Last synced: 15 Apr 2025

https://github.com/maluuba/nlg-eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

bleu bleu-score cider dialog dialogue evaluation machine-translation meteor natural-language-generation natural-language-processing nlg nlp rouge rouge-l skip-thought-vectors skip-thoughts task-oriented-dialogue

Last synced: 15 May 2025

https://github.com/Maluuba/nlg-eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

bleu bleu-score cider dialog dialogue evaluation machine-translation meteor natural-language-generation natural-language-processing nlg nlp rouge rouge-l skip-thought-vectors skip-thoughts task-oriented-dialogue

Last synced: 14 May 2025

https://github.com/lunary-ai/lunary

The production toolkit for LLMs. Observability, prompt management and evaluations.

ai evaluation hacktoberfest langchain llm logs monitoring observability openai prompts self-hosted testing

Last synced: 29 Apr 2025

https://github.com/open-compass/VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

Last synced: 28 Nov 2024

https://github.com/abo-abo/lispy

Short and sweet LISP editing

clojure common-lisp emacs-lisp evaluation navigation python refactoring scheme

Last synced: 15 May 2025

https://github.com/ethicalml/xai

XAI - An eXplainability toolbox for machine learning

ai artificial-intelligence bias bias-evaluation downsampling evaluation explainability explainable-ai explainable-ml feature-importance imbalance interpretability machine-learning machine-learning-explainability ml upsampling xai xai-library

Last synced: 15 May 2025

https://github.com/EthicalML/xai

XAI - An eXplainability toolbox for machine learning

ai artificial-intelligence bias bias-evaluation downsampling evaluation explainability explainable-ai explainable-ml feature-importance imbalance interpretability machine-learning machine-learning-explainability ml upsampling xai xai-library

Last synced: 14 Mar 2025

https://github.com/google/fuzzbench

FuzzBench - Fuzzer benchmarking as a service.

benchmark-framework benchmarking evaluation fuzzing security

Last synced: 14 May 2025

https://google.github.io/fuzzbench/

FuzzBench - Fuzzer benchmarking as a service.

benchmark-framework benchmarking evaluation fuzzing security

Last synced: 01 Apr 2025

https://github.com/toshas/torch-fidelity

High-fidelity performance metrics for generative models in PyTorch

evaluation frechet-inception-distance gan generative-model inception-score kernel-inception-distance metrics perceptual-path-length precision pytorch reproducibility reproducible-research

Last synced: 14 May 2025

https://github.com/Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

agent awsome-list benchmark blogs compress evaluation large-language-models length-extrapolation llm long-context-modeling long-term-memory papers rag ssm survey transformer

Last synced: 05 Dec 2024

https://github.com/modelscope/evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

evaluation llm performance rag vlm

Last synced: 14 May 2025

https://github.com/prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

evaluation gpt4 litellm llm llm-as-a-judge llm-as-evaluator llmops python vllm

Last synced: 05 Apr 2025

https://github.com/prbonn/semantic-kitti-api

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

dataset deep-learning evaluation labels large-scale-dataset machine-learning semantic-scene-completion semantic-segmentation

Last synced: 15 May 2025

https://github.com/intellabs/rag-fit

Framework for enhancing LLMs for RAG tasks using fine-tuning.

evaluation fine-tuning information-retrieval llm nlp question-answering rag semantic-search

Last synced: 15 May 2025

https://github.com/PaesslerAG/gval

Expression evaluation in golang

evaluate-expressions evaluation expression-evaluator expression-language go godoc golang gval parser parsing

Last synced: 14 Mar 2025

https://github.com/CBLUEbenchmark/CBLUE

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

acl2022 benchmark biomedical-tasks chinese chineseblue corpus dataset evaluation

Last synced: 01 Apr 2025

https://github.com/dbolya/tide

A General Toolbox for Identifying Object Detection Errors

error-detection errors evaluation instance-segmentation object-detection toolbox

Last synced: 08 Apr 2025

https://github.com/bochinski/iou-tracker

Python implementation of the IOU Tracker

demo-script detrac evaluation iou-tracker mot python tracker tracking-by-detection ua-detrac

Last synced: 05 May 2025

https://github.com/codingseb/expressionevaluator

A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts

calculations csharp-script eval evaluate evaluate-expressions evaluation evaluator execute executescript expression expression-evaluator expression-parser fluid math mathematical-expressions mathematical-expressions-evaluator parser reflection script scripting

Last synced: 14 Apr 2025

https://github.com/tecnickcom/tcexam

TCExam is a CBA (Computer-Based Assessment) system (e-exam, CBT - Computer Based Testing) for universities, schools and companies, that enables educators and trainers to author, schedule, deliver, and report on surveys, quizzes, tests and exams.

cba cbt computer-based-assessment computer-based-testing e-exam essay evaluation exam mcma mcsa multiple-choice school tcexam testing university

Last synced: 15 May 2025

https://ucinlp.github.io/autoprompt/

AutoPrompt: Automatic Prompt Construction for Masked Language Models.

evaluation language-model nlp

Last synced: 15 May 2025

https://github.com/ucinlp/autoprompt

AutoPrompt: Automatic Prompt Construction for Masked Language Models.

evaluation language-model nlp

Last synced: 15 May 2025

https://github.com/howiehwong/trustllm

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

ai benchmark dataset evaluation large-language-models llm natural-language-processing nlp pypi-package toolkit trustworthy-ai trustworthy-machine-learning

Last synced: 14 May 2025

https://github.com/langchain-ai/langsmith-sdk

LangSmith Client SDK Implementations

evaluation language-model observability

Last synced: 13 May 2025

https://github.com/caserec/CaseRecommender

Case Recommender: A Flexible and Extensible Python Framework for Recommender Systems

evaluation python ranking rating-prediction recommendation-system recommender-systems top-k

Last synced: 25 Nov 2024

https://github.com/zzzprojects/eval-expression.net

C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.

csharp dotnet eval eval-expression evaluation evaluator

Last synced: 15 May 2025

https://github.com/zzzprojects/Eval-Expression.NET

C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.

csharp dotnet eval eval-expression evaluation evaluator

Last synced: 24 Mar 2025

https://github.com/HowieHwong/TrustLLM

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

ai benchmark dataset evaluation large-language-models llm natural-language-processing nlp pypi-package toolkit trustworthy-ai trustworthy-machine-learning

Last synced: 09 May 2025

https://github.com/ModelTC/llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

awq benchmark deployment evaluation internlm2 large-language-models lightllm llama3 llm lvlm mixtral omniquant post-training-quantization pruning quantization quarot smoothquant spinquant tool vllm

Last synced: 23 Apr 2025

https://github.com/thu-keg/evaluationpapers4chatgpt

Resource, Evaluation and Detection Papers for ChatGPT

chatgpt detection evaluation large-language-models resource

Last synced: 13 May 2025

https://github.com/danthedeckie/simpleeval

Simple Safe Sandboxed Extensible Expression Evaluator for Python

evaluation library python

Last synced: 29 Mar 2025

https://github.com/THU-KEG/EvaluationPapers4ChatGPT

Resource, Evaluation and Detection Papers for ChatGPT

chatgpt detection evaluation large-language-models resource

Last synced: 04 Apr 2025

https://github.com/X-PLUG/CValues

面向中文大模型价值观的评估与对齐研究

benchmark chinese-llms evaluation human-values llms multi-choice responsibility safety

Last synced: 09 May 2025

https://github.com/AmenRa/ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

comparison data-fusion evaluation evaluation-metrics information-retrieval information-retrieval-evaluation information-retrieval-metrics metasearch numba python rank-fusion ranking-metrics recommender-systems score-fusion

Last synced: 31 Mar 2025

https://github.com/MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

computer-vision deep-learning deep-neural-networks evaluation foundation-models large-language-models large-multimodal-models llm llms machine-learning multimodal multimodal-deep-learning multimodal-learning multimodality natural-language-processing question-answering stem visual-question-answering

Last synced: 17 Apr 2025

https://github.com/davidstutz/superpixel-benchmark

An extensive evaluation and comparison of 28 state-of-the-art superpixel algorithms on 5 datasets.

benchmark computer-vision evaluation image-procesing opencv superpixel-algorithms superpixels

Last synced: 05 Apr 2025

https://github.com/alipay/ant-application-security-testing-benchmark

xAST评价体系，让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".

application benchmark dast evaluation iast sast sca security testing

Last synced: 15 May 2025

https://github.com/audiolabs/webmushra

a MUSHRA compliant web audio API based experiment software

audio bs1534 evaluation js mushra

Last synced: 15 May 2025

https://github.com/sb-ai-lab/replay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

algorithms collaborative-filtering deep-learning distributed-computing evaluation machine-learning matrix-factorization pyspark pytorch recommendation-algorithms recommender-system recsys transformers

Last synced: 15 May 2025

https://github.com/microsoft/genaiops-promptflow-template

GenAIOps with Prompt Flow is a "GenAIOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.

aistudio azure azuremachinelearning cloud docker evaluation experimentation genai genaiops largelanguagemodels llm llmops machine-learning mlops mlops-template orchestration prompt promptengineering promptflow python

Last synced: 15 May 2025

https://github.com/hbaniecki/adversarial-explainable-ai

💡 Adversarial attacks on explanations and how to defend them

adversarial adversarial-attacks adversarial-examples adversarial-machine-learning attacks counterfactual deep defense evaluation explainability explainable-ai iml interpretability interpretable interpretable-machine-learning model responsible-ai robustness security xai

Last synced: 25 Mar 2025

https://github.com/cvangysel/pytrec_eval

pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.

evaluation information-retrieval

Last synced: 29 Apr 2025

https://github.com/AstraZeneca/rexmex

A general purpose recommender metrics library for fair evaluation.

coverage deep-learning evaluation machine-learning metric metrics mrr personalization precision rank ranking recall recommender recommender-system recsys rsquared

Last synced: 27 Mar 2025

https://github.com/astrazeneca/rexmex

A general purpose recommender metrics library for fair evaluation.

coverage deep-learning evaluation machine-learning metric metrics mrr personalization precision rank ranking recall recommender recommender-system recsys rsquared

Last synced: 04 Apr 2025

https://github.com/rentruewang/bocoel

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.

bayesian-optimization benchmarking evaluation language-model llm machine-learning

Last synced: 05 Apr 2025

https://github.com/shmsw25/FActScore

A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"

emnlp2023 evaluation factuality language language-modeling

Last synced: 02 Feb 2025

https://github.com/athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-evaluation-toolkit llm-ops llmops

Last synced: 15 Apr 2025

https://github.com/FuxiaoLiu/LRV-Instruction?tab=readme-ov-file

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

chatgpt evaluation evaluation-metrics foundation-models gpt gpt-4 hallucination iclr iclr2024 llama llava multimodal object-detection prompt-engineering vicuna vision vision-and-language vqa

Last synced: 29 Mar 2025

https://github.com/belambert/asr-evaluation

Python module for evaluating ASR hypotheses (e.g. word error rate, word recognition rate).

asr error-rate evaluation speech-recognition

Last synced: 27 Nov 2024

https://github.com/Wscats/compile-hero

🔰Visual Studio Code Extension For Compiling Language

automatic compile es6 evaluation gulp jade javascript json jsx less pug sass scss typescript

Last synced: 24 Mar 2025

https://github.com/wscats/compile-hero

🔰Visual Studio Code Extension For Compiling Language

automatic compile es6 evaluation gulp jade javascript json jsx less pug sass scss typescript

Last synced: 07 Apr 2025

https://github.com/clovaai/generative-evaluation-prdc

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

deep-learning diversity evaluation evaluation-metrics fidelity generative-adversarial-network generative-model icml icml-2020 icml2020 machine-learning precision recall

Last synced: 09 Apr 2025

https://github.com/microsoft/rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

acs azure chunking dense embedding evaluation experiment genai indexing information-retrieval llm openai rag sparse vectors

Last synced: 16 May 2025

https://github.com/evfro/polara

Recommender system and evaluation framework for top-n recommendations tasks that respects polarity of feedbacks. Fast, flexible and easy to use. Written in python, boosted by scientific python stack.

collaborative-filtering evaluation matrix-factorization recommender-system tensor-factorization top-n-recommendations