Projects in Awesome Lists tagged with evaluation
A curated list of projects in awesome lists tagged with evaluation .
https://github.com/langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
analytics autogen evaluation langchain large-language-models llama-index llm llm-evaluation llm-observability llmops monitoring observability open-source openai playground prompt-engineering prompt-management self-hosted ycombinator
Last synced: 13 May 2025
https://github.com/explodinggradients/ragas?tab=readme-ov-file
Supercharge Your LLM Application Evaluations 🚀
Last synced: 04 Apr 2025
https://github.com/explodinggradients/ragas
Supercharge Your LLM Application Evaluations 🚀
Last synced: 02 Apr 2025
https://github.com/promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners
Last synced: 14 Mar 2025
https://github.com/open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai
Last synced: 14 May 2025
https://github.com/internlm/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai
Last synced: 13 Dec 2024
https://github.com/open-compass/OpenCompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai
Last synced: 04 Dec 2024
https://github.com/marker-inc-korea/autorag
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation
Last synced: 12 May 2025
https://github.com/knetic/govaluate
Arbitrary expression evaluation for golang
evaluation expression go parsing
Last synced: 25 Mar 2025
https://github.com/michaelgrupp/evo
Python package for the evaluation of odometry and SLAM
benchmark euroc evaluation kitti mapping metrics odometry robotics ros ros2 slam trajectory trajectory-analysis trajectory-evaluation tum
Last synced: 16 May 2025
https://github.com/Knetic/govaluate
Arbitrary expression evaluation for golang
evaluation expression go parsing
Last synced: 28 Mar 2025
https://github.com/helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
agent-monitoring analytics evaluation gpt langchain large-language-models llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-engineering prompt-management ycombinator
Last synced: 08 May 2025
https://github.com/MichaelGrupp/evo
Python package for the evaluation of odometry and SLAM
benchmark euroc evaluation kitti mapping metrics odometry robotics ros ros2 slam trajectory trajectory-analysis trajectory-evaluation tum
Last synced: 11 Apr 2025
https://github.com/kiln-ai/kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
ai chain-of-thought collaboration dataset-generation evals evaluation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows
Last synced: 23 Apr 2025
https://github.com/sdiehl/write-you-a-haskell
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
book compiler evaluation functional-language functional-programming haskel hindley-milner intermediate-representation lambda-calculus pdf-book type type-checking type-inference type-system type-theory
Last synced: 15 May 2025
https://github.com/cluebenchmark/superclue
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
chatgpt chinese evaluation foundation-models gpt-4
Last synced: 15 May 2025
https://github.com/CLUEbenchmark/SuperCLUE
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
chatgpt chinese evaluation foundation-models gpt-4
Last synced: 24 Mar 2025
https://github.com/viebel/klipse
Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
brainfuck clojure clojurescript code-evaluation codemirror-editor common-lisp evaluation interactive-snippets javascript klipse-plugin lua ocaml prolog python react reactjs reasonml ruby scheme
Last synced: 14 May 2025
https://github.com/zzw922cn/Automatic_Speech_Recognition
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
audio automatic-speech-recognition chinese-speech-recognition cnn data-preprocessing deep-learning end-to-end evaluation feature-vector layer-normalization lstm paper phonemes rnn rnn-encoder-decoder speech-recognition tensorflow timit-dataset
Last synced: 02 Apr 2025
https://github.com/zzw922cn/automatic_speech_recognition
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
audio automatic-speech-recognition chinese-speech-recognition cnn data-preprocessing deep-learning end-to-end evaluation feature-vector layer-normalization lstm paper phonemes rnn rnn-encoder-decoder speech-recognition tensorflow timit-dataset
Last synced: 15 May 2025
https://github.com/microsoft/promptbench
A unified evaluation framework for large language models
adversarial-attacks benchmark chatgpt evaluation large-language-models prompt prompt-engineering robustness
Last synced: 13 May 2025
https://github.com/ianarawjo/chainforge
An open-source visual programming environment for battle-testing prompts to LLMs.
ai evaluation large-language-models llmops llms prompt-engineering
Last synced: 14 May 2025
https://github.com/ianarawjo/ChainForge?tab=readme-ov-file
An open-source visual programming environment for battle-testing prompts to LLMs.
ai evaluation large-language-models llmops llms prompt-engineering
Last synced: 08 May 2025
https://github.com/evolvinglmms-lab/lmms-eval
Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
agi evaluation large-language-models multimodal
Last synced: 13 May 2025
https://github.com/uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
autoevaluation evaluation experimentation hallucination-detection jailbreak-detection llm-eval llm-prompting llm-test llmops machine-learning monitoring openai-evals prompt-engineering root-cause-analysis
Last synced: 14 May 2025
https://github.com/ianarawjo/ChainForge
An open-source visual programming environment for battle-testing prompts to LLMs.
ai evaluation large-language-models llmops llms prompt-engineering
Last synced: 27 Mar 2025
https://github.com/open-compass/vlmevalkit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa
Last synced: 13 May 2025
https://github.com/huggingface/evaluate
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Last synced: 14 May 2025
https://github.com/ContinualAI/avalanche
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
benchmarks continual-learning continualai deep-learning evaluation framework library lifelong-learning metrics pytorch strategies training
Last synced: 06 May 2025
https://github.com/continualai/avalanche
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
benchmarks continual-learning continualai deep-learning evaluation framework library lifelong-learning metrics pytorch strategies training
Last synced: 25 Apr 2025
https://github.com/Helicone/helicone
🧊 Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc. Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain, and more. 🍓 YC W23
agent-monitoring analytics evaluation gpt langchain large-language-models llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-engineering prompt-management ycombinator
Last synced: 31 Mar 2025
https://github.com/cloud-cv/evalai
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
ai ai-challenges angularjs artificial-intelligence challenge codecov coveralls django docker evalai evaluation leaderboard machine-learning python reproducibility reproducible-research travis-ci
Last synced: 14 May 2025
https://github.com/langwatch/langwatch
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
ai analytics datasets dspy evaluation gpt llm llmops low-code observability openai prompt-engineering
Last synced: 13 May 2025
https://github.com/lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llm-workflow llmops monitoring observability open-source pipeline-builder rag rust-lang self-hosted
Last synced: 14 May 2025
https://github.com/Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
ai ai-challenges angular7 angularjs artificial-intelligence challenge django docker evalai evaluation leaderboard machine-learning python reproducibility reproducible-research
Last synced: 26 Mar 2025
https://github.com/xinshuoweng/ab3dmot
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking
Last synced: 15 May 2025
https://github.com/tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf
Last synced: 13 May 2025
https://github.com/xinshuoweng/AB3DMOT
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking
Last synced: 20 Mar 2025
https://tatsu-lab.github.io/alpaca_eval/
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf
Last synced: 23 Mar 2025
https://github.com/MLGroupJLU/LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
benchmark evaluation large-language-models llm llms model-assessment
Last synced: 04 Apr 2025
https://github.com/sepandhaghighi/pycm
Multi-class confusion matrix library in Python
accuracy ai artificial-intelligence classification confusion-matrix data data-analysis data-mining data-science deep-learning deeplearning evaluation machine-learning mathematics matrix ml multiclass-classification neural-network statistical-analysis statistics
Last synced: 13 May 2025
https://github.com/mlgroupjlu/llm-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
benchmark evaluation large-language-models llm llms model-assessment
Last synced: 26 Mar 2025
https://github.com/huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
evaluation evaluation-framework evaluation-metrics huggingface
Last synced: 15 Apr 2025
https://github.com/maluuba/nlg-eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
bleu bleu-score cider dialog dialogue evaluation machine-translation meteor natural-language-generation natural-language-processing nlg nlp rouge rouge-l skip-thought-vectors skip-thoughts task-oriented-dialogue
Last synced: 15 May 2025
https://github.com/Maluuba/nlg-eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
bleu bleu-score cider dialog dialogue evaluation machine-translation meteor natural-language-generation natural-language-processing nlg nlp rouge rouge-l skip-thought-vectors skip-thoughts task-oriented-dialogue
Last synced: 14 May 2025
https://github.com/lunary-ai/lunary
The production toolkit for LLMs. Observability, prompt management and evaluations.
ai evaluation hacktoberfest langchain llm logs monitoring observability openai prompts self-hosted testing
Last synced: 29 Apr 2025
https://github.com/open-compass/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa
Last synced: 28 Nov 2024
https://github.com/abo-abo/lispy
Short and sweet LISP editing
clojure common-lisp emacs-lisp evaluation navigation python refactoring scheme
Last synced: 15 May 2025
https://github.com/ethicalml/xai
XAI - An eXplainability toolbox for machine learning
ai artificial-intelligence bias bias-evaluation downsampling evaluation explainability explainable-ai explainable-ml feature-importance imbalance interpretability machine-learning machine-learning-explainability ml upsampling xai xai-library
Last synced: 15 May 2025
https://github.com/EthicalML/xai
XAI - An eXplainability toolbox for machine learning
ai artificial-intelligence bias bias-evaluation downsampling evaluation explainability explainable-ai explainable-ml feature-importance imbalance interpretability machine-learning machine-learning-explainability ml upsampling xai xai-library
Last synced: 14 Mar 2025
https://github.com/google/fuzzbench
FuzzBench - Fuzzer benchmarking as a service.
benchmark-framework benchmarking evaluation fuzzing security
Last synced: 14 May 2025
https://google.github.io/fuzzbench/
FuzzBench - Fuzzer benchmarking as a service.
benchmark-framework benchmarking evaluation fuzzing security
Last synced: 01 Apr 2025
https://github.com/toshas/torch-fidelity
High-fidelity performance metrics for generative models in PyTorch
evaluation frechet-inception-distance gan generative-model inception-score kernel-inception-distance metrics perceptual-path-length precision pytorch reproducibility reproducible-research
Last synced: 14 May 2025
https://github.com/Xnhyacinth/Awesome-LLM-Long-Context-Modeling
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
agent awsome-list benchmark blogs compress evaluation large-language-models length-extrapolation llm long-context-modeling long-term-memory papers rag ssm survey transformer
Last synced: 05 Dec 2024
https://github.com/modelscope/evalscope
A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
evaluation llm performance rag vlm
Last synced: 14 May 2025
https://github.com/prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
evaluation gpt4 litellm llm llm-as-a-judge llm-as-evaluator llmops python vllm
Last synced: 05 Apr 2025
https://github.com/prbonn/semantic-kitti-api
SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
dataset deep-learning evaluation labels large-scale-dataset machine-learning semantic-scene-completion semantic-segmentation
Last synced: 15 May 2025
https://github.com/intellabs/rag-fit
Framework for enhancing LLMs for RAG tasks using fine-tuning.
evaluation fine-tuning information-retrieval llm nlp question-answering rag semantic-search
Last synced: 15 May 2025
https://github.com/PaesslerAG/gval
Expression evaluation in golang
evaluate-expressions evaluation expression-evaluator expression-language go godoc golang gval parser parsing
Last synced: 14 Mar 2025
https://github.com/CBLUEbenchmark/CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
acl2022 benchmark biomedical-tasks chinese chineseblue corpus dataset evaluation
Last synced: 01 Apr 2025
https://github.com/dbolya/tide
A General Toolbox for Identifying Object Detection Errors
error-detection errors evaluation instance-segmentation object-detection toolbox
Last synced: 08 Apr 2025
https://github.com/bochinski/iou-tracker
Python implementation of the IOU Tracker
demo-script detrac evaluation iou-tracker mot python tracker tracking-by-detection ua-detrac
Last synced: 05 May 2025
https://github.com/codingseb/expressionevaluator
A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts
calculations csharp-script eval evaluate evaluate-expressions evaluation evaluator execute executescript expression expression-evaluator expression-parser fluid math mathematical-expressions mathematical-expressions-evaluator parser reflection script scripting
Last synced: 14 Apr 2025
https://github.com/tecnickcom/tcexam
TCExam is a CBA (Computer-Based Assessment) system (e-exam, CBT - Computer Based Testing) for universities, schools and companies, that enables educators and trainers to author, schedule, deliver, and report on surveys, quizzes, tests and exams.
cba cbt computer-based-assessment computer-based-testing e-exam essay evaluation exam mcma mcsa multiple-choice school tcexam testing university
Last synced: 15 May 2025
https://ucinlp.github.io/autoprompt/
AutoPrompt: Automatic Prompt Construction for Masked Language Models.
Last synced: 15 May 2025
https://github.com/ucinlp/autoprompt
AutoPrompt: Automatic Prompt Construction for Masked Language Models.
Last synced: 15 May 2025
https://github.com/howiehwong/trustllm
[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
ai benchmark dataset evaluation large-language-models llm natural-language-processing nlp pypi-package toolkit trustworthy-ai trustworthy-machine-learning
Last synced: 14 May 2025
https://github.com/langchain-ai/langsmith-sdk
LangSmith Client SDK Implementations
evaluation language-model observability
Last synced: 13 May 2025
https://github.com/caserec/CaseRecommender
Case Recommender: A Flexible and Extensible Python Framework for Recommender Systems
evaluation python ranking rating-prediction recommendation-system recommender-systems top-k
Last synced: 25 Nov 2024
https://github.com/zzzprojects/eval-expression.net
C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.
csharp dotnet eval eval-expression evaluation evaluator
Last synced: 15 May 2025
https://github.com/zzzprojects/Eval-Expression.NET
C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.
csharp dotnet eval eval-expression evaluation evaluator
Last synced: 24 Mar 2025
https://github.com/HowieHwong/TrustLLM
[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
ai benchmark dataset evaluation large-language-models llm natural-language-processing nlp pypi-package toolkit trustworthy-ai trustworthy-machine-learning
Last synced: 09 May 2025
https://github.com/ModelTC/llmc
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
awq benchmark deployment evaluation internlm2 large-language-models lightllm llama3 llm lvlm mixtral omniquant post-training-quantization pruning quantization quarot smoothquant spinquant tool vllm
Last synced: 23 Apr 2025
https://github.com/thu-keg/evaluationpapers4chatgpt
Resource, Evaluation and Detection Papers for ChatGPT
chatgpt detection evaluation large-language-models resource
Last synced: 13 May 2025
https://github.com/danthedeckie/simpleeval
Simple Safe Sandboxed Extensible Expression Evaluator for Python
Last synced: 29 Mar 2025
https://github.com/THU-KEG/EvaluationPapers4ChatGPT
Resource, Evaluation and Detection Papers for ChatGPT
chatgpt detection evaluation large-language-models resource
Last synced: 04 Apr 2025
https://github.com/X-PLUG/CValues
面向中文大模型价值观的评估与对齐研究
benchmark chinese-llms evaluation human-values llms multi-choice responsibility safety
Last synced: 09 May 2025
https://github.com/AmenRa/ranx
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
comparison data-fusion evaluation evaluation-metrics information-retrieval information-retrieval-evaluation information-retrieval-metrics metasearch numba python rank-fusion ranking-metrics recommender-systems score-fusion
Last synced: 31 Mar 2025
https://github.com/MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
computer-vision deep-learning deep-neural-networks evaluation foundation-models large-language-models large-multimodal-models llm llms machine-learning multimodal multimodal-deep-learning multimodal-learning multimodality natural-language-processing question-answering stem visual-question-answering
Last synced: 17 Apr 2025
https://github.com/davidstutz/superpixel-benchmark
An extensive evaluation and comparison of 28 state-of-the-art superpixel algorithms on 5 datasets.
benchmark computer-vision evaluation image-procesing opencv superpixel-algorithms superpixels
Last synced: 05 Apr 2025
https://github.com/alipay/ant-application-security-testing-benchmark
xAST评价体系,让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".
application benchmark dast evaluation iast sast sca security testing
Last synced: 15 May 2025
https://github.com/audiolabs/webmushra
a MUSHRA compliant web audio API based experiment software
audio bs1534 evaluation js mushra
Last synced: 15 May 2025
https://github.com/sb-ai-lab/replay
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
algorithms collaborative-filtering deep-learning distributed-computing evaluation machine-learning matrix-factorization pyspark pytorch recommendation-algorithms recommender-system recsys transformers
Last synced: 15 May 2025
https://github.com/microsoft/genaiops-promptflow-template
GenAIOps with Prompt Flow is a "GenAIOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.
aistudio azure azuremachinelearning cloud docker evaluation experimentation genai genaiops largelanguagemodels llm llmops machine-learning mlops mlops-template orchestration prompt promptengineering promptflow python
Last synced: 15 May 2025
https://github.com/hbaniecki/adversarial-explainable-ai
💡 Adversarial attacks on explanations and how to defend them
adversarial adversarial-attacks adversarial-examples adversarial-machine-learning attacks counterfactual deep defense evaluation explainability explainable-ai iml interpretability interpretable interpretable-machine-learning model responsible-ai robustness security xai
Last synced: 25 Mar 2025
https://github.com/cvangysel/pytrec_eval
pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.
evaluation information-retrieval
Last synced: 29 Apr 2025
https://github.com/AstraZeneca/rexmex
A general purpose recommender metrics library for fair evaluation.
coverage deep-learning evaluation machine-learning metric metrics mrr personalization precision rank ranking recall recommender recommender-system recsys rsquared
Last synced: 27 Mar 2025
https://github.com/astrazeneca/rexmex
A general purpose recommender metrics library for fair evaluation.
coverage deep-learning evaluation machine-learning metric metrics mrr personalization precision rank ranking recall recommender recommender-system recsys rsquared
Last synced: 04 Apr 2025
https://github.com/rentruewang/bocoel
Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.
bayesian-optimization benchmarking evaluation language-model llm machine-learning
Last synced: 05 Apr 2025
https://github.com/shmsw25/FActScore
A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"
emnlp2023 evaluation factuality language language-modeling
Last synced: 02 Feb 2025
https://github.com/athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-evaluation-toolkit llm-ops llmops
Last synced: 15 Apr 2025
https://github.com/FuxiaoLiu/LRV-Instruction?tab=readme-ov-file
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
chatgpt evaluation evaluation-metrics foundation-models gpt gpt-4 hallucination iclr iclr2024 llama llava multimodal object-detection prompt-engineering vicuna vision vision-and-language vqa
Last synced: 29 Mar 2025
https://github.com/belambert/asr-evaluation
Python module for evaluating ASR hypotheses (e.g. word error rate, word recognition rate).
asr error-rate evaluation speech-recognition
Last synced: 27 Nov 2024
https://github.com/Wscats/compile-hero
🔰Visual Studio Code Extension For Compiling Language
automatic compile es6 evaluation gulp jade javascript json jsx less pug sass scss typescript
Last synced: 24 Mar 2025
https://github.com/wscats/compile-hero
🔰Visual Studio Code Extension For Compiling Language
automatic compile es6 evaluation gulp jade javascript json jsx less pug sass scss typescript
Last synced: 07 Apr 2025
https://github.com/clovaai/generative-evaluation-prdc
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
deep-learning diversity evaluation evaluation-metrics fidelity generative-adversarial-network generative-model icml icml-2020 icml2020 machine-learning precision recall
Last synced: 09 Apr 2025
https://github.com/microsoft/rag-experiment-accelerator
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
acs azure chunking dense embedding evaluation experiment genai indexing information-retrieval llm openai rag sparse vectors
Last synced: 16 May 2025
https://github.com/evfro/polara
Recommender system and evaluation framework for top-n recommendations tasks that respects polarity of feedbacks. Fast, flexible and easy to use. Written in python, boosted by scientific python stack.
collaborative-filtering evaluation matrix-factorization recommender-system tensor-factorization top-n-recommendations
Last synced: 04 Apr 2025
https://github.com/appinho/SARosPerceptionKitti
ROS package for the Perception (Sensor Processing, Detection, Tracking and Evaluation) of the KITTI Vision Benchmark Suite
cpp dbscan deep-learning deeplab evaluation kitti kitti-dataset multi-object-tracking object-detection python ros ros-kinetic ros-node ros-nodes ros-packages rosbag rviz semantic-segmentation sensor-fusion unscented-kalman-filter
Last synced: 05 May 2025
https://github.com/devmount/germanwordembeddings
Toolkit to obtain and preprocess German text corpora, train models and evaluate them with generated testsets. Built with Gensim and Tensorflow.
deep-learning deep-neural-networks evaluation gensim german-language model natural-language-processing neural-network nlp training word-embeddings word2vec
Last synced: 06 Apr 2025