https://github.com/cleanlab/tlm

Score the trustworthiness of outputs from any LLM in real-time
https://github.com/cleanlab/tlm

ai-agents ai-safety confidence-estimation data-extraction data-labeling error-detection evals evaluation guardrails hallucination hallucination-detection human-in-the-loop-ai llm llm-as-a-judge llm-evaluation rag structured-outputs trustworthy-ai uncertainty-quantification verifiers

Last synced: 5 months ago
JSON representation

Score the trustworthiness of outputs from any LLM in real-time

Host: GitHub
URL: https://github.com/cleanlab/tlm
Owner: cleanlab
License: apache-2.0
Created: 2025-12-17T21:42:29.000Z (8 months ago)
Default Branch: main
Last Pushed: 2026-01-15T21:04:20.000Z (7 months ago)
Last Synced: 2026-01-15T23:50:42.512Z (7 months ago)
Topics: ai-agents, ai-safety, confidence-estimation, data-extraction, data-labeling, error-detection, evals, evaluation, guardrails, hallucination, hallucination-detection, human-in-the-loop-ai, llm, llm-as-a-judge, llm-evaluation, rag, structured-outputs, trustworthy-ai, uncertainty-quantification, verifiers
Language: Python
Homepage:
Size: 4.71 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 19
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# Trustworthy Language Model (TLM)

The [Trustworthy Language Model](https://cleanlab.ai/blog/trustworthy-language-model/) scores the **trustworthiness** of outputs from *any* LLM in *real-time*.

Automatically detect hallucinated/incorrect responses in: Q&A (RAG), Chatbots, Agents, Structured Outputs, Data Extraction, Tool Calling, Classification/Tagging, Data Labeling, and other LLM applications.

Use TLM to:
- Guardrail AI mistakes before they are served to user
- Escalate cases where AI is untrustworthy to humans
- Discover incorrect LLM (or human) generated outputs in datasets/logs
- Boost AI accuracy

Powered by *uncertainty estimation* techniques, TLM **works out of the box**, and does **not** require:

data preparation/labeling work or custom model training/serving infrastructure.

Learn more and see precision/recall benchmarks with frontier models (from OpenAI, Anthropic, Google, etc):

[Blog](https://cleanlab.ai/blog/), [Research Paper](https://aclanthology.org/2024.acl-long.283/)

## Usage

See [notebooks](notebooks) for Jupyter notebooks with example usage.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cleanlab/tlm

Awesome Lists containing this project

README