https://github.com/cleanlab/tlm
Score the trustworthiness of outputs from any LLM in real-time
https://github.com/cleanlab/tlm
ai-agents ai-safety confidence-estimation data-extraction data-labeling error-detection evals evaluation guardrails hallucination hallucination-detection human-in-the-loop-ai llm llm-as-a-judge llm-evaluation rag structured-outputs trustworthy-ai uncertainty-quantification verifiers
Last synced: 3 months ago
JSON representation
Score the trustworthiness of outputs from any LLM in real-time
- Host: GitHub
- URL: https://github.com/cleanlab/tlm
- Owner: cleanlab
- License: apache-2.0
- Created: 2025-12-17T21:42:29.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-01-15T21:04:20.000Z (5 months ago)
- Last Synced: 2026-01-15T23:50:42.512Z (5 months ago)
- Topics: ai-agents, ai-safety, confidence-estimation, data-extraction, data-labeling, error-detection, evals, evaluation, guardrails, hallucination, hallucination-detection, human-in-the-loop-ai, llm, llm-as-a-judge, llm-evaluation, rag, structured-outputs, trustworthy-ai, uncertainty-quantification, verifiers
- Language: Python
- Homepage:
- Size: 4.71 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Trustworthy Language Model (TLM)
The [Trustworthy Language Model](https://cleanlab.ai/blog/trustworthy-language-model/) scores the **trustworthiness** of outputs from *any* LLM in *real-time*.
Automatically detect hallucinated/incorrect responses in: Q&A (RAG), Chatbots, Agents, Structured Outputs, Data Extraction, Tool Calling, Classification/Tagging, Data Labeling, and other LLM applications.
Use TLM to:
- Guardrail AI mistakes before they are served to user
- Escalate cases where AI is untrustworthy to humans
- Discover incorrect LLM (or human) generated outputs in datasets/logs
- Boost AI accuracy
Powered by *uncertainty estimation* techniques, TLM **works out of the box**, and does **not** require:
data preparation/labeling work or custom model training/serving infrastructure.
Learn more and see precision/recall benchmarks with frontier models (from OpenAI, Anthropic, Google, etc):
[Blog](https://cleanlab.ai/blog/), [Research Paper](https://aclanthology.org/2024.acl-long.283/)
## Usage
See [notebooks](notebooks) for Jupyter notebooks with example usage.