{"id":28042830,"url":"https://github.com/cvs-health/uqlm","last_synced_at":"2025-05-11T15:01:34.500Z","repository":{"id":290469842,"uuid":"968339993","full_name":"cvs-health/uqlm","owner":"cvs-health","description":"UQLM (Uncertainty Quantification for Language Models) is a Python package for UQ-based LLM hallucination detection","archived":false,"fork":false,"pushed_at":"2025-05-06T15:08:42.000Z","size":12193,"stargazers_count":14,"open_issues_count":5,"forks_count":7,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-06T15:59:40.705Z","etag":null,"topics":["ai-evaluation","ai-safety","confidence-estimation","confidence-score","hallucination","hallucination-detection","hallucination-evaluation","hallucination-mitigation","llm","llm-evaluation","llm-hallucination","llm-safety","uncertainty-estimation","uncertainty-quantification"],"latest_commit_sha":null,"homepage":"https://cvs-health.github.io/uqlm/latest/index.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cvs-health.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-17T23:12:21.000Z","updated_at":"2025-05-06T15:36:25.000Z","dependencies_parsed_at":"2025-04-29T02:29:15.147Z","dependency_job_id":"fd19d9ac-f687-4378-a030-463ba0305f6d","html_url":"https://github.com/cvs-health/uqlm","commit_stats":null,"previous_names":["cvs-health/uqlm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fuqlm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fuqlm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fuqlm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fuqlm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cvs-health","download_url":"https://codeload.github.com/cvs-health/uqlm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253584490,"owners_count":21931547,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-evaluation","ai-safety","confidence-estimation","confidence-score","hallucination","hallucination-detection","hallucination-evaluation","hallucination-mitigation","llm","llm-evaluation","llm-hallucination","llm-safety","uncertainty-estimation","uncertainty-quantification"],"created_at":"2025-05-11T15:01:08.554Z","updated_at":"2025-05-11T15:01:34.395Z","avatar_url":"https://github.com/cvs-health.png","language":"Python","funding_links":[],"categories":["*Ops for AI","Recently Updated","Measuring Hallucinations in LLMs","A01_文本生成_文本对话","Tools","Evaluation \u0026 Testing"],"sub_categories":["LLMOps","[May 10, 2025](/content/2025/05/10/README.md)","[Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning](https://people.ece.ubc.ca/amesbah/resources/papers/cedar-icse23.pdf)","大语言对话模型及数据","Services","Sandboxing \u0026 Execution"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/uqlm_flow_ds.png\" /\u003e\n\u003c/p\u003e\n\n\n# uqlm: Uncertainty Quantification for Language Models\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Build Status](https://github.com/cvs-health/uqlm/actions/workflows/ci.yaml/badge.svg)](https://github.com/cvs-health/uqlm/actions)\n[![](https://img.shields.io/badge/arXiv-2504.19254-B31B1B.svg)](https://arxiv.org/abs/2504.19254)\n\nUQLM is a Python library for Large Language Model (LLM) hallucination detection using state-of-the-art uncertainty quantification techniques. \n\n## Installation\nThe latest version can be installed from PyPI:\n\n```bash\npip install uqlm\n```\n\n## Hallucination Detection\nUQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations.  We categorize these scorers into four main types:\n\n\n\n| Scorer Type            | Added Latency                                      | Added Cost                               | Compatibility                                             | Off-the-Shelf / Effort                                  |\n|------------------------|----------------------------------------------------|------------------------------------------|-----------------------------------------------------------|---------------------------------------------------------|\n| [Black-Box Scorers](#black-box-scorers-consistency-based)      | ⏱️ Medium-High (multiple generations \u0026 comparisons)           | 💸 High (multiple LLM calls)             | 🌍 Universal (works with any LLM)                         | ✅ Off-the-shelf |\n| [White-Box Scorers](#white-box-scorers-token-probability-based)      | ⚡ Minimal (token probabilities already returned)   | ✔️ None (no extra LLM calls)             | 🔒 Limited (requires access to token probabilities)       | ✅ Off-the-shelf            |\n| [LLM-as-a-Judge Scorers](#llm-as-a-judge-scorers) | ⏳ Low-Medium (additional judge calls add latency)    | 💵 Low-High (depends on number of judges)| 🌍 Universal (any LLM can serve as judge)                     |✅  Off-the-shelf        |\n| [Ensemble Scorers](#ensemble-scorers)       | 🔀 Flexible (combines various scorers)       | 🔀 Flexible (combines various scorers)      | 🔀 Flexible (combines various scorers)                    | ✅  Off-the-shelf (beginner-friendly); 🛠️ Can be tuned (best for advanced users)    |\n\n\nBelow we provide illustrative code snippets and details about available scorers for each type.\n\n### Black-Box Scorers (Consistency-Based)\n\nThese scorers assess uncertainty by measuring the consistency of multiple responses generated from the same prompt. They are compatible with any LLM, intuitive to use, and don't require access to internal model states or token probabilities.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/black_box_graphic.png\" /\u003e\n\u003c/p\u003e\n\n**Example Usage:**\nBelow is a sample of code illustrating how to use the `BlackBoxUQ` class to conduct hallucination detection.\n\n```python\nfrom langchain_google_vertexai import ChatVertexAI\nllm = ChatVertexAI(model='gemini-pro')\n\nfrom uqlm import BlackBoxUQ\nbbuq = BlackBoxUQ(llm=llm, scorers=[\"semantic_negentropy\"], use_best=True)\n\nresults = await bbuq.generate_and_score(prompts=prompts, num_responses=5)\nresults.to_df()\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/black_box_output4.png\" /\u003e\n\u003c/p\u003e\n\nAbove, `use_best=True` implements mitigation so that the uncertainty-minimized responses is selected. Note that although we use `ChatVertexAI` in this example, any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. For a more detailed demo, refer to our [Black-Box UQ Demo](./examples/black_box_demo.ipynb).\n\n\n**Available Scorers:**\n\n*   Non-Contradiction Probability ([Chen \u0026 Mueller, 2023](https://arxiv.org/abs/2308.16175); [Lin et al., 2025](https://arxiv.org/abs/2305.19187); [Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n*   Semantic Entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0); [Kuhn et al., 2023](https://arxiv.org/abs/2302.09664))\n*   Exact Match ([Cole et al., 2023](https://arxiv.org/abs/2305.14613); [Chen \u0026 Mueller, 2023](https://arxiv.org/abs/2308.16175))\n*   BERT-score ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Zheng et al., 2020](https://arxiv.org/abs/1904.09675))\n*   BLUERT-score ([Sellam et al., 2020](https://arxiv.org/abs/2004.04696))\n*   Cosine Similarity ([Shorinwa et al., 2024](https://arxiv.org/abs/2412.05563); [HuggingFace](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))\n\n### White-Box Scorers (Token-Probability-Based)\n\nThese scorers leverage token probabilities to estimate uncertainty.  They are significantly faster and cheaper than black-box methods, but require access to the LLM's internal probabilities, meaning they are not necessarily compatible with all LLMs/APIs.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/white_box_graphic.png\" /\u003e\n\u003c/p\u003e\n\n**Example Usage:**\nBelow is a sample of code illustrating how to use the `WhiteBoxUQ` class to conduct hallucination detection. \n\n```python\nfrom langchain_google_vertexai import ChatVertexAI\nllm = ChatVertexAI(model='gemini-pro')\n\nfrom uqlm import WhiteBoxUQ\nwbuq = WhiteBoxUQ(llm=llm, scorers=[\"min_probability\"])\n\nresults = await wbuq.generate_and_score(prompts=prompts)\nresults.to_df()\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/white_box_output2.png\" /\u003e\n\u003c/p\u003e\n\nAgain, any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used in place of `ChatVertexAI`. For a more detailed demo, refer to our [White-Box UQ Demo](./examples/white_box_demo.ipynb).\n\n\n**Available Scorers:**\n\n*   Minimum token probability ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n*   Length-Normalized Joint Token Probability ([Malinin \u0026 Gales, 2021](https://arxiv.org/abs/2002.07650))\n\n### LLM-as-a-Judge Scorers\n\nThese scorers use one or more LLMs to evaluate the reliability of the original LLM's response.  They offer high customizability through prompt engineering and the choice of judge LLM(s).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/judges_graphic.png\" /\u003e\n\u003c/p\u003e\n\n**Example Usage:**\nBelow is a sample of code illustrating how to use the `LLMPanel` class to conduct hallucination detection using a panel of LLM judges. \n\n```python\nfrom langchain_google_vertexai import ChatVertexAI\nllm1 = ChatVertexAI(model='gemini-1.0-pro')\nllm2 = ChatVertexAI(model='gemini-1.5-flash-001')\nllm3 = ChatVertexAI(model='gemini-1.5-pro-001')\n\nfrom uqlm import LLMPanel\npanel = LLMPanel(llm=llm1, judges=[llm1, llm2, llm3])\n\nresults = await panel.generate_and_score(prompts=prompts)\nresults.to_df()\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/panel_output2.png\" /\u003e\n\u003c/p\u003e\n\nNote that although we use `ChatVertexAI` in this example, we can use any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) as judges. For a more detailed demo illustrating how to customize a panel of LLM judges, refer to our [LLM-as-a-Judge Demo](./examples/judges_demo.ipynb).\n\n\n**Available Scorers:**\n\n*   Categorical LLM-as-a-Judge ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Chen \u0026 Mueller, 2023](https://arxiv.org/abs/2308.16175); [Luo et al., 2023](https://arxiv.org/abs/2303.15621))\n*   Continuous LLM-as-a-Judge ([Xiong et al., 2024](https://arxiv.org/abs/2306.13063))\n*   Panel of LLM Judges ([Verga et al., 2024](https://arxiv.org/abs/2404.18796))\n\n### Ensemble Scorers\n\nThese scorers leverage a weighted average of multiple individual scorers to provide a more robust uncertainty/confidence estimate. They offer high flexibility and customizability, allowing you to tailor the ensemble to specific use cases.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/uqensemble_generate_score.png\" /\u003e\n\u003c/p\u003e\n\n**Example Usage:**\nBelow is a sample of code illustrating how to use the `UQEnsemble` class to conduct hallucination detection. \n\n```python\nfrom langchain_google_vertexai import ChatVertexAI\nllm = ChatVertexAI(model='gemini-pro')\n\nfrom uqlm import UQEnsemble\n## ---Option 1: Off-the-Shelf Ensemble---\n# uqe = UQEnsemble(llm=llm)\n# results = await uqe.generate_and_score(prompts=prompts, num_responses=5)\n\n## ---Option 2: Tuned Ensemble---\nscorers = [ # specify which scorers to include\n    \"exact_match\", \"noncontradiction\", # black-box scorers\n    \"min_probability\", # white-box scorer\n    llm # use same LLM as a judge\n]\nuqe = UQEnsemble(llm=llm, scorers=scorers)\n\n# Tune on tuning prompts with provided ground truth answers\ntune_results = await uqe.tune(\n    prompts=tuning_prompts, ground_truth_answers=ground_truth_answers\n)\n# ensemble is now tuned - generate responses on new prompts\nresults = await uqe.generate_and_score(prompts=prompts)\nresults.to_df()\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/uqensemble_output2.png\" /\u003e\n\u003c/p\u003e\n\nAs with the other examples, any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used in place of `ChatVertexAI`. For more detailed demos, refer to our [Off-the-Shelf Ensemble Demo](./examples/ensemble_off_the_shelf_demo.ipynb) (quick start) or our [Ensemble Tuning Demo](./examples/ensemble_tuning_demo.ipynb) (advanced).\n\n\n**Available Scorers:**\n\n*   BS Detector ([Chen \u0026 Mueller, 2023](https://arxiv.org/abs/2308.16175))\n*   Generalized UQ Ensemble ([Bouchard \u0026 Chauhan, 2025](https://arxiv.org/abs/2504.19254))\n\n## Documentation\nCheck out our [documentation site](https://cvs-health.github.io/uqlm/latest/index.html) for detailed instructions on using this package, including API reference and more.\n\n## Example notebooks\nExplore the following demo notebooks to see how to use UQLM for various hallucination detection methods:\n\n- [Black-Box Uncertainty Quantification](https://github.com/cvs-health/uqlm/blob/develop/examples/black_box_demo.ipynb): A notebook demonstrating hallucination detection with black-box (consistency) scorers.\n- [White-Box Uncertainty Quantification](https://github.com/cvs-health/uqlm/blob/develop/examples/white_box_demo.ipynb): A notebook demonstrating hallucination detection with white-box (token probability-based) scorers.\n- [LLM-as-a-Judge](https://github.com/cvs-health/uqlm/blob/develop/examples/judges_demo.ipynb): A notebook demonstrating hallucination detection with LLM-as-a-Judge.\n- [Tunable UQ Ensemble](https://github.com/cvs-health/uqlm/blob/develop/examples/ensemble_tuning_demo.ipynb): A notebook demonstrating hallucination detection with a tunable ensemble of UQ scorers ([Bouchard \u0026 Chauhan, 2023](https://arxiv.org/abs/2504.19254)).\n- [Off-the-Shelf UQ Ensemble](https://github.com/cvs-health/uqlm/blob/develop/examples/ensemble_off_the_shelf_demo.ipynb): A notebook demonstrating hallucination detection using BS Detector ([Chen \u0026 Mueller, 2023](https://arxiv.org/abs/2308.16175)) off-the-shelf ensemble.\n\n\n## Associated Research\nA technical description of the `uqlm` scorers and extensive experiment results are contained in this **[this paper](https://arxiv.org/abs/2504.19254)**. If you use our framework or toolkit, we would appreciate citations to the following paper:\n\n```bibtex\n@misc{bouchard2025uncertaintyquantificationlanguagemodels,\n      title={Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers}, \n      author={Dylan Bouchard and Mohit Singh Chauhan},\n      year={2025},\n      eprint={2504.19254},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2504.19254}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcvs-health%2Fuqlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcvs-health%2Fuqlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcvs-health%2Fuqlm/lists"}