https://github.com/erikernst4/callm

A framework for evaluating confidence augmented systems, built on PyTorch Lightning. Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.
https://github.com/erikernst4/callm

calibration confidence llm metrics uncertainty-quantification

Last synced: 11 days ago
JSON representation

A framework for evaluating confidence augmented systems, built on PyTorch Lightning. Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.

Host: GitHub
URL: https://github.com/erikernst4/callm
Owner: erikernst4
License: apache-2.0
Created: 2025-09-03T20:02:22.000Z (9 months ago)
Default Branch: main
Last Pushed: 2026-05-23T03:45:07.000Z (20 days ago)
Last Synced: 2026-05-23T05:24:58.433Z (20 days ago)
Topics: calibration, confidence, llm, metrics, uncertainty-quantification
Language: Python
Homepage:
Size: 2.63 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # callm — Confidence Calibration for LLMs

A framework for evaluating confidence augmented systems, built on [PyTorch Lightning](https://lightning.ai/).

Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.

## Supported Benchmarks

| Benchmark | Task type | Semantic‑equivalence evaluation needed? |

|---|---|---|

| **TriviaQA** | Open‑ended QA | Yes — uses an evaluator LLM |

| **MMLU** | Multiple‑choice | No — exact match on answer letter |

| **Classification** | Image/Audio/Text Classification | No — exact match on class label |

## Calibration Metrics

| Metric | Description |

|---|---|

| **ECE** | Expected Calibration Error (L1, 10 bins) |

| **AUC** | Area Under the ROC Curve |

| **BS** | Brier Score (MSE between confidence and correctness) |

| **CE** | Binary Cross‑Entropy |

| **n‑ECUAS** | Expected Cost for Uncertainty-Augmented Systems (parameterised by n = 0, 1, 2, …) |

| **γ‑ECUAS** | Gamma‑ECUAS — selective prediction at operating point γ |

| **AURC** | Area Under the Risk‑Coverage curve |

| **FPR@95** | False Positive Rate at 95% recall |

| **Error Rate** | Overall prediction error rate |

| **LogLog** | LogLog Score (Classification) |

| **NER / NBS / NCE** | Normalized versions of Error Rate, Brier Score, and Cross-Entropy |

## Quick Start

### 1. Install dependencies

```bash

uv sync

```

### 2. Configure environment (optional)

```bash

cp .env.example .env

```

Then edit `.env`:

```env

HF_TOKEN=your_huggingface_token_here               # needed for gated models (e.g. Llama, Mistral)

GOOGLE_APPLICATION_CREDENTIALS=path/to/creds.json   # needed for GCP / Gemini models

```

### 3. Run unit tests

```bash

uv run pytest callm/tests/ -v

```

## Usage

The CLI is built on top of `LightningCLI` and exposes three subcommands:

### `validate` — Run LLM inference and extract answers + confidences

```bash

# TriviaQA with a local HuggingFace model (default config)

uv run python main.py validate \

  --model.init_args.model_name=google/flan-t5-small \

  --data.init_args.batch_size=8

# MMLU with a local model

uv run python main.py validate \

  -c configs/config_mmlu_base_validation.yaml \

  --model.init_args.model_name=mistralai/Ministral-3-8B-Instruct-2512

# TriviaQA with a GCP Gemini model

uv run python main.py validate \

  -c configs/config_gcp_validation.yaml

```

Outputs are saved to `lightning_logs//llm_outputs.csv`.

### `evaluation` — Evaluate correctness of LLM outputs via a judge model

For benchmarks that require semantic-equivalence checking (TriviaQA):

```bash

uv run python main.py evaluation \

  --llm_outputs_path=lightning_logs//llm_outputs.csv

# Or recalculate metrics from an existing evaluation CSV:

uv run python main.py evaluation \

  --use_existing_csv \

  --llm_outputs_path=lightning_logs//llm_outputs.csv

```

### `evaluate_csv` — Compute metrics from a saved evaluation CSV

```bash

uv run python main.py evaluate_csv \

  --csv_path=lightning_logs/_evaluation/version_0/evaluation_results.csv

```

## Configuration

All runs are configured via YAML. Pre-built configs live in `configs/`:

| Config | Backend | Benchmark |

|---|---|---|

| `config_base_validation.yaml` | HuggingFace | TriviaQA |

| `config_gcp_validation.yaml` | GCP (Gemini) | TriviaQA |

| `config_base_evaluation.yaml` | HuggingFace | TriviaQA (evaluator) |

| `config_gcp_evaluation.yaml` | GCP (Gemini) | TriviaQA (evaluator) |

| `config_mmlu_base_validation.yaml` | HuggingFace | MMLU |

| `config_mmlu_gcp_validation.yaml` | GCP (Gemini) | MMLU |

Any config value can be overridden from the CLI — see the [LightningCLI docs](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html).

## Project Structure

```

callm/

├── models/

│   ├── base.py              # Shared Lightning module base

│   ├── llm.py               # HuggingFace LLM (local GPU)

│   ├── gcp_llm.py           # GCP Vertex AI / Gemini LLM

│   ├── evaluator.py         # Semantic-equivalence evaluator (HF)

│   └── gcp_evaluator.py     # Semantic-equivalence evaluator (GCP)

├── data/

│   ├── triviaqa/            # TriviaQA data modules

│   ├── mmlu/                # MMLU data modules

│   ├── answers_data.py      # Shared answer-loading utilities

│   ├── classification.py    # Classification data module

│   └── simulation.py        # Simulated confidence data module

├── extractors/

│   ├── base.py              # Base + posterior extractors

│   ├── triviaqa.py          # TriviaQA verbalized-confidence extractor

│   └── mmlu.py              # MMLU answer/confidence extractors

├── prompts/

│   ├── base.py              # Prompt / ChatPrompt base classes

│   ├── triviaqa.py          # TriviaQA prompt templates

│   └── mmlu.py              # MMLU prompt templates

├── metrics/

│   ├── confidences.py       # Calibration metrics (ECE, AUC, BS, CE, n-ECUAS, …)

│   ├── classification.py    # Classification-specific metric variants

│   ├── constants.py         # Metric constants and registry

│   └── utils.py             # Metric lookup helpers

├── tests/                   # Unit & integration tests

├── config.py                # Shared config utilities

└── utils.py                 # Model loading & tokenizer helpers

configs/                     # YAML run configurations

scripts/                     # Analysis & paper-figure scripts

cli.py                       # CalibrationCLI (extends LightningCLI)

main.py                      # Entrypoint

```

## Confidence Extraction Methods

| Extractor | How confidence is obtained |

|---|---|

| **SequencePosteriorExtractor** | Product of token log‑probabilities of the generated answer |

| **IsTruePosteriorExtractor** | Log‑prob of the "True" token after an "Is this true?" follow‑up |

| **VerbalizedConfidenceExtractor** | Parsed from the model's own verbalized confidence value |

MMLU variants (`MMLUSequencePosteriorExtractor`, `MMLUVerbalizedExtractor`, etc.) adapt these strategies to multiple‑choice format.

## License

See [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/erikernst4/callm

Awesome Lists containing this project

README