https://github.com/erikernst4/callm
A framework for evaluating confidence augmented systems, built on PyTorch Lightning. Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.
https://github.com/erikernst4/callm
calibration confidence llm metrics uncertainty-quantification
Last synced: 11 days ago
JSON representation
A framework for evaluating confidence augmented systems, built on PyTorch Lightning. Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.
- Host: GitHub
- URL: https://github.com/erikernst4/callm
- Owner: erikernst4
- License: apache-2.0
- Created: 2025-09-03T20:02:22.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2026-05-23T03:45:07.000Z (20 days ago)
- Last Synced: 2026-05-23T05:24:58.433Z (20 days ago)
- Topics: calibration, confidence, llm, metrics, uncertainty-quantification
- Language: Python
- Homepage:
- Size: 2.63 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# callm — Confidence Calibration for LLMs
A framework for evaluating confidence augmented systems, built on [PyTorch Lightning](https://lightning.ai/).
Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.
## Supported Benchmarks
| Benchmark | Task type | Semantic‑equivalence evaluation needed? |
|---|---|---|
| **TriviaQA** | Open‑ended QA | Yes — uses an evaluator LLM |
| **MMLU** | Multiple‑choice | No — exact match on answer letter |
| **Classification** | Image/Audio/Text Classification | No — exact match on class label |
## Calibration Metrics
| Metric | Description |
|---|---|
| **ECE** | Expected Calibration Error (L1, 10 bins) |
| **AUC** | Area Under the ROC Curve |
| **BS** | Brier Score (MSE between confidence and correctness) |
| **CE** | Binary Cross‑Entropy |
| **n‑ECUAS** | Expected Cost for Uncertainty-Augmented Systems (parameterised by n = 0, 1, 2, …) |
| **γ‑ECUAS** | Gamma‑ECUAS — selective prediction at operating point γ |
| **AURC** | Area Under the Risk‑Coverage curve |
| **FPR@95** | False Positive Rate at 95% recall |
| **Error Rate** | Overall prediction error rate |
| **LogLog** | LogLog Score (Classification) |
| **NER / NBS / NCE** | Normalized versions of Error Rate, Brier Score, and Cross-Entropy |
## Quick Start
### 1. Install dependencies
```bash
uv sync
```
### 2. Configure environment (optional)
```bash
cp .env.example .env
```
Then edit `.env`:
```env
HF_TOKEN=your_huggingface_token_here # needed for gated models (e.g. Llama, Mistral)
GOOGLE_APPLICATION_CREDENTIALS=path/to/creds.json # needed for GCP / Gemini models
```
### 3. Run unit tests
```bash
uv run pytest callm/tests/ -v
```
## Usage
The CLI is built on top of `LightningCLI` and exposes three subcommands:
### `validate` — Run LLM inference and extract answers + confidences
```bash
# TriviaQA with a local HuggingFace model (default config)
uv run python main.py validate \
--model.init_args.model_name=google/flan-t5-small \
--data.init_args.batch_size=8
# MMLU with a local model
uv run python main.py validate \
-c configs/config_mmlu_base_validation.yaml \
--model.init_args.model_name=mistralai/Ministral-3-8B-Instruct-2512
# TriviaQA with a GCP Gemini model
uv run python main.py validate \
-c configs/config_gcp_validation.yaml
```
Outputs are saved to `lightning_logs//llm_outputs.csv`.
### `evaluation` — Evaluate correctness of LLM outputs via a judge model
For benchmarks that require semantic-equivalence checking (TriviaQA):
```bash
uv run python main.py evaluation \
--llm_outputs_path=lightning_logs//llm_outputs.csv
# Or recalculate metrics from an existing evaluation CSV:
uv run python main.py evaluation \
--use_existing_csv \
--llm_outputs_path=lightning_logs//llm_outputs.csv
```
### `evaluate_csv` — Compute metrics from a saved evaluation CSV
```bash
uv run python main.py evaluate_csv \
--csv_path=lightning_logs/_evaluation/version_0/evaluation_results.csv
```
## Configuration
All runs are configured via YAML. Pre-built configs live in `configs/`:
| Config | Backend | Benchmark |
|---|---|---|
| `config_base_validation.yaml` | HuggingFace | TriviaQA |
| `config_gcp_validation.yaml` | GCP (Gemini) | TriviaQA |
| `config_base_evaluation.yaml` | HuggingFace | TriviaQA (evaluator) |
| `config_gcp_evaluation.yaml` | GCP (Gemini) | TriviaQA (evaluator) |
| `config_mmlu_base_validation.yaml` | HuggingFace | MMLU |
| `config_mmlu_gcp_validation.yaml` | GCP (Gemini) | MMLU |
Any config value can be overridden from the CLI — see the [LightningCLI docs](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html).
## Project Structure
```
callm/
├── models/
│ ├── base.py # Shared Lightning module base
│ ├── llm.py # HuggingFace LLM (local GPU)
│ ├── gcp_llm.py # GCP Vertex AI / Gemini LLM
│ ├── evaluator.py # Semantic-equivalence evaluator (HF)
│ └── gcp_evaluator.py # Semantic-equivalence evaluator (GCP)
├── data/
│ ├── triviaqa/ # TriviaQA data modules
│ ├── mmlu/ # MMLU data modules
│ ├── answers_data.py # Shared answer-loading utilities
│ ├── classification.py # Classification data module
│ └── simulation.py # Simulated confidence data module
├── extractors/
│ ├── base.py # Base + posterior extractors
│ ├── triviaqa.py # TriviaQA verbalized-confidence extractor
│ └── mmlu.py # MMLU answer/confidence extractors
├── prompts/
│ ├── base.py # Prompt / ChatPrompt base classes
│ ├── triviaqa.py # TriviaQA prompt templates
│ └── mmlu.py # MMLU prompt templates
├── metrics/
│ ├── confidences.py # Calibration metrics (ECE, AUC, BS, CE, n-ECUAS, …)
│ ├── classification.py # Classification-specific metric variants
│ ├── constants.py # Metric constants and registry
│ └── utils.py # Metric lookup helpers
├── tests/ # Unit & integration tests
├── config.py # Shared config utilities
└── utils.py # Model loading & tokenizer helpers
configs/ # YAML run configurations
scripts/ # Analysis & paper-figure scripts
cli.py # CalibrationCLI (extends LightningCLI)
main.py # Entrypoint
```
## Confidence Extraction Methods
| Extractor | How confidence is obtained |
|---|---|
| **SequencePosteriorExtractor** | Product of token log‑probabilities of the generated answer |
| **IsTruePosteriorExtractor** | Log‑prob of the "True" token after an "Is this true?" follow‑up |
| **VerbalizedConfidenceExtractor** | Parsed from the model's own verbalized confidence value |
MMLU variants (`MMLUSequencePosteriorExtractor`, `MMLUVerbalizedExtractor`, etc.) adapt these strategies to multiple‑choice format.
## License
See [LICENSE](LICENSE).