{"id":50458316,"url":"https://github.com/erikernst4/callm","last_synced_at":"2026-06-01T03:32:47.858Z","repository":{"id":359713046,"uuid":"1050002682","full_name":"erikernst4/callm","owner":"erikernst4","description":"A framework for evaluating confidence augmented systems, built on PyTorch Lightning. Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.","archived":false,"fork":false,"pushed_at":"2026-05-23T03:45:07.000Z","size":2762,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T05:24:58.433Z","etag":null,"topics":["calibration","confidence","llm","metrics","uncertainty-quantification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erikernst4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-03T20:02:22.000Z","updated_at":"2026-05-23T03:45:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/erikernst4/callm","commit_stats":null,"previous_names":["erikernst4/callm"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/erikernst4/callm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikernst4%2Fcallm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikernst4%2Fcallm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikernst4%2Fcallm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikernst4%2Fcallm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erikernst4","download_url":"https://codeload.github.com/erikernst4/callm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikernst4%2Fcallm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33759178,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["calibration","confidence","llm","metrics","uncertainty-quantification"],"created_at":"2026-06-01T03:32:45.853Z","updated_at":"2026-06-01T03:32:47.852Z","avatar_url":"https://github.com/erikernst4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# callm — Confidence Calibration for LLMs\n\nA framework for evaluating confidence augmented systems, built on [PyTorch Lightning](https://lightning.ai/).\nSupports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.\n\n## Supported Benchmarks\n\n| Benchmark | Task type | Semantic‑equivalence evaluation needed? |\n|---|---|---|\n| **TriviaQA** | Open‑ended QA | Yes — uses an evaluator LLM |\n| **MMLU** | Multiple‑choice | No — exact match on answer letter |\n| **Classification** | Image/Audio/Text Classification | No — exact match on class label |\n\n## Calibration Metrics\n\n| Metric | Description |\n|---|---|\n| **ECE** | Expected Calibration Error (L1, 10 bins) |\n| **AUC** | Area Under the ROC Curve |\n| **BS** | Brier Score (MSE between confidence and correctness) |\n| **CE** | Binary Cross‑Entropy |\n| **n‑ECUAS** | Expected Cost for Uncertainty-Augmented Systems (parameterised by n = 0, 1, 2, …) |\n| **γ‑ECUAS** | Gamma‑ECUAS — selective prediction at operating point γ |\n| **AURC** | Area Under the Risk‑Coverage curve |\n| **FPR@95** | False Positive Rate at 95% recall |\n| **Error Rate** | Overall prediction error rate |\n| **LogLog** | LogLog Score (Classification) |\n| **NER / NBS / NCE** | Normalized versions of Error Rate, Brier Score, and Cross-Entropy |\n\n## Quick Start\n\n### 1. Install dependencies\n\n```bash\nuv sync\n```\n\n### 2. Configure environment (optional)\n\n```bash\ncp .env.example .env\n```\n\nThen edit `.env`:\n\n```env\nHF_TOKEN=your_huggingface_token_here               # needed for gated models (e.g. Llama, Mistral)\nGOOGLE_APPLICATION_CREDENTIALS=path/to/creds.json   # needed for GCP / Gemini models\n```\n\n### 3. Run unit tests\n\n```bash\nuv run pytest callm/tests/ -v\n```\n\n## Usage\n\nThe CLI is built on top of `LightningCLI` and exposes three subcommands:\n\n### `validate` — Run LLM inference and extract answers + confidences\n\n```bash\n# TriviaQA with a local HuggingFace model (default config)\nuv run python main.py validate \\\n  --model.init_args.model_name=google/flan-t5-small \\\n  --data.init_args.batch_size=8\n\n# MMLU with a local model\nuv run python main.py validate \\\n  -c configs/config_mmlu_base_validation.yaml \\\n  --model.init_args.model_name=mistralai/Ministral-3-8B-Instruct-2512\n\n# TriviaQA with a GCP Gemini model\nuv run python main.py validate \\\n  -c configs/config_gcp_validation.yaml\n```\n\nOutputs are saved to `lightning_logs/\u003crun\u003e/llm_outputs.csv`.\n\n### `evaluation` — Evaluate correctness of LLM outputs via a judge model\n\nFor benchmarks that require semantic-equivalence checking (TriviaQA):\n\n```bash\nuv run python main.py evaluation \\\n  --llm_outputs_path=lightning_logs/\u003crun\u003e/llm_outputs.csv\n\n# Or recalculate metrics from an existing evaluation CSV:\nuv run python main.py evaluation \\\n  --use_existing_csv \\\n  --llm_outputs_path=lightning_logs/\u003crun\u003e/llm_outputs.csv\n```\n\n### `evaluate_csv` — Compute metrics from a saved evaluation CSV\n\n```bash\nuv run python main.py evaluate_csv \\\n  --csv_path=lightning_logs/\u003crun\u003e_evaluation/version_0/evaluation_results.csv\n```\n\n## Configuration\n\nAll runs are configured via YAML. Pre-built configs live in `configs/`:\n\n| Config | Backend | Benchmark |\n|---|---|---|\n| `config_base_validation.yaml` | HuggingFace | TriviaQA |\n| `config_gcp_validation.yaml` | GCP (Gemini) | TriviaQA |\n| `config_base_evaluation.yaml` | HuggingFace | TriviaQA (evaluator) |\n| `config_gcp_evaluation.yaml` | GCP (Gemini) | TriviaQA (evaluator) |\n| `config_mmlu_base_validation.yaml` | HuggingFace | MMLU |\n| `config_mmlu_gcp_validation.yaml` | GCP (Gemini) | MMLU |\n\nAny config value can be overridden from the CLI — see the [LightningCLI docs](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html).\n\n## Project Structure\n\n```\ncallm/\n├── models/\n│   ├── base.py              # Shared Lightning module base\n│   ├── llm.py               # HuggingFace LLM (local GPU)\n│   ├── gcp_llm.py           # GCP Vertex AI / Gemini LLM\n│   ├── evaluator.py         # Semantic-equivalence evaluator (HF)\n│   └── gcp_evaluator.py     # Semantic-equivalence evaluator (GCP)\n├── data/\n│   ├── triviaqa/            # TriviaQA data modules\n│   ├── mmlu/                # MMLU data modules\n│   ├── answers_data.py      # Shared answer-loading utilities\n│   ├── classification.py    # Classification data module\n│   └── simulation.py        # Simulated confidence data module\n├── extractors/\n│   ├── base.py              # Base + posterior extractors\n│   ├── triviaqa.py          # TriviaQA verbalized-confidence extractor\n│   └── mmlu.py              # MMLU answer/confidence extractors\n├── prompts/\n│   ├── base.py              # Prompt / ChatPrompt base classes\n│   ├── triviaqa.py          # TriviaQA prompt templates\n│   └── mmlu.py              # MMLU prompt templates\n├── metrics/\n│   ├── confidences.py       # Calibration metrics (ECE, AUC, BS, CE, n-ECUAS, …)\n│   ├── classification.py    # Classification-specific metric variants\n│   ├── constants.py         # Metric constants and registry\n│   └── utils.py             # Metric lookup helpers\n├── tests/                   # Unit \u0026 integration tests\n├── config.py                # Shared config utilities\n└── utils.py                 # Model loading \u0026 tokenizer helpers\nconfigs/                     # YAML run configurations\nscripts/                     # Analysis \u0026 paper-figure scripts\ncli.py                       # CalibrationCLI (extends LightningCLI)\nmain.py                      # Entrypoint\n```\n\n## Confidence Extraction Methods\n\n| Extractor | How confidence is obtained |\n|---|---|\n| **SequencePosteriorExtractor** | Product of token log‑probabilities of the generated answer |\n| **IsTruePosteriorExtractor** | Log‑prob of the \"True\" token after an \"Is this true?\" follow‑up |\n| **VerbalizedConfidenceExtractor** | Parsed from the model's own verbalized confidence value |\n\nMMLU variants (`MMLUSequencePosteriorExtractor`, `MMLUVerbalizedExtractor`, etc.) adapt these strategies to multiple‑choice format.\n\n## License\n\nSee [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferikernst4%2Fcallm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferikernst4%2Fcallm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferikernst4%2Fcallm/lists"}