{"id":26179721,"url":"https://github.com/dholzmueller/probmetrics","last_synced_at":"2026-03-09T14:33:44.514Z","repository":{"id":275557120,"uuid":"925254273","full_name":"dholzmueller/probmetrics","owner":"dholzmueller","description":"Post-hoc calibration methods and metrics for classification","archived":false,"fork":false,"pushed_at":"2026-03-02T23:15:47.000Z","size":138,"stargazers_count":52,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-03-03T00:34:23.026Z","etag":null,"topics":["calibration","classification","machine-learning","metrics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dholzmueller.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-31T14:28:37.000Z","updated_at":"2026-03-02T23:11:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"9eea4513-3daa-4890-b5da-283ae7d56a77","html_url":"https://github.com/dholzmueller/probmetrics","commit_stats":null,"previous_names":["dholzmueller/probmetrics"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/dholzmueller/probmetrics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dholzmueller%2Fprobmetrics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dholzmueller%2Fprobmetrics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dholzmueller%2Fprobmetrics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dholzmueller%2Fprobmetrics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dholzmueller","download_url":"https://codeload.github.com/dholzmueller/probmetrics/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dholzmueller%2Fprobmetrics/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30299108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T13:46:43.843Z","status":"ssl_error","status_checked_at":"2026-03-09T13:46:42.821Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["calibration","classification","machine-learning","metrics"],"created_at":"2025-03-11T21:53:10.595Z","updated_at":"2026-03-09T14:33:44.504Z","avatar_url":"https://github.com/dholzmueller.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![test](https://github.com/dholzmueller/probmetrics/actions/workflows/testing.yml/badge.svg)](https://github.com/dholzmueller/probmetrics/actions/workflows/testing.yml)\n[![Downloads](https://img.shields.io/pypi/dm/probmetrics)](https://pypistats.org/packages/probmetrics)\n\n\n# Probmetrics: Classification metrics and post-hoc calibration\n\nThis package (PyTorch-based) currently contains\n- classification metrics, especially also \nmetrics for assessing the quality of probabilistic predictions, and\n- post-hoc calibration methods, especially\n  - a fast and accurate implementation of temperature scaling.\n  - an implementation of structured matrix scaling (SMS), \n    a regularized version of matrix scaling that outperforms other \n    logistic-based calibration functions.\n\nIt accompanies our papers\n[Rethinking Early Stopping: Refine, Then Calibrate](https://arxiv.org/abs/2501.19195) and [Structured Matrix Scaling for Multi-Class Calibration](https://arxiv.org/abs/2511.03685) \nand [A Variational Estimator for Lp Calibration Errors](https://arxiv.org/abs/2602.24230).\nPlease cite us if you use this repository for research purposes.\nThe experiments from the papers can be found here: \n- Rethinking Early Stopping:\n  - [vision experiments](https://github.com/eugeneberta/RefineThenCalibrate-Vision).\n  - [tabular experiments](https://github.com/dholzmueller/pytabkit).\n  - [theory](https://github.com/eugeneberta/RefineThenCalibrate-Theory).\n- Structured Matrix Scaling: \n  [all experiments](https://github.com/eugeneberta/LogisticCalibrationBenchmark).\n- A Variational Estimator for Lp Calibration Errors: \n  [all experiments](https://github.com/ElSacho/Evaluating_Lp_Calibration_Errors).\n\n## Installation\n\nProbmetrics is available via\n```bash\npip install probmetrics\n```\nTo obtain all functionality, install `probmetrics[extra,dev,dirichletcal]`.\n- extra installs more packages for our CatBoost/LightGBM-based $L_p$ calibration \n  error metrics, smooth ECE (only works with scikit-learn versions \u003c= 1.6), \n  Venn-Abers calibration, \n  centered isotonic regression, \n  and the temperature scaling implementation in NetCal.\n- dev installs more packages for development (esp. testing)\n- dirichletcal installs Dirichlet calibration, \n  which however only works for Python 3.12 upwards.\n\n## Using post-hoc calibration methods\n\nYou can create a calibrator as follows:\n```python\nfrom probmetrics.calibrators import get_calibrator\n\ncalib = get_calibrator('logistic')\n```\n\nThese are the main supported methods:\n- `'logistic'` defaults to structured matrix scaling (SMS) for multiclass \n  and quadratic scaling for binary calibration. \n  We recommend using `'logistic'` for best results, \n  especially on multiclass problems. \n  It can be slow for larger numbers of classes. Only runs on CPU. \n  For the SAGA version (not the default), \n  the first call is slower due to numba compilation.\n- `'svs'`: Structured vector scaling (SVS) for multiclass problems, \n  faster than SMS for multiclass while being almost as good in many cases.\n- `'affine-scaling'`: Affine scaling for binary problems, \n  underperforms `'logistic'` (quadratic scaling) in our benchmarks but preserves AUC.\n- `'temp-scaling'`: Our \n  [highly efficient implementation of temperature scaling](https://arxiv.org/abs/2501.19195)\n  that, unlike some other implementations, \n  does not suffer from optimization issues. \n  Temperature scaling is not as expressive as matrix or vector scaling variants,\n  but it is faster and has the least overfitting risk.\n- `'ts-mix'`: Same as `'temp-scaling'` but with Laplace smoothing \n  (slightly preferable for logloss). Can also be achieved using \n  `get_calibrator('temp-scaling', calibrate_with_mixture=True)`\n- `'isotonic'` Isotonic regression from scikit-learn. \n  Isotonic variants can be good for binary classification with enough data (around 10K samples or more)\n- `'ivap'` Inductive Venn-Abers predictor (a version of isotonic regression, slow but a bit better)\n- `'cir'` Centered isotonic regression (slightly better and slower than isotonic)\n- `'dircal'` Dirichlet calibration (slow, logistic performs better in our experiments)\n- `'dircal-cv'` Dirichlet calibration optimized with cross-validation (very slow)\n\nMore details on parameters and other methods can be found in the get_calibrator function \n[here](https://github.com/dholzmueller/probmetrics/probmetrics/calibrators.py).\n\n### Usage with `numpy`\n\n```python\nimport numpy as np\n\nprobas = np.asarray([[0.1, 0.9]])  # shape = (n_samples, n_classes)\nlabels = np.asarray([1])  # shape = (n_samples,)\ncalib.fit(probas, labels)\ncalibrated_probas = calib.predict_proba(probas)\n```\n\n### Usage with PyTorch\n\nThe PyTorch version can be used directly with GPU tensors, \nwhich is leveraged by our temperature scaling implementation \nbut not by most other methods.\nFor temperature scaling, this could accelerate things, \nbut the CPU version can be faster \nfor smaller validation sets (around 1K-10K samples).\n\n```python\nfrom probmetrics.distributions import CategoricalProbs\nimport torch\n\nprobas = torch.as_tensor([[0.1, 0.9]])\nlabels = torch.as_tensor([1])\n\n# if you have logits, you can use CategoricalLogits instead\ncalib.fit_torch(CategoricalProbs(probas), labels)\nresult = calib.predict_proba_torch(CategoricalProbs(probas))\ncalibrated_probas = result.get_probs()\n```\n\n\n## Using our refinement and calibration metrics\n\nWe provide estimators for refinement error \n(loss after post-hoc calibration)\nand calibration error \n(loss improvement through post-hoc calibration). \nThey can be used as follows:\n\n```python\nimport torch\nfrom probmetrics.metrics import Metrics\n\n# compute multiple metrics at once \n# this is more efficient than computing them individually\nmetrics = Metrics.from_names(['logloss', \n                              'refinement_logloss_ts-mix_all', \n                              'calib-err_logloss_ts-mix_all'])\ny_true = torch.tensor(...)\ny_logits = torch.tensor(...)\nresults = metrics.compute_all_from_labels_logits(y_true, y_logits)\nprint(results['refinement_logloss_ts-mix_all'].item())\n```\n\n## Using more metrics\n\nIn general, while some metrics can be \nflexibly configured using the corresponding classes,\nmany metrics are available through their name. \nHere are some relevant classification metrics:\n```python\nfrom probmetrics.metrics import Metrics\n\nmetrics = Metrics.from_names([\n    'logloss',\n    'brier',  # for binary, this is 2x the brier from sklearn\n    'accuracy', 'class-error',\n    'auroc-ovr', # one-vs-rest\n    'auroc-ovo-sklearn', # one-vs-one (can be slow!)\n    # calibration metrics\n    'ece-15', 'rmsce-15', 'mce-15', 'smece'\n    'refinement_logloss_ts-mix_all', \n    'calib-err_logloss_ts-mix_all',\n    'refinement_brier_ts-mix_all', \n    'calib-err_brier_ts-mix_all',\n    'calib-err_proper-L1-binary-as-1d_WS_CatboostClassifier_all',\n    'calib-err_proper-L2-binary-as-1d_WS_CatboostClassifier_all',\n    'calib-err_proper-Linf-binary-as-1d_WS_CatboostClassifier_all',\n])\n```\n\nThe following function returns a list of all metric names:\n```python\nfrom probmetrics.metrics import Metrics, MetricType\nMetrics.get_available_names(metric_type=MetricType.CLASS)\n```\n\nWhile there are some classes for regression metrics, they are not implemented.\n\n## Advanced calibration, confidence, and top-class metrics\n\nBeyond standard metrics, you can evaluate proper Lp calibration errors for \nany p, as well as isolate specific types of errors like over-confidence, \nunder-confidence, and top-class errors. \n\n**Note:** Over- and under-confidence metrics are designed for binary classification.\nTo use those for multi-class, please use `TopClassLoss(OverConfidenceLoss(your_metric))`.\n\n```python\nfrom probmetrics.metrics import (\n  ProperLpLoss,\n  BrierLoss,\n  OverConfidenceLoss,\n  UnderConfidenceLoss,\n  TopClassLoss\n)\n\n# Evaluate proper Lp calibration errors for any p\nlp_loss_l1 = ProperLpLoss(p=1)  # Evaluate E[ \\| Y - E[Y|f(X)] \\|_1 ] \nlp_loss_l2 = ProperLpLoss(p=2)  # Evaluate E[ \\| Y - E[Y|f(X)] \\|_2 ] \n\n# Evaluate over-confidence and under-confidence \n# (Initialize via string name or by passing a metric object)\nover_brier = OverConfidenceLoss.from_name(\"brier\")\nunder_L1 = UnderConfidenceLoss.from_name(\"proper-L1\")\n\n# Evaluate top-class error with any accompanying loss\ntopclass_brier = TopClassLoss(BrierLoss(binary_as_multiclass=False))\ntopclass_L1 = TopClassLoss.from_name(\"proper-L1\")\n\n# Compose wrappers (e.g., top-class with underconfidence for proper-L1)\nunder_topclass_l1 = TopClassLoss(UnderConfidenceLoss.from_name(\"proper-L1\"))\nover_topclass_brier = TopClassLoss(OverConfidenceLoss(BrierLoss()))\n\n# Some metrics are listed by default, here are some of them\nmetrics = metrics = Metrics.from_names([\n    'proper-L1-binary-as-1d', # use to estimate  E[ \\| Y - E[Y|f(X)] \\|_1 ] and treat binary \n                              # predictions as scalars with shapes (n,1) )\n    'proper-L2', # use to estimate  E[ \\| Y - E[Y|f(X)] \\|_2 ] (and treat binary predictions \n                 # as vector with shapes (n,2) )\n    \"topclass-proper-L1-binary-as-1d\", # Estimate L1 calibration error of top class \n    \"topclass-under-proper-L1-binary-as-1d\", # Estimate L1-overconfidence of top class \n    \"topclass-over-proper-L1-binary-as-1d\", # Estimate L1-underconfidence of top class\n])\n\n```\n\nOnce those losses are defined, you can evaluate the calibration error by doing:\n\n```python\nfrom probmetrics.metrics import MetricsWithCalibration, CombinedMetrics\nfrom probmetrics.classifiers import WS_CatboostClassifier, WS_LGBMClassifier\nfrom probmetrics.splitters import CVSplitter\n\nloss = ProperLpLoss(p=2) \n\nmetrics = MetricsWithCalibration(loss,\n                            calibrator=WS_CatboostClassifier(), # The classifier used to recalibrate the predictions\n                            val_splitter=CVSplitter(n_cv=5) # cross-validation splitter\n                            )\n\n# or use combined metrics to evaluate multiple metrics \n# while fitting the post-hoc calibrator only once\ncombined_losses = CombinedMetrics( \n                                    [\n                                    ProperLpLoss(p=1), \n                                    OverConfidenceLoss.from_name(\"brier\"), \n                                    OverConfidenceLoss.from_name(\"proper-L1\") , \n                                    UnderConfidenceLoss.from_name(\"proper-L1\" ), \n                                    UnderConfidenceLoss( BrierLoss() ),\n                                    BrierLoss()\n                                    ]\n                                  )\n\nmetrics = MetricsWithCalibration(combined_losses,\n                            calibrator=WS_LGBMClassifier(), \n                            val_splitter=CVSplitter(n_cv=5)\n                            )\n\ny_true = torch.tensor(...)\ny_prob = torch.tensor(...)\nresults = metrics.compute_all_from_labels_probs(y_true, y_prob)\n```\n\nThe `calibrator` argument is a class used to recalibrate the original predictions. \nAny estimator that inherits from sklearn.base.ClassifierMixin (i.e., follows the \nscikit-learn classifier API) and implements `predict_proba()` can be used.\nWe recommend using `WS_CatboostClassifier` with default parameters. \nThe \"WS\" stands for \"Warm Start\", as predictions are initialized at the \noriginal predicted $f(x)$ values (see the paper [A Variational Estimator for Lp \nCalibration Errors](https://arxiv.org/abs/2602.24230) for additional information). \n\n\n### Binary vs. multiclass formatting\n\nThe library internally stores predictions in a multiclass format \nwith shape `(n_samples, n_classes)`.\nFor binary classification, for some metrics \nyou can control whether to treat the output as a two-column distribution \nor a single-column probability using the `binary_as_multiclass` parameter.\nFor example, for `BrierLoss()`, using `binary_as_multiclass=False` \nwill yield the scikit-learn formula, while `binary_as_multiclass=True` \nwill yield twice the value.\n\nSetting `binary_as_multiclass=False` tells the loss function to treat \n`(n_samples, 2)` predictions as a single-column `(n_samples, 1)` probability.\nThe loss then internally transforms the data to \nbinary labels $Y \\in {0, 1}$ and the probability\ncolumn $f(X) \\in [0, 1]$ for the calculation.\n\nThose features are also valid with the `TopClassLoss`.\nThe `TopClassLoss` wrapper focuses the loss calculation on the class with \nthe highest predicted probability. The behavior changes based on your binary setting,\nfor instance:\n\n| Configuration                                                | Estimate                                                                         | Description                                                                                                                                                                                                                                 |\n|:-------------------------------------------------------------|:---------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `TopClassLoss(ProperLpLoss(p=1))`                            | $\\mathbb{E}[ \\lvert Z - \\mathbb{E}[Z \\mid \\max f(X)] \\rvert ]$                   | Scalar probability: $\\max f(X)$ is the scalar probability of the top class of $f(X)$; $Z \\in \\{0, 1\\}$ equals $1$ if the label is what the top-class predicted and $0$ otherwise. Evaluates the absolute error of the top-class prediction. |\n| `TopClassLoss(ProperLpLoss(p=1, binary_as_multiclass=True))` | $\\mathbb{E}[ \\Vert \\mathbf{Z} - \\mathbb{E}[\\mathbf{Z} \\mid \\max f(X)] \\Vert_1 ]$ | Vectorized: $\\mathbf{Z}$ is a one-hot vector. Calculates the $L_1$ norm of the error vector.                                                                                                                                                |\n\nWhen used inside `MetricsWithCalibration`, `TopClassLoss` will choose the top-class \nbased on $f(X)$ instead of $g(f(X))$ so the loss difference uses the same choice of top class for both terms.\n\n## Contributors\n- David Holzmüller\n- Eugène Berta\n- Sacha Braun\n\n## Releases\n\n- v1.2.0 by [@elsacho](https://github.com/elsacho): Added new proper loss functions:\n  - ProperLpLoss(p=p): Metrics to evaluate $E[ \\Vert f(X) - E[Y|f(X)] \\Vert_p ]$ where $f(X)$ are the \n    predictions of the classifier, $p \u003e= 1$, including `p=float(\"inf\")`\n  - TopClassLoss: A wrapper to variationally evaluate top-class errors.\n  - OverConfidenceLoss \u0026 UnderConfidenceLoss: Wrappers to variationally evaluate \n    over/under-confidence in binary predictors.\n  - MetricsWithCalibration can now handle arbitrary classifiers and Lp-type losses.\n  - New classifiers:  Added `WS_CatboostClassifier` and `WS_LGBMClassifier` for \n    evaluating calibration errors.\n  - removed sklearn \u003c 1.7 constraint.\n- v1.1.0 by [@eugeneberta](https://github.com/eugeneberta): Improvements to the SVS and SMS calibrators:\n  - logit pre-processing with `'ts-mix'` is now automatic, \n    and the global scaling parameter $\\alpha$ is fixed to 1. This yields:\n    - improved performance on our tabular and computer vision benchmarks \n      (see the arxiv v2 of the SMS paper, coming soon).\n    - faster convergence.\n    - ability to compute the duality gap in closed form for stopping SAGA solvers, \n      which we implement in this version.\n  - improved L-BFGS solvers, much faster than in the previous version. \n    Now used in SVS and SMS by default.\n  - the default binary calibrator in `LogisticCalibrator` is now quadratic scaling \n    instead of affine scaling, this can be changed back by using \n    `LogisticCalibrator(binary_type='affine')`.\n- v1.0.0 by [@eugeneberta](https://github.com/eugeneberta): New post-hoc calibrators like `'logistic'` \n  including structured matrix scaling (SMS), \n  structured vector scaling (SVS), \n  affine scaling, and quadratic scaling.\n- v0.0.2 by [@dholzmueller](https://github.com/dholzmueller):\n  - Removed numpy\u003c2.0 constraint\n  - allow 1D vectors in CategoricalLogits / CategoricalProbs\n  - add TorchCal temperature scaling\n  - minor fixes in AutoGluon temperature scaling \n    that shouldn't affect the performance in practice\n- v0.0.1 by [@dholzmueller](https://github.com/dholzmueller):\n  Initial release with classification metrics, \n  calibration/refinement metrics, and some post-hoc calibration methods.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdholzmueller%2Fprobmetrics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdholzmueller%2Fprobmetrics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdholzmueller%2Fprobmetrics/lists"}