https://github.com/meesho/spark_calibration

Spark Calibration - A python package for calibrating probabilities predicted by ML model involving large training & test datasets as spark dataframes
https://github.com/meesho/spark_calibration

calibration classification pysparkml python spark

Last synced: 3 months ago
JSON representation

Spark Calibration - A python package for calibrating probabilities predicted by ML model involving large training & test datasets as spark dataframes

Host: GitHub
URL: https://github.com/meesho/spark_calibration
Owner: Meesho
License: apache-2.0
Created: 2023-10-03T13:20:38.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2025-12-10T03:59:06.000Z (8 months ago)
Last Synced: 2026-02-17T06:25:24.521Z (5 months ago)
Topics: calibration, classification, pysparkml, python, spark
Language: Python
Homepage:
Size: 44.9 KB
Stars: 18
Watchers: 1
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Model calibration with pyspark



This package provides a Betacal class which allows the user to fit/train the default beta calibration model on pyspark dataframes and predict calibrated scores

## Setup

spark-calibration package is [uploaded to PyPi](https://pypi.org/project/spark-calibration/) and can be installed with this command:

```

pip install spark-calibration

```

## Usage

### Training

train_df should be a pyspark dataframe containing:

- A column with raw model scores (default name: `score`)

- A column with binary labels (default name: `label`)

- (Optional) A column with sample weights (default name: `weight`)

You can specify different column names when calling `fit()`. In some tree-based models like LightGBM, the predicted scores may fall outside the [0, 1] range and can even be negative. Please apply a sigmoid function to normalize the outputs accordingly.

```python

from spark_calibration import Betacal

from spark_calibration import display_classification_calib_metrics

from spark_calibration import plot_calibration_curve

# Initialize model

bc = Betacal(parameters="abm")

# Load training data

train_df = spark.read.parquet("s3://train/")

# Fit the model

bc.fit(train_df)

# Or specify custom column names

# bc.fit(train_df, score_col="raw_score", label_col="actual_label")

# Fit with sample weights (optional)

# bc.fit(train_df, weight_col="sample_weight")

# Access model parameters

print(f"Model coefficients: a={bc.a}, b={bc.b}, c={bc.c}")

# Or use get_params() method

params = bc.get_params()

print(f"Model parameters: {params}")

```

The model learns three parameters:

- a: Coefficient for log(score)

- b: Coefficient for log(1-score) 

- c: Intercept term

### Saving and Loading Models

You can save the trained model to disk and load it later:

```python

# Save model

save_path = bc.save("/path/to/save/")

# Load model

loaded_model = Betacal.load("/path/to/save/")

```

### Prediction

test_df should be a pyspark dataframe containing a column with raw model scores. By default, this column should be named `score`, but you can specify a different column name when calling `predict()`. The `predict` function adds a new column `prediction` which has the calibrated score.

```python

test_df = spark.read.parquet("s3://test/")

# Using default column name 'score'

test_df = bc.predict(test_df)

# Or specify a custom score column name

# test_df = bc.predict(test_df, score_col="raw_score")

```

### Pre & Post Calibration Classification Metrics

The test_df should have `score`, `prediction` & `label` columns. 

The `display_classification_calib_metrics` functions displays `brier_score_loss`, `log_loss`, `area_under_PR_curve` and `area_under_ROC_curve`

```python

display_classification_calib_metrics(test_df)

```

#### Output

```

model brier score loss: 0.08072683729933376

calibrated model brier score loss: 0.01014015353257748

delta: -87.44%

model log loss: 0.3038106859864252

calibrated model log loss: 0.053275633947890755

delta: -82.46%

model aucpr: 0.03471287564672635

calibrated model aucpr: 0.03471240518472563

delta: -0.0%

model roc_auc: 0.7490639506966398

calibrated model roc_auc: 0.7490649764289607

delta: 0.0%

```

### Plot the Calibration Curve

Computes true, predicted probabilities (pre & post calibration) using quantile binning strategy with 50 bins and plots the calibration curve

```python

plot_calibration_curve(test_df)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/meesho/spark_calibration

Awesome Lists containing this project

README