Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/NorskRegnesentral/shapr

Explaining the output of machine learning models with more accurately estimated Shapley values
https://github.com/NorskRegnesentral/shapr

explainable-ai explainable-ml rcpp rcpparmadillo rstats shapley

Last synced: 3 months ago
JSON representation

Explaining the output of machine learning models with more accurately estimated Shapley values

Host: GitHub
URL: https://github.com/NorskRegnesentral/shapr
Owner: NorskRegnesentral
License: other
Created: 2018-05-16T07:05:13.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-05-29T19:29:48.000Z (5 months ago)
Last Synced: 2024-05-29T20:02:54.826Z (5 months ago)
Topics: explainable-ai, explainable-ml, rcpp, rcpparmadillo, rstats, shapley
Language: R
Homepage: https://norskregnesentral.github.io/shapr/
Size: 40.5 MB
Stars: 135
Watchers: 8
Forks: 30
Open Issues: 31
Metadata Files:
- Readme: README.Rmd
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        ---

output: github_document

bibliography: ./inst/REFERENCES.bib

---

```{r setup, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%",

  tidy = "styler"

)

```

# shapr 

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version-last-release/shapr)](https://cran.r-project.org/package=shapr)

[![CRAN_Downloads_Badge](https://cranlogs.r-pkg.org/badges/grand-total/shapr)](https://cran.r-project.org/package=shapr)

[![R build status](https://github.com/NorskRegnesentral/shapr/workflows/R-CMD-check/badge.svg)](https://github.com/NorskRegnesentral/shapr/actions?query=workflow%3AR-CMD-check)

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/mit/)

[![DOI](https://joss.theoj.org/papers/10.21105/joss.02027/status.svg)](https://doi.org/10.21105/joss.02027)

## Brief NEWS

### Breaking change (June 2023)

As of version 0.2.3.9000, the development version of shapr (master branch on GitHub from June 2023) has been severely restructured, introducing a new syntax for explaining models, and thereby introducing a range of breaking changes. This essentially amounts to using a single function (`explain()`) instead of two functions (`shapr()` and `explain()`).

The CRAN version of `shapr` (v0.2.2) still uses the old syntax. 

See the [NEWS](https://github.com/NorskRegnesentral/shapr/blob/master/NEWS.md) for details. 

The examples below uses the new syntax. 

[Here](https://github.com/NorskRegnesentral/shapr/blob/cranversion_0.2.2/README.md) is a version of this README with the syntax of the CRAN version (v0.2.2).

### Python wrapper

As of version 0.2.3.9100 (master branch on GitHub from June 2023), we provide a Python wrapper (`shaprpy`) which allows explaining python models with the methodology implemented in `shapr`, directly from Python. The wrapper is available [here](https://github.com/NorskRegnesentral/shapr/tree/master/python). See also details in the [NEWS](https://github.com/NorskRegnesentral/shapr/blob/master/NEWS.md).

## Introduction

The most common machine learning task is to train a model which is able to predict an unknown outcome (response variable) based on a set of known input variables/features.

When using such models for real life applications, it is often crucial to understand why a certain set of features lead to exactly that prediction.

However, explaining predictions from complex, or seemingly simple, machine learning models is a practical and ethical question, as well as a legal issue. Can I trust the model? Is it biased? Can I explain it to others? We want to explain individual predictions from a complex machine learning model by learning simple, interpretable  explanations.

Shapley values is the only prediction explanation framework with a solid theoretical foundation (@lundberg2017unified). Unless the true distribution of the features are known, and there are less than say 10-15 features, these Shapley values needs to be estimated/approximated. 

Popular methods like Shapley Sampling Values (@vstrumbelj2014explaining), SHAP/Kernel SHAP (@lundberg2017unified), and to some extent TreeSHAP (@lundberg2018consistent), assume that the features are independent when approximating the Shapley values for prediction explanation. This may lead to very inaccurate Shapley values, and consequently wrong interpretations of the predictions. @aas2019explaining extends and improves the Kernel SHAP method of @lundberg2017unified to account for the dependence between the features, resulting in significantly more accurate approximations to the Shapley values. 

[See the paper for details](https://arxiv.org/abs/1903.10464).

This package implements the methodology of @aas2019explaining.

The following methodology/features are currently implemented:

-   Native support of explanation of predictions from models fitted with the following functions 

`stats::glm`, `stats::lm`,`ranger::ranger`, `xgboost::xgboost`/`xgboost::xgb.train` and `mgcv::gam`.

-   Accounting for feature dependence 

    * assuming the features are Gaussian (`approach = 'gaussian'`, @aas2019explaining)

    * with a Gaussian copula (`approach = 'copula'`, @aas2019explaining)

    * using the Mahalanobis distance based empirical (conditional) distribution approach (`approach = 'empirical'`, @aas2019explaining)

    * using conditional inference trees (`approach = 'ctree'`, @redelmeier2020explaining). 

    * using the endpoint match method for time series (`approach = 'timeseries'`, @jullum2021efficient)

    * using the joint distribution approach for models with purely cateogrical data (`approach = 'categorical'`, @redelmeier2020explaining)

    * assuming all features are independent (`approach = 'independence'`, mainly for benchmarking)

-   Combining any of the above methods.

-   Explain *forecasts* from time series models at different horizons with `explain_forecast()` (R only)

-   Batch computation to reduce memory consumption significantly

-   Parallelized computation using the [future](https://future.futureverse.org/) framework. (R only)

-   Progress bar showing computation progress, using the [`progressr`](https://progressr.futureverse.org/) package. Must be activated by the user.

-   Optional use of the AICc criterion of @hurvich1998smoothing when optimizing the bandwidth parameter in the empirical (conditional) approach of @aas2019explaining.

-   Functionality for visualizing the explanations. (R only)

-   Support for models not supported natively.

Note the prediction outcome must be numeric. 

All approaches except `approach = 'categorical'` works for numeric features, but unless the models are very gaussian-like, we recommend `approach = 'ctree'` or `approach = 'empirical'`, especially if there are discretely distributed features.

When the models contains both numeric and categorical features, we recommend `approach = 'ctree'`.

For models with a smaller number of categorical features (without many levels) and a decent training set, we recommend `approach = 'categorical'`.

For (binary) classification based on time series models, we suggest using `approach = 'timeseries'`.

To explain forecasts of time series models (at different horizons), we recommend using `explain_forecast()` instead of `explain()`. 

The former has a more suitable input syntax for explaining those kinds of forecasts.

See the [vignette](https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html) for details and further examples.

Unlike SHAP and TreeSHAP, we decompose probability predictions directly to ease the interpretability, i.e. not via log odds transformations.

## Installation

To install the current stable release from CRAN (note, using the old explanation syntax), use

```{r, eval = FALSE}

install.packages("shapr")

```

To install the current development version (with the new explanation syntax), use

```{r, eval = FALSE}

remotes::install_github("NorskRegnesentral/shapr")

```

If you would like to install all packages of the models we currently support, use

```{r, eval = FALSE}

remotes::install_github("NorskRegnesentral/shapr", dependencies = TRUE)

```

If you would also like to build and view the vignette locally, use 

```{r, eval = FALSE}

remotes::install_github("NorskRegnesentral/shapr", dependencies = TRUE, build_vignettes = TRUE)

vignette("understanding_shapr", "shapr")

```

You can always check out the latest version of the vignette [here](https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html). 

## Example

`shapr` supports computation of Shapley values with any predictive model which takes a set of numeric features and produces a numeric outcome. 

The following example shows how a simple `xgboost` model is trained using the *airquality* dataset, and how `shapr` explains the individual predictions. 

```{r basic_example, warning = FALSE}

library(xgboost)

library(shapr)

data("airquality")

data <- data.table::as.data.table(airquality)

data <- data[complete.cases(data), ]

x_var <- c("Solar.R", "Wind", "Temp", "Month")

y_var <- "Ozone"

ind_x_explain <- 1:6

x_train <- data[-ind_x_explain, ..x_var]

y_train <- data[-ind_x_explain, get(y_var)]

x_explain <- data[ind_x_explain, ..x_var]

# Looking at the dependence between the features

cor(x_train)

# Fitting a basic xgboost model to the training data

model <- xgboost(

  data = as.matrix(x_train),

  label = y_train,

  nround = 20,

  verbose = FALSE

)

# Specifying the phi_0, i.e. the expected prediction without any features

p0 <- mean(y_train)

# Computing the actual Shapley values with kernelSHAP accounting for feature dependence using

# the empirical (conditional) distribution approach with bandwidth parameter sigma = 0.1 (default)

explanation <- explain(

  model = model,

  x_explain = x_explain,

  x_train = x_train,

  approach = "empirical",

  prediction_zero = p0

)

# Printing the Shapley values for the test data.

# For more information about the interpretation of the values in the table, see ?shapr::explain.

print(explanation$shapley_values)

# Finally we plot the resulting explanations

plot(explanation)

```

See the [vignette](https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html) for further examples.

## Contribution

All feedback and suggestions are very welcome. Details on how to contribute can be found 

[here](https://norskregnesentral.github.io/shapr/CONTRIBUTING.html). If you have any questions or comments, feel

free to open an issue [here](https://github.com/NorskRegnesentral/shapr/issues). 

Please note that the 'shapr' project is released with a

[Contributor Code of Conduct](https://norskregnesentral.github.io/shapr/CODE_OF_CONDUCT.html).

By contributing to this project, you agree to abide by its terms. 

## References