{"id":17111315,"url":"https://github.com/nredell/shapflex","last_synced_at":"2025-04-13T02:32:32.511Z","repository":{"id":215776468,"uuid":"182635617","full_name":"nredell/shapFlex","owner":"nredell","description":"An R package for computing asymmetric Shapley values to assess causality in any trained machine learning model","archived":false,"fork":false,"pushed_at":"2020-06-09T21:10:53.000Z","size":2230,"stargazers_count":74,"open_issues_count":4,"forks_count":7,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-26T20:49:33.306Z","etag":null,"topics":["causal-inference","causal-networks","causality","ensemble","feature-importance","iml","interpretable-machine-learning","machine-learning","package","r","r-package","shap","shapley","shapley-value","shapley-values"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nredell.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-04-22T06:26:53.000Z","updated_at":"2025-01-20T08:00:32.000Z","dependencies_parsed_at":"2024-01-06T14:11:11.170Z","dependency_job_id":null,"html_url":"https://github.com/nredell/shapFlex","commit_stats":null,"previous_names":["nredell/shapflex"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nredell%2FshapFlex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nredell%2FshapFlex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nredell%2FshapFlex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nredell%2FshapFlex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nredell","download_url":"https://codeload.github.com/nredell/shapFlex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248657819,"owners_count":21140842,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["causal-inference","causal-networks","causality","ensemble","feature-importance","iml","interpretable-machine-learning","machine-learning","package","r","r-package","shap","shapley","shapley-value","shapley-values"],"created_at":"2024-10-14T16:51:16.193Z","updated_at":"2025-04-13T02:32:32.129Z","avatar_url":"https://github.com/nredell.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)\n[![Travis Build Status](https://travis-ci.org/nredell/shapFlex.svg?branch=master)](https://travis-ci.org/nredell/shapFlex)\n[![codecov](https://codecov.io/github/nredell/shapFlex/branch/master/graphs/badge.svg)](https://codecov.io/github/nredell/shapFlex)\n\n                                                                               \n# package::shapFlex \u003cimg src=\"./tools/shapFlex_logo.png\" alt=\"shapFlex logo\" align=\"right\" height=\"138.5\" style=\"display: inline-block;\"\u003e\n\nThe purpose of `shapFlex`, short for Shapley flexibility, is to compute stochastic feature-level Shapley values which \ncan be used to (a) interpret and/or (b) assess the fairness of any machine learning model while \n**incorporating causal constraints into the model's feature space**. **[Shapley values](https://christophm.github.io/interpretable-ml-book/shapley.html)** \nare an intuitive and theoretically sound model-agnostic diagnostic tool to understand both **global feature importance** across all instances in a data set \nand instance/row-level **local feature importance** in black-box machine learning models.\n\n![](./tools/shap_diagram.PNG)\n\n* **Any ML model** + **Causal hypotheses among features** + **Shapley algorithm** = **Causal ML model interpretability**\n\nThis package implements the algorithm described in \n[Štrumbelj and Kononenko's (2014) sampling-based Shapley approximation algorithm](https://link.springer.com/article/10.1007%2Fs10115-013-0679-x) \nto compute the stochastic Shapley values for a given model feature and the algorithm described in \n[Frye, Feige, \u0026 Rowat's (2019) Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability](https://arxiv.org/pdf/1910.06358.pdf) \nto incorporate prior knowledge into the Shapley value calculation. Asymmetric Shapley values can be tuned by the researcher to avoid splitting \nthe Shapley feature effects uniformly across related/correlated features--as is done in the symmetric case--and focus on the unique effect of a target \nfeature after having conditioned on other pre-specified \"causal\" feature effects.\n\n* **Flexibility**: \n    + Shapley values can be estimated for \u003cu\u003eany machine learning model\u003c/u\u003e using a simple user-defined \n    `predict()` wrapper function.\n    + Shapley values can be estimated by incorporating prior knowledge about causaility in the feature space; this is especially \n    useful for interpreting time series models with a temporal dependence.\n\n* **Speed**:\n    + The code itself hasn't necessarily been optimized for speed. The speed advantage of `shapFlex` comes in the form of giving the user the ability \n to \u003cu\u003eselect 1 or more target features of interest\u003c/u\u003e and avoid having to compute Shapley values for all model features. This is especially \n useful in high-dimensional models as the computation of a Shapley value is exponential in the number of features.\n\n\n## README Contents\n\n* **[Install](#install)**\n* **[Vignettes](#vignettes)**\n* **Examples**\n    + **[Symmetric Shapley values](#symmetric-shapley-values)**\n    + **[Asymmetric causal Shapley values (EXPERIMENTAL)](#asymmetric-causal-shapley-values)**\n    + **[R2 decomposition](#r2-decomposition)**\n* **[Cite](#cite)**\n* **[References](#references)**\n* **[Roadmap](#roadmap)**\n\n\n## Install\n\n* Development\n\n``` r\ndevtools::install_github(\"nredell/shapFlex\")\nlibrary(shapFlex)\n```\n\n## Vignettes\n\n**[Consistency between stochastic and tree-based Shapley values.](https://nredell.github.io/shapFlex/doc/consistency.html)** \n\n## Examples\n\n### Symmetric Shapley values\n\n* TBD\n\n### Asymmetric causal Shapley values\n\n**EXPERIMENTAL**\n\nBelow is an example of how `shapFlex` can be used to compute Shapley values for a subset of model \nfeatures from a Random Forest model based on 3 sets of assumptions about causality amongst the model features:\n\n**1. Symmetric:** Default. No causal knowledge is incorporated into the Shapley calculations.\n\n**2. Asymmetric with weights = .5:** Agnostic causality. Similar to the symmetric algorithm. The difference is \nthat, in the asymmetric algorithm, the entire set of causal effects is conditioned on as a group; the \nsymmetric algorithm would condition on random subsets of the causal features.\n\n**3. Asymmetric with weights = 1:** Pure causality. The Shapley estimates for the causal targets are \nbased on the actual/true/known feature values of the causal effects. Put another way, the estimates for \nthe causal targets have been conditioned on the causal effects which decreases their magnitude. \nThe Shapley estimates for the causal effects will then increase correspondingly to satisfy the Shapley property \nthat the sum of the feature-level effects equals the model prediction.\n\n``` r\nlibrary(shapFlex)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(randomForest)\n\n# Input data: Adult aka Census Income dataset.\ndata(\"data_adult\", package = \"shapFlex\")\ndata \u003c- data_adult\n#------------------------------------------------------------------------------\n# Train a machine learning model; currently limited to single outcome regression and binary classification.\noutcome_name \u003c- \"income\"\noutcome_col \u003c- which(names(data) == outcome_name)\n\nmodel_formula \u003c- formula(paste0(outcome_name,  \"~ .\"))\n\nset.seed(1)\nmodel \u003c- randomForest::randomForest(model_formula, data = data, ntree = 300)\n#------------------------------------------------------------------------------\n# A user-defined prediction function that takes 2 positional arguments and returns\n# a 1-column data.frame of predictions for each instance to be explained: (1) A trained\n# ML model object and (2) a data.frame of model features; transformations of the input\n# data such as converting the data.frame to a matrix should occur within this wrapper.\npredict_function \u003c- function(model, data) {\n  \n  # We'll predict the probability of the outcome being \u003e50k.\n  data_pred \u003c- data.frame(\"y_pred\" = predict(model, data, type = \"prob\")[, 2])\n  return(data_pred)\n}\n#------------------------------------------------------------------------------\n# shapFlex setup.\nexplain \u003c- data[1:300, -outcome_col]  # Compute Shapley feature-level predictions for 300 instaces.\n\nreference \u003c- data[, -outcome_col]  # An optional reference population to compute the baseline prediction.\n\nsample_size \u003c- 60  # Number of Monte Carlo samples.\n\ntarget_features \u003c- c(\"marital_status\", \"education\", \"relationship\",  \"native_country\",\n                     \"age\", \"sex\", \"race\", \"hours_per_week\")  # Optional: A subset of features.\n\ncausal \u003c- data.frame(\n  \"cause\" = c(\"age\", \"sex\", \"race\", \"native_country\",\n              \"age\", \"sex\", \"race\", \"native_country\", \"age\",\n              \"sex\", \"race\", \"native_country\"),\n  \"effect\" = c(rep(\"marital_status\", 4), rep(\"education\", 4), rep(\"relationship\", 4))\n                     )\n```\n\n* Plot the causal setup.\n\n``` r\nset.seed(1)\np \u003c- ggraph(causal, layout = \"kk\")\np \u003c- p + geom_edge_link(aes(start_cap = label_rect(node1.name),\n                            end_cap = label_rect(node2.name)),\n                        arrow = arrow(length = unit(5, 'mm'), type = \"closed\"),\n                        color = \"grey25\")\np \u003c- p + geom_node_label(aes(label = name), fontface = \"bold\")\np \u003c- p + scale_x_continuous(expand = expand_scale(0.2))\np \u003c- p + theme_graph(foreground = 'white', fg_text_colour = 'white')\np\n```\n\n![](./tools/causal_diagram.png)\n\n* Calculate the Shapley values from our model under various degrees of belief in the causal structure.\n\n``` r\n# 1: Non-causal symmetric Shapley values: ~10 seconds to run.\nset.seed(1)\nexplained_non_causal \u003c- shapFlex::shapFlex(explain = explain,\n                                           reference = reference,\n                                           model = model,\n                                           predict_function = predict_function,\n                                           target_features = target_features,\n                                           sample_size = sample_size)\n#------------------------------------------------------------------------------\n# 2: Causal asymmetric Shapley values with full causal weights of 1: ~30 seconds to run.\nset.seed(1)\nexplained_full \u003c- shapFlex::shapFlex(explain = explain,\n                                     reference = reference,\n                                     model = model,\n                                     predict_function = predict_function,\n                                     target_features = target_features,\n                                     causal = causal,\n                                     causal_weights = rep(1, nrow(causal)),  # Pure causal weights\n                                     sample_size = sample_size)\n#------------------------------------------------------------------------------\n# 3: Causal asymmetric Shapley values with agnostic causal weights of .5: ~30 seconds to run.\nset.seed(1)\nexplained_half \u003c- shapFlex::shapFlex(explain = explain,\n                                     reference = reference,\n                                     model = model,\n                                     predict_function = predict_function,\n                                     target_features = target_features,\n                                     causal = causal,\n                                     causal_weights = rep(.5, nrow(causal)),  # Approximates symmetric calc.\n                                     sample_size = sample_size)\n```\n\n* Reshape the data for plotting.\n\n``` r\nexplained_non_causal_sum \u003c- explained_non_causal %\u003e%\n  dplyr::group_by(feature_name) %\u003e%\n  dplyr::summarize(\"shap_effect\" = mean(shap_effect, na.rm = TRUE))\nexplained_non_causal_sum$type \u003c- \"Symmetric\"\n\nexplained_full_sum \u003c- explained_full %\u003e%\n  dplyr::group_by(feature_name) %\u003e%\n  dplyr::summarize(\"shap_effect\" = mean(shap_effect, na.rm = TRUE))\nexplained_full_sum$type \u003c- \"Pure causal (1)\"\n\nexplained_half_sum \u003c- explained_half %\u003e%\n  dplyr::group_by(feature_name) %\u003e%\n  dplyr::summarize(\"shap_effect\" = mean(shap_effect, na.rm = TRUE))\nexplained_half_sum$type \u003c- \"Agnostic causal (.5)\"\n#------------------------------------------------------------------------------\n# Plot the Shapley feature effects for the target features.\n\ndata_plot \u003c- dplyr::bind_rows(explained_non_causal_sum, explained_full_sum, explained_half_sum)\n\n# Re-order the target features so the causal outcomes are first.\ndata_plot$feature_name \u003c- factor(data_plot$feature_name, levels = target_features, ordered = TRUE)\n\np \u003c- ggplot(data_plot, aes(feature_name, shap_effect, fill = ordered(type)))\np \u003c- p + geom_col(position = position_dodge())\np \u003c- p + theme_bw() + theme(\n  plot.title = element_text(size = 14, face = \"bold\"),\n  axis.title = element_text(size = 12, face = \"bold\"),\n  axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5, size = 12),\n  axis.text.y = element_text(size = 12)\n)\np \u003c- p + xlab(NULL) + ylab(\"Average Shapley effect (baseline is .23)\") + labs(fill = \"Algorithm\") +\n  ggtitle(\"Average Shapley Feature Effects Based on 3 Causal Assumptions\")\np\n```\n![](./tools/shap_avg_feature_effects.jpeg)\n\n***\n\n### R2 decomposition\n\nThe code below illustrates how to decompose a regression model's R^2 to get global measures \nof feature importance for any black box model. The `shapFlex::r2()` will also work with Shapley \nvalues computed from other packages.\n\n``` r\nlibrary(shapFlex)\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(ggplot2)\nlibrary(randomForest)\n\ndata(\"imports85\", package = \"randomForest\")\ndata \u003c- imports85\n\ndata \u003c- data[, -2]  # This column has excessive missing data.\ndata \u003c- data[complete.cases(data), ]\n#------------------------------------------------------------------------------\n# Train a machine learning model; currently limited to single outcome regression and binary classification.\n\noutcome_col \u003c- which(names(data) == \"price\")\noutcome_name \u003c- names(data)[outcome_col]\n\nmodel_formula \u003c- formula(paste0(outcome_name,  \"~ .\"))\n\nmodel \u003c- randomForest::randomForest(model_formula, data = data, ntree = 300)\n#------------------------------------------------------------------------------\n# A user-defined prediction function that takes 2 positional arguments and returns\n# a 1-column data.frame of predictions for each instance to be explained: (1) A trained\n# ML model object and (2) a data.frame of model features; transformations of the input\n# data such as converting the data.frame to a matrix should occur within this wrapper.\npredict_function \u003c- function(model, data) {\n\n  data_pred \u003c- data.frame(\"y_pred\" = predict(model, data))\n  return(data_pred)\n}\n#------------------------------------------------------------------------------\n# shapFlex setup.\n\n# Compute Shapley feature-level predictions for all 193 instaces in the dataset.\nexplain \u003c- data[, -outcome_col]\n\nreference \u003c- NULL  # The optional reference group is not needed because we're using the population.\n\nsample_size \u003c- 60  # Number of Monte Carlo samples.\n\ntarget_features \u003c- NULL  # Default; compute Shapley values for all features.\n#------------------------------------------------------------------------------\n# Symmetric Shapley values with no causal specifications; ~10 seconds to run.\nset.seed(224)\ndata_shap \u003c- shapFlex::shapFlex(explain = explain,\n                                reference = reference,\n                                model = model,\n                                predict_function = predict_function,\n                                target_features = target_features,\n                                sample_size = sample_size)\n\nhead(data_shap, 10)\n```\n![](./tools/shapFlex_output.PNG)\n\n* Reshape the data for `r2()`.\n\n``` r\ndata_shap_wide \u003c- tidyr::pivot_wider(data_shap, id_cols = \"index\",\n                                     names_from = \"feature_name\", values_from = \"shap_effect\")\n\ndata_shap_wide$index \u003c- NULL\n\nhead(data_shap_wide)\n```\n![](./tools/shapFlex_output_wide.PNG)\n\n``` r\ny \u003c- data[, outcome_name]\nintercept \u003c- unique(data_shap$intercept)\n\nshapFlex::r2(data_shap_wide, y, intercept)\n```\n\n![](./tools/r2.PNG)\n\n***\n\n## Cite\n\nAt the moment, the best citation for this package is related to the `shapFlex::r2()` function.\n\nRedell, N. (2019). [Shapley decomposition of R^2 in machine learning models](https://arxiv.org/abs/1908.09718). arXiv preprint arXiv:1908.09718.\n\n\n## References\n\nŠtrumbelj, E. \u0026 Kononenko, I. (2014) Explaining prediction models and individual predictions with feature contributions. Knowl Inf Syst (2014) 41: 647. https://doi.org/10.1007/s10115-013-0679-x\n\n\n## Roadmap\n\n* Thorough unit testing with many different causal setups and simulated, ground truth data.\n\n* Vignettes detailing how the algorithms work--in pictures.\n\n* Think about how `lavaan` and `piecewiseSEM` models might be incorporated.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnredell%2Fshapflex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnredell%2Fshapflex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnredell%2Fshapflex/lists"}