{"id":13737217,"url":"https://github.com/BCG-X-Official/facet","last_synced_at":"2025-05-08T13:33:15.601Z","repository":{"id":37894265,"uuid":"285236885","full_name":"BCG-X-Official/facet","owner":"BCG-X-Official","description":"Human-explainable AI.","archived":false,"fork":false,"pushed_at":"2024-02-01T13:13:55.000Z","size":52936,"stargazers_count":500,"open_issues_count":10,"forks_count":46,"subscribers_count":12,"default_branch":"2.1.x","last_synced_at":"2024-05-23T10:02:51.053Z","etag":null,"topics":["data-analytics","data-science","explainable-ai","hyperparameter-tuning","interpretability","machine-learning","model-selection","python","shap-vector-decomposition","simulation","statistics"],"latest_commit_sha":null,"homepage":"https://bcg-x-official.github.io/facet","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BCG-X-Official.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-05T09:07:03.000Z","updated_at":"2024-06-21T16:45:22.414Z","dependencies_parsed_at":"2024-01-06T14:51:08.949Z","dependency_job_id":"5ca61301-a835-47b6-8722-e07a4b9c7590","html_url":"https://github.com/BCG-X-Official/facet","commit_stats":null,"previous_names":["bcg-x-official/facet","bcg-gamma/facet"],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Ffacet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Ffacet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Ffacet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Ffacet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BCG-X-Official","download_url":"https://codeload.github.com/BCG-X-Official/facet/tar.gz/refs/heads/2.1.x","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253077544,"owners_count":21850342,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analytics","data-science","explainable-ai","hyperparameter-tuning","interpretability","machine-learning","model-selection","python","shap-vector-decomposition","simulation","statistics"],"created_at":"2024-08-03T03:01:37.737Z","updated_at":"2025-05-08T13:33:13.289Z","avatar_url":"https://github.com/BCG-X-Official.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":".. image:: sphinx/source/_images/Gamma_Facet_Logo_RGB_LB.svg\n\nFACET is an open source library for human-explainable AI.\nIt combines sophisticated model inspection and model-based simulation to enable better \nexplanations of your supervised machine learning models.\n\nFACET is composed of the following key components:\n\n+-----------------+-----------------------------------------------------------------------+\n| |spacer|        | **Model Inspection**                                                  |\n|                 |                                                                       |\n| |inspect|       | FACET introduces a new algorithm to quantify dependencies and         |\n|                 | interactions between features in ML models.                           |\n|                 | This new tool for human-explainable AI adds a new, global             |\n|                 | perspective to the observation-level explanations provided by the     |\n|                 | popular `SHAP \u003chttps://shap.readthedocs.io/en/stable/\u003e`__ approach.   |\n|                 | To learn more about FACET’s model inspection capabilities, see the    |\n|                 | getting started example below.                                        |\n+-----------------+-----------------------------------------------------------------------+\n| |spacer|        | **Model Simulation**                                                  |\n|                 |                                                                       |\n| |sim|           | FACET’s model simulation algorithms use ML models for                 |\n|                 | *virtual experiments* to help identify scenarios that optimise        |\n|                 | predicted outcomes.                                                   |\n|                 | To quantify the uncertainty in simulations, FACET utilises a range    |\n|                 | of bootstrapping algorithms including stationary and stratified       |\n|                 | bootstraps.                                                           |\n|                 | For an example of FACET’s bootstrap simulations, see the              |\n|                 | quickstart example below.                                             |\n+-----------------+-----------------------------------------------------------------------+\n| |spacer|        | **Enhanced Machine Learning Workflow**                                |\n|                 |                                                                       |\n| |pipe|          | FACET offers an efficient and transparent machine learning            |\n|                 | workflow, enhancing                                                   |\n|                 | `scikit-learn \u003chttps://scikit-learn.org/stable/index.html\u003e`__'s       |\n|                 | tried and tested pipelining paradigm with new capabilities for model  |\n|                 | selection, inspection, and simulation.                                |\n|                 | FACET also introduces                                                 |\n|                 | `sklearndf \u003chttps://github.com/BCG-X-Official/sklearndf\u003e`__                |\n|                 | [`documentation \u003chttps://bcg-x-official.github.io/sklearndf/index.html\u003e`__]|\n|                 | an augmented version of *scikit-learn* with enhanced support for      |\n|                 | *pandas* data frames that ensures end-to-end traceability of features.|\n+-----------------+-----------------------------------------------------------------------+\n\n.. Begin-Badges\n\n|pypi| |conda| |azure_build| |azure_code_cov|\n|python_versions| |code_style| |made_with_sphinx_doc| |License_badge|\n\n.. End-Badges\n\n\nInstallation\n------------\n\nFACET supports both PyPI and Anaconda.\nWe recommend to install FACET into a dedicated environment.\n\nAnaconda\n~~~~~~~~\n\n.. code-block:: sh\n\n    conda create -n facet\n    conda activate facet\n    conda install -c bcg_gamma -c conda-forge gamma-facet\n\n\nPip\n~~~\n\nmacOS and Linux:\n^^^^^^^^^^^^^^^^\n\n.. code-block:: sh\n\n    python -m venv facet\n    source facet/bin/activate\n    pip install gamma-facet\n\nWindows:\n^^^^^^^^\n\n.. code-block:: dosbatch\n\n    python -m venv facet\n    facet\\Scripts\\activate.bat\n    pip install gamma-facet\n\n\nQuickstart\n----------\n\nThe following quickstart guide provides a minimal example workflow to get you\nup and running with FACET.\nFor additional tutorials and the API reference,\nsee the `FACET documentation \u003chttps://bcg-x-official.github.io/facet/docs-version/2-0\u003e`__.\n\nChanges and additions to new versions are summarized in the\n`release notes \u003chttps://bcg-x-official.github.io/facet/docs-version/2-0/release_notes.html\u003e`__.\n\n\nEnhanced Machine Learning Workflow\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nTo demonstrate the model inspection capability of FACET, we first create a\npipeline to fit a learner. In this simple example we will use the\n`diabetes dataset \u003chttps://web.stanford.edu/~hastie/Papers/LARS/diabetes.data\u003e`__\nwhich contains age, sex, BMI and blood pressure along with 6 blood serum\nmeasurements as features. This dataset was used in this\n`publication \u003chttps://statweb.stanford.edu/~tibs/ftp/lars.pdf\u003e`__.\nA transformed version of this dataset is also available on scikit-learn\n`here \u003chttps://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset\u003e`__.\n\nIn this quickstart we will train a Random Forest regressor using 10 repeated\n5-fold CV to predict disease progression after one year. With the use of\n*sklearndf* we can create a *pandas* DataFrame compatible workflow. However,\nFACET provides additional enhancements to keep track of our feature matrix\nand target vector using a sample object (`Sample`) and easily compare\nhyperparameter configurations and even multiple learners with the `LearnerSelector`.\n\n.. code-block:: Python\n\n    # standard imports\n    import pandas as pd\n    from sklearn.model_selection import RepeatedKFold, GridSearchCV\n\n    # some helpful imports from sklearndf\n    from sklearndf.pipeline import RegressorPipelineDF\n    from sklearndf.regression import RandomForestRegressorDF\n\n    # relevant FACET imports\n    from facet.data import Sample\n    from facet.selection import LearnerSelector, ParameterSpace\n\n    # declaring url with data\n    data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'\n\n    #importing data from url\n    diabetes_df = pd.read_csv(data_url, delimiter='\\t').rename(\n        # renaming columns for better readability\n        columns={\n            'S1': 'TC', # total serum cholesterol\n            'S2': 'LDL', # low-density lipoproteins\n            'S3': 'HDL', # high-density lipoproteins\n            'S4': 'TCH', # total cholesterol/ HDL\n            'S5': 'LTG', # lamotrigine level\n            'S6': 'GLU', # blood sugar level\n            'Y': 'Disease_progression' # measure of progress since 1yr of baseline\n        }\n    )\n\n    # create FACET sample object\n    diabetes_sample = Sample(observations=diabetes_df, target_name=\"Disease_progression\")\n\n    # create a (trivial) pipeline for a random forest regressor\n    rnd_forest_reg = RegressorPipelineDF(\n        regressor=RandomForestRegressorDF(n_estimators=200, random_state=42)\n    )\n\n    # define parameter space for models which are \"competing\" against each other\n    rnd_forest_ps = ParameterSpace(rnd_forest_reg)\n    rnd_forest_ps.regressor.min_samples_leaf = [8, 11, 15]\n    rnd_forest_ps.regressor.max_depth = [4, 5, 6]\n\n    # create repeated k-fold CV iterator\n    rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)\n\n    # rank your candidate models by performance\n    selector = LearnerSelector(\n        searcher_type=GridSearchCV,\n        parameter_space=rnd_forest_ps,\n        cv=rkf_cv,\n        n_jobs=-3,\n        scoring=\"r2\"\n    ).fit(diabetes_sample)\n\n    # get summary report\n    selector.summary_report()\n\n.. image:: sphinx/source/_images/ranker_summary.png\n   :width: 600\n\nWe can see based on this minimal workflow that a value of 11 for minimum\nsamples in the leaf and 5 for maximum tree depth was the best performing\nof the three considered values.\nThis approach easily extends to additional hyperparameters for the learner,\nand for multiple learners.\n\n\nModel Inspection\n~~~~~~~~~~~~~~~~\n\nFACET implements several model inspection methods for\n`scikit-learn \u003chttps://scikit-learn.org/stable/index.html\u003e`__ estimators.\nFACET enhances model inspection by providing global metrics that complement\nthe local perspective of SHAP (see\n`[arXiv:2107.12436] \u003chttps://arxiv.org/abs/2107.12436\u003e`__ for a formal description).\n\nThe key global metrics for each pair of features in a model are:\n\n- **Synergy**\n\n  The degree to which the model combines information from one feature with\n  another to predict the target. For example, let's assume we are predicting\n  cardiovascular health using age and gender and the fitted model includes\n  a complex interaction between them. This means these two features are\n  synergistic for predicting cardiovascular health. Further, both features\n  are important to the model and removing either one would significantly\n  impact performance. Let's assume age brings more information to the joint\n  contribution than gender. This asymmetric contribution means the synergy for\n  (age, gender) is less than the synergy for (gender, age). To think about it another\n  way, imagine the prediction is a coordinate you are trying to reach.\n  From your starting point, age gets you much closer to this point than\n  gender, however, you need both to get there. Synergy reflects the fact\n  that gender gets more help from age (higher synergy from the perspective\n  of gender) than age does from gender (lower synergy from the perspective of\n  age) to reach the prediction. *This leads to an important point: synergy\n  is a naturally asymmetric property of the global information two interacting\n  features contribute to the model predictions.* Synergy is expressed as a\n  percentage ranging from 0% (full autonomy) to 100% (full synergy).\n\n- **Redundancy**\n\n  The degree to which a feature in a model duplicates the information of a\n  second feature to predict the target. For example, let's assume we had\n  house size and number of bedrooms for predicting house price. These\n  features capture similar information as the more bedrooms the larger\n  the house and likely a higher price on average. The redundancy for\n  (number of bedrooms, house size) will be greater than the redundancy\n  for (house size, number of bedrooms). This is because house size\n  \"knows\" more of what number of bedrooms does for predicting house price\n  than vice-versa. Hence, there is greater redundancy from the perspective\n  of number of bedrooms. Another way to think about it is removing house\n  size will be more detrimental to model performance than removing number\n  of bedrooms, as house size can better compensate for the absence of\n  number of bedrooms. This also implies that house size would be a more\n  important feature than number of bedrooms in the model. *The important\n  point here is that like synergy, redundancy is a naturally asymmetric\n  property of the global information feature pairs have for predicting\n  an outcome.* Redundancy is expressed as a percentage ranging from 0%\n  (full uniqueness) to 100% (full redundancy).\n\n.. code-block:: Python\n\n    # fit the model inspector\n    from facet.inspection import LearnerInspector\n    inspector = LearnerInspector(\n        pipeline=selector.best_estimator_,\n        n_jobs=-3\n    ).fit(diabetes_sample)\n\n**Synergy**\n\n.. code-block:: Python\n\n    # visualise synergy as a matrix\n    from pytools.viz.matrix import MatrixDrawer\n    synergy_matrix = inspector.feature_synergy_matrix()\n    MatrixDrawer(style=\"matplot%\").draw(synergy_matrix, title=\"Synergy Matrix\")\n\n.. image:: sphinx/source/_images/synergy_matrix.png\n    :width: 600\n\nFor any feature pair (A, B), the first feature (A) is the row, and the second\nfeature (B) the column. For example, looking across the row for `LTG` (Lamotrigine)\nthere is hardly any synergy with other features in the model (≤ 1%).\nHowever, looking down the column for `LTG` (i.e., from the perspective of other features\nrelative with `LTG`) we find that many features (the rows) are aided by synergy with\nwith `LTG` (up to 27% in the case of LDL). We conclude that:\n\n- `LTG` is a strongly autonomous feature, displaying minimal synergy with other\n  features for predicting disease progression after one year.\n- The contribution of other features to predicting disease progression after one\n  year is partly enabled by the presence of `LTG`.\n\nHigh synergy between pairs of features must be considered carefully when investigating\nimpact, as the values of both features jointly determine the outcome. It would not make\nmuch sense to consider `LDL` without the context provided by `LTG` given close\nto 27% synergy of `LDL` with `LTG` for predicting progression after one year.\n\n**Redundancy**\n\n.. code-block:: Python\n\n    # visualise redundancy as a matrix\n    redundancy_matrix = inspector.feature_redundancy_matrix()\n    MatrixDrawer(style=\"matplot%\").draw(redundancy_matrix, title=\"Redundancy Matrix\")\n\n.. image:: sphinx/source/_images/redundancy_matrix.png\n    :width: 600\n\n\nFor any feature pair (A, B), the first feature (A) is the row, and the second feature\n(B) the column. For example, if we look at the feature pair (`LDL`, `TC`) from the\nperspective of `LDL` (Low-Density Lipoproteins), then we look-up the row for `LDL`\nand the column for `TC` and find 38% redundancy. This means that 38% of the information\nin `LDL` to predict disease progression is duplicated in `TC`. This\nredundancy is the same when looking \"from the perspective\" of `TC` for (`TC`, `LDL`),\nbut need not be symmetrical in all cases (see `LTG` vs. `TCH`).\n\nIf we look at `TCH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but\nthe same does not hold between `LTG` and `HDL` – meaning `TCH` shares different\ninformation with each of the two features.\n\n\n**Clustering redundancy**\n\nAs detailed above redundancy and synergy for a feature pair is from the\n\"perspective\" of one of the features in the pair, and so yields two distinct\nvalues. However, a symmetric version can also be computed that provides not\nonly a simplified perspective but allows the use of (1 - metric) as a\nfeature distance. With this distance hierarchical, single linkage clustering\nis applied to create a dendrogram visualization. This helps to identify\ngroups of low distance, features which activate \"in tandem\" to predict the\noutcome. Such information can then be used to either reduce clusters of\nhighly redundant features to a subset or highlight clusters of highly\nsynergistic features that should always be considered together.\n\nLet's look at the example for redundancy.\n\n.. code-block:: Python\n\n    # visualise redundancy using a dendrogram\n    from pytools.viz.dendrogram import DendrogramDrawer\n    redundancy = inspector.feature_redundancy_linkage()\n    DendrogramDrawer().draw(data=redundancy, title=\"Redundancy Dendrogram\")\n\n.. image:: sphinx/source/_images/redundancy_dendrogram.png\n    :width: 600\n\nBased on the dendrogram we can see that the feature pairs (`LDL`, `TC`)\nand (`HDL`, `TCH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`\nhave the highest importance. As potential next actions we could explore the impact of\nremoving `TCH`, and one of `TC` or `LDL` to further simplify the model and obtain a\nreduced set of independent features.\n\nPlease see the\n`API reference \u003chttps://bcg-x-official.github.io/facet/apidoc/facet.html\u003e`__\nfor more detail.\n\n\nModel Simulation\n~~~~~~~~~~~~~~~~\n\nTaking the `BMI` feature as an example of an important and highly independent feature,\nwe do the following for the simulation:\n\n- We use FACET's `ContinuousRangePartitioner` to split the range of observed values of\n  `BMI` into intervals of equal size. Each partition is represented by the central value\n  of that partition.\n- For each partition, the simulator creates an artificial copy of the original sample\n  assuming the variable to be simulated has the same value across all observations –\n  which is the value representing the partition. Using the best estimator\n  acquired from the selector, the simulator now re-predicts all targets using the models\n  trained for full sample and determines the uplift of the target variable\n  resulting from this.\n- The FACET `SimulationDrawer` allows us to visualise the result; both in a\n  *matplotlib* and a plain-text style.\n\n.. code-block:: Python\n\n    # FACET imports\n    from facet.validation import BootstrapCV\n    from facet.simulation import UnivariateUpliftSimulator\n    from facet.data.partition import ContinuousRangePartitioner\n    from facet.simulation.viz import SimulationDrawer\n\n    # create bootstrap CV iterator\n    bscv = BootstrapCV(n_splits=1000, random_state=42)\n\n    SIM_FEAT = \"BMI\"\n    simulator = UnivariateUpliftSimulator(\n        model=selector.best_estimator_,\n        sample=diabetes_sample,\n        n_jobs=-3\n    )\n\n    # split the simulation range into equal sized partitions\n    partitioner = ContinuousRangePartitioner()\n\n    # run the simulation\n    simulation = simulator.simulate_feature(feature_name=SIM_FEAT, partitioner=partitioner)\n\n    # visualise results\n    SimulationDrawer().draw(data=simulation, title=SIM_FEAT)\n\n.. image:: sphinx/source/_images/simulation_output.png\n\nWe would conclude from the figure that higher values of `BMI` are associated with\nan increase in disease progression after one year, and that for a `BMI` of 28\nand above, there is a significant increase in disease progression after one year\nof at least 26 points.\n\nContributing\n------------\n\nFACET is stable and is being supported long-term.\n\nContributions to FACET are welcome and appreciated.\nFor any bug reports or feature requests/enhancements please use the appropriate\n`GitHub form \u003chttps://github.com/BCG-X-Official/facet/issues\u003e`_, and if you wish to do so,\nplease open a PR addressing the issue.\n\nWe do ask that for any major changes please discuss these with us first via an issue or\nusing our team email: FacetTeam@bcg.com.\n\nFor further information on contributing please see our\n`contribution guide \u003chttps://bcg-x-official.github.io/facet/contribution_guide.html\u003e`__.\n\n\nLicense\n-------\n\nFACET is licensed under Apache 2.0 as described in the\n`LICENSE \u003chttps://github.com/BCG-X-Official/facet/blob/develop/LICENSE\u003e`_ file.\n\n\nAcknowledgements\n----------------\n\nFACET is built on top of two popular packages for Machine Learning:\n\n-   The `scikit-learn \u003chttps://scikit-learn.org/stable/index.html\u003e`__ learners and\n    pipelining make up implementation of the underlying algorithms. Moreover, we tried\n    to design the FACET API to align with the scikit-learn API.\n-   The `SHAP \u003chttps://shap.readthedocs.io/en/latest/\u003e`__ implementation is used to\n    estimate the shapley vectors which FACET then decomposes into synergy, redundancy,\n    and independence vectors.\n\n\nBCG GAMMA\n---------\n\nIf you would like to know more about the team behind FACET please see the\n`about us \u003chttps://bcg-x-official.github.io/facet/about_us.html\u003e`__ page.\n\nWe are always on the lookout for passionate and talented data scientists to join the\nBCG GAMMA team. If you would like to know more you can find out about\n`BCG GAMMA \u003chttps://www.bcg.com/en-gb/beyond-consulting/bcg-gamma/default\u003e`_,\nor have a look at\n`career opportunities \u003chttps://www.bcg.com/en-gb/beyond-consulting/bcg-gamma/careers\u003e`_.\n\n.. |pipe| image:: sphinx/source/_images/icons/pipe_icon.png\n   :width: 100px\n   :class: facet_icon\n\n.. |inspect| image:: sphinx/source/_images/icons/inspect_icon.png\n   :width: 100px\n   :class: facet_icon\n\n.. |sim| image:: sphinx/source/_images/icons/sim_icon.png\n   :width: 100px\n   :class: facet_icon\n\n.. |spacer| unicode:: 0x2003 0x2003 0x2003 0x2003 0x2003 0x2003\n\n.. Begin-Badges\n\n.. |conda| image:: https://anaconda.org/bcg_gamma/gamma-facet/badges/version.svg\n    :target: https://anaconda.org/BCG_Gamma/gamma-facet\n\n.. |pypi| image:: https://badge.fury.io/py/gamma-facet.svg\n    :target: https://pypi.org/project/gamma-facet/\n\n.. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-X-Official.facet?repoName=BCG-X-Official%2Ffacet\u0026branchName=develop\n   :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7\u0026_a=summary\n\n.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/7/2.0.x\n   :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7\u0026_a=summary\n\n.. |python_versions| image:: https://img.shields.io/badge/python-3.7|3.8|3.9-blue.svg\n   :target: https://www.python.org/downloads/release/python-380/\n\n.. |code_style| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n   :target: https://github.com/psf/black\n\n.. |made_with_sphinx_doc| image:: https://img.shields.io/badge/Made%20with-Sphinx-1f425f.svg\n   :target: https://bcg-x-official.github.io/facet/index.html\n\n.. |license_badge| image:: https://img.shields.io/badge/License-Apache%202.0-olivegreen.svg\n   :target: https://opensource.org/licenses/Apache-2.0\n\n.. End-Badges\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBCG-X-Official%2Ffacet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBCG-X-Official%2Ffacet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBCG-X-Official%2Ffacet/lists"}