https://github.com/parrt/random-forest-importances

Code to compute permutation and drop-column importances in Python scikit-learn models
https://github.com/parrt/random-forest-importances

Last synced: 3 months ago
JSON representation

Code to compute permutation and drop-column importances in Python scikit-learn models

Host: GitHub
URL: https://github.com/parrt/random-forest-importances
Owner: parrt
License: mit
Created: 2018-03-22T19:20:13.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2025-03-24T16:48:44.000Z (3 months ago)
Last Synced: 2025-04-03T19:59:30.178Z (3 months ago)
Language: Jupyter Notebook
Homepage:
Size: 14.6 MB
Stars: 610
Watchers: 21
Forks: 132
Open Issues: 10
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - parrt/random-forest-importances - learn机器学习模型，特别是随机森林模型。它通过排列重要性（permutation importance）和删除列重要性（drop-column importance）来弥补scikit-learn默认的基于基尼重要性的方法的不足。排列重要性通过打乱特征值并观察模型性能的变化来衡量特征的重要性，而删除列重要性则通过移除特征并观察模型性能的变化来衡量特征的重要性。该项目包含一个名为`rfpimp`的Python包，可用于计算这些重要性指标，并提供示例代码和笔记本，演示如何使用该包分析特征重要性。 (特征工程)
awesome-python-machine-learning-resources - GitHub - 14% open · ⏱️ 30.01.2021): (模型的可解释性)

README

        # Feature importances for scikit-learn machine learning models

By Terence Parr and Kerem Turgutlu. See [Explained.ai](http://explained.ai) for more stuff.

The scikit-learn Random Forest feature importances strategy is mean decrease in impurity (or gini importance) mechanism, which is unreliable.

To get reliable results, use permutation importance, provided in the `rfpimp` package in the `src` dir. Install with:

`pip install rfpimp`

We include permutation and drop-column importance measures that work with any sklearn model.  Yes, `rfpimp` is an increasingly-ill-suited name, but we still like it.

## Description

See Beware Default Random Forest Importances for a deeper discussion of the issues surrounding feature importances in random forests (authored by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard).

The mean-decrease-in-impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within random forests.  The problem is that this mechanism, while fast, does not always give an accurate picture of importance. Strobl et al pointed out in Bias in random forest variable importance measures: Illustrations, sources and a solution that “the variable importance measures of Breiman's original random forest method ... are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.” 

A more reliable method is permutation importance, which measures the importance of a feature as follows. Record a baseline accuracy (classifier) or R² score (regressor) by passing a  validation set or the out-of-bag (OOB) samples through the random forest.  Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R². The importance of that feature is the difference between the baseline and the drop in overall accuracy or R² caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.

## Sample code

See the [notebooks directory](https://github.com/parrt/random-forest-importances/blob/master/notebooks) for things like [Collinear features](https://github.com/parrt/random-forest-importances/blob/master/notebooks/collinear.ipynb) and [Plotting feature importances](https://github.com/parrt/random-forest-importances/blob/master/notebooks/pimp_plots.ipynb).

Here's some sample Python code that uses the `rfpimp` package contained in the `src` directory.  The data can be found in rent.csv, which is a subset of the data from Kaggle's Two Sigma Connect: Rental Listing Inquiries competition.

```python

from rfpimp import *

import pandas as pd

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split

df_orig = pd.read_csv("/Users/parrt/github/random-forest-importances/notebooks/data/rent.csv")

df = df_orig.copy()

# attentuate affect of outliers in price

df['price'] = np.log(df['price'])

df_train, df_test = train_test_split(df, test_size=0.20)

features = ['bathrooms','bedrooms','longitude','latitude',

            'price']

df_train = df_train[features]

df_test = df_test[features]

X_train, y_train = df_train.drop('price',axis=1), df_train['price']

X_test, y_test = df_test.drop('price',axis=1), df_test['price']

X_train['random'] = np.random.random(size=len(X_train))

X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)

rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test) # permutation

viz = plot_importances(imp)

viz.view()

df_train, df_test = train_test_split(df_orig, test_size=0.20)

features = ['bathrooms','bedrooms','price','longitude','latitude',

            'interest_level']

df_train = df_train[features]

df_test = df_test[features]

X_train, y_train = df_train.drop('interest_level',axis=1), df_train['interest_level']

X_test, y_test = df_test.drop('interest_level',axis=1), df_test['interest_level']

# Add column of random numbers

X_train['random'] = np.random.random(size=len(X_train))

X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestClassifier(n_estimators=100,

                            min_samples_leaf=5,

                            n_jobs=-1,

                            oob_score=True)

rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test, n_samples=-1)

viz = plot_importances(imp)

viz.view()

```

### Feature correlation

See [Feature collinearity heatmap](notebooks/rfpimp-collinear.ipynb). We can get the Spearman's correlation matrix:



### Feature dependencies

The features we use in machine learning are rarely completely independent, which makes interpreting feature importance tricky. We could compute correlation coefficients, but that only identifies linear relationships. A way to at least identify if a feature, x, is dependent on other features is to train a model using x as a dependent variable and all other features as independent variables. Because random forests give us an easy out of bag error estimate, the feature dependence functions rely on random forest models. The R^2 prediction error from the model indicates how easy it is to predict feature x using the other features. The higher the score, the more dependent feature x is. 

You can also get a feature dependence matrix / heatmap that returns a non-symmetric data frame where each row is the importance of each var to the row's var used as a model target. Example:

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/parrt/random-forest-importances

Awesome Lists containing this project

README