{"id":13427058,"url":"https://github.com/parrt/random-forest-importances","last_synced_at":"2025-04-11T03:32:51.808Z","repository":{"id":54791481,"uuid":"126384528","full_name":"parrt/random-forest-importances","owner":"parrt","description":"Code to compute permutation and drop-column importances in Python scikit-learn models","archived":false,"fork":false,"pushed_at":"2025-03-24T16:48:44.000Z","size":15316,"stargazers_count":610,"open_issues_count":10,"forks_count":132,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-04-03T19:59:30.178Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/parrt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-22T19:20:13.000Z","updated_at":"2025-03-29T16:46:39.000Z","dependencies_parsed_at":"2023-12-26T10:46:31.633Z","dependency_job_id":"61e4441e-2251-4ef6-a752-812a29bf2b9d","html_url":"https://github.com/parrt/random-forest-importances","commit_stats":{"total_commits":225,"total_committers":14,"mean_commits":"16.071428571428573","dds":"0.20444444444444443","last_synced_commit":"5e038f6e93abf1fc52f28907a0e9f1ff8a7fbc3d"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parrt%2Frandom-forest-importances","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parrt%2Frandom-forest-importances/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parrt%2Frandom-forest-importances/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parrt%2Frandom-forest-importances/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/parrt","download_url":"https://codeload.github.com/parrt/random-forest-importances/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248335447,"owners_count":21086593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T00:01:52.442Z","updated_at":"2025-04-11T03:32:51.781Z","avatar_url":"https://github.com/parrt.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","特征工程","模型的可解释性","Exploration"],"sub_categories":[],"readme":"# Feature importances for scikit-learn machine learning models\n\nBy \u003ca href=\"http://explained.ai/\"\u003eTerence Parr\u003c/a\u003e and \u003ca href=\"https://www.linkedin.com/in/kerem-turgutlu-12906b65/\"\u003eKerem Turgutlu\u003c/a\u003e. See [Explained.ai](http://explained.ai) for more stuff.\n\nThe scikit-learn Random Forest feature importances strategy is \u003ci\u003emean decrease in impurity\u003c/i\u003e (or \u003ci\u003egini importance\u003c/i\u003e) mechanism, which is unreliable.\nTo get reliable results, use permutation importance, provided in the `rfpimp` package in the `src` dir. Install with:\n\n`pip install rfpimp`\n\nWe include permutation and drop-column importance measures that work with any sklearn model.  Yes, `rfpimp` is an increasingly-ill-suited name, but we still like it.\n\n## Description\n\nSee \u003ca href=\"http://explained.ai/rf-importance/index.html\"\u003eBeware Default Random Forest Importances\u003c/a\u003e for a deeper discussion of the issues surrounding feature importances in random forests (authored by \u003ca href=\"http://parrt.cs.usfca.edu\"\u003eTerence Parr\u003c/a\u003e, \u003ca href=\"https://www.linkedin.com/in/kerem-turgutlu-12906b65/\"\u003eKerem Turgutlu\u003c/a\u003e, \u003ca href=\"https://www.linkedin.com/in/cpcsiszar/\"\u003eChristopher Csiszar\u003c/a\u003e, and \u003ca href=\"http://www.fast.ai/about/#jeremy\"\u003eJeremy Howard\u003c/a\u003e).\n\nThe mean-decrease-in-impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within random forests.  The problem is that this mechanism, while fast, does not always give an accurate picture of importance. Strobl \u003ci\u003eet al\u003c/i\u003e pointed out in \u003ca href=\"https://link.springer.com/article/10.1186%2F1471-2105-8-25\"\u003eBias in random forest variable importance measures: Illustrations, sources and a solution\u003c/a\u003e that \u0026ldquo;\u003ci\u003ethe variable importance measures of Breiman's original random forest method ... are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories\u003c/i\u003e.\u0026rdquo; \n\nA more reliable method is \u003ci\u003epermutation importance\u003c/i\u003e, which measures the importance of a feature as follows. Record a baseline accuracy (classifier) or R\u003csup\u003e2\u003c/sup\u003e score (regressor) by passing a  validation set or the out-of-bag (OOB) samples through the random forest.  Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R\u003csup\u003e2\u003c/sup\u003e. The importance of that feature is the difference between the baseline and the drop in overall accuracy or R\u003csup\u003e2\u003c/sup\u003e caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.\n\n## Sample code\n\nSee the [notebooks directory](https://github.com/parrt/random-forest-importances/blob/master/notebooks) for things like [Collinear features](https://github.com/parrt/random-forest-importances/blob/master/notebooks/collinear.ipynb) and [Plotting feature importances](https://github.com/parrt/random-forest-importances/blob/master/notebooks/pimp_plots.ipynb).\n\nHere's some sample Python code that uses the `rfpimp` package contained in the `src` directory.  The data can be found in \u003ca href=\"https://github.com/parrt/random-forest-importances/blob/master/notebooks/data/rent.csv\"\u003erent.csv\u003c/a\u003e, which is a subset of the data from Kaggle's \u003ca href=\"https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries\"\u003eTwo Sigma Connect: Rental Listing Inquiries\u003c/a\u003e competition.\n\n\n```python\nfrom rfpimp import *\nimport pandas as pd\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.model_selection import train_test_split\n\ndf_orig = pd.read_csv(\"/Users/parrt/github/random-forest-importances/notebooks/data/rent.csv\")\n\ndf = df_orig.copy()\n\n# attentuate affect of outliers in price\ndf['price'] = np.log(df['price'])\n\ndf_train, df_test = train_test_split(df, test_size=0.20)\n\nfeatures = ['bathrooms','bedrooms','longitude','latitude',\n            'price']\ndf_train = df_train[features]\ndf_test = df_test[features]\n\nX_train, y_train = df_train.drop('price',axis=1), df_train['price']\nX_test, y_test = df_test.drop('price',axis=1), df_test['price']\nX_train['random'] = np.random.random(size=len(X_train))\nX_test['random'] = np.random.random(size=len(X_test))\n\nrf = RandomForestRegressor(n_estimators=100, n_jobs=-1)\nrf.fit(X_train, y_train)\n\nimp = importances(rf, X_test, y_test) # permutation\nviz = plot_importances(imp)\nviz.view()\n\n\ndf_train, df_test = train_test_split(df_orig, test_size=0.20)\nfeatures = ['bathrooms','bedrooms','price','longitude','latitude',\n            'interest_level']\ndf_train = df_train[features]\ndf_test = df_test[features]\n\nX_train, y_train = df_train.drop('interest_level',axis=1), df_train['interest_level']\nX_test, y_test = df_test.drop('interest_level',axis=1), df_test['interest_level']\n# Add column of random numbers\nX_train['random'] = np.random.random(size=len(X_train))\nX_test['random'] = np.random.random(size=len(X_test))\n\nrf = RandomForestClassifier(n_estimators=100,\n                            min_samples_leaf=5,\n                            n_jobs=-1,\n                            oob_score=True)\nrf.fit(X_train, y_train)\n\nimp = importances(rf, X_test, y_test, n_samples=-1)\nviz = plot_importances(imp)\nviz.view()\n```\n### Feature correlation\n\nSee [Feature collinearity heatmap](notebooks/rfpimp-collinear.ipynb). We can get the Spearman's correlation matrix:\n\n\u003cimg src=\"article/images/corrheatmap.svg\" width=\"70%\"\u003e\n\n### Feature dependencies\n\nThe features we use in machine learning are rarely completely independent, which makes interpreting feature importance tricky. We could compute correlation coefficients, but that only identifies linear relationships. A way to at least identify if a feature, x, is dependent on other features is to train a model using x as a dependent variable and all other features as independent variables. Because random forests give us an easy out of bag error estimate, the feature dependence functions rely on random forest models. The R^2 prediction error from the model indicates how easy it is to predict feature x using the other features. The higher the score, the more dependent feature x is. \n\nYou can also get a feature dependence matrix / heatmap that returns a non-symmetric data frame where each row is the importance of each var to the row's var used as a model target. Example:\n\n\u003cimg src=\"article/images/cancer_dep.svg\" width=\"100%\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparrt%2Frandom-forest-importances","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparrt%2Frandom-forest-importances","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparrt%2Frandom-forest-importances/lists"}