Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/forestry-labs/rforestry


https://github.com/forestry-labs/rforestry

bagging boosting causal-inference interpretability machine-learning monotonic-constraints r r-package random-forest random-forests trees

Last synced: 26 days ago
JSON representation

Awesome Lists containing this project

README

        

[![R](https://github.com/forestry-labs/Rforestry/actions/workflows/R.yaml/badge.svg)](https://github.com/forestry-labs/Rforestry/actions/workflows/R.yaml)

## Rforestry: Random Forests, Linear Trees, and Gradient Boosting for Inference and Interpretability

Sören Künzel, Theo Saarinen, Simon Walter, Edward Liu, Sam Antonyan, Allen Tang, Jasjeet Sekhon

## Introduction

Rforestry is a fast implementation of Honest Random Forests, Gradient Boosting, and Linear Random Forests, with an emphasis on inference and interpretability.

## How to install
1. The GFortran compiler has to be up to date. GFortran Binaries can be found [here](https://gcc.gnu.org/wiki/GFortranBinaries).
2. The [devtools](https://github.com/r-lib/devtools) package has to be installed. You can install it using, `install.packages("devtools")`.
3. The package contains compiled code, and you must have a development environment to install the development version. You can use `devtools::has_devel()` to check whether you do. If no development environment exists, Windows users download and install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) and macOS users download and install [Xcode](https://apps.apple.com/us/app/xcode/id497799835).
4. The latest development version can then be installed using
`devtools::install_github("forestry-labs/Rforestry")`. For Windows users, you'll need to skip 64-bit compilation `devtools::install_github("forestry-labs/Rforestry", INSTALL_opts = c('--no-multiarch'))` due to an outstanding gcc issue.

## Documentation

For the Python package, see the documentation [here](https://random-forestry.readthedocs.io/en/latest/) and install from PyPI [here](https://pypi.org/project/random-forestry/).
For the R package, see the documentation [here](https://forestry-labs.github.io/Rforestry/) and install from CRAN [here](https://cran.r-project.org/web/packages/Rforestry/index.html).
For the source code for both packages, see the Github [here](https://github.com/forestry-labs/Rforestry)

## Usage

```R
library(Rforestry)

set.seed(292315)
test_idx <- sample(nrow(iris), 3)
x_train <- iris[-test_idx, -1]
y_train <- iris[-test_idx, 1]
x_test <- iris[test_idx, -1]

rf <- forestry(x = x_train, y = y_train, nthread = 2)

predict(rf, x_test)
```
## Monotonic Constraints

The parameter `monotonicConstraints` strictly enforces monotonicity of partition
averages when evaluating potential splits on the indicated features.
This parameter can be used to specify both monotone increasing and monotone
decreasing constraints.

```R
library(Rforestry)

set.seed(49)
x <- rnorm(150)+5
y <- .15*x + .5*sin(3*x)
data_train <- data.frame(x1 = x, x2 = rnorm(150)+5, y = y + rnorm(150, sd = .4))

monotone_rf <- forestry(x = data_train[,-3],
y = data_train$y,
monotonicConstraints = c(1,1),
nodesizeStrictSpl = 5,
nthread = 1,
ntree = 25)

predict(monotone_rf, newdata = data_train[,-3])
```

## OOB Predictions

We can return the predictions for the training data set using only the trees in
which each observation was out-of-bag (OOB). Note that when there are few trees, or a
high proportion of the observations sampled, there may be some observations
which are not out-of-bag for any trees. The predictions for these are returned as `NaN`.

```R
library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
y = iris[,1],
nthread = 2,
ntree = 500)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds - iris[,1])^2)
getOOB(rf)
```

If OOB predictions are going to be used, it is advised that one use OOB honesty during
training (OOBhonest=true). In this version of honesty, the OOB observations for each tree
are used as the honest (averaging) set. OOB honesty also changes how predictions
are constructed. When predicting for observations that are out-of-sample
(using Predict(..., aggregation = "average")), all the trees in the forest
are used to construct predictions. When predicting for an observation that was in-sample (using
predict(..., aggregation = "oob")), only the trees for which that observation
was not in the averaging set are used to construct the prediction for that observation.
aggregation="oob" (out-of-bag) ensures that the outcome value for an observation
is never used to construct predictions for a given observation even when it is in sample.
This property does not hold in standard honesty, which relies on an asymptotic subsampling argument.
OOB honesty, when used in combination with aggregation="oob" at the prediction stage, cannot overfit IID data,
at either the training or prediction stage. The outputs of such models are also more stable and more easily
interpretable. One can observe this if one queries the model using interpretation tools such as
ALEs, PDPs, LIME, etc.

```R
library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
y = iris[,1],
nthread = 2,
ntree = 500,
OOBhonest=TRUE)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds - iris[,1])^2)
getOOB(rf)
```

## Saving + Loading a model

In order to save a trained model, we include two functions in order to save and load
a model we have built.
The following code shows how to use saveForestry and loadForestry to save and load
a forestry model.

```R
library(Rforestry)

# Train a forest
forest <- forestry(x = iris[,-1],
y = iris[,1],
nthread = 2,
ntree = 500,
OOBhonest=TRUE)

# Get predictions before save the forest
y_pred_before <- predict(forest, iris[,-1])

# Save the forest
saveForestry(forest, filename = file.path("forest.Rda"))

# Delete the forest
rm(forest)

# Load the forest
forest_after <- loadForestry(file.path("forest.Rda"))

# Predict after loading the forest
y_pred_after <- predict(forest_after, iris[,-1])

```

## Ridge Random Forest

A fast implementation of random forests using ridge penalized splitting and
ridge regression for predictions.
In order to use this version of random forests, set the `linear` option to `TRUE`.

```R
library(Rforestry)

set.seed(49)
n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)
forest <- forestry(x, y, linear = TRUE, nthread = 2)
predict(forest, x)
```