An open API service indexing awesome lists of open source software.

https://github.com/winvector/cvrtsencoder

Spectral encoding of categorical variables using model residual trajectories
https://github.com/winvector/cvrtsencoder

Last synced: about 2 months ago
JSON representation

Spectral encoding of categorical variables using model residual trajectories

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, echo = FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = " # ",
fig.path = "tools/README-"
)
```

[`CVRTSEncoder`](https://github.com/WinVector/CVRTSEncoder) is a categorical variable encoding for supervised learning.

This package is still in a research and development mode. Functionality and interfaces may change.

Re-encode a set of categorical variables jointly as a spectral projection of the trajectory of modeling residuals. This is intended as a succinct numeric linear representation of a set of categorical variables in a manner that is useful for supervised learning.

The concept is y-aware encoding the trajectory of non-linear model residuals in terms of target categorical variables.

The idea is an extension of the [`vtreat`](https://github.com/WinVector/vtreat) coding [concepts](https://github.com/WinVector/vtreat/blob/master/extras/vtreat.pdf), the re-encoding concepts of [JavaLogistic](https://github.com/WinVector/Logistic), and of the y-aware scaling concepts of Nina Zumel and John Mount:

* [Principal Components Regression, Pt.1: The Standard Method](http://www.win-vector.com/blog/2016/05/pcr_part1_xonly/)
* [Principal Components Regression, Pt. 2: Y-Aware Methods](http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/)
* [Principal Components Regression, Pt. 3: Picking the Number of Components](http://www.win-vector.com/blog/2016/05/pcr_part3_pickk/)
* [y-aware scaling in context](http://www.win-vector.com/blog/2016/06/y-aware-scaling-in-context/).


The core idea is: other models factor the quantity to be explained into an explainable versus residual portion (with respect to the given model). Each of these components are possibly useful for modeling.

```{r example}
library("CVRTSEncoder")
library("wrapr")

data <- iris
avars <- c("Sepal.Length", "Petal.Length")
evars <- c("Sepal.Width", "Petal.Width")
dep_var <- "Species"
dep_target <- "versicolor"
for(vi in evars) {
data[[vi]] <- as.character(round(data[[vi]]))
}
str(data)

cross_enc <- estimate_residual_encoding_c(
data = data,
avars = avars,
evars = evars,
dep_var = dep_var,
dep_target = dep_target,
n_comp = 4
)
enc <- prepare(cross_enc$coder, data)
data <- cbind(data, enc)
data %.>%
head(.) %.>%
knitr::kable(.)

f0 <- wrapr::mk_formula(dep_var, avars, outcome_target = dep_target)
print(f0)

model0 <- glm(f0, data = data, family = binomial)
summary(model0)

data$pred0 <- predict(model0, newdata = data, type = "response")
table(data$Species, data$pred0>0.5)

newvars <- c(avars, colnames(enc))
f <- wrapr::mk_formula(dep_var, newvars, outcome_target = dep_target)
print(f)

model <- glmnet::cv.glmnet(as.matrix(data[, newvars, drop = FALSE]),
as.numeric(data[[dep_var]]==dep_target),
family = "binomial")
coef(model, lambda = "lambda.min")
data$pred <- as.numeric(predict(model, newx = as.matrix(data[, newvars, drop = FALSE]), s = "lambda.min"))
table(data$Species, data$pred>0.5)
```