https://github.com/winvector/cvrtsencoder
Spectral encoding of categorical variables using model residual trajectories
https://github.com/winvector/cvrtsencoder
Last synced: about 2 months ago
JSON representation
Spectral encoding of categorical variables using model residual trajectories
- Host: GitHub
- URL: https://github.com/winvector/cvrtsencoder
- Owner: WinVector
- License: gpl-3.0
- Created: 2019-04-13T03:15:57.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2020-06-22T20:57:03.000Z (over 5 years ago)
- Last Synced: 2024-12-27T02:14:03.770Z (12 months ago)
- Language: R
- Homepage: https://winvector.github.io/CVRTSEncoder/
- Size: 2.86 MB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---
```{r, echo = FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = " # ",
fig.path = "tools/README-"
)
```
[`CVRTSEncoder`](https://github.com/WinVector/CVRTSEncoder) is a categorical variable encoding for supervised learning.
This package is still in a research and development mode. Functionality and interfaces may change.
Re-encode a set of categorical variables jointly as a spectral projection of the trajectory of modeling residuals. This is intended as a succinct numeric linear representation of a set of categorical variables in a manner that is useful for supervised learning.
The concept is y-aware encoding the trajectory of non-linear model residuals in terms of target categorical variables.
The idea is an extension of the [`vtreat`](https://github.com/WinVector/vtreat) coding [concepts](https://github.com/WinVector/vtreat/blob/master/extras/vtreat.pdf), the re-encoding concepts of [JavaLogistic](https://github.com/WinVector/Logistic), and of the y-aware scaling concepts of Nina Zumel and John Mount:
* [Principal Components Regression, Pt.1: The Standard Method](http://www.win-vector.com/blog/2016/05/pcr_part1_xonly/)
* [Principal Components Regression, Pt. 2: Y-Aware Methods](http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/)
* [Principal Components Regression, Pt. 3: Picking the Number of Components](http://www.win-vector.com/blog/2016/05/pcr_part3_pickk/)
* [y-aware scaling in context](http://www.win-vector.com/blog/2016/06/y-aware-scaling-in-context/).
The core idea is: other models factor the quantity to be explained into an explainable versus residual portion (with respect to the given model). Each of these components are possibly useful for modeling.
```{r example}
library("CVRTSEncoder")
library("wrapr")
data <- iris
avars <- c("Sepal.Length", "Petal.Length")
evars <- c("Sepal.Width", "Petal.Width")
dep_var <- "Species"
dep_target <- "versicolor"
for(vi in evars) {
data[[vi]] <- as.character(round(data[[vi]]))
}
str(data)
cross_enc <- estimate_residual_encoding_c(
data = data,
avars = avars,
evars = evars,
dep_var = dep_var,
dep_target = dep_target,
n_comp = 4
)
enc <- prepare(cross_enc$coder, data)
data <- cbind(data, enc)
data %.>%
head(.) %.>%
knitr::kable(.)
f0 <- wrapr::mk_formula(dep_var, avars, outcome_target = dep_target)
print(f0)
model0 <- glm(f0, data = data, family = binomial)
summary(model0)
data$pred0 <- predict(model0, newdata = data, type = "response")
table(data$Species, data$pred0>0.5)
newvars <- c(avars, colnames(enc))
f <- wrapr::mk_formula(dep_var, newvars, outcome_target = dep_target)
print(f)
model <- glmnet::cv.glmnet(as.matrix(data[, newvars, drop = FALSE]),
as.numeric(data[[dep_var]]==dep_target),
family = "binomial")
coef(model, lambda = "lambda.min")
data$pred <- as.numeric(predict(model, newx = as.matrix(data[, newvars, drop = FALSE]), s = "lambda.min"))
table(data$Species, data$pred>0.5)
```