https://github.com/winvector/cvrtsencoder

Spectral encoding of categorical variables using model residual trajectories
https://github.com/winvector/cvrtsencoder

Last synced: 3 months ago
JSON representation

Spectral encoding of categorical variables using model residual trajectories

Host: GitHub
URL: https://github.com/winvector/cvrtsencoder
Owner: WinVector
License: gpl-3.0
Created: 2019-04-13T03:15:57.000Z (almost 7 years ago)
Default Branch: main
Last Pushed: 2020-06-22T20:57:03.000Z (over 5 years ago)
Last Synced: 2024-12-27T02:14:03.770Z (about 1 year ago)
Language: R
Homepage: https://winvector.github.io/CVRTSEncoder/
Size: 2.86 MB
Stars: 0
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, echo = FALSE, warning=FALSE, message=FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = " # ",

  fig.path = "tools/README-"

)

```

[`CVRTSEncoder`](https://github.com/WinVector/CVRTSEncoder) is a categorical variable encoding for supervised learning.

This package is still in a research and development mode.  Functionality and interfaces may change.

Re-encode a set of categorical variables jointly as a spectral projection of the trajectory of modeling residuals.  This is intended as a succinct numeric linear representation of a set of categorical variables in a manner that is useful for supervised learning.

The concept is y-aware encoding the trajectory of non-linear model residuals in terms of target categorical variables.

The idea is an extension of the [`vtreat`](https://github.com/WinVector/vtreat) coding [concepts](https://github.com/WinVector/vtreat/blob/master/extras/vtreat.pdf), the re-encoding concepts of [JavaLogistic](https://github.com/WinVector/Logistic), and of the y-aware scaling concepts of Nina Zumel and John Mount:

  * [Principal Components Regression, Pt.1: The Standard Method](http://www.win-vector.com/blog/2016/05/pcr_part1_xonly/)

  * [Principal Components Regression, Pt. 2: Y-Aware Methods](http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/)

  * [Principal Components Regression, Pt. 3: Picking the Number of Components](http://www.win-vector.com/blog/2016/05/pcr_part3_pickk/)

  * [y-aware scaling in context](http://www.win-vector.com/blog/2016/06/y-aware-scaling-in-context/).

  

  

The core idea is: other models factor the quantity to be explained into an explainable versus residual portion (with respect to the given model).  Each of these components are possibly useful for modeling.

```{r example}

library("CVRTSEncoder")

library("wrapr")

data <- iris

avars <- c("Sepal.Length", "Petal.Length")

evars <- c("Sepal.Width", "Petal.Width")

dep_var <- "Species"

dep_target <- "versicolor"

for(vi in evars) {

  data[[vi]] <- as.character(round(data[[vi]]))

}

str(data)

cross_enc <- estimate_residual_encoding_c(

  data = data,

  avars = avars,

  evars = evars,

  dep_var = dep_var,

  dep_target = dep_target,

  n_comp = 4

)

enc <- prepare(cross_enc$coder, data)

data <- cbind(data, enc)

data %.>%

  head(.) %.>% 

  knitr::kable(.)

f0 <- wrapr::mk_formula(dep_var, avars, outcome_target = dep_target)

print(f0)

model0 <- glm(f0, data = data, family = binomial)

summary(model0)

data$pred0 <- predict(model0, newdata = data, type = "response")

table(data$Species, data$pred0>0.5)

newvars <- c(avars, colnames(enc))

f <- wrapr::mk_formula(dep_var, newvars, outcome_target = dep_target)

print(f)

model <- glmnet::cv.glmnet(as.matrix(data[, newvars, drop = FALSE]), 

                           as.numeric(data[[dep_var]]==dep_target), 

                           family = "binomial")

coef(model, lambda = "lambda.min")

data$pred <- as.numeric(predict(model, newx = as.matrix(data[, newvars, drop = FALSE]), s = "lambda.min"))

table(data$Species, data$pred>0.5)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/winvector/cvrtsencoder

Awesome Lists containing this project

README