An open API service indexing awesome lists of open source software.

https://github.com/mncube/swaprinc

Swap Principal Components into Regression Models
https://github.com/mncube/swaprinc

Last synced: 30 days ago
JSON representation

Swap Principal Components into Regression Models

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# swaprinc

The objective of swaprinc is to streamline the comparison between a regression
model using original variables and a model in which some of these variables have
been swapped out for principal components.

## Installation

You can install the released version of swaprinc from [CRAN](https://CRAN.R-project.org) with:

```{r eval=FALSE}
install.packages("swaprinc")
```

You can install the development version of swaprinc from [GitHub](https://github.com/) with:

```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("mncube/swaprinc")
```

## A Simple Example

In the simple example provided, a regression model estimates the relationship
between x1 and y, while controlling for variables x2 through x10.

By using the default engine, "stats", the statistical model is fitted with
stats::lm, and by using the default prc_eng, "stats", principal components are
extracted with stats::prcomp.

The "raw model" is specified by the formula parameter, which is passed to stats::lm.
The pca_vars and n_pca_components parameters indicate that variables x2 to x10
will be used to extract three principal components. Subsequently, the "PCA model"
is passed to stats::lm as follows: y ~ x1 + PC1 + PC2 + PC3.

By setting the lpca_center and lpca_scale parameters to 'pca', the data in pca_vars
will be centered and scaled according to the guidelines in the
[Step-by-Step PCA](https://cran.r-project.org/package=LearnPCA/vignettes/Vig_03_Step_By_Step_PCA.pdf)
vignette before being passed to stats::prcomp. The miss_handler parameter, set to
'omit', ensures that only complete cases are included by subsetting the data frame
rows with stats::complete.cases.

```{r simple}
library(swaprinc)

# Create a small simulated dataset
set.seed(40)
n <- 50
x1 <- rnorm(n)
x2 <- rnorm(n, 5, 15)
x3 <- rnorm(n, -5.5, 20)
x4 <- rnorm(n, 3, 3) + x3*1.5
x5 <- rnorm(n, -2, 4) + x3*.25
x6 <- rnorm(n, -5, 5) + x4
x7 <- rnorm(n, -2, 6)
x8 <- rnorm(n, 2, 7)
x9 <- rnorm(n, -2, 3) +x2*.4
x10 <- rnorm(n, 5, 4)
y <- 1 + 2 * x1 + 3 * x2 + 2.5*x4 - 3.5*x5 + 2*x6 + 1.5*x7 + x8 + 2*x9 + x10 + rnorm(n)
data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

# Run swaprinc with
swaprinc_result <- swaprinc(data,
formula = "y ~ x1 + x2 + x3 + x4 + x5 + x5 + x6 + x7 + x8 + x9 + x10",
pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10"),
n_pca_components = 3,
lpca_center = "pca",
lpca_scale = "pca",
miss_handler = "omit")

# Summarize raw model
summary(swaprinc_result$model_raw)

# Summarize pca model
summary(swaprinc_result$model_pca)

# Get model comparisons
print(swaprinc_result$comparison)

```

## The Motivating Example

A common challenge in applied statistics and data science involves performing
logistic regression with a set of categorical independent variables. In this
motivating example, swaprinc is employed to compare a 'raw' logistic regression
model containing seven categorical independent variables with a 'pca' logistic
regression model. The latter model replaces six of the independent variables with
their first three principal components, using Gifi::princals to extract principal
components. For a comprehensive tutorial on Gifi, refer to
[Nonlinear Principal Components Analysis: Multivariate Analysis with Optimal Scaling (MVAOS)](https://www.css.cornell.edu/faculty/dgr2/_static/files/R_html/NonlinearPCA.html#2_Package).

I recommend using the 'broom' and 'broom.mixed' packages to summarize model
results when utilizing the '*_options' parameters for passing arguments to
functions within 'swaprinc'. This approach helps prevent
[overly extensive summaries caused by 'do.call'](https://stackoverflow.com/questions/75512192/r-do-call-function-returns-to-much/75512429#75512429).

```{r motivation}
# Create a small simulated dataset
set.seed(42)
n <- 50
x1 <- rnorm(n, 0.5, 4)
x2 <- rnorm(n, 3, 15)
x3 <- rnorm(n, -2.5, 5)
x4 <- -2.5*x2 + 3*x3 + rnorm(n, 0, 4)
x5 <- x2*x3 + rnorm(n, -5, 5)*rnorm(n, 5, 10)
x6 <- rnorm(n, -2, 4)*rnorm(n, 3, 5)
x7 <- x4 + x6 + rnorm(n, 0, 3)
y <- 1 + 2*x1 + 3*x2 + -2*x3 + .5*x4 + x5 + 1.5*x6 + x7 + rnorm(n)
data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7)

# Categorize the variables
yq <- stats::quantile(data$y,c(0,1/2, 1))
x1q <- stats::quantile(data$x1,c(0,1/2, 1))
x2q <- stats::quantile(data$x2,c(0,1/4,3/4,1))
x3q <- stats::quantile(data$x3,c(0,2/5,3/5,1))
x4q <- stats::quantile(data$x4,c(0,1/5,4/5,1))
x5q <- stats::quantile(data$x5,c(0,2/5,3/5,1))
x6q <- stats::quantile(data$x6,c(0,2/5,4/5,1))
x7q <- stats::quantile(data$x7,c(0,2/5,3/5,1))

data <- data %>% dplyr::mutate(
y = cut(y, breaks=yq, labels=c("0", "1"),include.lowest = TRUE),
x1 = cut(x1, breaks=x1q, labels=c("control", "treatment"),include.lowest = TRUE),
x2 = cut(x2, breaks=x2q, labels=c("small","medium","large"),include.lowest = TRUE),
x3 = cut(x3, breaks=x3q, labels=c("short","average","tall"),include.lowest = TRUE),
x4 = cut(x4, breaks=x4q, labels=c("lowbit","most","highbit"),include.lowest = TRUE),
x5 = cut(x5, breaks=x5q, labels=c("under","healthy","over"),include.lowest = TRUE),
x6 = cut(x6, breaks=x6q, labels=c("small","medium","large"),include.lowest = TRUE),
x7 = cut(x7, breaks=x7q, labels=c("small","medium","large"),include.lowest = TRUE)) %>%
dplyr::mutate(y = as.numeric(ifelse(y == "0", 0, 1)))

# Run swaprinc with prc_eng set to Gifi
swaprinc_result <- swaprinc(data,
formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",
pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7"),
n_pca_components = 3,
prc_eng = "Gifi",
model_options = list(family = binomial(link = "logit")))

# Summarize raw model
broom::tidy(swaprinc_result$model_raw)

# Summarize pca model
broom::tidy(swaprinc_result$model_pca)

# Get model comparisons
print(swaprinc_result$comparison)
```

## Compare Multiple Models

Utilizing the same dataset as in the logistic regression model mentioned earlier,
it is beneficial to compare outcomes for various swaps. In the example below,
the compswap helper function facilitates the comparison of results with 2, 3, 4,
and 5 principal components replacing six original independent variables.

```{r compswap}
# Run swaprinc with prc_eng set to Gifi
compswap_results <- compswap(data,
formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",
.pca_varlist = list(c("x2", "x3", "x4", "x5", "x6", "x7")),
.n_pca_list = list(2, 3, 4, 5),
.prc_eng_list = list("Gifi"),
.model_options_list = list(list(family = binomial(link = "logit"))))

# Show available models
summary(compswap_results$all_models)

# Get model comparisons
print(compswap_results$all_comparisons)

# View model summaries
lapply(compswap_results$all_models, broom::tidy)
```