https://github.com/mncube/swaprinc
Swap Principal Components into Regression Models
https://github.com/mncube/swaprinc
Last synced: 30 days ago
JSON representation
Swap Principal Components into Regression Models
- Host: GitHub
- URL: https://github.com/mncube/swaprinc
- Owner: mncube
- License: other
- Created: 2023-04-01T15:16:03.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2023-04-18T23:43:09.000Z (about 3 years ago)
- Last Synced: 2025-12-09T14:37:12.342Z (7 months ago)
- Language: R
- Homepage:
- Size: 124 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# swaprinc
The objective of swaprinc is to streamline the comparison between a regression
model using original variables and a model in which some of these variables have
been swapped out for principal components.
## Installation
You can install the released version of swaprinc from [CRAN](https://CRAN.R-project.org) with:
```{r eval=FALSE}
install.packages("swaprinc")
```
You can install the development version of swaprinc from [GitHub](https://github.com/) with:
```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("mncube/swaprinc")
```
## A Simple Example
In the simple example provided, a regression model estimates the relationship
between x1 and y, while controlling for variables x2 through x10.
By using the default engine, "stats", the statistical model is fitted with
stats::lm, and by using the default prc_eng, "stats", principal components are
extracted with stats::prcomp.
The "raw model" is specified by the formula parameter, which is passed to stats::lm.
The pca_vars and n_pca_components parameters indicate that variables x2 to x10
will be used to extract three principal components. Subsequently, the "PCA model"
is passed to stats::lm as follows: y ~ x1 + PC1 + PC2 + PC3.
By setting the lpca_center and lpca_scale parameters to 'pca', the data in pca_vars
will be centered and scaled according to the guidelines in the
[Step-by-Step PCA](https://cran.r-project.org/package=LearnPCA/vignettes/Vig_03_Step_By_Step_PCA.pdf)
vignette before being passed to stats::prcomp. The miss_handler parameter, set to
'omit', ensures that only complete cases are included by subsetting the data frame
rows with stats::complete.cases.
```{r simple}
library(swaprinc)
# Create a small simulated dataset
set.seed(40)
n <- 50
x1 <- rnorm(n)
x2 <- rnorm(n, 5, 15)
x3 <- rnorm(n, -5.5, 20)
x4 <- rnorm(n, 3, 3) + x3*1.5
x5 <- rnorm(n, -2, 4) + x3*.25
x6 <- rnorm(n, -5, 5) + x4
x7 <- rnorm(n, -2, 6)
x8 <- rnorm(n, 2, 7)
x9 <- rnorm(n, -2, 3) +x2*.4
x10 <- rnorm(n, 5, 4)
y <- 1 + 2 * x1 + 3 * x2 + 2.5*x4 - 3.5*x5 + 2*x6 + 1.5*x7 + x8 + 2*x9 + x10 + rnorm(n)
data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
# Run swaprinc with
swaprinc_result <- swaprinc(data,
formula = "y ~ x1 + x2 + x3 + x4 + x5 + x5 + x6 + x7 + x8 + x9 + x10",
pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10"),
n_pca_components = 3,
lpca_center = "pca",
lpca_scale = "pca",
miss_handler = "omit")
# Summarize raw model
summary(swaprinc_result$model_raw)
# Summarize pca model
summary(swaprinc_result$model_pca)
# Get model comparisons
print(swaprinc_result$comparison)
```
## The Motivating Example
A common challenge in applied statistics and data science involves performing
logistic regression with a set of categorical independent variables. In this
motivating example, swaprinc is employed to compare a 'raw' logistic regression
model containing seven categorical independent variables with a 'pca' logistic
regression model. The latter model replaces six of the independent variables with
their first three principal components, using Gifi::princals to extract principal
components. For a comprehensive tutorial on Gifi, refer to
[Nonlinear Principal Components Analysis: Multivariate Analysis with Optimal Scaling (MVAOS)](https://www.css.cornell.edu/faculty/dgr2/_static/files/R_html/NonlinearPCA.html#2_Package).
I recommend using the 'broom' and 'broom.mixed' packages to summarize model
results when utilizing the '*_options' parameters for passing arguments to
functions within 'swaprinc'. This approach helps prevent
[overly extensive summaries caused by 'do.call'](https://stackoverflow.com/questions/75512192/r-do-call-function-returns-to-much/75512429#75512429).
```{r motivation}
# Create a small simulated dataset
set.seed(42)
n <- 50
x1 <- rnorm(n, 0.5, 4)
x2 <- rnorm(n, 3, 15)
x3 <- rnorm(n, -2.5, 5)
x4 <- -2.5*x2 + 3*x3 + rnorm(n, 0, 4)
x5 <- x2*x3 + rnorm(n, -5, 5)*rnorm(n, 5, 10)
x6 <- rnorm(n, -2, 4)*rnorm(n, 3, 5)
x7 <- x4 + x6 + rnorm(n, 0, 3)
y <- 1 + 2*x1 + 3*x2 + -2*x3 + .5*x4 + x5 + 1.5*x6 + x7 + rnorm(n)
data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7)
# Categorize the variables
yq <- stats::quantile(data$y,c(0,1/2, 1))
x1q <- stats::quantile(data$x1,c(0,1/2, 1))
x2q <- stats::quantile(data$x2,c(0,1/4,3/4,1))
x3q <- stats::quantile(data$x3,c(0,2/5,3/5,1))
x4q <- stats::quantile(data$x4,c(0,1/5,4/5,1))
x5q <- stats::quantile(data$x5,c(0,2/5,3/5,1))
x6q <- stats::quantile(data$x6,c(0,2/5,4/5,1))
x7q <- stats::quantile(data$x7,c(0,2/5,3/5,1))
data <- data %>% dplyr::mutate(
y = cut(y, breaks=yq, labels=c("0", "1"),include.lowest = TRUE),
x1 = cut(x1, breaks=x1q, labels=c("control", "treatment"),include.lowest = TRUE),
x2 = cut(x2, breaks=x2q, labels=c("small","medium","large"),include.lowest = TRUE),
x3 = cut(x3, breaks=x3q, labels=c("short","average","tall"),include.lowest = TRUE),
x4 = cut(x4, breaks=x4q, labels=c("lowbit","most","highbit"),include.lowest = TRUE),
x5 = cut(x5, breaks=x5q, labels=c("under","healthy","over"),include.lowest = TRUE),
x6 = cut(x6, breaks=x6q, labels=c("small","medium","large"),include.lowest = TRUE),
x7 = cut(x7, breaks=x7q, labels=c("small","medium","large"),include.lowest = TRUE)) %>%
dplyr::mutate(y = as.numeric(ifelse(y == "0", 0, 1)))
# Run swaprinc with prc_eng set to Gifi
swaprinc_result <- swaprinc(data,
formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",
pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7"),
n_pca_components = 3,
prc_eng = "Gifi",
model_options = list(family = binomial(link = "logit")))
# Summarize raw model
broom::tidy(swaprinc_result$model_raw)
# Summarize pca model
broom::tidy(swaprinc_result$model_pca)
# Get model comparisons
print(swaprinc_result$comparison)
```
## Compare Multiple Models
Utilizing the same dataset as in the logistic regression model mentioned earlier,
it is beneficial to compare outcomes for various swaps. In the example below,
the compswap helper function facilitates the comparison of results with 2, 3, 4,
and 5 principal components replacing six original independent variables.
```{r compswap}
# Run swaprinc with prc_eng set to Gifi
compswap_results <- compswap(data,
formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",
.pca_varlist = list(c("x2", "x3", "x4", "x5", "x6", "x7")),
.n_pca_list = list(2, 3, 4, 5),
.prc_eng_list = list("Gifi"),
.model_options_list = list(list(family = binomial(link = "logit"))))
# Show available models
summary(compswap_results$all_models)
# Get model comparisons
print(compswap_results$all_comparisons)
# View model summaries
lapply(compswap_results$all_models, broom::tidy)
```