https://github.com/mncube/swaprinc

Swap Principal Components into Regression Models
https://github.com/mncube/swaprinc
Last synced: 30 days ago
JSON representation
Swap Principal Components into Regression Models
Host: GitHub
URL: https://github.com/mncube/swaprinc
Owner: mncube
License: other
Created: 2023-04-01T15:16:03.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2023-04-18T23:43:09.000Z (about 3 years ago)
Last Synced: 2025-12-09T14:37:12.342Z (7 months ago)
Language: R
Homepage:
Size: 124 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# swaprinc

The objective of swaprinc is to streamline the comparison between a regression 

model using original variables and a model in which some of these variables have 

been swapped out for principal components.

## Installation

You can install the released version of swaprinc from [CRAN](https://CRAN.R-project.org) with:

```{r eval=FALSE}

install.packages("swaprinc")

```

You can install the development version of swaprinc from [GitHub](https://github.com/) with:

```{r eval=FALSE}

# install.packages("devtools")

devtools::install_github("mncube/swaprinc")

```

## A Simple Example

In the simple example provided, a regression model estimates the relationship 

between x1 and y, while controlling for variables x2 through x10.

By using the default engine, "stats", the statistical model is fitted with 

stats::lm, and by using the default prc_eng, "stats", principal components are 

extracted with stats::prcomp.

The "raw model" is specified by the formula parameter, which is passed to stats::lm. 

The pca_vars and n_pca_components parameters indicate that variables x2 to x10 

will be used to extract three principal components. Subsequently, the "PCA model" 

is passed to stats::lm as follows: y ~ x1 + PC1 + PC2 + PC3.

By setting the lpca_center and lpca_scale parameters to 'pca', the data in pca_vars 

will be centered and scaled according to the guidelines in the 

[Step-by-Step PCA](https://cran.r-project.org/package=LearnPCA/vignettes/Vig_03_Step_By_Step_PCA.pdf) 

vignette before being passed to stats::prcomp. The miss_handler parameter, set to 

'omit', ensures that only complete cases are included by subsetting the data frame 

rows with stats::complete.cases.

```{r simple}

library(swaprinc)

  # Create a small simulated dataset

  set.seed(40)

  n <- 50

  x1 <- rnorm(n)

  x2 <- rnorm(n, 5, 15)

  x3 <- rnorm(n, -5.5, 20)

  x4 <- rnorm(n, 3, 3) + x3*1.5

  x5 <- rnorm(n, -2, 4) + x3*.25

  x6 <- rnorm(n, -5, 5) + x4

  x7 <- rnorm(n, -2, 6)

  x8 <- rnorm(n, 2, 7)

  x9 <- rnorm(n, -2, 3) +x2*.4

  x10 <- rnorm(n, 5, 4)

  y <- 1 + 2 * x1 + 3 * x2 + 2.5*x4 - 3.5*x5 + 2*x6 + 1.5*x7 + x8 + 2*x9 + x10 + rnorm(n)

  data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

  # Run swaprinc with

  swaprinc_result <- swaprinc(data,

                              formula = "y ~ x1 + x2 + x3 + x4 + x5 + x5 + x6 + x7 + x8 + x9 + x10",

                              pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10"),

                              n_pca_components = 3,

                              lpca_center = "pca", 

                              lpca_scale = "pca",

                              miss_handler = "omit")

  

  # Summarize raw model

  summary(swaprinc_result$model_raw)

  

  # Summarize pca model

  summary(swaprinc_result$model_pca)

  

  # Get model comparisons

  print(swaprinc_result$comparison)

```

## The Motivating Example

A common challenge in applied statistics and data science involves performing 

logistic regression with a set of categorical independent variables. In this 

motivating example, swaprinc is employed to compare a 'raw' logistic regression 

model containing seven categorical independent variables with a 'pca' logistic 

regression model. The latter model replaces six of the independent variables with 

their first three principal components, using Gifi::princals to extract principal

components. For a comprehensive tutorial on Gifi, refer to 

[Nonlinear Principal Components Analysis: Multivariate Analysis with Optimal Scaling (MVAOS)](https://www.css.cornell.edu/faculty/dgr2/_static/files/R_html/NonlinearPCA.html#2_Package).

I recommend using the 'broom' and 'broom.mixed' packages to summarize model 

results when utilizing the '*_options' parameters for passing arguments to 

functions within 'swaprinc'. This approach helps prevent 

[overly extensive summaries caused by 'do.call'](https://stackoverflow.com/questions/75512192/r-do-call-function-returns-to-much/75512429#75512429).

```{r motivation}

 # Create a small simulated dataset

  set.seed(42)

  n <- 50

  x1 <- rnorm(n, 0.5, 4)

  x2 <- rnorm(n, 3, 15)

  x3 <- rnorm(n, -2.5, 5)

  x4 <- -2.5*x2 + 3*x3 + rnorm(n, 0, 4)

  x5 <- x2*x3 + rnorm(n, -5, 5)*rnorm(n, 5, 10)

  x6 <- rnorm(n, -2, 4)*rnorm(n, 3, 5)

  x7 <- x4 + x6 + rnorm(n, 0, 3)

  y <- 1 + 2*x1 + 3*x2 + -2*x3 + .5*x4 + x5 + 1.5*x6 + x7 + rnorm(n)

  data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7)

  # Categorize the variables

  yq <- stats::quantile(data$y,c(0,1/2, 1))

  x1q <- stats::quantile(data$x1,c(0,1/2, 1))

  x2q <- stats::quantile(data$x2,c(0,1/4,3/4,1))

  x3q <- stats::quantile(data$x3,c(0,2/5,3/5,1))

  x4q <- stats::quantile(data$x4,c(0,1/5,4/5,1))

  x5q <- stats::quantile(data$x5,c(0,2/5,3/5,1))

  x6q <- stats::quantile(data$x6,c(0,2/5,4/5,1))

  x7q <- stats::quantile(data$x7,c(0,2/5,3/5,1))

  data <- data %>% dplyr::mutate(

    y = cut(y, breaks=yq, labels=c("0", "1"),include.lowest = TRUE),

    x1 = cut(x1, breaks=x1q, labels=c("control", "treatment"),include.lowest = TRUE),

    x2 = cut(x2, breaks=x2q, labels=c("small","medium","large"),include.lowest = TRUE),

    x3 = cut(x3, breaks=x3q, labels=c("short","average","tall"),include.lowest = TRUE),

    x4 = cut(x4, breaks=x4q, labels=c("lowbit","most","highbit"),include.lowest = TRUE),

    x5 = cut(x5, breaks=x5q, labels=c("under","healthy","over"),include.lowest = TRUE),

    x6 = cut(x6, breaks=x6q, labels=c("small","medium","large"),include.lowest = TRUE),

    x7 = cut(x7, breaks=x7q, labels=c("small","medium","large"),include.lowest = TRUE)) %>%

    dplyr::mutate(y = as.numeric(ifelse(y == "0", 0, 1)))

  # Run swaprinc with prc_eng set to Gifi

  swaprinc_result <- swaprinc(data,

                              formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",

                              pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7"),

                              n_pca_components = 3,

                              prc_eng = "Gifi",

                              model_options = list(family = binomial(link = "logit")))

  

  # Summarize raw model

  broom::tidy(swaprinc_result$model_raw)

  

  # Summarize pca model

  broom::tidy(swaprinc_result$model_pca)

  

  # Get model comparisons

  print(swaprinc_result$comparison)

```

## Compare Multiple Models

Utilizing the same dataset as in the logistic regression model mentioned earlier, 

it is beneficial to compare outcomes for various swaps. In the example below, 

the compswap helper function facilitates the comparison of results with 2, 3, 4, 

and 5 principal components replacing six original independent variables. 

```{r compswap}

  # Run swaprinc with prc_eng set to Gifi

  compswap_results <- compswap(data,

                              formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",

                              .pca_varlist = list(c("x2", "x3", "x4", "x5", "x6", "x7")),

                              .n_pca_list = list(2, 3, 4, 5),

                              .prc_eng_list = list("Gifi"),

                              .model_options_list = list(list(family = binomial(link = "logit"))))

  # Show available models

  summary(compswap_results$all_models)

  

  # Get model comparisons

  print(compswap_results$all_comparisons)

  

  # View model summaries

  lapply(compswap_results$all_models, broom::tidy)

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mncube/swaprinc

Awesome Lists containing this project

README