{"id":22966625,"url":"https://github.com/favstats/multicol_sim","last_synced_at":"2026-01-12T07:40:01.809Z","repository":{"id":108351090,"uuid":"146584033","full_name":"favstats/multicol_sim","owner":"favstats","description":"Analyzing Multicollineaerity with a little simulation","archived":false,"fork":false,"pushed_at":"2018-09-05T16:20:06.000Z","size":24751,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-02T05:12:30.756Z","etag":null,"topics":["analysis","blog","collinearity-diagnostics","multicollinearity","simulations","statistics"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/favstats.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-29T10:29:42.000Z","updated_at":"2025-03-14T02:01:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"f37fb7cc-64c6-47a6-b140-f4c5ebe6c89f","html_url":"https://github.com/favstats/multicol_sim","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/favstats/multicol_sim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favstats%2Fmulticol_sim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favstats%2Fmulticol_sim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favstats%2Fmulticol_sim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favstats%2Fmulticol_sim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/favstats","download_url":"https://codeload.github.com/favstats/multicol_sim/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favstats%2Fmulticol_sim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28336624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T06:09:07.588Z","status":"ssl_error","status_checked_at":"2026-01-12T06:05:18.301Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","blog","collinearity-diagnostics","multicollinearity","simulations","statistics"],"created_at":"2024-12-14T20:44:52.980Z","updated_at":"2026-01-12T07:40:01.804Z","avatar_url":"https://github.com/favstats.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"How does Collinearity Influence Linear Regressions?\n================\nFabio Votta\n29 August 2018\n\n## Load Packages\n\n## Simulation Function\n\n``` r\ngenerate_multi \u003c- function(n, cor_seq){\n  set.seed(2017)\n  x \u003c- runif(n, 1, 10)\n  models \u003c- list()\n  std.models \u003c- list()\n  for (jj in seq_along(cor_seq)) {\n    ## generate correlated variable x2\n    dat \u003c- data.frame(corgen(x = x, r = cor_seq[jj],  epsilon = 0))\n    colnames(dat) \u003c- c(\"x1\", \"x2\")\n    ## generate y variable\n    dat$y \u003c- 0.5 * dat$x1 + 0.5 * dat$x2 + rnorm(n, sd = 10)\n    ## modelling and tidy dataframe\n    models[[jj]] \u003c- tidy(lm(y ~ x1 + x2, data = dat))\n    ## get standardized betas\n    std.models[[jj]] \u003c- data.frame(lm.beta::coef.lm.beta((lm.beta::lm.beta(lm(y ~ x1 + x2, data = dat)))))\n    colnames(std.models[[jj]]) \u003c- c(\"std.estimate\")  \n    ## bind it together\n    models[[jj]] \u003c- std.models[[jj]] %\u003e% \n                      bind_cols(models[[jj]]) \n    models[[jj]]$cors \u003c- cor_seq[jj]\n  }\n  sim_dat \u003c- bind_rows(models)\n  sim_dat$col \u003c- n\n  return(sim_dat)\n}\n\ndraw.data \u003c- function(cor_seq = NULL, step_seq = NULL){\n    sim.list \u003c- list()\n    for(jj in seq_along(step_seq)) {\n      sim.list[[jj]] \u003c- generate_multi(n = step_seq[jj], cor_seq)\n      sim.list[[jj]]$n \u003c- step_seq[jj]\n      \n      cat(paste0(\"Batch: \", jj, \"\\t\"))\n      \n    } \n    sim_data \u003c- bind_rows(sim.list)\n    return(sim_data)\n}\n```\n\n## Simulate Data\n\n``` r\nsim_data \u003c- draw.data(cor_seq = seq(0,.99,0.01), step_seq = seq(50, 10000, by = 50))\n\nif(!dir.exists(\"data\")) dir.create(\"data\")\n\nsave(sim_data, file = \"data/sim_data.Rdata\")\n```\n\n## Visualizing the Influence of Collinearity\n\n``` r\nload(\"data/sim_data.Rdata\")\n```\n\n### Standard Errors\n\n``` r\nget_smooths \u003c- function(smooth_dat, n_val, y) {\n  smooth_dat \u003c- filter(smooth_dat, n == n_val \u0026 term == \"x1\")\n  # y \u003c- enquo(y)\n  fm \u003c- paste0(y,\" ~ cors\")\n  smooth_vals \u003c- predict(loess(fm, smooth_dat), smooth_dat$cors) \n  \n  smooth_dat %\u003e% \n  mutate(smooth = smooth_vals) %\u003e% \n  group_by(n) %\u003e% \n  summarise(max_smooth = max(smooth),\n            min_smooth = min(smooth)\n            )\n}\n\nsmooth_dat \u003c- c(50, 100, 150, 200, 10000) %\u003e% \n  map_df(~get_smooths(sim_data, n_val = .x, y = \"std.error\")) %\u003e% \n  mutate(n_lab = ifelse(n == 50, \"Sample Size: 50\", n))\n\nsim_data %\u003e% \n  filter(term == \"x1\") %\u003e% \n  ggplot(aes(cors, std.error, colour = n, group = n)) + \n  geom_smooth(method = \"loess\", se = F, size = 1, alpha = 0.5) +\n  xlab(expression(\"Pearson's\"~r~correlation~between~x[1]~and~x[2])) + \n  ylab(expression(x[1]~Standard~Error)) + \n  theme_hc() + \n  scale_color_viridis(\"Sample Size\", direction = -1,\n       # limits = seq(1000, 10000, 3000),\n       breaks = seq(1000, 10000, 3000),\n       labels = seq(1000, 10000, 3000)) +\n  ggtitle(\"Sample Size and Collinearity Influence on Standard Error\") +\n  geom_point(data = smooth_dat, aes(x = .99, y = max_smooth)) +\n  geom_text_repel(data = smooth_dat, aes(x = .99, y = max_smooth, label = n_lab), \n                  nudge_y = 0.07, nudge_x = 0.03) +\n  guides(colour = guide_colourbar(barwidth = 20, label.position = \"bottom\"))\n```\n\n![](multicol_sim_files/figure-gfm/unnamed-chunk-4-1.png)\u003c!-- --\u003e\n\n``` r\nggsave(filename = \"images/std_static.png\", width = 10, height = 7)\n```\n\n### T-Statistic\n\n``` r\nsim_data  %\u003e% \n     filter(term == \"x1\") %\u003e% \n     ggplot(aes(cors, statistic, colour = n, group = n)) + \n     geom_smooth(method = \"loess\", se = F, size = 1, alpha = 0.5) +\n  xlab(expression(\"Pearson's\"~r~correlation~between~x[1]~and~x[2])) + \n  ylab(expression(x[1]~t-Statistic)) + \n  theme_hc() + \n  scale_color_viridis(\"Sample Size\", direction = -1,\n       # limits = seq(1000, 10000, 3000),\n       breaks = seq(1000, 10000, 3000),\n       labels = seq(1000, 10000, 3000)) +\n  geom_hline(yintercept = 1.96, linetype = \"dashed\", alpha = 0.9) +\n  annotate(geom = \"text\", x = 0, y = 2.3, label = \"t = 1.96\") +\n  ggtitle(\"Sample Size and Collinearity Influence on t-Statistic\") +\n  guides(colour = guide_colourbar(barwidth = 20, label.position = \"bottom\"))\n```\n\n![](multicol_sim_files/figure-gfm/unnamed-chunk-5-1.png)\u003c!-- --\u003e\n\n``` r\nggsave(filename = \"images/t_static.png\", width = 10, height = 7)\n```\n\n#### P-Values\n\n``` r\nsim_data  %\u003e% \n     filter(term==\"x1\") %\u003e% \n     ggplot(aes(cors, p.value, colour = n, group = n)) + \n     geom_smooth(method = \"loess\", se = F, size = 1, alpha = 0.5) +\n  xlab(expression(\"Pearson's\"~r~correlation~between~x[1]~and~x[2])) + \n  ylab(expression(x[1]~p-value)) + \n  theme_hc() + \n  scale_color_viridis(\"Sample Size\", direction = -1,\n       # limits = seq(1000, 10000, 3000),\n       breaks = seq(1000, 10000, 3000),\n       labels = seq(1000, 10000, 3000)) +\n  geom_hline(yintercept = 0.05, linetype = \"dashed\", alpha = 0.9) +\n  annotate(geom = \"text\", x = 0, y = 0.08, label = \"p = 0.05\") +\n  ggtitle(\"Sample Size and Collinearity Influence on p-values\") +\n  guides(colour = guide_colourbar(barwidth = 20, label.position = \"bottom\"))\n```\n\n![](multicol_sim_files/figure-gfm/unnamed-chunk-6-1.png)\u003c!-- --\u003e\n\n``` r\nggsave(filename = \"images/p_static.png\", width = 10, height = 7)\n```\n\n#### B-Coefficients\n\n``` r\nsim_data  %\u003e%\n    filter(term==\"x1\") %\u003e%\n    filter(n\u003e200) %\u003e%\n    ggplot(aes(cors, estimate, colour = n, group = n)) +\n    geom_hline(yintercept = 0.5, linetype = \"dashed\", alpha = 0.9) +\n    #geom_smooth(method = \"loess\", se = F, size = 1, alpha = 0.5) +#\n    geom_line(alpha = 0.5) +\n  xlab(expression(\"Pearson's\"~r~correlation~between~x[1]~and~x[2])) + \n  ylab(expression(x[1]~b-coefficient)) + \n  theme_hc() + \n  scale_color_viridis(\"Sample Size\", direction = -1,\n       # limits = seq(1000, 10000, 3000),\n       breaks = seq(1000, 10000, 3000),\n       labels = seq(1000, 10000, 3000)) +\n  annotate(geom = \"text\", x = 0, y = 0.08, label = \"b = 0.5\") +\n  ggtitle(\"Sample Size and Collinearity Influence on b-coefficients\") +\n  guides(colour = guide_colourbar(barwidth = 20, label.position = \"bottom\"))\n```\n\n![](multicol_sim_files/figure-gfm/unnamed-chunk-7-1.png)\u003c!-- --\u003e\n\n``` r\nggsave(filename = \"images/b_static.png\", width = 10, height = 7)\n```\n\n##### Animation\n\n``` r\nlibrary(gganimate)\n\nsim_data_sub \u003c- sim_data  %\u003e%\n    filter(term == \"x1\") %\u003e%\n    filter(n \u003e 200) %\u003e% \n    filter(n %in% c(300, 1000, 10000)) %\u003e% \n  mutate(estimate_lab = round(estimate, 2) %\u003e% as.character) %\u003e% \n  mutate(n = as.factor(n))\n\n\n anim1 \u003c-  sim_data_sub  %\u003e%\n    # filter(term == \"x1\") %\u003e%\n    # filter(n \u003e 200) %\u003e%\n    ggplot(aes(cors, estimate, colour = n, group = n)) +\n    geom_hline(yintercept = 0.5, linetype = \"dashed\", alpha = 0.9) +\n  geom_line() +\n  geom_segment(aes(xend = 1, yend = estimate), linetype = 2, colour = 'grey') +\n  geom_point(size = 2) +\n  geom_text(aes(x = 1, label = n), hjust = 0, size = 4, fontface = \"bold\") +\n  geom_text(aes(x = 0.15, y = 1.8, label = paste0(\"Correlation: \", cors)), \n                hjust = 1, size = 5, color = \"black\") +\n  geom_text(aes(label = estimate_lab), hjust = 0, size = 3, fontface = \"bold\", nudge_y = 0.1) +\n  xlab(\"Pearson's r correlation between x1 and x2\") +\n  ylab(\"x1 b-coefficient\") + \n  coord_cartesian(clip = 'off') + \n  theme_hc() + \n  scale_color_viridis(\"Sample Size\", \n                      direction = -1, \n                      discrete = T,\n                      begin = 0.3,\n       # limits = seq(1000, 10000, 3000),\n       breaks = seq(1000, 10000, 3000),\n       labels = seq(1000, 10000, 3000)) +\n  guides(colour = F) +\n  theme(title = element_text(size = 15, face = \"bold\"), \n        axis.text.x = element_text(size = 14, face = \"bold\"), \n        axis.text.y = element_text(size = 10, face = \"italic\")) +\n  ggtitle(\"Sample Size and Collinearity Influence on b-coefficients (Sample Sizes: 300, 1000 and 10.000)\") +\n  # guides(colour = guide_colourbar(barwidth = 20, label.position = \"bottom\")) +\n  # Here comes the gganimate code\n  transition_reveal(n, cors) \n\n\nanim1 %\u003e% animate(\n  nframes = 500, fps = 15, width = 1000, height = 600, detail = 3\n)\n\nanim_save(\"images/b_anim.gif\")\n```\n\n![](https://github.com/favstats/multicol_sim/blob/master/images/b_anim.gif?raw=true)\u003c!-- --\u003e\n\n##### Standardized\n\n``` r\nsim_data  %\u003e% \n     filter(term==\"x1\") %\u003e% \n     filter(n\u003e200) %\u003e%\n     ggplot(aes(cors, std.estimate, colour = n, group = n)) + \n     #geom_smooth(method = \"loess\", se = F, size = 1, alpha = 0.5) +\n    geom_line(alpha = 0.5) +\n  xlab(expression(\"Pearson's\"~r~correlation~between~x[1]~and~x[2])) + \n  ylab(expression(x[1]~b-coefficient)) + \n  theme_hc() + \n  scale_color_viridis(\"Sample Size\", direction = -1,\n       # limits = seq(1000, 10000, 3000),\n       breaks = seq(1000, 10000, 3000),\n       labels = seq(1000, 10000, 3000)) +\n  ggtitle(\"Sample Size and Collinearity Influence on standardized b-coefficients\") +\n  guides(colour = guide_colourbar(barwidth = 20, label.position = \"bottom\"))\n```\n\n![](multicol_sim_files/figure-gfm/unnamed-chunk-8-1.png)\u003c!-- --\u003e\n\n``` r\nggsave(filename = \"images/b_standardized_static.png\", width = 10, height = 7)\n```\n\n``` r\nsessionInfo()\n```\n\n    ## R version 3.5.0 (2018-04-23)\n    ## Platform: x86_64-w64-mingw32/x64 (64-bit)\n    ## Running under: Windows 10 x64 (build 17134)\n    ## \n    ## Matrix products: default\n    ## \n    ## locale:\n    ## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   \n    ## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   \n    ## [5] LC_TIME=German_Germany.1252    \n    ## \n    ## attached base packages:\n    ## [1] grid      stats     graphics  grDevices utils     datasets  methods  \n    ## [8] base     \n    ## \n    ## other attached packages:\n    ##  [1] bindrcpp_0.2.2     ggrepel_0.8.0      lm.beta_1.5-1     \n    ##  [4] gridExtra_2.3      viridis_0.5.1      viridisLite_0.3.0 \n    ##  [7] ecodist_2.0.1      forcats_0.3.0      stringr_1.3.0     \n    ## [10] dplyr_0.7.5        readr_1.1.1        tidyr_0.8.1       \n    ## [13] tibble_1.4.2       ggplot2_3.0.0.9000 tidyverse_1.2.1   \n    ## [16] ggthemes_4.0.0     broom_0.4.4        purrr_0.2.4       \n    ## [19] arm_1.10-1         lme4_1.1-17        Matrix_1.2-14     \n    ## [22] MASS_7.3-49       \n    ## \n    ## loaded via a namespace (and not attached):\n    ##  [1] Rcpp_0.12.18     lubridate_1.7.4  lattice_0.20-35  assertthat_0.2.0\n    ##  [5] rprojroot_1.3-2  digest_0.6.15    psych_1.8.3.3    R6_2.2.2        \n    ##  [9] cellranger_1.1.0 plyr_1.8.4       backports_1.1.2  evaluate_0.10.1 \n    ## [13] coda_0.19-1      httr_1.3.1       pillar_1.2.1     rlang_0.2.1     \n    ## [17] lazyeval_0.2.1   readxl_1.1.0     minqa_1.2.4      rstudioapi_0.7  \n    ## [21] nloptr_1.0.4     rmarkdown_1.9    labeling_0.3     splines_3.5.0   \n    ## [25] foreign_0.8-70   munsell_0.4.3    compiler_3.5.0   modelr_0.1.1    \n    ## [29] pkgconfig_2.0.1  mnormt_1.5-5     htmltools_0.3.6  tidyselect_0.2.4\n    ## [33] crayon_1.3.4     withr_2.1.2      nlme_3.1-137     jsonlite_1.5    \n    ## [37] gtable_0.2.0     pacman_0.4.6     magrittr_1.5     scales_0.5.0    \n    ## [41] cli_1.0.0        stringi_1.1.7    reshape2_1.4.3   xml2_1.2.0      \n    ## [45] tools_3.5.0      glue_1.3.0       hms_0.4.2        abind_1.4-5     \n    ## [49] parallel_3.5.0   yaml_2.1.19      colorspace_1.4-0 rvest_0.3.2     \n    ## [53] knitr_1.20       bindr_0.1.1      haven_1.1.2\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffavstats%2Fmulticol_sim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffavstats%2Fmulticol_sim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffavstats%2Fmulticol_sim/lists"}