{"id":18430157,"url":"https://github.com/friendly/viscollin","last_synced_at":"2026-01-20T08:01:30.530Z","repository":{"id":187921240,"uuid":"677808977","full_name":"friendly/VisCollin","owner":"friendly","description":"Visualizing Collinearity Diagnostics","archived":false,"fork":false,"pushed_at":"2025-12-21T23:51:22.000Z","size":3952,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-12-23T08:54:45.859Z","etag":null,"topics":["biplots","collinearity-diagnostics","graphics","regression-models"],"latest_commit_sha":null,"homepage":"https://friendly.github.io/VisCollin/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/friendly.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-08-12T17:49:34.000Z","updated_at":"2025-12-21T23:51:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"b21ac2d7-aab6-48f9-aa32-8e282538106d","html_url":"https://github.com/friendly/VisCollin","commit_stats":null,"previous_names":["friendly/viscollin"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/friendly/VisCollin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendly%2FVisCollin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendly%2FVisCollin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendly%2FVisCollin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendly%2FVisCollin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/friendly","download_url":"https://codeload.github.com/friendly/VisCollin/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/friendly%2FVisCollin/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28598874,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T02:08:49.799Z","status":"ssl_error","status_checked_at":"2026-01-20T02:08:44.148Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biplots","collinearity-diagnostics","graphics","regression-models"],"created_at":"2024-11-06T05:19:46.311Z","updated_at":"2026-01-20T08:01:30.517Z","avatar_url":"https://github.com/friendly.png","language":"R","readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  message = FALSE,\n  warning = FALSE,\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n\noptions(digits=3)\n\n```\n\n\u003c!-- badges: start --\u003e\n[![CRAN](https://www.r-pkg.org/badges/version/VisCollin)](https://cran.r-project.org/package=VisCollin) \n[![R_Universe](https://friendly.r-universe.dev/badges/visCollin)](https://friendly.r-universe.dev/vcdVisCollin)\n[![Last Commit](https://img.shields.io/github/last-commit/friendly/VisCollin)](https://github.com/friendly/VisCollin)\n[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)\n[![License](https://img.shields.io/badge/license-GPL%20%28%3E=%202%29-brightgreen.svg?style=flat)](https://www.gnu.org/licenses/gpl-2.0.html) \n[![Downloads](https://cranlogs.r-pkg.org/badges/statquotes?color=brightgreen)](https://www.r-pkg.org:443/pkg/VisCollin)\n\u003c!-- badges: end --\u003e\n\n\n\n# VisCollin \u003cimg src=\"man/figures/logo.png\" style=\"float:right; height:200px;\" /\u003e\n\n**Visualizing Collinearity Diagnostics**\n\nVersion `r packageDescription(\"VisCollin\")$Version`; documentation built for `pkgdown` `r Sys.Date()`\n\nThe `VisCollin` package provides\nmethods to calculate diagnostics for multicollinearity among predictors in a linear or\ngeneralized linear model. It also provides methods to visualize those diagnostics following Friendly \u0026 Kwan (2009),\n\"Where’s Waldo: Visualizing Collinearity Diagnostics\", _The American Statistician_, **63**, 56–65.\n\nThese include:\n\n* better **tabular presentation** of collinearity diagnostics that highlight the important numbers.\n* a semi-graphic **tableplot** of the diagnostics to make warning and danger levels more salient and \n* a **collinearity biplot** of the _smallest dimensions_ of predictor space, where collinearity is most apparent.\n\n## Installation\n\n\n+-------------------+---------------------------------------------------+\n| CRAN version      | `install.packages(\"VisCollin\")`                   |\n+-------------------+---------------------------------------------------+\n| Development       | `remotes::install_github(\"friendly/VisCollin\")`   |\n| version           |                                                   |\n+-------------------+---------------------------------------------------+\n\n## Tutorial example\n\n```{r load}\nlibrary(VisCollin)\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(car)\nlibrary(corrplot)\nlibrary(tinytable)\n```\n\nThis example uses the `cars` data set containing various measures of size and performance on 406 models of automobiles from 1982. Interest is focused on predicting gas mileage, `mpg`.\n```{r cars}\ndata(cars, package = \"VisCollin\")\nstr(cars)\n```\n\n### Fit a model\nFit a model predicting gas mileage (`mpg`) from the number of cylinders, engine displacement, horsepower, weight,\ntime to\naccelerate from 0 -- 60 mph and model year (1970--1982). Perhaps surprisingly, only `weight` and `year` appear to\nsignificantly predict gas mileage. What's going on here?\n\n```{r cars-mod}\ncars.mod \u003c- lm (mpg ~ cylinder + engine + horse + weight + accel + year, \n                data=cars)\nAnova(cars.mod)\n```\n`lmtest::coeftest()` shows the coefficients, $\\hat{\\beta_j}$, their standard errors $s(\\hat{\\beta_j})$\nand associated $t$ statistics, $t = \\hat{\\beta_j} / s(\\hat{\\beta_j})$. As we will see, \nthe standard errors of the non-significant predictors have been inflated due to high multiple correlations\namong the predictors, making the $t$ statistics smaller.\n```{r coeftest}\nlmtest::coeftest(cars.mod)\n```\n\n### Correlation matrix\n\nIt is often recommended to examine the correlation matrix of the predictors to diagnose collinearity problems.\nIn the general case, this advice is misguided, because it is not the 0-order correlations that matter, but rather the\n**multiple correlations** predicting each independent variable from the others, $R_{x_j | \\text{others}}$.\n\nNonetheless, it is instructive to examine the correlations.\n\n```{r cars-R}\nR \u003c- cars |\u003e \n  select(cylinder:year) |\u003e \n  tidyr::drop_na() |\u003e\n  cor()\n\n100 * R |\u003e round(digits = 2)\n```\n\nOr, better yet, use `corrplot::corrplot.mixed()` to visualize them, using color and shading of glyphs,\n```{r cars-corrgram}\n#| fig.width = 6,\n#| fig.height = 6,\n#| out.width = \"60%\",\n#| fig.cap = \"Corrgram display of correlations among the `cars` variables.\"\ncorrplot.mixed(R, lower = \"square\", upper = \"ellipse\", tl.col = \"black\")\n```\n\nThe message here seems to be that there are two clusters of predictors with\nhigh correlations: {`cylinder`, `engine`, `horse` and `weight`},\nand {`accel`, `year`}.\n\n### Variance inflation factors\n\nVariance inflation factors measure the effect of multicollinearity on the standard errors of the \nestimated coefficients and are proportional to $1 / (1 - R^2_{x_j | \\text{others}})$.\n\nWe check the variance inflation factors, using `car::vif()`. We see that most predictors have very high\nVIFs, indicating moderately severe multicollinearity.\n\n```{r cars-vif}\nvif(cars.mod)\n\nsqrt(vif(cars.mod))\n```\n\nAccording to $\\sqrt{\\text{VIF}}$, the standard error of `cylinder` has been\nmultiplied by 3.26 and it's $t$-value divided by this number,\ncompared with the case when all predictors are\nuncorrelated. `engine`, `horse` and `weight` suffer a similar fate.\n\n### Collinearity diagnostics\n\nThe diagnostic measures introduced by Belsley (1991) are based on the eigenvalues $\\lambda_1, \\lambda_2, \\dots \\lambda_p$\nof the correlation matrix $R_{X}$ of the predictors (preferably centered and scaled, and not including the constant term\nfor the intercept), and the corresponding eigenvectors in the columns of $\\mathbf{V}_{p \\times p}$.\n\n`colldiag()` calculates:\n\n* **Condition indices**: \nThe smallest of the eigenvalues, those for which $\\lambda_j \\approx 0$,\nindicate collinearity and the number of small values indicates the number of near collinear relations.\nBecause the sum of the eigenvalues, $\\Sigma \\lambda_i = p$ increases with the number\nof predictors $p$, it is useful to scale them\nall in relation to the largest.  This leads to _condition indices_, defined as\n$\\kappa_j = \\sqrt{ \\lambda_1 / \\lambda_j}$. These have the property that the resulting numbers\nhave common interpretations regardless of the number of predictors.\n\n  + For completely uncorrelated predictors, all $\\kappa_j = 1$.\n  + $\\kappa_j \\rightarrow \\infty$ as any $\\lambda_k \\rightarrow 0$.\n\n  + In terms of the eigen-decomposition, variance inflation factors can be expressed as\n$$\n\\text{VIF}_j = \\sum_{k=1}^{p} \\frac{V^2_{jk}}{\\lambda_k} \\; .\n$$\n\n* **Variance decomposition proportions**:\nLarge VIFs indicate variables that are involved in _some_ nearly collinear\nrelations, but they don't indicate _which_ other variable(s) each is involved with.\nFor this purpose, Belsley et. al. (1980) and Belsley (1991) proposed calculation of\nthe proportions of variance of each variable associated with each principal component\nas a decomposition of the coefficient variance for each dimension.\n\nFor the current model, the usual display contains both the condition indices and\nvariance proportions. However, even for a small example, it is often difficult to know\nwhat numbers to pay attention to.\n```{r colldiag1}\n(cd \u003c- colldiag(cars.mod, center=TRUE))\n```\nBelsley (1991) recommends that the sources of collinearity be diagnosed \n(a) only for those components with large $\\kappa_j$, and\n(b) for those components for which the variance proportion is large (say, $\\ge 0.5$) on _two_ or more predictors.\nThe print method for `\"colldiag\"` objects has a `fuzz` argument controlling this.\n\n```{r colldiag2}\nprint(cd, fuzz = 0.5)\n```\n\nThe mystery is solved: There are two nearly collinear relations among the predictors, corresponding to the two\nsmallest dimensions. \n\n* Dimension 5 reflects the high correlation between horsepower and weight,\n* Dimension 6 reflects the high correlation between number of cylinders and engine displacement.\n\nNote that the high variance proportion for `year` (0.787) on the second component creates no problem and\nshould be ignored because (a) the condition index is low and (b) it shares nothing with other predictors.\n\n### Tableplot\n\nThe simplified tabular display above can be improved to make the patterns of collinearity more \nvisually apparent and to signify warnings directly to the eyes.\nA \"tableplot\" (Kwan et-al., 2009) is a semi-graphic display that presents numerical information in a table\nusing shapes proportional to the value in a cell and other visual attributes (shape type, color fill, and so forth)\nto encode other information. \n\nFor collinearity diagnostics, these show: \n\n* the condition indices,\nusing using _squares_ whose background color is red for condition indices \u003e 10,\ngreen for values \u003e 5 and green otherwise, reflecting danger, warning and OK respectively.\nThe value of the condition index is encoded within this using a white square whose side is proportional to the value\n(up to some maximum value, `cond.max`).\n* Variance decomposition proportions are shown by filled _circles_ whose radius is proportional to those values\nand are filled (by default) with shades ranging from white through pink to red. Rounded values of those diagnostics\nare printed in the cells.\n\n\nThe tableplot below encodes all the information from the values of `colldiag()` printed above\n(but using `prop.col` color breaks such that variance proportions \u003c 0.3 are shaded white).\nThe visual message is that one should attend to collinearities with large condition indices **and**\nlarge variance proportions implicating two or more predictors.\n\n\u003c!-- ```{r cars-tableplot0} --\u003e\n\u003c!-- knitr::include_graphics(\"man/figures/cars-tableplot.png\") --\u003e\n\u003c!-- ``` --\u003e\n\n```{r cars-tableplot}\ntableplot(cd, title = \"Tableplot of cars data\", cond.max = 30 )\n```\n\n### `tinytable` display\n\nA new display of these collinearity diagnostics is now included in a `tt()` method for `\"colldiag\"` results.\nThis is similar to the `tableplot` display, but rendered as a `tinytable`. It uses similar background shading\nfor the condition indices and variance proportions, but also allows the font size of the variance proportions\nto be made proportional to the values, scaled to a given range, 1.0 - 1.5 in this example.\n\n```{r tt-colldiag}\n#| out-width: \"60%\"\ntt(cd,\n   descending = TRUE,\n   fuzz = 0.3,\n   font.scale = c(1, 1.5)) |\u003e\n  save_tt(\"man/figures/README-tt-colldiag.png\", overwrite = TRUE)\n\nknitr::include_graphics(\"man/figures/README-tt-colldiag.png\")\n```\n\n\n\n### Collinearity biplot\n\nThe standard biplot (Gabriel, 1971; Gower\u0026 Hand 1996) can be regarded as\na multivariate analog of a scatterplot,\nobtained by projecting a\nmultivariate sample into a low-dimensional space\n(typically of 2 or 3 dimensions)\naccounting for the greatest variance in the data.\nWith the symmetric (PCA) scaling used here, this is \nequivalent to a plot of principal component scores of the mean-centered \nmatrix $\\widetilde{\\mathbf{X}} = \\mathbf{X} - \\bar{\\mathbf{X}}$ of predictors\nfor the observations (shown as points or case labels),\ntogether with principal component coefficients\nfor the variables (shown as vectors) in the same 2D (or 3D) space. \n\nHowever the standard biplot is less useful for visualizing the relations among the\npredictors that lead to nearly collinear relations.  Instead, biplots of\nthe **smallest dimensions** show these relations directly, and can show other\nfeatures of the data as well, such as outliers and leverage points.\nWe use `prcomp(X, scale.=TRUE)` to obtain the PCA of the correlation matrix\nof the predictors:\n\n```{r cars-pca}\ncars.X \u003c- cars |\u003e\n  select(where(is.numeric)) |\u003e\n  select(-mpg) |\u003e\n  tidyr::drop_na()\ncars.pca \u003c- prcomp(cars.X, scale. = TRUE)\ncars.pca\n```\n\nThe standard deviations above are the square roots $\\sqrt{\\lambda_j}$ of the eigenvalues of the\ncorrelation matrix, and are returned in the `sdev` component of the `\"prcomp\"` object.\nThe eigenvectors are returned in the `rotation` component, whose directions are arbitrary.\n```{r cars-biplot-prep}\n# Make labels for dimensions include % of variance\npct \u003c- 100 *(cars.pca$sdev^2) / sum(cars.pca$sdev^2)\nlab \u003c- glue::glue(\"Dimension {1:6} ({round(pct, 2)}%)\")\n\n# Direction of eigenvectors is arbitrary. Reflect them\ncars.pca$rotation \u003c- -cars.pca$rotation\n```\n\nThe collinearity biplot is then constructed as follows:\n```{r cars-biplot}\n#| fig.show = \"hold\",\n#| out.width = \"100%\"\nop \u003c- par(lwd = 2, xpd = NA )\nbiplot(cars.pca,\n       choices=6:5,           # only the last two dimensions\n       scale=0.5,             # symmetric biplot scaling\n       cex=c(0.6, 1),         # character sizes for points and vectors\n       col = c(\"black\", \"blue\"),\n       expand = 1.7,          # expand variable vectors for visibility\n       xlab = lab[6],\n       ylab = lab[5],\n       xlim = c(-0.7, 0.5),\n       ylim = c(-0.8, 0.5)\n      )\npar(op)\n```\n\nThe projections of the variable vectors\non the Dimension 5 and Dimension 6 axes are proportional to \ntheir variance proportions shown above.\nThe relative lengths of these variable vectors can be considered\nto indicate the extent to which each variable contributes to collinearity\nfor these two near-singular dimensions.\n\nThus, we see again that Dimension 6\nis largely determined by `engine` size, with a substantial (negative) relation\nto `cylinder`.  Dimension 5 has its' strongest relations to `weight`\nand `horse`. \n\nMoreover, there is one observation, #20, that stands out as\nan outlier in predictor space, far from the centroid.\nIt turns out that this vehicle, a Buick Estate wagon, is an early-year (1970) American behemoth,\nwith an 8-cylinder, 455 cu. in, 225 horse-power engine, and able to go from 0 to 60 mph\nin 10 sec.\n(Its MPG is only slightly under-predicted from the regression model, however.)\n\n### Remedies for collinearity: What to do?\n\nCollinearity is often a **data** problem, for which there is no magic cure. Nevertheless there are some\ngeneral guidelines and useful techniques to address this problem.\n\n* **Pure prediction**: If we are only interested in predicting / explaining an outcome, \nand not the model coefficients or which are \"significant\", collinearity can be largely ignored.\nThe fitted values are unaffected by collinearity.\n\n* **structural collinearity**: Sometimes collinearity results from structural relations among the variables:\n\n  + For example, polynomial terms, like $x, x^2, x^3$ or interaction terms like $x_1, x_2, x_1 * x_2$\nare necessarily correlated. A simple cure is to _center_ the predictors at their means, using\n$x - \\bar{x}, (x - \\bar{x})^2, (x - \\bar{x})^3$ or \n$(x_1 - \\bar{x}_1), (x_2 - \\bar{x}_2), (x_1 - \\bar{x}_1) * (x_2 - \\bar{x}_2)$\n\n  + When some predictors share a common cause, as in GNP or population in time-series or cross-national data,\n  you can reduce collinearity by re-defining predictors to reflect _per capita measures_.\n\n* **Model re-specification**:\n  + Drop one or more regressors that have a high VIF if they are not deemed to be essential\n\n  + Replace highly correlated regressors with linear combination(s) of them. For example, two related variables,\n$x_1$ and $x_2$ can be replaced without any loss of information by replacing them with their sum and\ndifference, $z_1 = x_1 + x_2$ and $z_2 = x_1 - x_2$.\n\n* **Statistical remedies**:\n  + Transform the predictors to uncorrelated principal components\n  \n  + use **regularization methods** such as ridge regression and lasso, which correct for collinearity by \n  introducing shrinking coefficients towards 0, introducing\na small amount of bias, . See the [genridge](https://CRAN.R-project.org/package=genridge) package and its [`pkgdown` documentation](https://friendly.github.io/genridge/) for visualization methods.\n\n  + use Bayesian regression; if multicollinearity prevents a regression coefficient from being estimated precisely, then a prior on that coefficient will help to reduce its posterior variance.\n\n## References\n\nBelsley, D.A.,  Kuh, E.  and Welsch, R. (1980).\n_Regression Diagnostics_, New York: John Wiley \u0026 Sons.\n\nBelsley, D.A. (1991).\n_Conditioning diagnostics, collinearity and weak data in regression_.\nNew York: John Wiley \u0026 Sons.\n\nFriendly, M., \u0026 Kwan, E. (2009).\n\"Where’s Waldo: Visualizing Collinearity Diagnostics.\" _The American Statistician_, **63**, 56–65.\nOnline: [https://www.datavis.ca/papers/viscollin-tast.pdf](https://www.datavis.ca/papers/viscollin-tast.pdf).\nSupp. materials: [https://www.datavis.ca/papers/viscollin/](https://www.datavis.ca/papers/viscollin/)\n\nGabriel, K. R. (1971). The Biplot Graphic Display of Matrices with Application to Principal Components Analysis. _Biometrics_, **58**, 453–467.\n\nGower, J. C., \u0026 Hand, D. J. (1996). _Biplots_. London: Chapman \u0026 Hall.\n\nKwan, E., Lu, I. R. R., \u0026 Friendly, M. (2009). Tableplot: A new tool for assessing precise predictions. _Zeitschrift Für Psychologie / Journal of Psychology_, **217**, 38–48.\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffriendly%2Fviscollin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffriendly%2Fviscollin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffriendly%2Fviscollin/lists"}