Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/friendly/mvinfluence
Influence Measures and Diagnostic Plots for Multivariate Linear Models
https://github.com/friendly/mvinfluence
multivariate-analysis multivariate-linear-regression r r-package statistics visualization
Last synced: 2 months ago
JSON representation
Influence Measures and Diagnostic Plots for Multivariate Linear Models
- Host: GitHub
- URL: https://github.com/friendly/mvinfluence
- Owner: friendly
- Created: 2018-04-09T13:19:03.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-09-22T22:16:05.000Z (over 2 years ago)
- Last Synced: 2024-09-21T16:38:26.012Z (4 months ago)
- Topics: multivariate-analysis, multivariate-linear-regression, r, r-package, statistics, visualization
- Language: R
- Homepage: https://friendly.github.io/mvinfluence/
- Size: 4.73 MB
- Stars: 2
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
warning = FALSE, # avoid warnings and messages in the output
message = FALSE,
collapse = TRUE,
fig.width = 5,
fig.height = 5,
comment = "#>",
fig.path = "man/figures/README-"
)par(mar=c(3,3,1,1)+.1)
options(digits=3)
``````{r, echo=FALSE}
library(mvinfluence)
```[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/mvinfluence)](https://cran.r-project.org/package=mvinfluence)
[![](http://cranlogs.r-pkg.org/badges/grand-total/mvinfluence)](https://cran.r-project.org/package=mvinfluence)
[![DOI](https://zenodo.org/badge/128774860.svg)](https://zenodo.org/badge/latestdoi/128774860)# mvinfluence
**Influence Measures and Diagnostic Plots for Multivariate Linear Models**
Version 0.9-1
Functions in this package compute regression deletion diagnostics for multivariate linear models following
methods proposed by Barrett & Ling (1992)
and provide some associated
diagnostic plots. The diagnostic measures include hat-values (leverages), generalized Cook's distance, and
generalized squared 'studentized' residuals. Several types of plots to detect influential observations are
provided.In addition, the functions provide diagnostics for deletion of
subsets of observations of size `m>1`. This case is theoretically interesting
because sometimes pairs (`m=2`) of influential observations can mask each other,
sometimes they can have joint influence far exceeding their individual effects,
as well as other interesting phenomena described by Lawrence (1995).
Associated methods for the case
`m>1` are still under development in this package.## Documentation
Documentation for the package is now available at [https://friendly.github.io/mvinfluence/](https://friendly.github.io/mvinfluence/).## Installation
Get the released CRAN version or this development version:
| | |
|---------------------|---------------------------------------------------|
| CRAN version | `install.packages("mvinfluence")` |
| Development version | `remotes::install_github("friendly/mvinfluence")` |## Goals
The design goal for this package is that, as an extension of standard methods for univariate linear models, you should be able to fit a linear model with a **multivariate** response,
mymlm <- lm( cbind(y1, y2, y3) ~ x1 + x2 + x3, data=mydata)
and then get useful diagnostics and plots with:
influence(mymlm)
hatvalues(mymlm)
cooks.distance(mymlm)
influencePlot(mymlm, ...)As is done in comparable univariate functions in the `car` package, *noteworthy* points are identified in printed output and graphs.
## Examples
The `Rohwer` data contains data on kindergarten children designed to examine how well performance on a set of paired-associate (PA)
learning tasks can predict performance on some measures of aptitude and achievement---
`SAT` (a scholastic aptitude test),
`PPVT` (Peabody Picture Vocabulary Test), and
`Raven` ( Raven Progressive Matrices Test). The PA tasks differ in how the stimulus item was presented:
`n` (named),
`s` (still),
`ns` (named still),
`na` (named action) and
`ss` (sentence still).Here, we fit a MLM to a subset of the Rohwer data (the Low SES group).
```{r rohwer1}
data(Rohwer, package="heplots")
Rohwer2 <- subset(Rohwer, subset=group==2)
rownames(Rohwer2)<- 1:nrow(Rohwer2)
Rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, data=Rohwer2)car::Anova(Rohwer.mod)
```### Influence plots
The default influence plot (`type="stres"`) shows the squared standardized residual against the Hat value. The areas of the circles representing the observations are proportional to generalized Cook's distances.
```{r rohwer2}
(infl <-influencePlot(Rohwer.mod, id.n=4, type = "stres"))
```As you can see above,
the function returns a data frame of the influence statistics for the identified points. "Noteworthy" points are those that are unusual on *either* Hat value (H) or the squared studentized residual (Q), so more points will be shown than the `id.n` value. It is often more useful to sort these in descending order by one of the influence measures.```{r infl}
infl |> dplyr::arrange(desc(H))
```An alternative (`type="LR"`) plots residual components against leverage components, both on log scales. Because influence is a product of residual $\times$ Leverage, this plot had the property that contours of constant Cook's distance fall on diagonal lines with slope = -1. Each successive dashed line represents
a **multiple** of Cook's D.
This plot is often easier to read than the standard version.```{r rohwer3}
influencePlot(Rohwer.mod, id.n=4, type="LR")
```We observe that case 5 has the largest leverage and it is highly influential.
Case 25 has the largest residual component and middling leverage, so it is moderately influential.
Cases 14, 29, 27 have nearly identical residuals, and their influence increases from left to right
with leverage.### Index plots
If you wish to see how the observations fare on each of the the measures (as well as Mahalanobis $D^2$ of the residuals from the origin),
the `inflIndexPlot()` function gives you index plots.There are extensive options for identifying and labeling "noteworthy"
observations, with various methods. These rely on `car::showLabels()`, where the default `id.method = "y"` label points whose
Y coordinate is very large.```{r indexplot, fig.width=9}
infIndexPlot(Rohwer.mod,
id.n=3, id.col = "red", id.cex=1.5, id.location="ab")
```In this example, note that while case 5 stands out as influential, it does not have an exceptionally large Mahalanobis
squared distance, $D^2$ of the residuals.# Robust MLMs
Influential cases and those with large residuals can sometimes be dealt with by fitting a **robust** version of
the multivariate model. The function `heplots::robmlm()` uses a simple M-estimator that down-weights cases
with large residuals. Fitting is done by iterated re-weighted least squares (IWLS), using weights based on the Mahalanobis squared distances of the current residuals from the origin, and a scaling (covariance) matrix calculated by `MASS::cov.trob()`.```{r robmlm1}
Rohwer.rmod <- heplots::robmlm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss,
data=Rohwer2)
```The returned object has a `weights` component, the weight for each case in the final iteration. Which ones are less than 0.9 here?
```{r rob-weights}
which(Rohwer.rmod$weights < .9)
```A simple index plot makes the down-weighted observations stand out. Case 5 is not among them, but I label it anyway.
```{r rob-index-plot, fig.width=8, fig.height=4}
par(mar = c(4,4,1,1)+.1)
wts <- Rohwer.rmod$weights
idx <- c(5, which(wts < .9))
plot(wts, type="h",
xlab = "Case index",
ylab = "Robust mlm weight",
cex.lab = 1.25)
rect(0, .9, 33, 1.1,
col=scales::alpha("gray", .25),
border=NA)
points(wts, pch = 16,
cex = ifelse(wts < .9, 1.5, 1),
col = ifelse(wts < .9, "red", "black"))
text(idx, wts[idx], label=idx, pos=3, cex=1.2, xpd=NA )
```What's up with case 5? It had the largest leverage, but it's Mahalanobis $D^2$ was not large.
Thus, it was not down-weighted, even though it is an influential observation.What difference do these observations make in the fitted regression? This calculates the percentage relative difference
between the coefficients in the standard `lm()` and the robust version. The largest changes are for the coefficients
of the `ss` task, but there is an even greater one for `PPVT` on the `n` task.
```{r}
100 * abs(coef(Rohwer.mod) - coef(Rohwer.rmod)) / abs(coef(Rohwer.mod))
```## Citation
To cite `mvinfluence` in publications, use:
```{r citation}
citation("mvinfluence")
```## References
Barrett, B. E. and Ling, R. F. (1992).
General Classes of Influence Measures for Multivariate Regression.
*Journal of the American Statistical Association*, **87**(417), 184-191.Barrett, B. E. (2003). Understanding Influence in Multivariate Regression.
*Communications in Statistics -- Theory and Methods*, **32**, 3, 667-680.Lawrence, A. J. (1995). Deletion Influence and Masking in Regression.
*Journal of the Royal Statistical Society. Series B (Methodological)* , **57**, No. 1, pp. 181-189.