Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tidymodels/corrr

Explore correlations in R
https://github.com/tidymodels/corrr

Last synced: 2 days ago
JSON representation

Explore correlations in R

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```

# corrr

[![R-CMD-check](https://github.com/tidymodels/corrr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/corrr/actions/workflows/R-CMD-check.yaml)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/corrr)](https://cran.r-project.org/package=corrr)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/corrr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/corrr?branch=main)

corrr is a package for exploring **corr**elations in **R**. It focuses on creating and working with **data frames** of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the [tidyverse](https://www.tidyverse.org/). This, along with the primary corrr functions, is represented below:

You can install:

- the latest released version from CRAN with

```{r install_cran, eval = FALSE}
install.packages("corrr")
```

- the latest development version from GitHub with

```{r install_git, eval = FALSE}
# install.packages("remotes")
remotes::install_github("tidymodels/corrr")
```

## Using corrr

Using `corrr` typically starts with `correlate()`, which acts like the base correlation function `cor()`. It differs by defaulting to pairwise deletion, and returning a correlation data frame (`cor_df`) of the following structure:

- A `tbl` with an additional class, `cor_df`
- An extra "term" column
- Standardized variances (the matrix diagonal) set to missing values (`NA`) so they can be ignored.

### API

The corrr API is designed with data pipelines in mind (e.g., to use `%>%` from the magrittr package). After `correlate()`, the primary corrr functions take a `cor_df` as their first argument, and return a `cor_df` or `tbl` (or output like a plot). These functions serve one of three purposes:

Internal changes (`cor_df` out):

- `shave()` the upper or lower triangle (set to `r NA`).
- `rearrange()` the columns and rows based on correlation strengths.

Reshape structure (`tbl` or `cor_df` out):

- `focus()` on select columns and rows.
- `stretch()` into a long format.

Output/visualizations (console/plot out):

- `fashion()` the correlations for pretty printing.
- `rplot()` the correlations with shapes in place of the values.
- `network_plot()` the correlations in a network.

## Databases and Spark

The `correlate()` function also works with database tables. The function will automatically push the calculations of the correlations to the database, collect the results in R, and return the `cor_df` object. This allows for those results integrate with the rest of the `corrr` API.

## Examples

```{r example, message = FALSE, warning = FALSE}
library(MASS)
library(corrr)
set.seed(1)

# Simulate three columns correlating about .7 with each other
mu <- rep(0, 3)
Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3
seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Simulate three columns correlating about .4 with each other
mu <- rep(0, 3)
Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6
four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Bind together
d <- cbind(seven, four)
colnames(d) <- paste0("v", 1:ncol(d))

# Insert some missing values
d[sample(1:nrow(d), 100, replace = TRUE), 1] <- NA
d[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA

# Correlate
x <- correlate(d)
class(x)
x
```

**NOTE: Previous to corrr 0.4.3, the first column of a `cor_df` dataframe was named "rowname". As of corrr 0.4.3, the name of this first column changed to "term".**

As a `tbl`, we can use functions from data frame packages like `dplyr`, `tidyr`, `ggplot2`:

```{r, message = FALSE, warning = FALSE}
library(dplyr)

# Filter rows by correlation size
x %>% filter(v1 > .6)
```

corrr functions work in pipelines (`cor_df` in; `cor_df` or `tbl` out):

```{r combination, warning = FALSE, fig.height = 4, fig.width = 5}
x <- datasets::mtcars %>%
correlate() %>% # Create correlation data frame (cor_df)
focus(-cyl, -vs, mirror = TRUE) %>% # Focus on cor_df without 'cyl' and 'vs'
rearrange() %>% # rearrange by correlations
shave() # Shave off the upper triangle for a clean result

fashion(x)
rplot(x)

datasets::airquality %>%
correlate() %>%
network_plot(min_cor = .2)
```

## Contributing

This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).

- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/corrr/issues).

- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.

- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).