Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/boxuancui/DataExplorer

Automate Data Exploration and Treatment
https://github.com/boxuancui/DataExplorer

cran data-analysis data-exploration data-science eda r r-package rstats visualization

Last synced: 2 months ago
JSON representation

Automate Data Exploration and Treatment

Host: GitHub
URL: https://github.com/boxuancui/DataExplorer
Owner: boxuancui
License: other
Created: 2015-06-28T14:48:43.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2024-01-24T00:42:25.000Z (5 months ago)
Last Synced: 2024-04-03T09:48:08.104Z (3 months ago)
Topics: cran, data-analysis, data-exploration, data-science, eda, r, r-package, rstats, visualization
Language: R
Homepage: http://boxuancui.github.io/DataExplorer/
Size: 46.8 MB
Stars: 490
Watchers: 34
Forks: 89
Open Issues: 17
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: code_of_conduct.md

Lists

awesome-R - DataExplorer - Fast exploratory data analysis with minimum code. (Data Manipulation)
awesome-R - DataExplorer - Fast exploratory data analysis with minimum code. (Data Manipulation)
awesome - DataExplorer - Automate Data Exploration and Treatment (R)
fucking-awesome-R - DataExplorer - Fast exploratory data analysis with minimum code. (Data Manipulation)
jimsghstars - boxuancui/DataExplorer - Automate Data Exploration and Treatment (R)

README

        ---

output: github_document

---

```{r setup, echo = FALSE}

library(DataExplorer)

library(knitr)

library(ggplot2)

library(gridExtra)

set.seed(1)

knitr::opts_chunk$set(

	eval = FALSE,

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-"

)

```

# DataExplorer 

[![CRAN Version](http://www.r-pkg.org/badges/version/DataExplorer)](https://cran.r-project.org/package=DataExplorer)

[![Downloads](http://cranlogs.r-pkg.org/badges/DataExplorer)](https://cran.r-project.org/package=DataExplorer)

[![Total Downloads](http://cranlogs.r-pkg.org/badges/grand-total/DataExplorer)](https://cran.r-project.org/package=DataExplorer)

[![GitHub Stars](https://img.shields.io/github/stars/boxuancui/DataExplorer.svg?style=social)](https://github.com/boxuancui/DataExplorer)

[![R-CMD-check](https://github.com/boxuancui/DataExplorer/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/boxuancui/DataExplorer/actions/workflows/check-standard.yaml)

[![codecov](https://codecov.io/gh/boxuancui/DataExplorer/graph/badge.svg?token=w8eMGjF8Jw)](https://app.codecov.io/gh/boxuancui/DataExplorer)

[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/2053/badge)](https://bestpractices.coreinfrastructure.org/projects/2053)

[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](http://boxuancui.github.io/DataExplorer/code_of_conduct.html)

## Background

[Exploratory Data Analysis (EDA)](https://en.wikipedia.org/wiki/Exploratory_data_analysis) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This [R](https://cran.r-project.org/) package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

## Installation

The package can be installed directly from CRAN.

```{r install-cran}

install.packages("DataExplorer")

```

However, the latest stable version (if any) could be found on [GitHub](https://github.com/boxuancui/DataExplorer), and installed using `devtools` package.

```{r install-github}

if (!require(devtools)) install.packages("devtools")

devtools::install_github("boxuancui/DataExplorer")

```

If you would like to install the latest [development version](https://github.com/boxuancui/DataExplorer/tree/develop), you may install the develop branch.

```{r install-github-develop}

if (!require(devtools)) install.packages("devtools")

devtools::install_github("boxuancui/DataExplorer", ref = "develop")

```

## Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes [here](https://boxuancui.github.io/DataExplorer/articles/dataexplorer-intro.html).

#### Report

To get a report for the [airquality](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html) dataset:

```{r create-report-airquality}

library(DataExplorer)

create_report(airquality)

```

To get a report for the [diamonds](https://ggplot2.tidyverse.org/reference/diamonds.html) dataset with response variable **price**:

```{r create-report-diamonds}

library(ggplot2)

create_report(diamonds, y = "price")

```

#### Visualization

Instead of running `create_report`, you may also run each function individually for your analysis, e.g.,

```{r introduce-template}

## View basic description for airquality data

introduce(airquality)

```

```{r introduce, echo=FALSE, eval=TRUE}

kable(t(introduce(airquality)), row.names = TRUE, col.names = "", format.args = list(big.mark = ","))

```

```{r plot-intro, eval=TRUE}

## Plot basic description for airquality data

plot_intro(airquality)

```

```{r plot-missing, eval=TRUE}

## View missing value distribution for airquality data

plot_missing(airquality)

```

```{r plot-bar-template}

## Left: frequency distribution of all discrete variables

plot_bar(diamonds)

## Right: `price` distribution of all discrete variables

plot_bar(diamonds, with = "price")

```

```{r plot-bar-prepare, include=FALSE, echo=FALSE, eval=TRUE}

plot_bar_a <- plot_bar(diamonds, nrow = 3L, ncol = 1L)

plot_bar_b <- plot_bar(diamonds, with = "price", nrow = 3L, ncol = 1L)

```

```{r plot-bar, echo=FALSE, eval=TRUE}

DataExplorer:::plotDataExplorer.grid(

	plot_obj = list(plot_bar_a[[1]], plot_bar_b[[1]]),

	page_layout = list("Page 1" = seq.int(2L)),

	nrow = 1L,

	ncol = 2L,

	ggtheme = theme_gray(),

	theme_config = list(),

	title = NULL

)

```

```{r plot-bar-by, eval=TRUE}

## View frequency distribution by a discrete variable

plot_bar(diamonds, by = "cut")

```

```{r plot-histogram-template}

## View histogram of all continuous variables

plot_histogram(diamonds)

```

```{r plot-histogram, echo=FALSE, eval=TRUE}

plot_histogram(diamonds, nrow = 3L, ncol = 3L)

```

```{r plot-density-template}

## View estimated density distribution of all continuous variables

plot_density(diamonds)

```

```{r plot-density, echo=FALSE, eval=TRUE}

plot_density(diamonds, nrow = 3L, ncol = 3L)

```

```{r plot-qq-template}

## View quantile-quantile plot of all continuous variables

plot_qq(diamonds)

```

```{r plot-qq, echo=FALSE, eval=TRUE}

plot_qq(diamonds, sampled_rows = 500L, geom_qq_args = list("size" = 0.5))

```

```{r plot-qq-cut-template}

## View quantile-quantile plot of all continuous variables by feature `cut`

plot_qq(diamonds, by = "cut")

```

```{r plot-qq-cut, echo=FALSE, eval=TRUE}

plot_qq(diamonds, by = "cut", sampled_rows = 500L, geom_qq_args = list("size" = 0.5))

```

```{r plot_correlation, eval=TRUE}

## View overall correlation heatmap

plot_correlation(diamonds)

```

```{r plot_boxplot-template}

## View bivariate continuous distribution based on `cut`

plot_boxplot(diamonds, by = "cut")

```

```{r plot_boxplot, echo=FALSE, eval=TRUE}

plot_boxplot(diamonds, by = "cut", nrow = 3L, ncol = 3L)

```

```{r plot_scatterplot-template}

## Scatterplot `price` with all other continuous features

plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)

```

```{r plot_scatterplot, echo=FALSE, eval=TRUE}

plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L, geom_point_args = list("size" = 0.5))

```

```{r plot_prcomp-template}

## Visualize principal component analysis

plot_prcomp(diamonds, maxcat = 5L)

```

```{r plot_prcomp, echo=FALSE, eval=TRUE}

plot_prcomp(diamonds, maxcat = 5L, nrow = 2L, ncol = 2L)

```

#### Feature Engineering

To make quick updates to your data:

```{r, eval=FALSE}

## Group bottom 20% `clarity` by frequency

group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)

## Group bottom 20% `clarity` by `price`

group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)

## Dummify diamonds dataset

dummify(diamonds)

dummify(diamonds, select = "cut")

## Set values for missing observations

df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))

df[sample.int(260, 50), ] <- NA

set_missing(df, list(0L, "unknown"))

## Update columns

update_columns(airquality, c("Month", "Day"), as.factor)

update_columns(airquality, 1L, function(x) x^2)

## Drop columns

drop_columns(diamonds, 8:10)

drop_columns(diamonds, "clarity")

```

## Articles

See [article wiki page](https://github.com/boxuancui/DataExplorer/wiki/Articles).