{"id":14069117,"url":"https://github.com/boxuancui/DataExplorer","last_synced_at":"2025-07-30T05:31:26.640Z","repository":{"id":34298701,"uuid":"38202431","full_name":"boxuancui/DataExplorer","owner":"boxuancui","description":"Automate Data Exploration and Treatment","archived":false,"fork":false,"pushed_at":"2024-01-24T00:42:25.000Z","size":49091,"stargazers_count":512,"open_issues_count":17,"forks_count":88,"subscribers_count":34,"default_branch":"master","last_synced_at":"2024-11-15T01:01:44.895Z","etag":null,"topics":["cran","data-analysis","data-exploration","data-science","eda","r","r-package","rstats","visualization"],"latest_commit_sha":null,"homepage":"http://boxuancui.github.io/DataExplorer/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/boxuancui.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"code_of_conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-06-28T14:48:43.000Z","updated_at":"2024-11-09T04:23:33.000Z","dependencies_parsed_at":"2024-01-31T11:10:18.936Z","dependency_job_id":"f2218834-aeda-4924-ae8d-fccb684d0156","html_url":"https://github.com/boxuancui/DataExplorer","commit_stats":{"total_commits":193,"total_committers":5,"mean_commits":38.6,"dds":"0.17616580310880825","last_synced_commit":"f908a14b166ccbeb6fb75f5aa615a1d455b7d8a3"},"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boxuancui%2FDataExplorer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boxuancui%2FDataExplorer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boxuancui%2FDataExplorer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boxuancui%2FDataExplorer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/boxuancui","download_url":"https://codeload.github.com/boxuancui/DataExplorer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228092145,"owners_count":17868140,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cran","data-analysis","data-exploration","data-science","eda","r","r-package","rstats","visualization"],"created_at":"2024-08-13T07:06:37.275Z","updated_at":"2024-12-04T10:30:56.950Z","avatar_url":"https://github.com/boxuancui.png","language":"R","funding_links":[],"categories":["R","Data Manipulation"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n```{r setup, echo = FALSE}\nlibrary(DataExplorer)\nlibrary(knitr)\nlibrary(ggplot2)\nlibrary(gridExtra)\nset.seed(1)\n\nknitr::opts_chunk$set(\n\teval = FALSE,\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\"\n)\n```\n\n# DataExplorer \u003cimg src=\"man/figures/logo.png\" align=\"right\" width=\"130\" height=\"150\"/\u003e\n\n[![CRAN Version](http://www.r-pkg.org/badges/version/DataExplorer)](https://cran.r-project.org/package=DataExplorer)\n[![Downloads](http://cranlogs.r-pkg.org/badges/DataExplorer)](https://cran.r-project.org/package=DataExplorer)\n[![Total Downloads](http://cranlogs.r-pkg.org/badges/grand-total/DataExplorer)](https://cran.r-project.org/package=DataExplorer)\n[![GitHub Stars](https://img.shields.io/github/stars/boxuancui/DataExplorer.svg?style=social)](https://github.com/boxuancui/DataExplorer)\n[![R-CMD-check](https://github.com/boxuancui/DataExplorer/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/boxuancui/DataExplorer/actions/workflows/check-standard.yaml)\n[![codecov](https://codecov.io/gh/boxuancui/DataExplorer/graph/badge.svg?token=w8eMGjF8Jw)](https://app.codecov.io/gh/boxuancui/DataExplorer)\n[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/2053/badge)](https://bestpractices.coreinfrastructure.org/projects/2053)\n[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](http://boxuancui.github.io/DataExplorer/code_of_conduct.html)\n\n## Background\n\n[Exploratory Data Analysis (EDA)](https://en.wikipedia.org/wiki/Exploratory_data_analysis) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This [R](https://cran.r-project.org/) package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.\n\n## Installation\n\nThe package can be installed directly from CRAN.\n\n```{r install-cran}\ninstall.packages(\"DataExplorer\")\n```\n\nHowever, the latest stable version (if any) could be found on [GitHub](https://github.com/boxuancui/DataExplorer), and installed using `devtools` package.\n\n```{r install-github}\nif (!require(devtools)) install.packages(\"devtools\")\ndevtools::install_github(\"boxuancui/DataExplorer\")\n```\n\nIf you would like to install the latest [development version](https://github.com/boxuancui/DataExplorer/tree/develop), you may install the develop branch.\n\n```{r install-github-develop}\nif (!require(devtools)) install.packages(\"devtools\")\ndevtools::install_github(\"boxuancui/DataExplorer\", ref = \"develop\")\n```\n\n## Examples\n\nThe package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes [here](https://boxuancui.github.io/DataExplorer/articles/dataexplorer-intro.html).\n\n#### Report\nTo get a report for the [airquality](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html) dataset:\n\n```{r create-report-airquality}\nlibrary(DataExplorer)\ncreate_report(airquality)\n```\n\nTo get a report for the [diamonds](https://ggplot2.tidyverse.org/reference/diamonds.html) dataset with response variable **price**:\n\n```{r create-report-diamonds}\nlibrary(ggplot2)\ncreate_report(diamonds, y = \"price\")\n```\n\n#### Visualization\nInstead of running `create_report`, you may also run each function individually for your analysis, e.g.,\n\n```{r introduce-template}\n## View basic description for airquality data\nintroduce(airquality)\n```\n\n```{r introduce, echo=FALSE, eval=TRUE}\nkable(t(introduce(airquality)), row.names = TRUE, col.names = \"\", format.args = list(big.mark = \",\"))\n```\n\n```{r plot-intro, eval=TRUE}\n## Plot basic description for airquality data\nplot_intro(airquality)\n```\n\n```{r plot-missing, eval=TRUE}\n## View missing value distribution for airquality data\nplot_missing(airquality)\n```\n\n```{r plot-bar-template}\n## Left: frequency distribution of all discrete variables\nplot_bar(diamonds)\n## Right: `price` distribution of all discrete variables\nplot_bar(diamonds, with = \"price\")\n```\n\n```{r plot-bar-prepare, include=FALSE, echo=FALSE, eval=TRUE}\nplot_bar_a \u003c- plot_bar(diamonds, nrow = 3L, ncol = 1L)\nplot_bar_b \u003c- plot_bar(diamonds, with = \"price\", nrow = 3L, ncol = 1L)\n```\n\n```{r plot-bar, echo=FALSE, eval=TRUE}\nDataExplorer:::plotDataExplorer.grid(\n\tplot_obj = list(plot_bar_a[[1]], plot_bar_b[[1]]),\n\tpage_layout = list(\"Page 1\" = seq.int(2L)),\n\tnrow = 1L,\n\tncol = 2L,\n\tggtheme = theme_gray(),\n\ttheme_config = list(),\n\ttitle = NULL\n)\n```\n\n```{r plot-bar-by, eval=TRUE}\n## View frequency distribution by a discrete variable\nplot_bar(diamonds, by = \"cut\")\n```\n\n```{r plot-histogram-template}\n## View histogram of all continuous variables\nplot_histogram(diamonds)\n```\n\n```{r plot-histogram, echo=FALSE, eval=TRUE}\nplot_histogram(diamonds, nrow = 3L, ncol = 3L)\n```\n\n```{r plot-density-template}\n## View estimated density distribution of all continuous variables\nplot_density(diamonds)\n```\n\n```{r plot-density, echo=FALSE, eval=TRUE}\nplot_density(diamonds, nrow = 3L, ncol = 3L)\n```\n\n```{r plot-qq-template}\n## View quantile-quantile plot of all continuous variables\nplot_qq(diamonds)\n```\n\n```{r plot-qq, echo=FALSE, eval=TRUE}\nplot_qq(diamonds, sampled_rows = 500L, geom_qq_args = list(\"size\" = 0.5))\n```\n\n```{r plot-qq-cut-template}\n## View quantile-quantile plot of all continuous variables by feature `cut`\nplot_qq(diamonds, by = \"cut\")\n```\n\n```{r plot-qq-cut, echo=FALSE, eval=TRUE}\nplot_qq(diamonds, by = \"cut\", sampled_rows = 500L, geom_qq_args = list(\"size\" = 0.5))\n```\n\n```{r plot_correlation, eval=TRUE}\n## View overall correlation heatmap\nplot_correlation(diamonds)\n```\n\n```{r plot_boxplot-template}\n## View bivariate continuous distribution based on `cut`\nplot_boxplot(diamonds, by = \"cut\")\n```\n\n```{r plot_boxplot, echo=FALSE, eval=TRUE}\nplot_boxplot(diamonds, by = \"cut\", nrow = 3L, ncol = 3L)\n```\n\n```{r plot_scatterplot-template}\n## Scatterplot `price` with all other continuous features\nplot_scatterplot(split_columns(diamonds)$continuous, by = \"price\", sampled_rows = 1000L)\n```\n\n```{r plot_scatterplot, echo=FALSE, eval=TRUE}\nplot_scatterplot(split_columns(diamonds)$continuous, by = \"price\", sampled_rows = 1000L, geom_point_args = list(\"size\" = 0.5))\n```\n\n```{r plot_prcomp-template}\n## Visualize principal component analysis\nplot_prcomp(diamonds, maxcat = 5L)\n```\n\n```{r plot_prcomp, echo=FALSE, eval=TRUE}\nplot_prcomp(diamonds, maxcat = 5L, nrow = 2L, ncol = 2L)\n```\n\n#### Feature Engineering\n\nTo make quick updates to your data:\n\n```{r, eval=FALSE}\n## Group bottom 20% `clarity` by frequency\ngroup_category(diamonds, feature = \"clarity\", threshold = 0.2, update = TRUE)\n\n## Group bottom 20% `clarity` by `price`\ngroup_category(diamonds, feature = \"clarity\", threshold = 0.2, measure = \"price\", update = TRUE)\n\n## Dummify diamonds dataset\ndummify(diamonds)\ndummify(diamonds, select = \"cut\")\n\n## Set values for missing observations\ndf \u003c- data.frame(\"a\" = rnorm(260), \"b\" = rep(letters, 10))\ndf[sample.int(260, 50), ] \u003c- NA\nset_missing(df, list(0L, \"unknown\"))\n\n## Update columns\nupdate_columns(airquality, c(\"Month\", \"Day\"), as.factor)\nupdate_columns(airquality, 1L, function(x) x^2)\n\n## Drop columns\ndrop_columns(diamonds, 8:10)\ndrop_columns(diamonds, \"clarity\")\n```\n\n## Articles\n\nSee [article wiki page](https://github.com/boxuancui/DataExplorer/wiki/Articles).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboxuancui%2FDataExplorer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fboxuancui%2FDataExplorer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboxuancui%2FDataExplorer/lists"}