An open API service indexing awesome lists of open source software.

https://github.com/elastacloud/automatic-data-explorer

An R package to explore and quality check data
https://github.com/elastacloud/automatic-data-explorer

correlations covariance pca summary-statistics

Last synced: 4 months ago
JSON representation

An R package to explore and quality check data

Awesome Lists containing this project

README

        

# Automatic Data Explorer [![Build Status](https://travis-ci.org/elastacloud/automatic-data-explorer.svg?branch=master)](https://travis-ci.org/elastacloud/automatic-data-explorer) [![codecov](https://codecov.io/gh/elastacloud/automatic-data-explorer/branch/master/graph/badge.svg)](https://codecov.io/gh/elastacloud/automatic-data-explorer)

An R package to explore and quality check data. Contains a variety of useful functions which enable automatic checking of data quality, factors and numeric data as well as correlations.

- `targetCorrletions()`
- `ggdensity()`
- `gghistogram()`
- `SummaryStatsCat()`
- `SummaryStatsNum()`
- `autoMarkdown()`

## Using targetCorrelations

To get started use a data frame and detail the column that you want to get target correlations for:

install.packages("purrr")
library(purrr)

data <- data.frame(A = rnorm(50,0,1),
B = runif(50,10,20),
C = seq(1,50,1),
D = rep(LETTERS[1:5], 10))

targetCorrelations(data, "B")

This should give a similar report to:

C A
0.40549008 0.01356416

## Using autoMarkdown

The `autoMarkdown()` function can be used to automatically generate R Markdown files directly from one or more
R scripts. The idea is to take the focus away from thinking about your Markdown styling when doing the
most important part of data science, the actual expoloration and analysis.

The function requires that the R script has some formatting; the code that you wish to be incorporated into a
code chunk must be separated with a divider, e.g.

#' # Summary
#' This is the summary of the mtcars dataset

#.#
summary(mtcars)
#.#

#' ## Histogram of mpg
#' This is a histogram of the mpg variable

#.#
autoHistogramPlot(mtcars, mpg, colour = "black", fill = "blue")
#.#

There are two things to note in this example
- #.# are the dividers and mean that the code within should be treated as a code chunk
- #' autoMarkdown recognises these as Roxygen comments and treats them accordingly

Say that we have saved the above in an R script called `mtcars.R`, we can now write this as R Markdown to an existing
`mtcars.Rmd` file with

autoMarkdown("mtcars.R", "mtcars.Rmd")

Most projects will have multiple separate scripts; perhaps detailing different stages of the data science life-cycle.
This makes our work flow much easier to follow and keeps code neat and tidy. However, when it comes to reporting it
is most likely that we want just one report. If we have multiple scripts these can all be written to the same .Rmd
file with

autoMarkdown(c("DataExploration.R", "DataCleaning.R", "Modelling.R"), "ProjectReport.Rmd", overwrite = TRUE)

Note the `overwrite = TRUE` argument. This specification will mean that any existing markdown in the .Rmd file will automatically be written over. This is useful in most circumstances but could potentially be dangerous if you specify the
wrong .Rmd file, so use with caution.

The default setting is to create code chunks that are "quiet", that is they will only display the results of the code,
not the code itself or any messages generated by it. Further development may include an option to specify a code chunk
that also displays the code itself.