Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ncordon/smartdata

R package for data preprocessing
https://github.com/ncordon/smartdata

Last synced: 16 days ago
JSON representation

R package for data preprocessing

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
[![Build Status](https://travis-ci.com/ncordon/smartdata.svg?branch=master)](https://travis-ci.com/ncordon/smartdata)
[![minimal R version](https://img.shields.io/badge/R%3E%3D-3.5.0-6666ff.svg)](https://cran.r-project.org/)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/smartdata)](https://cran.r-project.org/package=smartdata)
[![packageversion](https://img.shields.io/badge/Package%20version-1.0.2-orange.svg?style=flat-square)](https://github.com/ncordon/smartdata/commits/master)

# smartdata

Package that integrates preprocessing algorithms for oversampling, instance/feature selection, normalization, discretization, space transformation, and outliers/missing values/noise cleaning.

## Installation

You can install the latest smartdata stable release from CRAN with:

```{r gh-installation, eval = FALSE}
# This sets both CRAN and Bioconductor as repositories to resolve dependencies
setRepositories(ind = 1:2)
install.packages("smartdata")
```

and load it into an R session with:

```{r results='hide', message=FALSE, warning=FALSE}
library("smartdata")
```

## Examples

`smartdata` provides the following wrappers:

* `instance_selection`
* `feature_selection`
* `normalize`
* `discretize`
* `space_transformation`
* `clean_outliers`
* `impute_missing`
* `clean_noise`

To get the possible methods available for a certain wrapper, we can do:

```{r options}
which_options("instance_selection")
```

To get information about the parameters available for a method:

```{r options_method}
which_options("instance_selection", "multiedit")
```

First let's load a bunch of datasets:

```{r data_load, results = "hide"}
data(iris0, package = "imbalance")
data(ecoli1, package = "imbalance")
data(nhanes, package = "mice")
```
#### Oversampling

```{r oversample, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris0 %>% oversample(method = "MWMOTE", ratio = 0.8, filtering = TRUE)
```

#### Instance selection

```{r instance_selection, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2,
null_passes = 10, class_attr = "Species")
```

#### Feature selection

```{r feature_selection, results = "hide", message = FALSE, warning = FALSE}
super_ecoli <- ecoli1 %>% feature_selection("Boruta", class_attr = "Class")
```

#### Normalization

```{r normalize, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% normalize("min_max", exclude = c("Sepal.Length", "Species"))
```

#### Discretization

```{r discretize, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% discretize("ameva", class_attr = "Species")
```

#### Space transformation

```{r space_transformation, results = "hide", message = FALSE, warning = FALSE}
super_ecoli <- ecoli1 %>% space_transformation("lle_knn", k = 3, num_features = 2)
```

#### Outliers

```{r clean_outliers, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% clean_outliers("multivariate", type = "adj")
```

#### Missing values

```{r impute_missing, results = "hide", message = FALSE, warning = FALSE}
super_nhanes <- nhanes %>% impute_missing("gibbs_sampling")
```

#### Noise

```{r clean_noise, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% clean_noise("hybrid", class_attr = "Species",
consensus = FALSE, action = "repair")
```