https://github.com/ncordon/smartdata

R package for data preprocessing
https://github.com/ncordon/smartdata

Last synced: about 2 months ago
JSON representation

R package for data preprocessing

Host: GitHub
URL: https://github.com/ncordon/smartdata
Owner: ncordon
License: gpl-2.0
Created: 2017-09-13T16:05:12.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2019-12-18T02:37:24.000Z (over 5 years ago)
Last Synced: 2025-03-31T04:31:54.414Z (3 months ago)
Language: R
Homepage: https://ncordon.github.io/smartdata
Size: 220 KB
Stars: 13
Watchers: 5
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "README-"

)

```

[![Build Status](https://travis-ci.com/ncordon/smartdata.svg?branch=master)](https://travis-ci.com/ncordon/smartdata)

[![minimal R version](https://img.shields.io/badge/R%3E%3D-3.5.0-6666ff.svg)](https://cran.r-project.org/)

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/smartdata)](https://cran.r-project.org/package=smartdata)

[![packageversion](https://img.shields.io/badge/Package%20version-1.0.2-orange.svg?style=flat-square)](https://github.com/ncordon/smartdata/commits/master)

# smartdata

Package that integrates preprocessing algorithms for oversampling, instance/feature selection, normalization, discretization, space transformation, and outliers/missing values/noise cleaning.

## Installation

You can install the latest smartdata stable release from CRAN with:

```{r gh-installation, eval = FALSE}

# This sets both CRAN and Bioconductor as repositories to resolve dependencies

setRepositories(ind = 1:2)

install.packages("smartdata")

```

and load it into an R session with:

```{r results='hide', message=FALSE, warning=FALSE}

library("smartdata")

```

## Examples

`smartdata` provides the following wrappers: 

* `instance_selection`

* `feature_selection`

* `normalize`

* `discretize`

* `space_transformation`

* `clean_outliers`

* `impute_missing`

* `clean_noise`

To get the possible methods available for a certain wrapper, we can do:

```{r options}

which_options("instance_selection")

```

To get information about the parameters available for a method:

```{r options_method}

which_options("instance_selection", "multiedit")

```

First let's load a bunch of datasets:

```{r data_load, results = "hide"}

data(iris0,  package = "imbalance")

data(ecoli1, package = "imbalance")

data(nhanes, package = "mice")

```

#### Oversampling

```{r oversample, results = "hide", message = FALSE, warning = FALSE}

super_iris <- iris0 %>% oversample(method = "MWMOTE", ratio = 0.8, filtering = TRUE)

```

#### Instance selection

```{r instance_selection, results = "hide", message = FALSE, warning = FALSE}

super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2, 

                                          null_passes = 10, class_attr = "Species")

```

#### Feature selection

```{r feature_selection, results = "hide", message = FALSE, warning = FALSE}

super_ecoli <- ecoli1 %>% feature_selection("Boruta", class_attr = "Class")

```

#### Normalization

```{r normalize, results = "hide", message = FALSE, warning = FALSE}

super_iris <- iris %>% normalize("min_max", exclude = c("Sepal.Length", "Species"))

```

#### Discretization

```{r discretize, results = "hide", message = FALSE, warning = FALSE}

super_iris <- iris %>% discretize("ameva", class_attr = "Species")

```

#### Space transformation

```{r space_transformation, results = "hide", message = FALSE, warning = FALSE}

super_ecoli <- ecoli1 %>% space_transformation("lle_knn", k = 3, num_features = 2)

```

#### Outliers

```{r clean_outliers, results = "hide", message = FALSE, warning = FALSE}

super_iris <- iris %>% clean_outliers("multivariate", type = "adj")

```

#### Missing values

```{r impute_missing, results = "hide", message = FALSE, warning = FALSE}

super_nhanes <- nhanes %>% impute_missing("gibbs_sampling")

```

#### Noise

```{r clean_noise, results = "hide", message = FALSE, warning = FALSE}

super_iris <- iris %>% clean_noise("hybrid", class_attr = "Species", 

                                   consensus = FALSE, action = "repair")

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ncordon/smartdata

Awesome Lists containing this project

README