https://github.com/ncordon/smartdata
R package for data preprocessing
https://github.com/ncordon/smartdata
Last synced: about 2 months ago
JSON representation
R package for data preprocessing
- Host: GitHub
- URL: https://github.com/ncordon/smartdata
- Owner: ncordon
- License: gpl-2.0
- Created: 2017-09-13T16:05:12.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2019-12-18T02:37:24.000Z (over 5 years ago)
- Last Synced: 2025-03-31T04:31:54.414Z (3 months ago)
- Language: R
- Homepage: https://ncordon.github.io/smartdata
- Size: 220 KB
- Stars: 13
- Watchers: 5
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
[](https://travis-ci.com/ncordon/smartdata)
[](https://cran.r-project.org/)
[](https://cran.r-project.org/package=smartdata)
[](https://github.com/ncordon/smartdata/commits/master)# smartdata
Package that integrates preprocessing algorithms for oversampling, instance/feature selection, normalization, discretization, space transformation, and outliers/missing values/noise cleaning.
## Installation
You can install the latest smartdata stable release from CRAN with:
```{r gh-installation, eval = FALSE}
# This sets both CRAN and Bioconductor as repositories to resolve dependencies
setRepositories(ind = 1:2)
install.packages("smartdata")
```and load it into an R session with:
```{r results='hide', message=FALSE, warning=FALSE}
library("smartdata")
```## Examples
`smartdata` provides the following wrappers:
* `instance_selection`
* `feature_selection`
* `normalize`
* `discretize`
* `space_transformation`
* `clean_outliers`
* `impute_missing`
* `clean_noise`To get the possible methods available for a certain wrapper, we can do:
```{r options}
which_options("instance_selection")
```To get information about the parameters available for a method:
```{r options_method}
which_options("instance_selection", "multiedit")
```First let's load a bunch of datasets:
```{r data_load, results = "hide"}
data(iris0, package = "imbalance")
data(ecoli1, package = "imbalance")
data(nhanes, package = "mice")
```
#### Oversampling```{r oversample, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris0 %>% oversample(method = "MWMOTE", ratio = 0.8, filtering = TRUE)
```#### Instance selection
```{r instance_selection, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2,
null_passes = 10, class_attr = "Species")
```#### Feature selection
```{r feature_selection, results = "hide", message = FALSE, warning = FALSE}
super_ecoli <- ecoli1 %>% feature_selection("Boruta", class_attr = "Class")
```#### Normalization
```{r normalize, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% normalize("min_max", exclude = c("Sepal.Length", "Species"))
```#### Discretization
```{r discretize, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% discretize("ameva", class_attr = "Species")
```#### Space transformation
```{r space_transformation, results = "hide", message = FALSE, warning = FALSE}
super_ecoli <- ecoli1 %>% space_transformation("lle_knn", k = 3, num_features = 2)
```#### Outliers
```{r clean_outliers, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% clean_outliers("multivariate", type = "adj")
```#### Missing values
```{r impute_missing, results = "hide", message = FALSE, warning = FALSE}
super_nhanes <- nhanes %>% impute_missing("gibbs_sampling")
```#### Noise
```{r clean_noise, results = "hide", message = FALSE, warning = FALSE}
super_iris <- iris %>% clean_noise("hybrid", class_attr = "Species",
consensus = FALSE, action = "repair")
```