https://github.com/tylerlittlefield/ucimlr
:bar_chart: UCI Machine Learning Repository in R
https://github.com/tylerlittlefield/ucimlr
datasets r rstats uci-machine-learning
Last synced: about 1 year ago
JSON representation
:bar_chart: UCI Machine Learning Repository in R
- Host: GitHub
- URL: https://github.com/tylerlittlefield/ucimlr
- Owner: tylerlittlefield
- License: other
- Created: 2019-02-09T20:17:14.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2021-01-30T19:40:32.000Z (over 5 years ago)
- Last Synced: 2025-04-04T01:32:09.487Z (about 1 year ago)
- Topics: datasets, r, rstats, uci-machine-learning
- Language: R
- Homepage: https://ucimlr.netlify.com/
- Size: 1.8 MB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Contributing: .github/CONTRIBUTING.html
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
# total ucimlr datasets
datasets <- data(package = "ucimlr")
total_datasets_names <- datasets$results[, "Item"]
dataset_names <- glue::glue("* `{total_datasets_names}` \n")
total_datasets <- length(datasets$results[, "Item"])
```
# ucimlr 
[](https://travis-ci.org/tyluRp/ucimlr)
[](https://ci.appveyor.com/project/tyluRp/ucimlr)
[](https://codecov.io/gh/tyluRp/ucimlr?branch=master)
[](https://app.netlify.com/sites/ucimlr/deploys)
The goal of `ucimlr` is to give R users easy access to datasets found at the [**U**niversity of **I**rvine's **M**achine **L**earning **R**epository](https://archive.ics.uci.edu/ml/index.php). The benefits of using this package are:
1. Ease of access
2. Clean data
Note that data in this repository dates back to 1987, the format across datasets are not consistent. Some inconsistencies include column separation and the way NA values are handled. Luckily, data in `ucimlr` follows a consistent structure that any R user can dive into. The structure is as follows:
1. All variations of NA (null, blank character, ?, etc) are coded as NA
2. All variables are snake case
3. Everything is `stringAsFactors = FALSE`
4. All datasets are presented as a [`tibble`](https://github.com/tidyverse/tibble)
Note on point 3: Factors aren't evil, but I'd rather the user decide when to code something as factor or not.
Currently, there are `r nrow(ucimlr::ucidata())` datasets available at the official repository and `r total_datasets` available in `ucimlr`. These numbers update every time the README.Rmd is reknit.
## Installation
Keep in mind that this is a data package. As of now the package is ~`r ucimlr:::pkg_size("ucimlr")` and it will continue to grow. You can install `ucimlr` from GitHub with [`devtools`](https://github.com/r-lib/devtools):
``` r
# install.packages("devtools")
devtools::install_github("tyluRp/ucimlr")
```
## Example
We can load data by name and we can scrape the current list of datasets using the `ucidata` function:
```{r example}
library(ucimlr)
automobile
ucidata()
ucinews()
```
I'd suggest loading data using R's `::` so that you can access all exported variables without loading the package. This will prevent any namespace collisions and have an additional benefit of autopopulating all the datasets and functions (assuming you're using RStudio). Alternatively, to see a list of all available datasets you can run: `data(package = "ucimlr")`
## Contributing
There are a lot of datasets and I'm slowly adding as many as I can. If you'd like to add a dataset, fix something, suggest an improvement, etc., please file an issue or submit a pull request!