Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/beanumber/etl

R package to facilitate ETL operations
https://github.com/beanumber/etl

Last synced: 25 days ago
JSON representation

R package to facilitate ETL operations

Awesome Lists containing this project

README

        

---
output: github_document
---

# etl

[![R-CMD-check](https://github.com/beanumber/etl/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/beanumber/etl/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/etl)](https://CRAN.R-project.org/package=etl)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/etl)](https://www.r-pkg.org:443/pkg/etl)

`etl` is an R package to facilitate [Extract - Transform - Load (ETL)](https://en.wikipedia.org/wiki/Extract,_transform,_load) operations for **medium data**. The end result is generally a populated SQL database, but the user interaction takes place solely within R.

`etl` is on CRAN, so you can install it in the usual way, then load it.

```{r, eval=FALSE}
install.packages("etl")
```

```{r, message=FALSE}
library(etl)
```

Instantiate an `etl` object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial `mtcars` database is built into `etl`.

```{r, warning=FALSE}
cars <- etl("mtcars")
class(cars)
```

## Connect to a local or remote database

`etl` works with a local or remote database to store your data. Every `etl` object extends a `dplyr::src_dbi` object. If, as in the example above, you do not specify a SQL source, a local `RSQLite` database will be created for you. However, you can also specify any source that inherits from `dplyr::src_dbi`.

> Note: If you want to use a database other than a local RSQLite, you must create the `mtcars` database and have permission to write to it first!

```{r, eval=FALSE}
# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")

# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)
```

At the heart of `etl` are three functions: `etl_extract()`, `etl_transform()`, and `etl_load()`.

## Extract

The first step is to acquire data from an online source.

```{r}
cars %>%
etl_extract()
```

This creates a local store of raw data.

## Transform

These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).

```{r}
cars %>%
etl_transform()
```

## Load

Populate the SQL database with the transformed data.

```{r}
cars %>%
etl_load()
```

## Do it all at once

To populate the whole database from scratch, use `etl_create`.

```{r}
cars %>%
etl_create()
```

You can also update an existing database without re-initializing, but watch out for primary key collisions.

```{r, eval=FALSE}
cars %>%
etl_update()
```

## Do Your Analysis

Now that your database is populated, you can work with it as a `src` data table just like any other `dplyr` source.
```{r}
cars %>%
tbl("mtcars") %>%
group_by(cyl) %>%
summarise(N = n(), mean_mpg = mean(mpg))
```

## Create your own ETL packages

Suppose you want to create your own ETL package called `pkgname`. All you have to do is write a package that requires `etl`, and then you have to write **two S3 methods**:

```{r, eval=FALSE}
etl_extract.etl_pkgname()
etl_load.etl_pkgname()
```

Please see the "[Extending etl](https://github.com/beanumber/etl/blob/master/vignettes/extending_etl.Rmd)" vignette for more information.

## Use other ETL packages

- [macleish](https://github.com/beanumber/etl) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/macleish)](https://cran.r-project.org/package=macleish) :
Weather and spatial data from the MacLeish Field Station in Whately, MA.
- [airlines](https://github.com/beanumber/airlines): On-time flight arrival data from the Bureau of Transportation Statistics
- [citibike](https://github.com/beanumber/citibike): Municipal bike-sharing system in New York City
- [nyc311](https://github.com/beanumber/nyc311): Phone calls to New York City's feedback hotline
- [fec](https://github.com/beanumber/fec): Campaign contribution data from the Federal Election Commission
- [imdb](https://github.com/beanumber/imdb): Mirror of the Internet Movie Database

## Cite

Please see [the full manuscript](https://doi.org/10.1080/10618600.2018.1512867) for additional details.

```{r}
citation("etl")
```