Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/beanumber/etl
R package to facilitate ETL operations
https://github.com/beanumber/etl
Last synced: 25 days ago
JSON representation
R package to facilitate ETL operations
- Host: GitHub
- URL: https://github.com/beanumber/etl
- Owner: beanumber
- Created: 2015-08-14T19:08:24.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2023-10-12T18:46:12.000Z (about 1 year ago)
- Last Synced: 2024-11-10T15:12:01.487Z (about 1 month ago)
- Language: R
- Size: 313 KB
- Stars: 127
- Watchers: 10
- Forks: 21
- Open Issues: 18
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
Awesome Lists containing this project
- jimsghstars - beanumber/etl - R package to facilitate ETL operations (R)
README
---
output: github_document
---# etl
[![R-CMD-check](https://github.com/beanumber/etl/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/beanumber/etl/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/etl)](https://CRAN.R-project.org/package=etl)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/etl)](https://www.r-pkg.org:443/pkg/etl)`etl` is an R package to facilitate [Extract - Transform - Load (ETL)](https://en.wikipedia.org/wiki/Extract,_transform,_load) operations for **medium data**. The end result is generally a populated SQL database, but the user interaction takes place solely within R.
`etl` is on CRAN, so you can install it in the usual way, then load it.
```{r, eval=FALSE}
install.packages("etl")
``````{r, message=FALSE}
library(etl)
```Instantiate an `etl` object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial `mtcars` database is built into `etl`.
```{r, warning=FALSE}
cars <- etl("mtcars")
class(cars)
```## Connect to a local or remote database
`etl` works with a local or remote database to store your data. Every `etl` object extends a `dplyr::src_dbi` object. If, as in the example above, you do not specify a SQL source, a local `RSQLite` database will be created for you. However, you can also specify any source that inherits from `dplyr::src_dbi`.
> Note: If you want to use a database other than a local RSQLite, you must create the `mtcars` database and have permission to write to it first!
```{r, eval=FALSE}
# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)
```At the heart of `etl` are three functions: `etl_extract()`, `etl_transform()`, and `etl_load()`.
## Extract
The first step is to acquire data from an online source.
```{r}
cars %>%
etl_extract()
```This creates a local store of raw data.
## Transform
These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).
```{r}
cars %>%
etl_transform()
```## Load
Populate the SQL database with the transformed data.
```{r}
cars %>%
etl_load()
```## Do it all at once
To populate the whole database from scratch, use `etl_create`.
```{r}
cars %>%
etl_create()
```You can also update an existing database without re-initializing, but watch out for primary key collisions.
```{r, eval=FALSE}
cars %>%
etl_update()
```## Do Your Analysis
Now that your database is populated, you can work with it as a `src` data table just like any other `dplyr` source.
```{r}
cars %>%
tbl("mtcars") %>%
group_by(cyl) %>%
summarise(N = n(), mean_mpg = mean(mpg))
```## Create your own ETL packages
Suppose you want to create your own ETL package called `pkgname`. All you have to do is write a package that requires `etl`, and then you have to write **two S3 methods**:
```{r, eval=FALSE}
etl_extract.etl_pkgname()
etl_load.etl_pkgname()
```Please see the "[Extending etl](https://github.com/beanumber/etl/blob/master/vignettes/extending_etl.Rmd)" vignette for more information.
## Use other ETL packages
- [macleish](https://github.com/beanumber/etl) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/macleish)](https://cran.r-project.org/package=macleish) :
Weather and spatial data from the MacLeish Field Station in Whately, MA.
- [airlines](https://github.com/beanumber/airlines): On-time flight arrival data from the Bureau of Transportation Statistics
- [citibike](https://github.com/beanumber/citibike): Municipal bike-sharing system in New York City
- [nyc311](https://github.com/beanumber/nyc311): Phone calls to New York City's feedback hotline
- [fec](https://github.com/beanumber/fec): Campaign contribution data from the Federal Election Commission
- [imdb](https://github.com/beanumber/imdb): Mirror of the Internet Movie Database## Cite
Please see [the full manuscript](https://doi.org/10.1080/10618600.2018.1512867) for additional details.
```{r}
citation("etl")
```