https://github.com/wahani/dat

Tools for Data Manipulation in R
https://github.com/wahani/dat
Last synced: 6 months ago
JSON representation
Tools for Data Manipulation in R
Host: GitHub
URL: https://github.com/wahani/dat
Owner: wahani
License: other
Created: 2015-11-24T10:47:35.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2020-12-04T09:35:13.000Z (over 4 years ago)
Last Synced: 2024-10-13T19:06:30.436Z (8 months ago)
Language: R
Homepage:
Size: 190 KB
Stars: 15
Watchers: 6
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

jimsghstars - wahani/dat - Tools for Data Manipulation in R (R)
README

        [![Travis-CI Build Status](https://travis-ci.org/wahani/dat.svg?branch=master)](https://travis-ci.org/wahani/dat)

[![codecov.io](https://codecov.io/github/wahani/dat/coverage.svg?branch=master)](https://codecov.io/github/wahani/dat?branch=master)

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/dat)](http://cran.r-project.org/package=dat)

[![Downloads](http://cranlogs.r-pkg.org/badges/dat?color=brightgreen)](http://www.r-pkg.org/pkg/dat)

An implementation of common higher order functions with syntactic

sugar for anonymous function. Provides also a link to 'dplyr' and

'data.table' for common transformations on data frames to work around non

standard evaluation by default.

## Installation

### From GitHub

```r

remotes::install_github("wahani/dat")

```

### From CRAN

```r

install.packages("dat")

```

## Why should you care?

- You probably have to rewrite all your dplyr / data.table code once you put it 

inside a package. I.e. working around non standard evaluation or find another

way to satisfy `R CMD check`. And you don't like that.

- `dplyr` is not respecting the class of the object it operates on; the class

attribute changes on-the-fly.

- Neither `dplyr` nor `data.table` are playing nice with S4, but you really,

really want a S4 *data.table* or *tbl_df*.

- You like currying as in `rlist` and `purrr`.

## Tools for data manipulation

The examples are from the introductory vignette of `dplyr`. You still work with

data frames: so you can simply mix in dplyr features whenever you need them.

```r

library("nycflights13")

library("dat")

```

### Select rows

We can use `mutar` to select rows. When you

reference a variable in the data frame, you can indicate this by using a one

sided formula.

```r

mutar(flights, ~ month == 1 & day == 1)

mutar(flights, ~ 1:10)

```

And for sorting:

```r

mutar(flights, ~ order(year, month, day))

```

```

## # A tibble: 336,776 x 19

##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time

##                                       

##  1  2013     1     1      517            515         2      830            819

##  2  2013     1     1      533            529         4      850            830

##  3  2013     1     1      542            540         2      923            850

##  4  2013     1     1      544            545        -1     1004           1022

##  5  2013     1     1      554            600        -6      812            837

##  6  2013     1     1      554            558        -4      740            728

##  7  2013     1     1      555            600        -5      913            854

##  8  2013     1     1      557            600        -3      709            723

##  9  2013     1     1      557            600        -3      838            846

## 10  2013     1     1      558            600        -2      753            745

## # … with 336,766 more rows, and 11 more variables: arr_delay ,

## #   carrier , flight , tailnum , origin , dest ,

## #   air_time , distance , hour , minute , time_hour 

```

### Select cols

You can use characters, logicals, regular expressions and functions to select

columns. Regular expressions are indicated by a leading "^".

```r

flights %>%

  extract(c("year", "month", "day")) %>%

  extract("^day$") %>%

  extract(is.numeric)

```

### Operations on columns

The main difference between `dplyr::mutate` and `mutar` is that you use a `~`

instead of `=`.

    

```r

mutar(

  flights,

  gain ~ arr_delay - dep_delay,

  speed ~ distance / air_time * 60

)

```

Grouping data is handled within `mutar`:

```r

mutar(flights, n ~ .N, by = "month")

```

```r

mutar(flights, delay ~ mean(dep_delay, na.rm = TRUE), by = "month")

```

You can also provide additional arguments to a formula. This is especially

helpful when you want to pass arguments from a function to such expressions. The

additional augmentation can be anything which you can use to select columns

(character, regular expression, function) or a named list where each element is

a character.

    

```r

mutar(

  flights,

  .n ~ mean(.n, na.rm = TRUE) | "^.*delay$",

  .x ~ mean(.x, na.rm = TRUE) | list(.x = "arr_time"),

  by = "month"

)

```

```

## # A tibble: 336,776 x 19

##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time

##                                       

##  1  2013     1     1      517            515      10.0    1523.            819

##  2  2013     1     1      533            529      10.0    1523.            830

##  3  2013     1     1      542            540      10.0    1523.            850

##  4  2013     1     1      544            545      10.0    1523.           1022

##  5  2013     1     1      554            600      10.0    1523.            837

##  6  2013     1     1      554            558      10.0    1523.            728

##  7  2013     1     1      555            600      10.0    1523.            854

##  8  2013     1     1      557            600      10.0    1523.            723

##  9  2013     1     1      557            600      10.0    1523.            846

## 10  2013     1     1      558            600      10.0    1523.            745

## # … with 336,766 more rows, and 11 more variables: arr_delay ,

## #   carrier , flight , tailnum , origin , dest ,

## #   air_time , distance , hour , minute , time_hour 

```

## A link to S4

Using this package you can create S4 classes to contain a data frame (or a

data.table) and use the interface to `dplyr`. Both `dplyr` and `data.table` do

not support integration with S4. The main function here is `mutar` which is

generic enough to link to subsetting of rows and cols as well as mutate and

summarise. In the background `dplyr`s ability to work on a `data.table` is being

used.

```r

library("data.table")

setClass("DataTable", "data.table")

DataTable <- function(...) {

  new("DataTable", data.table::data.table(...))

}

setMethod("[", "DataTable", mutar)

dtflights <- do.call(DataTable, nycflights13::flights)

dtflights[1:10, c("year", "month", "day")]

dtflights[n ~ .N, by = "month"]

dtflights[n ~ .N, sby = "month"]

dtflights %>%

  filtar(~month > 6) %>%

  mutar(n ~ .N, by = "month") %>%

  sumar(n ~ data.table::first(n), by = "month")

```

## Working with vectors

Inspired by `rlist` and `purrr` some low level operations on vectors are

supported. The aim here is to integrate syntactic sugar for anonymous functions.

Furthermore the functions should support the use of pipes.

- `map` and `flatmap` as replacements for the apply functions

- `extract` for subsetting

- `replace` for replacing elements in a vector

What we can do with map:

```r

map(1:3, ~ .^2)

flatmap(1:3, ~ .^2)

map(1:3 ~ 11:13, c) # zip

dat <- data.frame(x = 1, y = "")

map(dat, x ~ x + 1, is.numeric)

```

What we can do with extract:

```r

extract(1:10, ~ . %% 2 == 0) %>% sum

extract(1:15, ~ 15 %% . == 0)

l <- list(aList = list(x = 1), aAtomic = "hi")

extract(l, "^aL")

extract(l, is.atomic)

```

What we can do with replace:

```r

replace(c(1, 2, NA), is.na, 0)

replace(c(1, 2, NA), rep(TRUE, 3), 0)

replace(c(1, 2, NA), 3, 0)

replace(list(x = 1, y = 2), "x", 0)

replace(list(x = 1, y = 2), "^x$", 0)

replace(list(x = 1, y = "a"), is.character, NULL)

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wahani/dat

Awesome Lists containing this project

README