Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mayer79/data_preparation_r

base R vs. tidyverse vs. data.table vs. sqldf
https://github.com/mayer79/data_preparation_r

Last synced: 15 days ago
JSON representation

base R vs. tidyverse vs. data.table vs. sqldf

Host: GitHub
URL: https://github.com/mayer79/data_preparation_r
Owner: mayer79
Created: 2019-01-07T19:04:22.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-01-11T09:10:18.000Z (almost 6 years ago)
Last Synced: 2024-10-04T12:57:06.011Z (3 months ago)
Size: 10.7 KB
Stars: 7
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # base R vs. dplyr vs. data.table vs. sqldf

This is a short translator between the four common ways to do basic data preparation queries in R.

- **base R**: a very good starting point; easy to program with but not always easy to read. A must for R enthusiasts.

- **dplyr/tidyverse**: very easy to write and read since each function cares about exactly one single task; 100% compatible with the chaining approach of `magrittr`. Difficult to understand the internals though.

- **data.table**: extremely fast and memory efficient, so *the* approach for large data; longer queries are not always easy to read.

- **sqldf**: very easy to read, even for people who have never used R before; great to learn SQL on the fly; compared to the other approaches, it is quite slow, so suboptimal for large dage.

Illustrated with data set `iris` with 150 observations of four numeric columns and one factor `Species`.

|Task   | base R  | dplyr/tidyverse  |  data.table | sqldf  | 

|-|-|-|-|-|

|**library**||`dplyr`, `tidyverse`|`data.table`|`sqldf`|

|**view some rows**   |`head(iris)`   | `iris` | `iris` | `sqldf("select * from iris limit 6")`  |

|**select rows**|`iris[cond, ]` or `subset(iris, cond)`|`filter(iris, cond)`|`iris[cond]`|`sqldf("select * from iris where cond)`|

|**sort rows** | `iris[order(cols), ]`  |  `arrange(iris, cols)` | `iris[order(cols)]` or in-place `setorder(iris, cols)` | `sqldf("select * from iris order by cols")`  |

|**select columns**   | `iris[, cols]` or `subset(iris, select = cols)` | `select(iris, cols)`  | `iris[, cols]`  |  `sqldf("select cols from iris")` |

|**remove column** | `iris$Species <- NULL`  | `mutate(iris, Species = NULL)` |  `iris[, Species := NULL]` | `sqldf("select other cols from iris)`  |

|**add column** | `iris$x <- iris$Sepal.Length^2` or `transform(iris, x = Sepal.Length^2)` | `mutate(iris, x = Sepal.Length^2)` |  `iris[, x := Sepal.Length^2)` | `sqldf("select *, power([Sepal.Length], 2) as x from iris")`  |

|**grouped stats**| `aggregate(Sepal.Width ~ Species, data = iris, FUN = median)` | `iris %>% group_by(Species) %>% summarize(med = median(Sepal.Width))` | `iris[, .(med = median(Sepal.Width)), by = Species]` | `sqldf("select Species, median([Sepal.Width]) as med from iris group by Species")` |

|**left join**|`merge(iris, grouped_stats, by = "Species", all.x = TRUE)`|`left_join(iris, grouped_stats, by = "Species)`| `grouped_stats[iris, on = "Species")` or like `merge`| `sqldf("select a.*, b.med from iris a left join grouped_stats b on a.Species = b.Species")`|

|**inner join**|`merge(iris, grouped_stats, by = "Species")`|`inner_join(iris, grouped_stats, by = "Species)`| `grouped_stats[iris, on = "Species", nomatch = 0)` or like `merge`| `sqldf("select a.*, b.med from iris a inner join grouped_stats b on a.Species = b.Species")`|

|**add grouped stats**|`transform(iris, med = ave(Sepal.Width, Species, FUN = median))`|`iris %>% group_by(Species) %>% mutate(med = median(Sepal.Width))` |`iris[, med := median(Sepal.Width), by = Species]`|group by and left join|

|**transpose to long**|`reshape(???, direction = "long")`|`gather`|`melt`|through "union all"|

|**transpose to wide**|`reshape(???, direction = "wide")`|`spread`|`dcast`|through "left joins"|

|**row bind**|`rbind(data1, data2)`|`bind_rows(data1, data2)`|`rbind(data1, data2)` or `rbindlist(list(data1, data2))`|`sqldf("select * from data1 union all select * from data2")`|

|**column bind**|`cbind(data1, data2)`|`bind_cols(data1, data2)`|`cbind(data1, data2)`|Add row numbers, then join|