https://github.com/knapply/data.table-vs-parquet
https://github.com/knapply/data.table-vs-parquet
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/knapply/data.table-vs-parquet
- Owner: knapply
- Created: 2019-10-08T23:33:33.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-10-09T01:39:03.000Z (over 6 years ago)
- Last Synced: 2025-03-05T14:28:38.131Z (over 1 year ago)
- Size: 4.88 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
title: "Read"
author: "Brendan Knapp"
date: "10/8/2019"
output: github_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
```{r}
dl_path <- "datasets/yellow_tripdata_2010-01.csv"
if (!file.exists(dl_path)) {
download.file(
"https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-01.csv",
destfile = dl_path
)
}
```
```{r}
library(data.table)
library(arrow)
library(scales)
library(microbenchmark)
library(ggplot2)
```
```{r}
col_names <- c("vendor_id", "pickup_datetime", "dropoff_datetime",
"passenger_count", "trip_distance", "pickup_longitude",
"pickup_latitude", "rate_code", "store_and_fwd_flag",
"dropoff_longitude", "dropoff_latitude", "payment_type",
"fare_amount", "surcharge", "mta_tax", "tip_amount",
"tolls_amount", "total_amount")
init <- fread(dl_path)
colnames(init) <- col_names
big_df <- rbindlist(
replicate(n = 5, init, simplify = FALSE)
)
setNames(comma(dim(big_df)), c("# rows", "# cols"))
```
```{r}
csv_path <- "datasets/csvy-file.csv"
csvy_path <- "datasets/csvy-file.csvy"
parquet_path <- "datasets/parquet-file.parquet"
fwrite(big_df, file = csv_path)
fwrite(big_df, file = csvy_path)
write_parquet(big_df, sink = parquet_path)
number_bytes(
file.size(c(csv_path, csvy_path, parquet_path))
)
```
```{r}
res <- microbenchmark::microbenchmark(
DT_csv = fread(csv_path, showProgress = FALSE),
DT_csvy = fread(csvy_path, showProgress = FALSE),
arrow_parquet = read_parquet(parquet_path),
times = 5
)
```
```{r}
res
ggplot2::autoplot(res)
```
```{r}
data.table::getDTthreads()
sessionInfo()
```