https://github.com/knapply/data.table-vs-parquet

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/knapply/data.table-vs-parquet
Owner: knapply
Created: 2019-10-08T23:33:33.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-10-09T01:39:03.000Z (over 6 years ago)
Last Synced: 2025-03-05T14:28:38.131Z (over 1 year ago)
Size: 4.88 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

          ---

title: "Read"

author: "Brendan Knapp"

date: "10/8/2019"

output: github_document

editor_options: 

  chunk_output_type: console

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

```

```{r}

dl_path <- "datasets/yellow_tripdata_2010-01.csv"

if (!file.exists(dl_path)) {

  download.file(

    "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-01.csv",

    destfile = dl_path

  )

}

```

```{r}

library(data.table)

library(arrow)

library(scales)

library(microbenchmark)

library(ggplot2)

```

```{r}

col_names <- c("vendor_id", "pickup_datetime", "dropoff_datetime", 

               "passenger_count", "trip_distance", "pickup_longitude", 

               "pickup_latitude", "rate_code", "store_and_fwd_flag", 

               "dropoff_longitude", "dropoff_latitude", "payment_type", 

               "fare_amount", "surcharge", "mta_tax", "tip_amount", 

               "tolls_amount", "total_amount")

init <- fread(dl_path)

colnames(init) <- col_names

big_df <- rbindlist(

  replicate(n = 5, init, simplify = FALSE)

)

setNames(comma(dim(big_df)), c("# rows", "# cols"))

```

```{r}

csv_path <- "datasets/csvy-file.csv"

csvy_path <- "datasets/csvy-file.csvy"

parquet_path <- "datasets/parquet-file.parquet"

fwrite(big_df, file = csv_path)

fwrite(big_df, file = csvy_path)

write_parquet(big_df, sink = parquet_path)

number_bytes(

  file.size(c(csv_path, csvy_path, parquet_path))

)

```

```{r}

res <- microbenchmark::microbenchmark(

  DT_csv = fread(csv_path, showProgress = FALSE),

  DT_csvy = fread(csvy_path, showProgress = FALSE),

  

  arrow_parquet = read_parquet(parquet_path),

  

  times = 5

)

```

```{r}

res

ggplot2::autoplot(res)

```

```{r}

data.table::getDTthreads()

sessionInfo()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/knapply/data.table-vs-parquet

Awesome Lists containing this project

README