An open API service indexing awesome lists of open source software.

https://github.com/knapply/fastgz

Fast reading of .gz files to R character vectors.
https://github.com/knapply/fastgz

Last synced: 8 months ago
JSON representation

Fast reading of .gz files to R character vectors.

Awesome Lists containing this project

README

          

---
output:
github_document:
html_preview: true
html_document:
keep_md: yes
always_allow_html: yes
editor_options:
chunk_output_type: console
---

```{r, echo=FALSE}
knitr::opts_chunk$set(
# collapse = TRUE,
fig.align = "center",
comment = "#>",
fig.path = "man/figures/",
message = FALSE,
warning = FALSE,
out.width="100%"
)
options(width = 120)
```

# `{fastgz}`

[![CRAN status](https://www.r-pkg.org/badges/version/fastgz)](https://cran.r-project.org/package=fastgz)
[![Lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![GitHub last commit](https://img.shields.io/github/last-commit/knapply/fastgz.svg)](https://github.com/knapply/fastgz/commits/master)
[![Codecov test coverage](https://codecov.io/gh/knapply/fastgz/branch/master/graph/badge.svg)](https://codecov.io/gh/knapply/fastgz?branch=master)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/knapply/fastgz?branch=master&svg=true)](https://ci.appveyor.com/project/knapply/fastgz)
[![Travis-CI Build Status](https://travis-ci.org/knapply/fastgz.svg?branch=master)](https://travis-ci.org/knapply/fastgz)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/knapply/fastgz.svg)](https://github.com/knapply/fastgz)
[![HitCount](http://hits.dwyl.io/knapply/fastgz.svg)](http://hits.dwyl.io/knapply/fastgz)

# Why?

Files of non-trivial sizes are typically gzip files. `base::readLines()` is suprisingly quick at reading them, but we can go a tad faster. On the other hand, `readr::read_lines()` decompresses the file before reading it, which is... less than ideal.

`{fastgz}` contains two simple helpers:

1. `fastgz::read_gz_file()` reads an entire file(s) into a single `character()`
2. `fastgz::read_gz_lines()`is the equivalent of `base::readLines()`/`readr::read_lines()`

Rather than relying the `apply()`/`purrr::map()` families, you can pass multiple file paths to both.

# Benchmarks

```{r}
library(fastgz)
library(microbenchmark)
library(ggplot2)

file_dir <- readRDS("big_file_path")

files <- dir(file_dir, pattern = "\\.gz$", full.names = TRUE)[1:3]

scales::number_bytes(sum(file.size(files)))
```

```{r}
res <- microbenchmark(
fastgz_single = fastgz_single <- read_gz_lines(files[[1]]),
base_single = base_single <- readLines(files[[1]]),
fastgz_multi = fastgz_multi <- read_gz_lines(files),
base_multi = base_multi <- unlist(lapply(files, readLines),
use.names = FALSE)
,
times = 3
)

identical(fastgz_single, base_single) && identical(fastgz_multi, base_multi)
lapply(list(single = fastgz_single, multi = fastgz_multi), pryr::object_size)
res
autoplot(res)
```

# Shout Outs

* [`{Rcpp}`](http://www.rcpp.org/)
* [`Gzstream`](https://www.cs.unc.edu/Research/compgeom/gzstream/)