https://github.com/knapply/fastgz

Fast reading of .gz files to R character vectors.
https://github.com/knapply/fastgz

Last synced: 8 months ago
JSON representation

Fast reading of .gz files to R character vectors.

Host: GitHub
URL: https://github.com/knapply/fastgz
Owner: knapply
License: gpl-3.0
Created: 2019-12-08T20:24:23.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-12-09T01:31:20.000Z (over 6 years ago)
Last Synced: 2025-03-05T14:28:40.071Z (over 1 year ago)
Language: C++
Homepage: https://knapply.github.io/fastgz/
Size: 58.6 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

          ---

output:

  github_document:

    html_preview: true

  html_document:

    keep_md: yes

always_allow_html: yes

editor_options: 

  chunk_output_type: console

---

```{r, echo=FALSE}

knitr::opts_chunk$set(

  # collapse = TRUE,

  fig.align = "center",

  comment = "#>",

  fig.path = "man/figures/",

  message = FALSE,

  warning = FALSE,

  out.width="100%"

)

options(width = 120)

```

# `{fastgz}`

[![CRAN status](https://www.r-pkg.org/badges/version/fastgz)](https://cran.r-project.org/package=fastgz)

[![Lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

[![GitHub last commit](https://img.shields.io/github/last-commit/knapply/fastgz.svg)](https://github.com/knapply/fastgz/commits/master)

[![Codecov test coverage](https://codecov.io/gh/knapply/fastgz/branch/master/graph/badge.svg)](https://codecov.io/gh/knapply/fastgz?branch=master)

[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/knapply/fastgz?branch=master&svg=true)](https://ci.appveyor.com/project/knapply/fastgz)

[![Travis-CI Build Status](https://travis-ci.org/knapply/fastgz.svg?branch=master)](https://travis-ci.org/knapply/fastgz)

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

[![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/knapply/fastgz.svg)](https://github.com/knapply/fastgz)

[![HitCount](http://hits.dwyl.io/knapply/fastgz.svg)](http://hits.dwyl.io/knapply/fastgz)

# Why?

Files of non-trivial sizes are typically gzip files. `base::readLines()` is suprisingly quick at reading them, but we can go a tad faster. On the other hand, `readr::read_lines()` decompresses the file before reading it, which is... less than ideal.

`{fastgz}` contains two simple helpers:

1. `fastgz::read_gz_file()` reads an entire file(s) into a single `character()`

2. `fastgz::read_gz_lines()`is the equivalent of `base::readLines()`/`readr::read_lines()`

Rather than relying the `apply()`/`purrr::map()` families, you can pass multiple file paths to both.

# Benchmarks

```{r}

library(fastgz)

library(microbenchmark)

library(ggplot2)

file_dir <- readRDS("big_file_path")

files <- dir(file_dir, pattern = "\\.gz$", full.names = TRUE)[1:3]

scales::number_bytes(sum(file.size(files)))

```

```{r}

res <- microbenchmark(

  fastgz_single = fastgz_single <- read_gz_lines(files[[1]]),

  base_single   = base_single   <- readLines(files[[1]]),

  fastgz_multi  = fastgz_multi  <- read_gz_lines(files),

  base_multi    = base_multi    <- unlist(lapply(files, readLines), 

                                          use.names = FALSE)

  ,

  times = 3

)

identical(fastgz_single, base_single) && identical(fastgz_multi, base_multi)

lapply(list(single = fastgz_single, multi = fastgz_multi), pryr::object_size)

res

autoplot(res)

```

# Shout Outs

* [`{Rcpp}`](http://www.rcpp.org/)

* [`Gzstream`](https://www.cs.unc.edu/Research/compgeom/gzstream/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/knapply/fastgz

Awesome Lists containing this project

README