https://github.com/knapply/fastgz
Fast reading of .gz files to R character vectors.
https://github.com/knapply/fastgz
Last synced: 8 months ago
JSON representation
Fast reading of .gz files to R character vectors.
- Host: GitHub
- URL: https://github.com/knapply/fastgz
- Owner: knapply
- License: gpl-3.0
- Created: 2019-12-08T20:24:23.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-12-09T01:31:20.000Z (over 6 years ago)
- Last Synced: 2025-03-05T14:28:40.071Z (over 1 year ago)
- Language: C++
- Homepage: https://knapply.github.io/fastgz/
- Size: 58.6 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output:
github_document:
html_preview: true
html_document:
keep_md: yes
always_allow_html: yes
editor_options:
chunk_output_type: console
---
```{r, echo=FALSE}
knitr::opts_chunk$set(
# collapse = TRUE,
fig.align = "center",
comment = "#>",
fig.path = "man/figures/",
message = FALSE,
warning = FALSE,
out.width="100%"
)
options(width = 120)
```
# `{fastgz}`
[](https://cran.r-project.org/package=fastgz)
[](https://www.tidyverse.org/lifecycle/#experimental)
[](https://github.com/knapply/fastgz/commits/master)
[](https://codecov.io/gh/knapply/fastgz?branch=master)
[](https://ci.appveyor.com/project/knapply/fastgz)
[](https://travis-ci.org/knapply/fastgz)
[](https://www.gnu.org/licenses/gpl-3.0)
[](https://github.com/knapply/fastgz)
[](http://hits.dwyl.io/knapply/fastgz)
# Why?
Files of non-trivial sizes are typically gzip files. `base::readLines()` is suprisingly quick at reading them, but we can go a tad faster. On the other hand, `readr::read_lines()` decompresses the file before reading it, which is... less than ideal.
`{fastgz}` contains two simple helpers:
1. `fastgz::read_gz_file()` reads an entire file(s) into a single `character()`
2. `fastgz::read_gz_lines()`is the equivalent of `base::readLines()`/`readr::read_lines()`
Rather than relying the `apply()`/`purrr::map()` families, you can pass multiple file paths to both.
# Benchmarks
```{r}
library(fastgz)
library(microbenchmark)
library(ggplot2)
file_dir <- readRDS("big_file_path")
files <- dir(file_dir, pattern = "\\.gz$", full.names = TRUE)[1:3]
scales::number_bytes(sum(file.size(files)))
```
```{r}
res <- microbenchmark(
fastgz_single = fastgz_single <- read_gz_lines(files[[1]]),
base_single = base_single <- readLines(files[[1]]),
fastgz_multi = fastgz_multi <- read_gz_lines(files),
base_multi = base_multi <- unlist(lapply(files, readLines),
use.names = FALSE)
,
times = 3
)
identical(fastgz_single, base_single) && identical(fastgz_multi, base_multi)
lapply(list(single = fastgz_single, multi = fastgz_multi), pryr::object_size)
res
autoplot(res)
```
# Shout Outs
* [`{Rcpp}`](http://www.rcpp.org/)
* [`Gzstream`](https://www.cs.unc.edu/Research/compgeom/gzstream/)