Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fstpackage/fst

Lightning Fast Serialization of Data Frames for R
https://github.com/fstpackage/fst

compression data-frame data-storage r

Last synced: about 2 months ago
JSON representation

Lightning Fast Serialization of Data Frames for R

Awesome Lists containing this project

README

        

---
output:
github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```

[![Build Status](https://github.com/fstpackage/fst/actions/workflows/R-CMD-check.yaml/badge.svg?branch=develop)](https://github.com/fstpackage/fst/actions/workflows/R-CMD-check.yaml)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/fst)](https://cran.r-project.org/package=fst)
[![fastverse status badge](https://fastverse.r-universe.dev/badges/fst)](https://fastverse.r-universe.dev/ui#package:fastverse)
[![codecov](https://codecov.io/gh/fstpackage/fst/branch/develop/graph/badge.svg)](https://app.codecov.io/gh/fstpackage/fst)
[![downloads](http://cranlogs.r-pkg.org/badges/fst)](http://cran.rstudio.com/web/packages/fst/index.html)
[![total_downloads](https://cranlogs.r-pkg.org/badges/grand-total/fst)](http://cran.rstudio.com/web/packages/fst/index.html)

## Overview

The [_fst_ package][fstRepo] for R provides a fast, easy and flexible way to serialize data frames. With access speeds of multiple GB/s, _fst_ is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers. Data frames stored in the _fst_ format have full random access, both in column and rows.

The figure below compares the read and write performance of the _fst_ package to various alternatives.

```{r speedTable, results = 'asis', message = FALSE, echo = FALSE}
require(ggplot2)
require(data.table)
require(fst)
require(knitr)

speed_results <- read_fst("res_readme.fst", as.data.table = TRUE)

speeds <- speed_results[NrOfRows == 5e7, list(
`Time (ms)` = as.integer(median(Time)),
`Size (MB)` = format(median(FileSize), digits = 2),
`Speed (MB/s)` = as.integer(median(Speed)),
N = .N),
by = "Package,Mode"]

speeds[Package == "data.table" & Mode == "Write", Method := "fwrite"]
speeds[Package == "data.table" & Mode == "Read", Method := "fread"]
speeds[Package == "fst" & Mode == "Write", Method := "write_fst"]
speeds[Package == "fst" & Mode == "Read", Method := "read_fst"]
speeds[Package == "baseR" & Mode == "Write", Method := "saveRDS"]
speeds[Package == "baseR" & Mode == "Read", Method := "readRDS"]
speeds[Package == "feather" & Mode == "Write", Method := "write_feather"]
speeds[Package == "feather" & Mode == "Read", Method := "read_feather"]
speeds[, Format := "bin"]
speeds[Package == "data.table", Format := "csv"]
setkey(speeds, Package, Mode)

display_speeds <- copy(speeds)
for (col in colnames(display_speeds)[2:ncol(display_speeds)]) {
setnames(display_speeds, col, "CurCol")
display_speeds[, CurCol := as.character(CurCol)]
display_speeds[Package == "fst", CurCol := paste0("**", CurCol, "**")]
setnames(display_speeds, "CurCol", col)
}

kable(display_speeds[, c(7, 8, 3, 4, 5, 6)])
```

These benchmarks were performed on a laptop (i7 4710HQ @2.5 GHz) with a reasonably fast SSD (M.2 Samsung SM951) using the dataset defined below. Parameter *Speed* was calculated by dividing the in-memory size of the data frame by the measured time. These results are also visualized in the following graph:

```{r speed-bench, echo = FALSE, message = FALSE, results = 'hide', fig.width = 8.5, fig.height = 6}
ggplot(speed_results[NrOfRows == 5e7, .(Speed = median(Speed)), by = "Package,Mode"]) +
geom_bar(aes(Mode, Speed, fill = Mode), colour = "darkgrey", stat = "identity") +
facet_wrap(~ Package, 1) +
ylim(0, NA) +
ylab("Read or Write speed (MB/s)") +
xlab("") +
theme(legend.position="none")
```

As can be seen from the figure, the measured speeds for the _fst_ package are very high and even top the maximum drive speed of the SSD used. The package accomplishes this by an effective combination of multi-threading and compression. The on-disk file sizes of _fst_ files are also much smaller than that of the other formats tested. This is an added benefit of _fst_'s use of type-specific compressors on each stored column.

In addition to methods for data frame serialization, _fst_ also provides methods for multi-threaded in-memory compression with the popular LZ4 and ZSTD compressors and an extremely fast multi-threaded hasher.

## Multi-threading

The _fst_ package relies heavily on multi-threading to boost the read- and write speed of data frames. To maximize throughput, _fst_ compresses and decompresses data _in the background_ and tries to keep the disk busy writing and reading data at the same time.

## Installation

The easiest way to install the package is from CRAN:

```{r, eval = FALSE}
install.packages("fst")
```

You can also use the development version from GitHub:

```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("fstpackage/fst", ref = "develop")
```

## Basic usage

```{r, results = 'hide', echo = FALSE, message = FALSE}
require(fst)
```

Using _fst_ is simple. Data can be stored and retrieved using methods _write\_fst_ and _read\_fst_:

```{r}
# Generate some random data frame with 10 million rows and various column types
nr_of_rows <- 1e7

df <- data.frame(
Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)

# Store the data frame to disk
write_fst(df, "dataset.fst")

# Retrieve the data frame again
df <- read_fst("dataset.fst")
```

_Note: the dataset defined in this example code was also used to obtain the benchmark results shown in the introduction._

## Random access

The _fst_ file format provides full random access to stored datasets. You can retrieve a selection of columns and rows with:

```{r, results = 'hide'}
df_subset <- read_fst("dataset.fst", c("Logical", "Factor"), from = 2000, to = 5000)
```

This reads rows 2000 to 5000 from columns _Logical_ and _Factor_ without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, _readRDS_ or _read\_feather_ where you have to read the complete file or column before you can make a subset.

## Compression

For compression the excellent and speedy [LZ4][lz4Repo] and [ZSTD][zstdRepo] compression algorithms are used. These compressors (in combination with type-specific bit filters), enable _fst_ to achieve high compression speeds at reasonable compression factors. The compression factor can be tuned from 0 (minimum) to 100 (maximum):

```{r, results = 'hide'}
write_fst(df, "dataset.fst", 100) # use maximum compression
```

Compression reduces the size of the _fst_ file that holds your data. But because the (de-)compression is done _on background threads_, it can increase the total read- and write speed as well. The graph below shows how the use of multiple threads enhances the read and write speed of our sample dataset.

```{r multi-threading, fig.width = 10, fig.height = 8, echo = FALSE}
speed_results[Package %in% c("baseR", "feather"), Threads := 1]
speed_results[, Threads := as.factor(Threads)]
speed_results[, Speed := 0.001 * Speed]
speed_results[, NrOfRows := ifelse(NrOfRows == 1e7, "1 million rows", "10 million rows")]

fig_results <- speed_results[, .(Speed = median(Speed)), by = "Package,Mode,NrOfRows,Threads"]

ggplot(fig_results) +
geom_bar(aes(Package, Speed, fill = Threads),
stat = "identity", position = "dodge", width = 0.8) +
facet_grid(Mode ~ NrOfRows, scales = "free_y") +
scale_fill_manual(values = colorRampPalette(c("#FBE3A5", "#433B54"))(10)[10:3]) +
ylab("Read or write speed in GB/s") +
theme_minimal()
```

The _csv_ format used by the _fread_ and _fwrite_ methods of package _data.table_ is actually a human-readable text format and not a binary format. Normally, binary formats would be much faster than the _csv_ format, because _csv_ takes more space on disk, is row based, uncompressed and needs to be parsed into a computer-native format to have any meaning. So any serializer that's working on _csv_ has an enormous disadvantage as compared to binary formats. Yet, the results show that _data.table_ is on par with binary formats and when more threads are used, it can even be faster. Because of this impressive performance, it was included in the graph for comparison.

## Bindings in other languages

**Julia**: [**`FstFileFormat.jl`**][fstformatRepo] A naive Julia binding using RCall.jl

> **Note to users**: From CRAN release v0.8.0, the _fst_ format is stable and backwards compatible. That means that all _fst_ files generated with package v0.8.0 or later can be read by future versions of the package.

```{r, echo = FALSE, results = 'hide'}
# cleanup
file.remove("dataset.fst")
```

[fstRepo]: https://github.com/fstpackage/fst
[lz4Repo]: https://github.com/lz4/lz4
[zstdRepo]: https://github.com/facebook/zstd
[fstformatRepo]: https://github.com/xiaodaigh/FstFileFormat.jl