An open API service indexing awesome lists of open source software.

https://github.com/paleolimbot/narrow

An R interface to the 'Apache Arrow' C API
https://github.com/paleolimbot/narrow

Last synced: 5 months ago
JSON representation

An R interface to the 'Apache Arrow' C API

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
library(arrow)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# narrow

[![Codecov test coverage](https://codecov.io/gh/paleolimbot/narrow/branch/master/graph/badge.svg)](https://codecov.io/gh/paleolimbot/narrow?branch=master)
[![R-CMD-check](https://github.com/paleolimbot/narrow/workflows/R-CMD-check/badge.svg)](https://github.com/paleolimbot/narrow/actions)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

The goal of narrow is to wrap the [Arrow Data C API](https://arrow.apache.org/docs/format/CDataInterface.html) and [Arrow Stream C API](https://arrow.apache.org/docs/format/CStreamInterface.html) to provide lightweight Arrow support for R packages to consume and produce streams of data in Arrow format.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("remotes")
remotes::install_github("paleolimbot/narrow")
```

## Creating arrays

You can create an Arrow array using `as_narrow_array()`. For many types (e.g., integers and doubles), this is done without any copying of memory: narrow just arranges the existing R vector memory and protects it for the lifetime of the underlying `struct ArrowArray`.

```{r example}
library(narrow)
(array <- as_narrow_array(1:5))
```

For `Array`s and `RecordBatch`es from the [arrow package](https://arrow.apache.org/docs/r/), this is almost always a zero-copy operation and is instantaneous even for very large Arrays.

```{r}
library(arrow)
(array2 <- as_narrow_array(Array$create(1:5)))
```

## Exporting arrays

To convert an array object to some other type, use `from_narrow_array()`:

```{r}
str(from_narrow_array(array))
```

The narrow package has built-in defaults for converting arrays to R objects; you can also specify your own using the `ptype` argument:

```{r}
str(from_narrow_array(array, ptype = double()))
from_narrow_array(array, ptype = arrow::Array)
```

## Streams

The Arrow C API also specifies an experimental stream interface. In addition to handling streams created elsewhere, you can create streams based on a `narrow_array()`:

```{r}
stream1 <- as_narrow_array_stream(as_narrow_array(1:3))
narrow_array_stream_get_next(stream1)
narrow_array_stream_get_next(stream1)
```

...or based on a function that returns one or more `narrow_array()`s:

```{r}
counter <- -1
rows_per_chunk <- 5
csv_file <- readr::readr_example("mtcars.csv")
schema <- as_narrow_array(
readr::read_csv(
csv_file,
n_max = 0,
col_types = readr::cols(.default = readr::col_double())
)
)$schema

stream2 <- narrow_array_stream_function(schema, function() {
counter <<- counter + 1L
result <- readr::read_csv(
csv_file,
skip = 1 + (counter * rows_per_chunk),
n_max = rows_per_chunk,
col_names = c(
"mpg", "cyl", "disp", "hp", "drat",
"wt", "qsec", "vs", "am", "gear", "carb"
),
col_types = readr::cols(.default = readr::col_double())
)

if (nrow(result) > 0) result else NULL
})
```

You can pass these to Arrow as a `RecordBatchReader` using `narrow_array_stream_to_arrow()`:

```{r}
reader <- narrow_array_stream_to_arrow(stream2)
as.data.frame(reader$read_table())
```

Round-turn operations for `RecordBatch` also work:

```{r}
df <- readr::read_csv(csv_file, show_col_types=FALSE)
as.data.frame(from_narrow_array(as_narrow_array(df), arrow::RecordBatch))
```

## C data access

The C data interface is ABI stable (and a version of the stream interface will be ABI stable in the future) so you can access the underlying pointers in compiled code from any R package (or inline C or C++ code). A `narrow_schema()` is an external pointer to a `struct ArrowSchema`, a `narrow_array_data()` is an external pointer to a `struct ArrowArray`, and a `narrow_array()` is a `list()` of a `narrow_schema()` and a `narrow_array_data()`.

```{r, include=FALSE}
Sys.setenv(
PKG_CPPFLAGS = paste0("-I", system.file("include", package = "narrow"))
)
```

```{c, results='hide'}
#include
#include
#include "narrow.h"

SEXP extract_null_count(SEXP array_data_xptr) {
struct ArrowArray* array_data = (struct ArrowArray*) R_ExternalPtrAddr(array_data_xptr);
return Rf_ScalarInteger(array_data->null_count);
}
```

```{r}
.Call("extract_null_count", as_narrow_array(c(NA, NA, 1:5))$array_data)
```

The lifecycles of objects pointed to by the external pointers are managed by R's garbage collector: any object that gets garbage collected has its `release()` callback called (if it isn't `NULL`) and the underlying memory for the `struct Arrow...` freed. You can call the `release()` callback yourself from compiled code but you probably don't want to unless you're explicitly limiting access to an object.