https://github.com/paleolimbot/narrow
An R interface to the 'Apache Arrow' C API
https://github.com/paleolimbot/narrow
Last synced: 5 months ago
JSON representation
An R interface to the 'Apache Arrow' C API
- Host: GitHub
- URL: https://github.com/paleolimbot/narrow
- Owner: paleolimbot
- License: other
- Archived: true
- Created: 2021-08-23T20:09:11.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-10-01T01:08:04.000Z (over 1 year ago)
- Last Synced: 2024-08-03T22:16:14.663Z (8 months ago)
- Language: C
- Homepage: https://paleolimbot.github.io/narrow/
- Size: 692 KB
- Stars: 30
- Watchers: 5
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
Awesome Lists containing this project
- jimsghstars - paleolimbot/narrow - An R interface to the 'Apache Arrow' C API (C)
README
---
output: github_document
---```{r, include = FALSE}
library(arrow)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# narrow
[](https://codecov.io/gh/paleolimbot/narrow?branch=master)
[](https://github.com/paleolimbot/narrow/actions)
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)The goal of narrow is to wrap the [Arrow Data C API](https://arrow.apache.org/docs/format/CDataInterface.html) and [Arrow Stream C API](https://arrow.apache.org/docs/format/CStreamInterface.html) to provide lightweight Arrow support for R packages to consume and produce streams of data in Arrow format.
## Installation
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("paleolimbot/narrow")
```## Creating arrays
You can create an Arrow array using `as_narrow_array()`. For many types (e.g., integers and doubles), this is done without any copying of memory: narrow just arranges the existing R vector memory and protects it for the lifetime of the underlying `struct ArrowArray`.
```{r example}
library(narrow)
(array <- as_narrow_array(1:5))
```For `Array`s and `RecordBatch`es from the [arrow package](https://arrow.apache.org/docs/r/), this is almost always a zero-copy operation and is instantaneous even for very large Arrays.
```{r}
library(arrow)
(array2 <- as_narrow_array(Array$create(1:5)))
```## Exporting arrays
To convert an array object to some other type, use `from_narrow_array()`:
```{r}
str(from_narrow_array(array))
```The narrow package has built-in defaults for converting arrays to R objects; you can also specify your own using the `ptype` argument:
```{r}
str(from_narrow_array(array, ptype = double()))
from_narrow_array(array, ptype = arrow::Array)
```## Streams
The Arrow C API also specifies an experimental stream interface. In addition to handling streams created elsewhere, you can create streams based on a `narrow_array()`:
```{r}
stream1 <- as_narrow_array_stream(as_narrow_array(1:3))
narrow_array_stream_get_next(stream1)
narrow_array_stream_get_next(stream1)
```...or based on a function that returns one or more `narrow_array()`s:
```{r}
counter <- -1
rows_per_chunk <- 5
csv_file <- readr::readr_example("mtcars.csv")
schema <- as_narrow_array(
readr::read_csv(
csv_file,
n_max = 0,
col_types = readr::cols(.default = readr::col_double())
)
)$schemastream2 <- narrow_array_stream_function(schema, function() {
counter <<- counter + 1L
result <- readr::read_csv(
csv_file,
skip = 1 + (counter * rows_per_chunk),
n_max = rows_per_chunk,
col_names = c(
"mpg", "cyl", "disp", "hp", "drat",
"wt", "qsec", "vs", "am", "gear", "carb"
),
col_types = readr::cols(.default = readr::col_double())
)if (nrow(result) > 0) result else NULL
})
```You can pass these to Arrow as a `RecordBatchReader` using `narrow_array_stream_to_arrow()`:
```{r}
reader <- narrow_array_stream_to_arrow(stream2)
as.data.frame(reader$read_table())
```Round-turn operations for `RecordBatch` also work:
```{r}
df <- readr::read_csv(csv_file, show_col_types=FALSE)
as.data.frame(from_narrow_array(as_narrow_array(df), arrow::RecordBatch))
```## C data access
The C data interface is ABI stable (and a version of the stream interface will be ABI stable in the future) so you can access the underlying pointers in compiled code from any R package (or inline C or C++ code). A `narrow_schema()` is an external pointer to a `struct ArrowSchema`, a `narrow_array_data()` is an external pointer to a `struct ArrowArray`, and a `narrow_array()` is a `list()` of a `narrow_schema()` and a `narrow_array_data()`.
```{r, include=FALSE}
Sys.setenv(
PKG_CPPFLAGS = paste0("-I", system.file("include", package = "narrow"))
)
``````{c, results='hide'}
#include
#include
#include "narrow.h"SEXP extract_null_count(SEXP array_data_xptr) {
struct ArrowArray* array_data = (struct ArrowArray*) R_ExternalPtrAddr(array_data_xptr);
return Rf_ScalarInteger(array_data->null_count);
}
``````{r}
.Call("extract_null_count", as_narrow_array(c(NA, NA, 1:5))$array_data)
```The lifecycles of objects pointed to by the external pointers are managed by R's garbage collector: any object that gets garbage collected has its `release()` callback called (if it isn't `NULL`) and the underlying memory for the `struct Arrow...` freed. You can call the `release()` callback yourself from compiled code but you probably don't want to unless you're explicitly limiting access to an object.