https://github.com/paleolimbot/narrow

An R interface to the 'Apache Arrow' C API
https://github.com/paleolimbot/narrow

Last synced: 5 months ago
JSON representation

An R interface to the 'Apache Arrow' C API

Host: GitHub
URL: https://github.com/paleolimbot/narrow
Owner: paleolimbot
License: other
Archived: true
Created: 2021-08-23T20:09:11.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2023-10-01T01:08:04.000Z (about 2 years ago)
Last Synced: 2024-11-13T13:39:00.706Z (11 months ago)
Language: C
Homepage: https://paleolimbot.github.io/narrow/
Size: 692 KB
Stars: 30
Watchers: 5
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md

Awesome Lists containing this project

jimsghstars - paleolimbot/narrow - An R interface to the 'Apache Arrow' C API (C)

README

          ---

output: github_document

---

```{r, include = FALSE}

library(arrow)

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# narrow

[![Codecov test coverage](https://codecov.io/gh/paleolimbot/narrow/branch/master/graph/badge.svg)](https://codecov.io/gh/paleolimbot/narrow?branch=master)

[![R-CMD-check](https://github.com/paleolimbot/narrow/workflows/R-CMD-check/badge.svg)](https://github.com/paleolimbot/narrow/actions)

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

The goal of narrow is to wrap the [Arrow Data C API](https://arrow.apache.org/docs/format/CDataInterface.html) and [Arrow Stream C API](https://arrow.apache.org/docs/format/CStreamInterface.html) to provide lightweight Arrow support for R packages to consume and produce streams of data in Arrow format.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("remotes")

remotes::install_github("paleolimbot/narrow")

```

## Creating arrays

You can create an Arrow array using `as_narrow_array()`. For many types (e.g., integers and doubles), this is done without any copying of memory: narrow just arranges the existing R vector memory and protects it for the lifetime of the underlying `struct ArrowArray`.

```{r example}

library(narrow)

(array <- as_narrow_array(1:5))

```

For `Array`s and `RecordBatch`es from the [arrow package](https://arrow.apache.org/docs/r/), this is almost always a zero-copy operation and is instantaneous even for very large Arrays.

```{r}

library(arrow)

(array2 <- as_narrow_array(Array$create(1:5)))

```

## Exporting arrays

To convert an array object to some other type, use `from_narrow_array()`:

```{r}

str(from_narrow_array(array))

```

The narrow package has built-in defaults for converting arrays to R objects; you can also specify your own using the `ptype` argument:

```{r}

str(from_narrow_array(array, ptype = double()))

from_narrow_array(array, ptype = arrow::Array)

```

## Streams

The Arrow C API also specifies an experimental stream interface. In addition to handling streams created elsewhere, you can create streams based on a `narrow_array()`:

```{r}

stream1 <- as_narrow_array_stream(as_narrow_array(1:3))

narrow_array_stream_get_next(stream1)

narrow_array_stream_get_next(stream1)

```

...or based on a function that returns one or more `narrow_array()`s:

```{r}

counter <- -1

rows_per_chunk <- 5

csv_file <- readr::readr_example("mtcars.csv")

schema <- as_narrow_array(

  readr::read_csv(

    csv_file,

    n_max = 0,

    col_types = readr::cols(.default = readr::col_double())

  )

)$schema

stream2 <- narrow_array_stream_function(schema, function() {

  counter <<- counter + 1L

  result <- readr::read_csv(

    csv_file,

    skip = 1 + (counter * rows_per_chunk),

    n_max = rows_per_chunk,

    col_names = c(

      "mpg", "cyl", "disp", "hp", "drat",

      "wt", "qsec", "vs", "am", "gear", "carb"

    ),

    col_types = readr::cols(.default = readr::col_double())

  )

  if (nrow(result) > 0) result else NULL

})

```

You can pass these to Arrow as a `RecordBatchReader` using `narrow_array_stream_to_arrow()`:

```{r}

reader <- narrow_array_stream_to_arrow(stream2)

as.data.frame(reader$read_table())

```

Round-turn operations for `RecordBatch` also work:

```{r}

df <- readr::read_csv(csv_file, show_col_types=FALSE)

as.data.frame(from_narrow_array(as_narrow_array(df), arrow::RecordBatch)) 

```

## C data access

The C data interface is ABI stable (and a version of the stream interface will be ABI stable in the future) so you can access the underlying pointers in compiled code from any R package (or inline C or C++ code). A `narrow_schema()` is an external pointer to a `struct ArrowSchema`, a `narrow_array_data()` is an external pointer to a `struct ArrowArray`, and a `narrow_array()` is a `list()` of a `narrow_schema()` and a `narrow_array_data()`.

```{r, include=FALSE}

Sys.setenv(

  PKG_CPPFLAGS = paste0("-I", system.file("include", package = "narrow"))

)

```

```{c, results='hide'}

#include 

#include 

#include "narrow.h"

SEXP extract_null_count(SEXP array_data_xptr) {

  struct ArrowArray* array_data = (struct ArrowArray*) R_ExternalPtrAddr(array_data_xptr);

  return Rf_ScalarInteger(array_data->null_count);

}

```

```{r}

.Call("extract_null_count", as_narrow_array(c(NA, NA, 1:5))$array_data)

```

The lifecycles of objects pointed to by the external pointers are managed by R's garbage collector: any object that gets garbage collected has its `release()` callback called (if it isn't `NULL`) and the underlying memory for the `struct Arrow...` freed. You can call the `release()` callback yourself from compiled code but you probably don't want to unless you're explicitly limiting access to an object.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paleolimbot/narrow

Awesome Lists containing this project

README