Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cboettig/neonstore

:package: A local content-based storage system for NEON data
https://github.com/cboettig/neonstore

database ecology neon-data provenance r

Last synced: 2 months ago
JSON representation

:package: A local content-based storage system for NEON data

Awesome Lists containing this project

README

        

---
output:
github_document:
df_print: tibble
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
cache = FALSE
)
library(neonstore)
library(dplyr)
library(stringr)
Sys.setenv("NEONSTORE_HOME" = tempfile())
Sys.setenv("NEONSTORE_DB" = tempfile())

```

# neonstore

[![R build status](https://github.com/cboettig/neonstore/workflows/R-CMD-check/badge.svg)](https://github.com/cboettig/neonstore/actions)
[![Codecov test coverage](https://codecov.io/gh/cboettig/neonstore/branch/master/graph/badge.svg)](https://app.codecov.io/gh/cboettig/neonstore?branch=master)
[![CRAN status](https://www.r-pkg.org/badges/version/neonstore)](https://CRAN.R-project.org/package=neonstore)
[![R-CMD-check](https://github.com/cboettig/neonstore/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/cboettig/neonstore/actions/workflows/R-CMD-check.yaml)

`neonstore` provides quick access and persistent storage of NEON data tables.
`neonstore` emphasizes simplicity and a clean data provenance trail, see
Provenance section below.

## Installation

Install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("cboettig/neonstore")
```
## Quickstart

```{r, message=FALSE}
library(neonstore)
library(tidyverse)
```

Discover data products of interest:

```{r}
products <- neon_products()
products |>
filter(str_detect(keywords, "bird")) |>
select(productName, productCode)
```

You may also prefer to explore the [NEON Data Portal](https://data.neonscience.org/) website interactively.

## Download-based workflow

Once we have identified a data product code, we can download all associated data files, e.g. in the bird survey data. Optionally, we can restrict this download to a set of sites or date ranges of interest, (see function documentation for details).

```{r}
neon_download("DP1.10003.001")
```

View your store of NEON products:

```{r }
neon_index()
```

These files will persist between sessions, so you only need to download once
or to retrieve updates. `neon_index()` can take arguments to filter by product
or pattern (regular expression) in table name, e.g. `neon_index(table = "brd")`.

## Database backend

`neonstore` now supports a backend relation database as well. Import data from
the raw downloaded files using `neon_store()`:

```{r}
neon_store(product = "DP1.10003.001")
```

Access an imported table using `neon_table()` instead of `neon_read()`:

```{r}
neon_table("brd_countdata")
```

Note that we need to include the product name in the table name when accessing the database, as table names alone may not be unique. RStudio users can also list and explore all tables interactively in the Connections pane in RStudio using the function `neon_pane()`.

## Larger-than-RAM data

When working across data from many sites or years simultaneously, it is easy for data to be too big for R to fit into working memory. This is especially true when working with sensor data. `neonstore` makes it easy to work with such data using dplyr-operations though. Just include the option `lazy = TRUE`, and [most dplyr](https://dbplyr.tidyverse.org/reference/index.html) operations will execute quickly on disk instead (by leveraging the `dbplyr` backend and the power of the `duckdb` database).

```{r}
brd <- neon_table("brd_countdata", lazy=TRUE)
# unique species per site?
brd |>
distinct(siteID, scientificName) |>
count(siteID, sort=TRUE) |>
collect()
```

Use the function `collect()` at the end of a chain of dplyr functions to bring the resulting data into R.

## NEW: Cloud-based workflow

It is now possible to access data directly from NEON's cloud storage system without downloading. (Note: this still must ping the NEON API to obtain the most recent list of files, and this list is subject to rate limits). Like the local database approach, this strategy works for larger-than-RAM data, and can be substantially faster than downloading. However, if you work frequently with the same data products and have ample disk space available, you will find the one-time wait for downloading to be faster.

```{r}
brd <- neon_cloud("brd_countdata", product="DP1.10003.001")

brd |>
distinct(siteID, scientificName) |>
count(siteID, sort=TRUE) |>
collect()
```

## Note on API limits

If `neon_download()` exceeds the API request limit (with or without the token),
`neonstore` will simply pause for the required amount of time to avoid
rate-limit-based errors.

[The NEON API now rate-limits requests.](https://data.neonscience.org/data-api/rate-limiting/#api-tokens).
Using a personal token will increase the number of requests you can make before
encountering this delay. See link for directions on registering for a token.
Then pass this token in `.token` argument of `neon_download()`,
or for frequent use, add this token as an environmental variable, `NEON_DATA`
to your local `.Renviron` file in your user's home directory.
`neon_download()` must first query each the API of each NEON site which collects
that product, for each month the product is collected.

(It would be much more efficient on the NEON server if the API could take
queries of the from `/data//`, and pool the results, rather than
require each month of sampling separately!)

```{r include=FALSE}
unlink("my_neondata.zip")
unlink("index.csv")
Sys.unsetenv("NEONSTORE_HOME")
Sys.unsetenv("NEONSTORE_DB")
```

## Non-stacking files and low-level interface

At it's core, `neonstore` is simply a mechanism to download files from the NEON API.
While the `.csv` files from the Observation Systems (OS, e.g. bird count surveys),
and Instrument Systems (e.g. aquatic sensors) are typically stacked into large
tables, other products, such as the `.laz` and `.tif` images produced by the
airborne observation platform (AOP) sensors such as LIDAR and cameras still
require the user to work directly with the downloaded files returned
by `neon_index()`. Note that the local database can process Eddy Covariance
data (h5 files), but at present this does not work with `neon_cloud()`.