https://github.com/cboettig/taxalight

A lightning-fast taxonomic database store backed by LMDB
https://github.com/cboettig/taxalight

package r taxonomy

Last synced: 7 months ago
JSON representation

A lightning-fast taxonomic database store backed by LMDB

Host: GitHub
URL: https://github.com/cboettig/taxalight
Owner: cboettig
License: other
Created: 2020-09-16T03:04:38.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-09-13T15:56:59.000Z (about 4 years ago)
Last Synced: 2025-03-17T12:41:15.824Z (7 months ago)
Topics: package, r, taxonomy
Language: R
Homepage:
Size: 166 KB
Stars: 5
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: 

  github_document:

    df_print: tibble

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# taxalight :zap: :zap:

[![R build status](https://github.com/cboettig/taxalight/workflows/R-CMD-check/badge.svg)](https://github.com/cboettig/taxalight/actions)

[![CRAN status](https://www.r-pkg.org/badges/version/taxalight)](https://CRAN.R-project.org/package=taxalight)

`taxalight` provides a lightweight, lightning fast query for resolving taxonomic identifiers to taxonomic names, and vice versa, by using a Lightning Memory Mapped Database backend. Compared to `taxadb`, it has few dependencies, fewer functions, and faster performance.  

If you just need to resolve scientific names to identifiers and vice versa, `taxalight` is a fast and simple option.  `taxalight` currently supports names from Integrated Taxonomic Information System (ITIS),  National Center for Biotechnology Information (NCBI), Global Biodiversity Information Facility (GBIF), Catalogue of Life (COL), and Open Tree Taxonomy (OTT). Like `taxadb`, `taxalight` uses annual stable version snapshots from these providers and presents the naming data in the simple and consistent tabular format of the Darwin Core Standard.

## Installation

You can install the released version of taxalight from [CRAN](https://CRAN.R-project.org) with:

``` r

install.packages("taxalight")

```

And the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("devtools")

devtools::install_github("cboettig/taxalight")

```

## Quickstart

`taxalight` needs to first download and import the provider naming databases.  This can take a while, but needs to only be done once.

```{r example}

library(taxalight)

tl_create("itis")

```

Now we can look up species by names, IDs, or a mix.  Even vernacular names can be recognized as key.  Note that only exact matches are supported though! ITIS (`itis`) is the default provider, but GBIF, COL, OTT, and NCBI are also available. 

```{r}

tl("Homo sapiens", provider = "itis")

```

```{r}

id <- c("ITIS:180092", "ITIS:179913", "Dendrocygna autumnalis", "Snow Goose",

        provider = "itis")

tl(id)

```

For convenience, we can request just the name or id as a character vector (paralleling functionality in `taxize`).  If the name is recognized as an accepted name, the corresponding ID for the provider is returned.  

```{r}

get_ids("Homo sapiens")

```

```{r}

get_names("ITIS:179913")

```

## Benchmarks

```{r}

library(bench)

```

```{r}

sp <- c("Dendrocygna autumnalis", "Dendrocygna bicolor",

        "Chen canagica",          "Chen caerulescens"     )

```

```{r}

taxadb::td_create("itis", schema="dwc")

```

```{r}

bench::bench_time(

  df_tb <- taxadb::filter_name(sp, "itis")

)

df_tb

```

```{r}

bench::bench_time(

  df_tl <- taxalight::tl(sp, "itis")

)

df_tl

```

```{r}

bench::bench_time(

  id_tb <- taxadb::get_ids(sp, "itis")

)

id_tb

```

```{r}

bench::bench_time(

  id_tl <- taxalight::get_ids(sp, "itis")

)

id_tl

```

## A provenance-backed data import

Under the hood, `taxalight` consumes a [DCAT2/PROV-O based description](https://raw.githubusercontent.com/boettiger-lab/taxadb-cache/master/prov.json) of the data provenance which generates the standard-format tables imported by `taxalight` (and `taxadb`) from the original data published by the naming providers.  All data and scripts are identified by content-based identifiers, which can be resolved by  or the R package, `contentid`.  This provides several benefits over resolving data from a URL source:

1. We have cryptographic certainty that we get the expected bytes every time

1. We can automatically cache and reference a local copy.  If the hash matches the requested identifier, then we don't even need to check eTags or other indications that the version we have already is the right one.

1. By registering multiple sources, the data can remain accessible even if one link rots away.  

Input data and scripts for transforming the data into the desired format are similarly archived and referenced by content identifiers in the provenance trace.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cboettig/taxalight

Awesome Lists containing this project

README