https://github.com/cboettig/taxalight
A lightning-fast taxonomic database store backed by LMDB
https://github.com/cboettig/taxalight
package r taxonomy
Last synced: 7 months ago
JSON representation
A lightning-fast taxonomic database store backed by LMDB
- Host: GitHub
- URL: https://github.com/cboettig/taxalight
- Owner: cboettig
- License: other
- Created: 2020-09-16T03:04:38.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-09-13T15:56:59.000Z (about 4 years ago)
- Last Synced: 2025-03-17T12:41:15.824Z (7 months ago)
- Topics: package, r, taxonomy
- Language: R
- Homepage:
- Size: 166 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output:
github_document:
df_print: tibble
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# taxalight :zap: :zap:
[](https://github.com/cboettig/taxalight/actions)
[](https://CRAN.R-project.org/package=taxalight)`taxalight` provides a lightweight, lightning fast query for resolving taxonomic identifiers to taxonomic names, and vice versa, by using a Lightning Memory Mapped Database backend. Compared to `taxadb`, it has few dependencies, fewer functions, and faster performance.
If you just need to resolve scientific names to identifiers and vice versa, `taxalight` is a fast and simple option. `taxalight` currently supports names from Integrated Taxonomic Information System (ITIS), National Center for Biotechnology Information (NCBI), Global Biodiversity Information Facility (GBIF), Catalogue of Life (COL), and Open Tree Taxonomy (OTT). Like `taxadb`, `taxalight` uses annual stable version snapshots from these providers and presents the naming data in the simple and consistent tabular format of the Darwin Core Standard.
## Installation
You can install the released version of taxalight from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("taxalight")
```And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("cboettig/taxalight")
```## Quickstart
`taxalight` needs to first download and import the provider naming databases. This can take a while, but needs to only be done once.
```{r example}
library(taxalight)
tl_create("itis")
```Now we can look up species by names, IDs, or a mix. Even vernacular names can be recognized as key. Note that only exact matches are supported though! ITIS (`itis`) is the default provider, but GBIF, COL, OTT, and NCBI are also available.
```{r}
tl("Homo sapiens", provider = "itis")
``````{r}
id <- c("ITIS:180092", "ITIS:179913", "Dendrocygna autumnalis", "Snow Goose",
provider = "itis")
tl(id)
```For convenience, we can request just the name or id as a character vector (paralleling functionality in `taxize`). If the name is recognized as an accepted name, the corresponding ID for the provider is returned.
```{r}
get_ids("Homo sapiens")
``````{r}
get_names("ITIS:179913")
```## Benchmarks
```{r}
library(bench)
``````{r}
sp <- c("Dendrocygna autumnalis", "Dendrocygna bicolor",
"Chen canagica", "Chen caerulescens" )
``````{r}
taxadb::td_create("itis", schema="dwc")
``````{r}
bench::bench_time(
df_tb <- taxadb::filter_name(sp, "itis")
)
df_tb
``````{r}
bench::bench_time(
df_tl <- taxalight::tl(sp, "itis")
)
df_tl
``````{r}
bench::bench_time(
id_tb <- taxadb::get_ids(sp, "itis")
)
id_tb
``````{r}
bench::bench_time(
id_tl <- taxalight::get_ids(sp, "itis")
)
id_tl
```## A provenance-backed data import
Under the hood, `taxalight` consumes a [DCAT2/PROV-O based description](https://raw.githubusercontent.com/boettiger-lab/taxadb-cache/master/prov.json) of the data provenance which generates the standard-format tables imported by `taxalight` (and `taxadb`) from the original data published by the naming providers. All data and scripts are identified by content-based identifiers, which can be resolved by or the R package, `contentid`. This provides several benefits over resolving data from a URL source:
1. We have cryptographic certainty that we get the expected bytes every time
1. We can automatically cache and reference a local copy. If the hash matches the requested identifier, then we don't even need to check eTags or other indications that the version we have already is the right one.
1. By registering multiple sources, the data can remain accessible even if one link rots away.Input data and scripts for transforming the data into the desired format are similarly archived and referenced by content identifiers in the provenance trace.