Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ropensci/taxadb

:package: Taxonomic Database
https://github.com/ropensci/taxadb

Last synced: 8 days ago
JSON representation

:package: Taxonomic Database

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = TRUE,
comment = "#>"
)
taxadb:::td_disconnect()
```

# taxadb

[![R-CMD-check](https://github.com/ropensci/taxadb/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/taxadb/actions/workflows/R-CMD-check.yaml)
[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)
[![CRAN status](https://www.r-pkg.org/badges/version/taxadb)](https://cran.r-project.org/package=taxadb)
[![DOI](https://zenodo.org/badge/130153207.svg)](https://zenodo.org/badge/latestdoi/130153207)

The goal of `taxadb` is to provide *fast*, *consistent* access to taxonomic data, supporting common tasks such as resolving taxonomic names to identifiers, looking up higher classification ranks of given species, or returning a list of all species below a given rank. These tasks are particularly common when synthesizing data across large species assemblies, such as combining occurrence records with trait records.

Existing approaches to these problems typically rely on web APIs, which can make them impractical for work with large numbers of species or in more complex pipelines. Queries and returned formats also differ across the different taxonomic authorities, making tasks that query multiple authorities particularly complex. `taxadb` creates a *local* database of most readily available taxonomic authorities, each of which is transformed into consistent, standard, and researcher-friendly tabular formats.

## Install and initial setup

To get started, install from CRAN

``` r
install.packages("taxadb")
```

or install the development version directly from GitHub:

``` r
devtools::install_github("ropensci/taxadb")
```

```{r message = FALSE}
library(taxadb)
library(dplyr) # Used to illustrate how a typical workflow combines nicely with `dplyr`
```

Create a local copy of the (current) Catalogue of Life database:

```{r }
td_create("col")
```

Read in the species list used by the Breeding Bird Survey:

```{r, message = FALSE}
bbs_species_list <- system.file("extdata/bbs.tsv", package="taxadb")
bbs <- read.delim(bbs_species_list)
```

## Getting names and ids

Two core functions are `get_ids()` and `get_names()`. These functions take a vector of names or ids (respectively), and return a vector of ids or names (respectively). For instance, we can use this to attempt to resolve all the bird names in the Breeding Bird Survey against the Catalogue of Life:

```{r}
birds <- bbs %>%
select(species) %>%
mutate(id = get_ids(species, "col"))

head(birds, 10)
```

Note that some names cannot be resolved to an identifier. This can occur because of miss-spellings, non-standard formatting, or the use of a synonym not recognized by the naming provider. Names that cannot be uniquely resolved because they are known synonyms of multiple different species will also return `NA`. The `filter_name` filtering functions can help us resolve this last case (see below).

`get_ids()` returns the IDs of accepted names, that is `dwc:AcceptedNameUsageID`s. We can resolve the IDs into accepted names:

```{r}
birds %>%
mutate(accepted_name = get_names(id, "col")) %>%
head()
```

This illustrates that some of our names, e.g. *Dendrocygna bicolor* are accepted in the Catalogue of Life, while others, *Anser canagicus* are **known synonyms** of a different accepted name: **Chen canagica**. Resolving synonyms and accepted names to identifiers helps us avoid the possible miss-matches we could have when the same species is known by two different names.

## Taxonomic Data Tables

Local access to taxonomic data tables lets us do much more than look up names and ids. A family of `filter_*` functions in `taxadb` help us work directly with subsets of the taxonomic data. As we noted above, this can be useful in resolving certain ambiguous names.

For instance, *Agrostis caespitosa* does not resolve to an identifier in ITIS:

```{r}
get_ids("Agrostis caespitosa", "itis")
```

Using `filter_name()`, we find this is because the name resolves not to zero matches, but is a known synonym to more than one accepted name (as indicated by the accepted name usage id)

```{r}
filter_name('Agrostis caespitosa', 'itis')
```

We can resolve the scientific name to the acceptedNameUsage using `get_names()` on the _accepted_ IDs: (These also correspond to the genus and specificEpithet column, as the classification is always given only based on acceptedNameUsageID).

```{r}
filter_name("Agrostis caespitosa") %>%
mutate(acceptedNameUsage = get_names(acceptedNameUsageID)) %>%
select(scientificName, taxonomicStatus, acceptedNameUsage, acceptedNameUsageID)
```

Similar functions `filter_id`, `filter_rank`, and `filter_common` take IDs, scientific ranks, or common names, respectively. Here, we can get taxonomic data on all bird names in the Catalogue of Life:

```{r}
filter_rank(name = "Aves", rank = "class", provider = "col")
```

Combining these with `dplyr` functions can make it easy to explore this data: for instance, which families have the most species?

```{r}
filter_rank(name = "Aves", rank = "class", provider = "col") %>%
filter(taxonomicStatus == "accepted", taxonRank=="species") %>%
group_by(family) %>%
count(sort = TRUE) %>%
head()
```

## Using the database connection directly

`filter_*` functions by default return in-memory data frames. Because they are filtering functions, they return a subset of the full data which matches a given query (names, ids, ranks, etc), so the returned data.frames are smaller than the full record of a naming provider. Working directly with the SQL connection to the MonetDBLite database gives us access to all the data. The `taxa_tbl()` function provides this connection:

```{r}
taxa_tbl("col")
```

We can still use most familiar `dplyr` verbs to perform common tasks. For instance: which species has the most known synonyms?

```{r}
taxa_tbl("itis") %>%
count(acceptedNameUsageID, sort=TRUE)
```

However, unlike the `filter_*` functions which return convenient in-memory tables, this is still a remote connection. This means that direct access using the `taxa_tbl()` function (or directly accessing the database connection using `td_connect()`) is more low-level and requires greater care. For instance, we cannot just add a `%>% mutate(acceptedNameUsage = get_names(acceptedNameUsageID))` to the above, because `get_names` does not work on a remote collection. Instead, we would first need to use a `collect()` to pull the summary table into memory. Users familiar with remote databases in `dplyr` will find using `taxa_tbl()` directly to be convenient and fast, while other users may find the `filter_*` approach to be more intuitive.

## Learn more

- See richer examples the package [Tutorial](https://docs.ropensci.org/taxadb/articles/articles/intro.html).

- Learn about the underlying data sources and formats in [Data Sources](https://docs.ropensci.org/taxadb/articles/data-sources.html)

- Get better performance by selecting an alternative [database backend](https://docs.ropensci.org/taxadb/articles/backends.html) engines.

```{r include=FALSE}
taxadb:::td_disconnect()

if(require(codemeta)) codemeta::write_codemeta()
```

----

Please note that this project is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/).
By participating in this project you agree to abide by its terms.

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)