An open API service indexing awesome lists of open source software.

https://github.com/brownag/labtaxa

Analysis-ready Rocker RStudio Server-based Container for the USDA-NRCS-NCSS Kellogg Soil Survey Lab Data Mart 'SQLite' Snapshot
https://github.com/brownag/labtaxa

database docker geopackage kssl lab ncss r rstudio rstudio-server soil soil-survey

Last synced: 4 months ago
JSON representation

Analysis-ready Rocker RStudio Server-based Container for the USDA-NRCS-NCSS Kellogg Soil Survey Lab Data Mart 'SQLite' Snapshot

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# labtaxa

[![R-CMD-check](https://github.com/brownag/labtaxa/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/brownag/labtaxa/actions/workflows/R-CMD-check.yaml)
[![docker-publish](https://github.com/brownag/labtaxa/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/brownag/labtaxa/actions/workflows/docker-publish.yml)
[![Docker Pulls](https://badgen.net/docker/pulls/brownag/labtaxa?icon=docker&label=pulls)](https://hub.docker.com/r/brownag/labtaxa/)
[![Docker Size](https://badgen.net/docker/size/brownag/labtaxa?icon=docker&label=size)](https://github.com/users/brownag/packages/container/package/labtaxa)
[![pkgdown](https://img.shields.io/badge/docs-HTML-informational)](https://brownag.github.io/labtaxa/)

## Overview

**labtaxa** provides reproducible access to the USDA-NRCS Kellogg Soil Survey Laboratory (KSSL) database snapshots in R.

The package automatically downloads, caches, and loads the complete Lab Data Mart database (~65,000 soil profiles with detailed laboratory analyses) for soil science research and education.

## Docker

To get up and running quickly you can use the Docker container. The `labtaxa` container is based on **rocker/rstudio** with a pinned R version (see Dockerfile for current version) for reproducibility. In addition to the standard RStudio tools, the container has:

- **Cached Lab Data Mart GeoPackage** - Complete KSSL database (lab and spatial data)
- **Morphologic database** - Derived from NASIS pedon field descriptions
- **Pre-processed data** - `SoilProfileCollection` objects cached as RDS files
- **Curated packages** - All dependencies for soil science analysis
- **RStudio Server** - Full IDE accessible via web browser

The data and packages are exactly versioned (not floating), guaranteeing reproducible results: the same Docker tag always gives the same environment and data.

From Docker Hub:

``` sh
docker pull brownag/labtaxa:latest
```

Or from GitHub:

``` sh
docker pull ghcr.io/brownag/labtaxa:latest
```

### Quick Start with Docker Compose (Recommended)

The easiest way to run the container is with **Docker Compose**. A `docker-compose.yml` file is included in this repository:

``` sh
# Clone the repository
git clone https://github.com/brownag/labtaxa.git
cd labtaxa

# Start the container (downloads image on first run)
docker-compose up -d

# Stop the container
docker-compose down
```

Then open your web browser and navigate to `http://localhost:8787`. The default username is `rstudio` and the default password is `soilscience`.

**Features:**

- Persistent `projects/` directory for your work
- Automatic volume management for package cache
- Pre-configured resource limits (16GB memory, 4 CPUs)
- Health checks to monitor container status

### Running with Docker Run

Alternatively, you can run the container directly:

``` sh
docker run -d -p 8787:8787 -e PASSWORD=mypassword -v ~/Documents:/home/rstudio/Documents -e ROOT=TRUE brownag/labtaxa
```

Then open your web browser and navigate to `http://localhost:8787`. The default username is `rstudio` and the default password is `mypassword`.

## R Package Installation

You can install the development version of {labtaxa} from GitHub:

``` r
if (!require("labtaxa"))
remotes::install_github("brownag/labtaxa")
```

## Example

Download (and cache) the latest Lab Data Mart SQLite snapshot from like so:

```{r example, echo = TRUE}
library(labtaxa)
ldm <- get_LDM_snapshot()
```

Downloaded and derived files will be cached in platform-specific directory specified by `ldm_data_dir()` using `cache_labtaxa()`

In the Docker container the snapshot has already been created and cached from the latest data (as of the last time the container was built). Updates to the method used to create the cache, as well as scheduled (monthly) updates occur.

The cached data help to get off and running quickly analyzing the entire KSSL database using the [{aqp}](https://cran.r-project.org/package=aqp) R package toolchain.

The lab data are pre-loaded in a large SoilProfileCollection object (over 65,000 profiles). In only a few seconds from when you have the Docker container loaded, you can be filtering and processing the lab data object. Downloading archives of the complete databases can take 10s of minutes to a couple hours (depending on internet connection). Only in cases where the absolute most recent data are needed would require doing a cache update.

The downloaded databases (GeoPackage, SQLite) are queried locally using {soilDB} functions `fetchLDM()` and `fetchNASIS()`. The {soilDB} functions can take a couple minutes to process on larger databases like this, so the container building process front loads these more costly processing steps. Querying the data using a method like this essentially precedes all analyses. soilDB provides standard aggregation methods that produce {aqp} SoilProfileCollections, which provide a convenient data structure for working with horizon and site level data associated with specific soil profiles.

When you start up {labtaxa} in the Docker container you will have the latest database and the first-step data object (as if you ran the {soilDB} functions) readily available for post-processing for answering specific questions.

```{r example2, echo = TRUE}
ldm
```

If you are running on your own machine you will have to run `get_LDM_snapshot()` at least once (as above) before the `load_labtaxa()` command works. In future runs you will not need to re-download or prepare the data unless you need to update the cache.

## Data Versioning Strategy

**labtaxa** uses semantic versioning for reproducibility:

- `latest` - Most recent data snapshot (always updated)
- `YYYY.MM` - Specific month snapshot (e.g., `2026.02` for February 2026 data)
- `YYYY.MM.DD` - Specific day snapshot (rare, for patch builds)

**For reproducible research**, always specify a version tag:

```bash
# Use a specific month snapshot (recommended for publications)
docker pull ghcr.io/brownag/labtaxa:2026.02
```
**GitHub Releases** contain checksums for verification:
- Each release is tagged with the data snapshot date
- Download `snapshot-metadata.json` to verify file integrity

### Cite the Data

When publishing research using NCSS laboratory data should cite:

> National Cooperative Soil Survey
> National Cooperative Soil Survey Soil Characterization Database
> http://ncsslabdatamart.sc.egov.usda.gov/
> Accessed

You can cite both the package and the data version:

```bibtex
@misc{labtaxa2026,
author = {Brown, Andrew},
title = {labtaxa: USDA KSSL Database Snapshots},
year = {2026},
url = {https://github.com/brownag/labtaxa}
}
```
### Getting Help

- **Documentation**: https://brownag.github.io/labtaxa/
- **Report Issues**: https://github.com/brownag/labtaxa/issues
- **Discussions**: https://github.com/brownag/labtaxa/discussions

## Related Resources

- **Lab Data Mart**: https://ncsslabdatamart.sc.egov.usda.gov/
- **aqp Package** (SoilProfileCollection object): https://cran.r-project.org/package=aqp
- **soilDB Package** (soil database tools): https://cran.r-project.org/package=soilDB