Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/NIEHS/chopin
Scalable GIS methods for environmental and climate data analysis
https://github.com/NIEHS/chopin
Last synced: 3 months ago
JSON representation
Scalable GIS methods for environmental and climate data analysis
- Host: GitHub
- URL: https://github.com/NIEHS/chopin
- Owner: NIEHS
- License: other
- Created: 2023-08-18T19:39:33.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-28T20:39:53.000Z (5 months ago)
- Last Synced: 2024-05-29T08:15:21.800Z (5 months ago)
- Language: R
- Homepage: https://niehs.github.io/chopin/
- Size: 68.4 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.Rmd
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codemeta: codemeta.json
Awesome Lists containing this project
- open-sustainable-technology - chopin - Computation for Climate and Health research On Parallelized infrastructure is a Scalable GIS methods for environmental and climate data analysis. (Climate Change / Climate Data Processing and Analysis)
README
---
output:
github_document:
html_preview: false
always_allow_html: true
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing
[![cov](https://NIEHS.github.io/chopin/badges/coverage.svg)](https://github.com/NIEHS/chopin/actions)
[![R-CMD-check](https://github.com/NIEHS/chopin/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/NIEHS/chopin/actions/workflows/check-standard.yaml)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/638_status.svg)](https://github.com/ropensci/software-review/issues/638)
[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)## Objective and target users
### Objective
- This package automates [parallelization](https://en.wikipedia.org/wiki/Parallel_computing) in spatial operations with `chopin` functions as well as [sf](https://github.com/r-spatial/sf)/[terra](https://github.com/rspatial/terra) functions. With [GDAL](https://gdal.org)-compatible files and database tables, `chopin` functions help to calculate spatial variables from vector and raster data with no external software requirements.### For whom `chopin` is useful
- All who need to perform geospatial operations with large datasets may find this package useful to accelerate the covariate calculation process for further analysis and modeling.
- We assume that users--
- Have basic knowledge of [geographic information system data models](https://r.geocompx.org/spatial-class), [coordinate systems and transformations](https://doi.org/10.22224/gistbok/2023.1.2), [spatial operations](https://r.geocompx.org/spatial-operations), and [raster-vector overlay](https://r.geocompx.org/raster-vector);
- Understood and planned **what they want to calculate**; and
- Collected **datasets they need**## Overview
- Processing functions accept [terra](https://github.com/rspatial/terra)/[sf](https://github.com/r-spatial/sf) classes for spatial data. Raster-vector overlay is done with `exactextractr`.
- From version 0.8.0, this package supports three basic functions that are readily parallelized over multithread environments:
- `extract_at`: extract raster values with point buffers or polygons with or without kernel weights
- `summarize_sedc`: calculate sums of [exponentially decaying contributions](https://mserre.sph.unc.edu/BMElab_web/SEDCtutorial/index.html)
- `summarize_aw`: area-weighted covariates based on target and reference polygons- When processing points/polygons in parallel, the entire study area will be divided into partly overlapped grids or processed through its own hierarchy. We suggest two flowcharts to help which function to use for parallel processing below. The upper flowchart is raster-oriented and the lower one is vector-oriented. They are separated but supplementary to each other. When a user follows the raster-oriented one, they might visit the vector-oriented flowchart at each end of the raster-oriented flowchart.
- `par_grid`: parallelize over artificial grid polygons that are generated from the maximum extent of inputs. `par_pad_grid` is used to generate the grid polygons before running this function.
- `par_hierarchy`: parallelize over hierarchy coded in identifier fields (for example, census blocks in each county in the US)
- `par_multirasters`: parallelize over multiple raster files
- These functions are designed to be used with `future` and `future.mirai` packages to parallelize over multiple CPU threads. Users can choose the number of threads to be used in the parallelization process. Users always need to register parallel workers with `future` before running the three functions above.```r
future::plan(future.mirai::mirai_multisession, workers = 4L)
# future::multisession, future::cluster are available,
# See future.batchtools and future.callr for other options
# the number of workers are up to users' choice
``````{r flowchart-mermaid-raster, echo = FALSE, eval = (Sys.getenv("IN_GALLEY") == "")}
mermaid_chart_raster <-
'
graph LR
n6695079["Is the spatial resolution finer than 100 meters?"]
n11509997["Are there multiple rasters?"]
n72001430["exact_extract with suitable max_cells_in_memory value"]
n27284812["Do they have the same extent and resolution?"]
n83137384["Is a single raster larger than your free memory space?"]
n83318893["Do you have memory larger than the total raster file size?"]
n14786842["exact_extract with low max_cells_in_memory"]
n17102479["exact_extract with high max_cells_in_memory argument value"]
n7037868["Stack rasters then process in the single thread"]
n58642837["par_multirasters"]
n6695079 -->|Yes| n11509997
n6695079 -->|No| n72001430
n11509997 -->|Yes| n27284812
n11509997 -->|No| n83137384
n27284812 -->|Yes| n83318893
n27284812 -->|No| n58642837
n83137384 -->|No| n14786842
n83137384 -->|Yes| n17102479
n83318893 -->|Yes| n7037868
n83318893 -->|No| n58642837
'DiagrammeR::mermaid(mermaid_chart_raster, width = 500, height = 1200)
``````{r flowchart-mermaid-vector, echo = FALSE, eval = (Sys.getenv("IN_GALLEY") == "")}
mermaid_chart_vector <-
'
graph LR
n21640044["Are there 100K+ features in the input vectors?"]
n84295645["Are they hierarchical?"]
n82902796["single thread processing"]
n34878990["Are the data grouped in similar sizes?"]
n27787116["Are they spatially clustered?"]
n89847105["par_hierarchy"]
n90014927["par_pad_balanced"]
n94475834["par_pad_grid(..., mode = \'grid_quantile\') or par_make_gridset_mode = \'grid_advanced\')"]
n77415399["par_pad_grid(..., mode = \'grid\'"]
n64849552["par_grid"]
n21640044 -->|Yes| n84295645
n21640044 -->|No| n82902796
n84295645 -->|Yes| n34878990
n84295645 -->|No| n27787116
n34878990 -->|Yes| n89847105
n34878990 -->|No| n90014927
n34878990 -->|No| n94475834
n27787116 -->|Yes| n94475834
n27787116 -->|No| n77415399
n90014927 --> n64849552
n94475834 --> n64849552
n77415399 --> n64849552
'DiagrammeR::mermaid(mermaid_chart_vector, width = 500, height = 1200)
```## Installation
- `chopin` can be installed using `remotes::install_github` (also possible with `pak::pak` or `devtools::install_github`).
```r
rlang::check_installed("remotes")
remotes::install_github("NIEHS/chopin")
```## Examples
- Examples will navigate `par_grid`, `par_hierarchy`, and `par_multirasters` functions in `chopin` to parallelize geospatial operations.```{r load-packages}
# check and install packages to run examples
pkgs <- c("chopin", "dplyr", "sf", "terra", "future", "future.mirai", "mirai")
# install packages if anything is unavailable
rlang::check_installed(pkgs)library(chopin)
library(dplyr)
library(sf)
library(terra)
library(future)
library(future.mirai)
library(mirai)# disable spherical geometries
sf::sf_use_s2(FALSE)# parallelization-safe random number generator
set.seed(2024, kind = "L'Ecuyer-CMRG")
```### `par_grid`: parallelize over artificial grid polygons
- Please refer to a small example below for extracting mean altitude values at circular point buffers and census tracts in North Carolina.
- Before running code chunks below, set the cloned `chopin` repository as your working directory with `setwd()````{r read-nc}
ncpoly <- system.file("shape/nc.shp", package = "sf")
ncsf <- sf::read_sf(ncpoly)
ncsf <- sf::st_transform(ncsf, "EPSG:5070")
plot(sf::st_geometry(ncsf))
```#### Generate random points in NC
- Ten thousands random point locations were generated inside the counties of North Carolina.
```{r gen-ncpoints}
ncpoints <- sf::st_sample(ncsf, 1e4)
ncpoints <- sf::st_as_sf(ncpoints)
ncpoints$pid <- sprintf("PID-%05d", seq(1, 1e4))
plot(sf::st_geometry(ncpoints))
```#### Target raster dataset: [Shuttle Radar Topography Mission](https://www.usgs.gov/centers/eros/science/usgs-eros-archive-digital-elevation-shuttle-radar-topography-mission-srtm-1)
- We use an elevation dataset with and a moderate spatial resolution (approximately 400 meters or 0.25 miles).```{r load-srtm}
# data preparation
wdir <- system.file("extdata", package = "chopin")
srtm <- file.path(wdir, "nc_srtm15_otm.tif")# terra SpatRaster objects are wrapped when exported to rds file
srtm_ras <- terra::rast(srtm)
terra::crs(srtm_ras) <- "EPSG:5070"
srtm_ras
terra::plot(srtm_ras)
``````{r srtm-extract-single}
# ncpoints_tr <- terra::vect(ncpoints)
system.time(
ncpoints_srtm <-
chopin::extract_at(
x = srtm,
y = ncpoints,
id = "pid",
mode = "buffer",
radius = 1e4L # 10,000 meters (10 km)
)
)```
#### Generate regular grid computational regions
- `chopin::par_pad_grid` takes a spatial dataset to generate regular grid polygons with `nx` and `ny` arguments with padding. Users will have both overlapping (by the degree of `radius`) and non-overlapping grids, both of which will be utilized to split locations and target datasets into sub-datasets for efficient processing.
```{r gen-compregions}
compregions <-
chopin::par_pad_grid(
ncpoints,
mode = "grid",
nx = 2L,
ny = 2L,
padding = 1e4L
)
```- `compregions` is a list object with two elements named `original` (non-overlapping grid polygons) and `padded` (overlapping by `padding`). The figures below illustrate the grid polygons with and without overlaps.
```{r compare-compregions, fig.width = 8, fig.height = 8}
names(compregions)oldpar <- par()
par(mfrow = c(2, 1))
terra::plot(
terra::vect(compregions$original),
main = "Original grids"
)
terra::plot(
terra::vect(compregions$padded),
main = "Padded grids"
)```
#### Parallel processing
- Using the grid polygons, we distribute the task of averaging elevations at 10,000 circular buffer polygons, which are generated from the random locations, with 10 kilometers radius by `chopin::par_grid`.
- Users always need to **register** multiple CPU threads (logical cores) for parallelization.
- `chopin::par_*` functions are flexible in terms of supporting generic spatial operations in `sf` and `terra`, especially where two datasets are involved.
- Users can inject generic functions' arguments (parameters) by writing them in the ellipsis (`...`) arguments, as demonstrated below:
```{r}
future::plan(future.mirai::mirai_multisession, workers = 4L)system.time(
ncpoints_srtm_mthr <-
par_grid(
grids = compregions,
fun_dist = extract_at,
x = srtm,
y = ncpoints,
id = "pid",
radius = 1e4L,
.standalone = FALSE
)
)ncpoints_srtm <-
extract_at(
x = srtm,
y = ncpoints,
id = "pid",
radius = 1e4L
)```
```{r compare-single-multi}
colnames(ncpoints_srtm_mthr)[2] <- "mean_par"
ncpoints_compar <- merge(ncpoints_srtm, ncpoints_srtm_mthr)
# Are the calculations equal?
all.equal(ncpoints_compar$mean, ncpoints_compar$mean_par)
``````{r plot results}
ncpoints_s <-
merge(ncpoints, ncpoints_srtm)
ncpoints_m <-
merge(ncpoints, ncpoints_srtm_mthr)plot(ncpoints_s[, "mean"], main = "Single-thread", pch = 19, cex = 0.33)
plot(ncpoints_m[, "mean_par"], main = "Multi-thread", pch = 19, cex = 0.33)
```### `chopin::par_hierarchy`: parallelize geospatial computations using intrinsic data hierarchy
- In real world datasets, we usually have nested/exhaustive hierarchies. For example, land is organized by administrative/jurisdictional borders where multiple levels exist. In the U.S. context, a state consists of several counties, counties are split into census tracts, and they have a group of block groups.
- `chopin::par_hierarchy` leverages such hierarchies to parallelize geospatial operations, which means that a group of lower-level geographic units in a higher-level geography is assigned to a process.
- A demonstration below shows that census tracts are grouped by their counties then each county will be processed in a CPU thread.#### Read data
```{r}
# nc_hierarchy.gpkg includes two layers: county and tracts
path_nchrchy <- file.path(wdir, "nc_hierarchy.gpkg")nc_data <- path_nchrchy
nc_county <- sf::st_read(nc_data, layer = "county")
nc_tracts <- sf::st_read(nc_data, layer = "tracts")# reproject to Conus Albers Equal Area
nc_county <- sf::st_transform(nc_county, "EPSG:5070")
nc_tracts <- sf::st_transform(nc_tracts, "EPSG:5070")
nc_tracts$COUNTY <- substr(nc_tracts$GEOID, 1, 5)
```#### Extract average SRTM elevations by single and multiple threads
```{r compare-runtime-hierarchy}
# single-thread
system.time(
nc_elev_tr_single <-
chopin::extract_at(
x = srtm,
y = nc_tracts,
id = "GEOID"
)
)# hierarchical parallelization
system.time(
nc_elev_tr_distr <-
chopin::par_hierarchy(
regions = nc_county, # higher level geometry
regions_id = "GEOID", # higher level unique id
fun_dist = extract_at,
x = srtm,
y = nc_tracts, # lower level geometry
id = "GEOID", # lower level unique id
func = "mean"
)
)```
### `par_multirasters`: parallelize over multiple rasters
- There is a common case of having a large group of raster files at which the same operation should be performed.
- `chopin::par_multirasters` is for such cases. An example below demonstrates where we have five elevation raster files to calculate the average elevation at counties in North Carolina.```{r prep-multiraster}
# nccnty <- sf::st_read(nc_data, layer = "county")
ncelev <- terra::rast(srtm)
terra::crs(ncelev) <- "EPSG:5070"
names(ncelev) <- c("srtm15")
tdir <- tempdir()terra::writeRaster(ncelev, file.path(tdir, "test1.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test2.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test3.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test4.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test5.tif"), overwrite = TRUE)# check if the raster files were exported as expected
testfiles <- list.files(tdir, pattern = "*.tif$", full.names = TRUE)
testfiles
``````{r run-multiraster}
system.time(
res <-
chopin::par_multirasters(
filenames = testfiles,
fun_dist = extract_at,
x = ncelev,
y = nc_county,
id = "GEOID",
func = "mean"
)
)
knitr::kable(head(res))# remove temporary raster files
file.remove(testfiles)
```## Parallelization of a generic geospatial operation
- Other than `chopin` internal macros, `chopin::par_*` functions support generic geospatial operations.
- An example below uses `terra::nearest`, which gets the nearest feature's attributes, inside `chopin::par_grid`.```{r prep-par-generic}
path_ncrd1 <- file.path(wdir, "ncroads_first.gpkg")# Generate 5000 random points
pnts <- sf::st_sample(nc_county, 5000)
pnts <- sf::st_as_sf(pnts)
# assign identifiers
pnts$pid <- sprintf("RPID-%04d", seq(1, 5000))
rd1 <- sf::st_read(path_ncrd1)# reproject
pntst <- sf::st_transform(pnts, "EPSG:5070")
rd1t <- sf::st_transform(rd1, "EPSG:5070")# generate grids
nccompreg <-
chopin::par_pad_grid(
input = pntst,
mode = "grid",
nx = 4L,
ny = 2L,
padding = 5e4L
)```
- The figure below shows the padded grids (50 kilometers), primary roads, and points. Primary roads will be selected by a padded grid per iteration and used to calculate the distance from each point to the nearest primary road. Padded grids and their overlapping areas will look different according to `padding` argument in `chopin::par_pad_grid`.
```{r map-all}
# plot
terra::plot(nccompreg$padded, border = "orange")
terra::plot(terra::vect(ncsf), add = TRUE)
terra::plot(rd1t, col = "blue", add = TRUE)
terra::plot(pntst, add = TRUE, cex = 0.3)
legend(1.02e6, 1.72e6,
legend = c("Computation grids (50km padding)", "Major roads"),
lty = 1, lwd = 1, col = c("orange", "blue"),
cex = 0.5)
``````{r compare-generic}
# terra::nearest run
system.time(
restr <- terra::nearest(x = terra::vect(pntst), y = terra::vect(rd1t))
)pnt_path <- file.path(tdir, "pntst.gpkg")
sf::st_write(pntst, pnt_path)# we use four threads that were configured above
system.time(
resd <-
chopin::par_grid(
grids = nccompreg,
fun_dist = nearest,
x = pnt_path,
y = path_ncrd1,
pad_y = TRUE
)
)
```- We will compare the results from the single-thread and multi-thread calculation.
```{r compare-distance}
resj <- merge(restr, resd, by = c("from_x", "from_y"))
all.equal(resj$distance.x, resj$distance.y)
```- Users should be mindful of potential caveats in the parallelization of nearest feature search, which may result in no or excess distance depending on the distribution of the target dataset to which the nearest feature is searched.
- For example, when one wants to calculate the nearest interstate from rural homes with fine grids, some grids may have no interstates then homes in such grids will not get any distance to the nearest interstate.
- Such problems can be avoided by choosing `nx`, `ny`, and `padding` values in `par_pad_grid` meticulously.### Notes on data restrictions
- `chopin` works best with **two-dimensional** (**planar**) geometries. Users should disable `s2` spherical geometry mode in `sf` by setting. Running any `chopin` functions at spherical or three-dimensional (e.g., including M/Z dimensions) geometries may produce incorrect or unexpected results.
```r
sf::sf_use_s2(FALSE)
```