Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/evamaerey/sf2stat

helpers to get data ready to be used in stat or geom layer w ggplot2
https://github.com/evamaerey/sf2stat

Last synced: 11 days ago
JSON representation

helpers to get data ready to be used in stat or geom layer w ggplot2

Awesome Lists containing this project

README

        

---
output:
github_document:
toc: FALSE
---

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
eval = T,
warning = F,
message = F
)

library(tidyverse)

```

# Part 0. Proposal

Proposing the {sf2stat} package! 🦄

The goal of {sf2stat} is to make it easier to prep *sf data* for use in a ggproto Stat computation; the Stat then can be used for creating a stat/geom function to be used in ggplot2 plots.

Without the package, we live in the effortful world, in which we'd have to prep our own data including figuring out the bounding box for each geometry, and, if we want labeling functionality, the centroid for each geometry.

With the {sf2stat} package, we'll live in a different world (🦄 🦄 🦄) where the task is a snap 🫰:

Proposed API is:

```
library(sf2stat)
--
--
read.csv("nc-midterms.csv") |>
ggplot() +
aes(county_name = str_to_title(desc_county)) +
geom_county() +
aes(fill = cd_party) +
geom_county_text()
```

# Package build Part I. Work out functionality ✅

In this section we'll use the nc sf dataframe to check out how our functions work.

## Select toy sf data

```{r}
nc_ref <- sf::st_read(system.file("shape/nc.shp", package="sf")) |>
select(county_name = NAME, fips = FIPS)

read.csv("nc-midterms.csv") |>
mutate(county_name = str_to_title(desc_county)) |>
left_join(nc_ref) %>%
ggplot() +
geom_sf() +
aes(fill = cd_party,
label = county_name,
geometry = geometry)+
geom_sf_text(check_overlap = T)

```

```{r}
# we want our stat to do stuff that StatSf and StatSfCoordinates does.
prep_geo_reference <- function(ref_data, id_index = 1){

ref_data |>
ggplot2::StatSf$compute_panel(coord = ggplot2::CoordSf) |>
ggplot2::StatSfCoordinates$compute_group(coord = ggplot2::CoordSf) %>%
mutate(id_col = .[[id_index]])

}

# Flip the script... prepare compute (join) to happen in layer (NEW!)
compute_panel_region <- function(data, scales, ref_data, id_index = 1,
stamp = FALSE, keep_id = NULL,
drop_id = NULL){

ref_data %>%
prep_geo_reference(id_index = id_index) ->
ref_data

if(!is.null(keep_id)){

ref_data %>%
filter(id_col %in% keep_id) ->
ref_data

}

if(!is.null(drop_id)){

ref_data %>%
filter(!(id_col %in% drop_id)) ->
ref_data

}

if(stamp){

ref_data

}else{

ref_data %>%
inner_join(data)

}

}
```

```{r}
nc_ref <- sf::st_read(system.file("shape/nc.shp", package="sf")) |>
select(county_name = NAME, fips = FIPS)

read.csv("nc-midterms.csv") |>
mutate(county_name = str_to_title(desc_county)) |>
select(county_name) |>
compute_panel_region(ref_data = nc_ref)

read.csv("nc-midterms.csv") |>
mutate(county_name = str_to_title(desc_county)) |>
select(county_name) |>
compute_panel_region(ref_data = nc_ref, keep_id = "Mecklenburg")
```

# wrapping up more

```{r}
# same as geom_sf but geom (and stat) is flexible
qlayer_sf_crs <- function (mapping = aes(), data = NULL, geom = "sf",
stat = "sf", position = "identity",
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE,
crs, ...) {

c(layer_sf(geom = geom, data = data, mapping = mapping,
stat = stat, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = rlang::list2(na.rm = na.rm,
...)),
coord_sf(crs = crs))
}

stat_region <- function(ref_data = getOption("sf2stat.ref_data", nc_ref),
id_index = 1,
required_aes = getOption("sf2stat.required_aes", "fips|county_name"),
geom = GeomSf, ...){

# if(!is.null(stamp)){if(stamp){required_aes = c()}}

StatSfJoin <- ggproto("StatSfJoin", Stat,
compute_panel = compute_panel_region,
default_aes = aes(label = after_stat(id_col)),
required_aes = required_aes)

qlayer_sf_crs(stat = StatSfJoin,
geom = geom,
ref_data = ref_data,
crs = sf::st_crs(ref_data),
id_index = id_index, ...)

}

GeomOutline <- ggproto("GeomOutline", GeomSf,
default_aes = aes(!!!modifyList(GeomSf$default_aes,
aes(fill = NA,
color = "black"))))

geom_region_sf <- function(...){stat_region(geom = GeomSf, ...)}
geom_region <- geom_region_sf # convenience short name
geom_region_outline <- function(...){stat_region(geom = GeomOutline, ...)}
geom_region_label <- function(...){stat_region(geom = GeomLabel,...)}
geom_region_text <- function(...){stat_region(geom = GeomText, ...)}

stamp_region_sf <- function(...){geom_region_sf(stamp = T, ...)}
stamp_region <- stamp_region_sf
stamp_region_outline <- function(...){geom_region_outline(stamp = T, ...)}
stamp_region_label <- function(...){geom_region_label(stamp = T, ...)}
stamp_region_text <- function(...){geom_region_text(stamp = T, ...)}

options(sf2stat.ref_data = nc_ref,
sf2stat.required_aes = "fips|county_name")

# north carolina counties....
read.csv("nc-midterms.csv") |>
ggplot() +
aes(county_name = str_to_title(desc_county)) +
stamp_region(fill = 'darkgrey') +
geom_region() +
aes(fill = n/1000) +
stamp_region_outline(
keep_id = "Mecklenburg",
color = "orange",
linewidth = 1) +
geom_region_text(check_overlap = T,
color = "whitesmoke")

chilemapas::generar_regiones() %>%
mutate(numero_region = as.numeric(codigo_region)) %>%
options(sf2stat.ref_data = .,
sf2stat.required_aes = "codigo_region|numero_region")

chilemapas::censo_2017_comunas %>%
mutate(region = str_extract(codigo_comuna, "..")) %>%
summarise(pop = sum(poblacion), .by = c(region, sexo)) %>%
ggplot() +
aes(codigo_region = region, fill = pop/100000) +
geom_region(linewidth = .01) +
facet_wrap(~sexo) +
scale_fill_viridis_b(transform = "log") +
stamp_region_outline(color = "red",
keep_id = "05")

library(tmap)

data(World)

World %>%
select(country_name = name,
iso3c = iso_a3) %>%
options(sf2stat.ref_data = .,
sf2stat.required_aes = "country_name|iso3c")

tidyr::world_bank_pop %>%
filter(indicator == "SP.POP.GROW") %>%
ggplot() +
aes(iso3c = country) +
geom_region() +
aes(fill = `2000`) +
scale_fill_viridis_c() +
labs(title = "Population Growth, 2000")

data("NLD_prov")

NLD_prov %>%
select(prov_name = name, prov_code = code) %>%
options(sf2stat.ref_data = .,
sf2stat.required_aes = "prov_code|prov_name")

NLD_prov %>%
sf::st_drop_geometry() %>%
ggplot() +
aes(prov_code = code) +
geom_region() +
aes(fill = population/100000) +
geom_region_text(check_overlap = T,
size = 2,
color = "whitesmoke")

last_plot() +
aes(label = round(population/100000, 3))

usmapdata::us_map() %>%
select(state_name = full, state_abb = abbr, fips,
geometry = geom) %>%
options(sf2stat.ref_data = .,
sf2stat.required_aes = "state_name|state_abb|fips")

USArrests %>%
rownames_to_column("state") %>%
ggplot() +
aes(state_name = state) +
geom_region(alpha = .75) +
aes(fill = UrbanPop) +
scale_fill_viridis_c()

ggseg::aseg$data %>%
# filter(!is.na(label)) %>%
select(region_id = label) %>%
options(sf2stat.ref_data = .,
sf2stat.required_aes = "region_id")

ggplot() +
aes(region_id = 1) +
stamp_region() +
stamp_region(keep_id = c("Right-Amygdala", "Left-Amygdala"),
fill = "darkred")

```

# sf2stat Proposed usage

https://twitter.com/EmilyRiederer/status/1816820773581127781

```{r, fig.height=2}
# given some flatfile data of interest
read.csv("nc-midterms.csv") |> head()

# and being aware of geographic data with geometry shape column
nc <- sf::st_read(system.file("shape/nc.shp", package="sf"))

# select relevant id columns (this will keep geometry column)
nc_ref <- nc |>
select(county_name = NAME, fips = FIPS)

# do this routine, change out the ref data, required_aes and 'county' in convenience funtion names.
stat_county <- function(...){stat_region(ref_data = nc_ref, required_aes = "county_name|fips", id_index = 1, ...)} # uses GeomSf as default

GeomOutline <- ggproto("GeomOutline", GeomSf,
default_aes = aes(!!!modifyList(GeomSf$default_aes,
aes(fill = NA,
color = "black"))))

geom_county_sf <- function(...){stat_county(geom = GeomSf,...)}
geom_county <- geom_county_sf # convenience short name
geom_county_outline <- function(...){stat_county(geom = GeomOutline, ...)}
geom_county_label <- function(...){stat_county(geom = GeomLabel,...)}
geom_county_text <- function(...){stat_county(geom = GeomText, ...)}

stamp_county_sf <- function(...){geom_county_sf(stamp = T, ...)}
stamp_county <- stamp_county_sf
stamp_county_outline <- function(...){geom_county_outline(stamp = T, ...)}
stamp_county_label <- function(...){geom_county_label(stamp = T, ...)}
stamp_county_text <- function(...){geom_county_text(stamp = T, ...)}
```

A first NC map shows when we map desc_county to county name.

```{r}
read.csv("nc-midterms.csv") |>
ggplot() +
aes(county_name = str_to_title(desc_county)) +
geom_county()
```

We see that there are actually undiscovered counties, as exact name matching can be a little dicy. Using fips which would probably perform better, which is possible with any plot data with an input fips column (I decided to skip adding fips even though I had a cross-walk)

We can do the following 'stamp' convenience layer to get the full map. I think of this as an annotation layer - it doesn't refer to global data, but 'brings its own data'. annotate_county is just too long.

```{r}
read.csv("nc-midterms.csv") |>
ggplot() +
aes(county_name = str_to_title(desc_county)) +
stamp_county(fill = 'darkgrey')

```

Then we use geom_county(), which reflects your data and the success of the underlying join process.

```{r}
last_plot() +
geom_county()
```

Then look at population choropleth (fill = n) and highlight Mecklenburg with convenience annotation layer 'stamp_county'

```{r}
options(scipen = 10)
last_plot() +
aes(fill = n/100000)
```

highlight at county of interest...

```{r}
last_plot() +
stamp_county_outline(keep_id = "Mecklenburg",
color = "orange",
linewidth = 1)
```

We can add a text layer defaults to ref_data column 1 (id_index setting)...

```{r}
last_plot() +
geom_county_text(color = "white",
check_overlap = T,
size = 2)
```

We can look at another variable...

```{r}
last_plot() +
aes(fill = cd_party)
```

And another...

```{r}
last_plot() +
aes(fill = ind_vote)
```

And look at some values for that variable

```{r}
last_plot() +
aes(label = round(ind_vote, 2))

```

```{r}
knitr::knit_exit()

```

```{r}
sf_df_add_xy_center_coords <- function(sf_df){

sf_df |>
dplyr::pull(geometry) |>
sf::st_zm() |>
sf::st_point_on_surface() ->
points_sf

the_coords <- do.call(rbind, sf::st_geometry(points_sf)) |>
tibble::as_tibble() |> setNames(c("x","y"))

cbind(sf_df, the_coords)

}
```

```{r}
nc |> sf_df_add_xy_center_coords()
```

## `sf_df_return_bbox_df()`

Second we have a function that's going to return bounding boxes as a dataframe. For our reference data we need these xmin, xmax variables for each row in our data.

```{r sf_df_return_bbox_df}
sf_df_return_bbox_df <- function(sf_df){

bb <- sf::st_bbox(sf_df)

data.frame(xmin = bb[1], ymin = bb[2],
xmax = bb[3], ymax = bb[4])

}
```

```{r}
nc[10,] |> sf_df_return_bbox_df()
```

## `sf_df_prep_for_stat()`

Finally we bundle this into the user-facing function that will take an sf dataframe and add required columns for display in ggplot2 sf layer.

```{r sf_df_prep_for_stat}
sf_df_prep_for_stat <- function(sf_df, id_col_name = NULL){

sf_df |>
# using purrr allows us to get bb for each row
dplyr::mutate(bb =
purrr::map(geometry,
sf_df_return_bbox_df)) |>
tidyr::unnest(bb) |>
data.frame() |>
sf_df_add_xy_center_coords() ->
sf_df_w_bb_and_centers

# use first column as keep/drop column unless otherwise specified
if(is.null(id_col_name)){id_col_name <- 1}

sf_df_w_bb_and_centers$id_col <- sf_df_w_bb_and_centers[,id_col_name]

return(sf_df_w_bb_and_centers)

}
```

----

```{r}
knitr::opts_chunk$get(eval = T)
```

## Fully worked example: How you'd use sf2stat to build functionality with scope, region type, and location name

Let's see how we might recreate the functionality in the ggnorthcarolina package using some templates in this readme.

In the example, the scope of the package is 'northcarolina'. The region of interest is 'county', and the location names that we are using are the county names.

### Step 00. prep reference data

```{r, eval = F}
usethis::use_data_raw()
```

##

```{r nc_geo_reference_prep, eval = T}
nc <- sf::st_read(system.file("shape/nc.shp", package="sf"))

geo_reference_northcarolina_county <- nc |>
dplyr::select(county_name = NAME, fips = FIPS) |>
sf_df_prep_for_stat(id_col_name = "county_name")

```

```{r, eval = F}
usethis::use_data(geo_reference_northcarolina_county)
```

```{r stat_region_template, eval = F}
compute_panel_scope_region <- function(data, scales, keep_id = NULL,
drop_id = NULL, stamp = FALSE){

if(!stamp){data <- dplyr::inner_join(data, geo_reference_scope_region)}
if( stamp){data <- geo_reference_scope_region }

if(!is.null(keep_id)){ data <- dplyr::filter(data, id_col %in% keep_id) }
if(!is.null(drop_id)){ data <- dplyr::filter(data, !(id_col %in% drop_id)) }

data

}

# step 2
StatSfscoperegion <- ggplot2::ggproto(`_class` = "StatSfscoperegion",
`_inherit` = ggplot2::Stat,
# required_aes = c("fips|county_name"),
compute_panel = compute_panel_scope_region,
default_aes = ggplot2::aes(label = after_stat(id_col)))

stat_region <- function(
mapping = NULL,
data = NULL,
geom = ggplot2::GeomSf,
position = "identity",
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE,
crs = "NAD27", # "NAD27", 5070, "WGS84", "NAD83", 4326 , 3857
...) {

c(ggplot2::layer_sf(
stat = StatSfscoperegion, # proto object from step 2
geom = geom, # inherit other behavior
data = data,
mapping = mapping,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = rlang::list2(na.rm = na.rm, ...)
),

coord_sf(crs = crs,
default_crs = sf::st_crs(crs),
datum = crs,
default = TRUE)
)
}
```

```{r}
readme2pkg::chunk_variants_to_dir(chunk_name = "stat_region_template",
file_name = "stat_county",
replace1 = "scope",
replacements1 = "northcarolina",
replace2 = "region",
replacements2 = "county")
```

# test it out

```{r}
source("./R/stat_county")

library(ggplot2)
nc |>
sf::st_drop_geometry() |>
ggplot() +
aes(fips = FIPS) +
stat_county() +
aes(fill = BIR79)
```

### Make derivitive functions, aliases

```{r geom_region_template, eval = F}
geom_region <- stat_region
geom_region_label <- function(...){stat_region(geom = "text",...)}
stamp_region <- function(...){
stat_region(stamp = T,
data = mtcars,
aes(fill = NULL, color = NULL, label = NULL,
fips = NULL, region_name = NULL),
...)}
stamp_region_label <- function(...){
stat_region(stamp = T,
geom = "text",
data = mtcars,
aes(fill = NULL, color = NULL,
fips = NULL, region_name = NULL),
...)}
```

```{r}
readme2pkg::chunk_variants_to_dir(chunk_name = "geom_region_template",
file_name = "geom_county",
replace1 = "region",
replacements1 = "county")
```

# try those out

```{r}
source("./R/geom_county")

nc |>
sf::st_drop_geometry() |>
ggplot() +
aes(fips = FIPS) +
geom_county() +
geom_county_label(check_overlap = T,
color = "grey85") +
aes(fill = BIR79)

last_plot() +
stamp_county() +
stamp_county_label()

ggplot() +
stamp_county()

last_plot() +
stamp_county_label(check_overlap = T)

last_plot() +
stamp_county(keep_id = "Wake", fill = "darkred")
```

# Wanting even more?

## Stamps for each polygon?

```{r stamp_region_location}
#' Title
#'
#' @param ...
#'
#' @return
#' @export
#'
#' @examples
stamp_region_location <- function(...){stamp_region(keep_id = 'Location', ...)}

#' Title
#'
#' @param ...
#'
#' @return
#' @export
#'
#' @examples
stamp_region_label_location <- function(...){stamp_region_label(keep_id = 'Location', ...)}

```

```{r eval=F}
ids <- geo_reference_northcarolina_county$county_name
ids_snake <- tolower(geo_reference_northcarolina_county$county_name) |>
stringr::str_replace_all(" ", "_")

readme2pkg::chunk_variants_to_dir(chunk_name = "stamp_region_location",
file_name = "stamp_county_locations.R",
replace1 = "region",
replacements1 = rep("county", length(ids)),
replace2 = "location",
replacements2 = ids_snake,
replace3 = "Location",
replacements3 = ids)
```

```{r}
source("./R/stamp_county_locations.R")

nc |>
sf::st_drop_geometry() |>
ggplot() +
aes(fips = FIPS) +
stamp_county() +
stamp_county_ashe(fill = "darkred")
```

# Template functions. Some old ideas that we're moving away from.

These are more of an experiment. The code to write a layer is multistep and verbose, so maybe providing some templates is a good idea. But maybe this isn't the right place or implementation.

## `template_compute_panel_code()`

```{r template_compute_panel_code}
template_compute_panel_code <- function(){

"compute_panel_geo_XXXX <- function(data, scales, keep_id = NULL, drop_id = NULL){

if(!is.null(keep_id)){ data <- filter(data, id_col %in% keep_id) }
if(!is.null(drop_id)){ data <- filter(data, !(id_col %in% drop_id)) }

if(!stamp){data <- dplyr::inner_join(data, geo_ref_XXXX)}
if( stamp){data <- geo_ref_XXXX }

data

}" |> cat()

}
```

## `template_stat_code()`

```{r template_stat_code}
template_stat_code <- function(){

'StatXXXXsf <- ggplot2::ggproto(`_class` = "StatXXXXsf",
`_inherit` = ggplot2::Stat,
required_aes = c("fips|county_name|XXXX"),
compute_panel = compute_panel_geo_XXXX,
default_aes = c(label = ggplot2::after_stat(id_col)))' |> cat()
}
```

## `template_layer_code()`

```{r template_layer_code}
template_layer_code <- function(){ 'stat_XXXX <- function(
mapping = NULL,
data = NULL,
geom = ggplot2::GeomSf,
position = "identity",
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE,
crs = "NAD27", # "NAD27", 5070, "WGS84", "NAD83", 4326 , 3857
...) {

c(ggplot2::layer_sf(
stat = StatXXXX, # proto object from step 2
geom = geom, # inherit other behavior
data = data,
mapping = mapping,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = rlang::list2(na.rm = na.rm, ...)
),

coord_sf(crs = crs,
default_crs = sf::st_crs(crs),
datum = crs,
default = TRUE)
)
}' |> cat()

}

```

# Part II. Packaging and documentation 🚧 ✅

## Phase 1. Minimal working package

### Bit A. Created package archetecture, running `devtools::create(".")` in interactive session. 🚧 ✅

```{r, eval = F}
devtools::create(".")
```

### Bit B. Added roxygen skeleton? 🚧 ✅

Use a roxygen skeleton for auto documentation and making sure proposed functions are *exported*. Generally, early on, I don't do much (anything) in terms of filling in the skeleton for documentation, because things may change.

### Bit C. Managed dependencies ? 🚧 ✅

Package dependencies managed, i.e. `depend::function()` in proposed functions and declared in the DESCRIPTION

```{r, eval = F}
usethis::use_package("sf")
usethis::use_package("dplyr")
usethis::use_package("tibble")
usethis::use_package("tidyr")
usethis::use_package("purrr")

```

### Bit D. Moved functions R folder? 🚧 ✅

Use new {readme2pkg} function to do this from readme...

```{r, eval = F}
readme2pkg::chunk_to_r("sf_df_return_bbox_df")
readme2pkg::chunk_to_r("sf_df_add_xy_center_coords")
readme2pkg::chunk_to_r("sf_df_prep_for_stat")
readme2pkg::chunk_to_r("template_compute_panel_code")
readme2pkg::chunk_to_r("template_stat_code")
readme2pkg::chunk_to_r("template_layer_code")
```

### Bit E. Run `devtools::check()` and addressed errors. 🚧 ✅

```{r, eval = F}
devtools::check(pkg = ".")
```

### Bit F. [Install](https://r-pkgs.org/whole-game.html#install) and restart package 🚧 ✅

```{r, eval = F}
devtools::build()
```

### Bit G. Write traditional README that uses built package (also serves as a test of build. 🚧 ✅

The goal of the {xxxx} package is to ...

Install package with:

```
remotes::installgithub("EvaMaeRey/readme2pkg.template")
```

Once functions are exported you can remove go to two colons, and when things are are really finalized, then go without colons (and rearrange your readme...)

```{r, eval = F}
library(sf2stat) ##<< change to your package name here

nc <- sf::st_read(system.file("shape/nc.shp", package="sf"))

nc |>
dplyr::select(county_name = NAME, fips = FIPS) |>
sf2stat::sf_df_prep_for_stat(id_col_name = "county_name") ->
nc_geo_reference
```

### Bit H. Chosen a [license](https://r-pkgs.org/license.html)? 🚧 ✅

```{r, eval = F}
usethis::use_mit_license()
```

### Bit I. Add [lifecycle badge](https://r-pkgs.org/lifecycle.html) (experimental)

```{r, eval = F}
usethis::use_lifecycle_badge("experimental")
```

## Phase 2: Listen & iterate 🚧 ✅

Try to get feedback from experts on API, implementation, default decisions. Is there already work that solves this problem?

## Phase 3: Let things settle

### Bit A. Settle on [examples](https://r-pkgs.org/man.html#sec-man-examples). Put them in the roxygen skeleton and readme. 🚧 ✅

### Bit B. Written formal [tests](https://r-pkgs.org/testing-basics.html) of functions and save to test that folders 🚧 ✅

That would look like this...

```{r test_calc_times_two_works, eval = F}
library(testthat)

test_that("calc times 2 works", {
expect_equal(times_two(4), 8)
expect_equal(times_two(5), 10)

})
```

```{r, eval = F}
readme2pkg::chunk_to_tests_testthat("test_calc_times_two_works")
```

### Bit C. Added a description and author information in the DESCRIPTION file 🚧 ✅

### Bit D. Addressed *all* notes, warnings and errors. 🚧 ✅

## Phase 4. Promote to wider audience...

### Bit A. Package website built? 🚧 ✅

### Bit B. Package website deployed? 🚧 ✅

## Phase 5: Harden/commit

### Submit to CRAN/RUniverse? 🚧 ✅

# Appendix: Reports, Environment

## Edit Description file

```{r, eval = F}
readLines("DESCRIPTION")
```

## Environment

Here I just want to print the packages and the versions

```{r}
all <- sessionInfo() |> print() |> capture.output()
all[11:17]
```

## `devtools::check()` report

```{r, error = T, results="hide", warning=F, eval=F}
devtools::check(pkg = ".")
```