Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/adibender/chagas-vector-sdm

Code used for species distribution modelling of chagas vectors
https://github.com/adibender/chagas-vector-sdm
Last synced: 1 day ago
JSON representation
Code used for species distribution modelling of chagas vectors
Host: GitHub
URL: https://github.com/adibender/chagas-vector-sdm
Owner: adibender
Created: 2019-07-02T14:57:13.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-05-19T16:49:56.000Z (over 4 years ago)
Last Synced: 2024-12-20T23:51:32.472Z (2 days ago)
Language: R
Size: 547 KB
Stars: 1
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Species distribution models of chagas vectors

This folder contains code used to fit species ditribution models for chagas vectors as described here: https://www.biorxiv.org/content/10.1101/738310v1.

Note that the Article is in Revision at PLOS NTD and was updated since.

**Disclaimer**: Since our workflow relies on some data (particularly the satellite data) that require users to sign up to a series of separate open access licences with various organisations, we created a demo (within the README of the repository) that can be run directly (the code calls some functions that download presence data as well as environmental covariates). This demo illustrates the workflow for one species. Further, all code used for analyses in the manuscript is also available in the repository and described in the README.

If anything is unclear or not working, please don't hesitate to open up an issue.

# Examplary analysis

This section illustrates the workflow presented in [Bender, Python, Lindsay, et al. (2019)](https://www.biorxiv.org/content/10.1101/738310v1). Note that full details are given in the manuscript and

this code repository (see also description of the Folder structure below the

demo analysis).

```r

# for first run install packages from this repository

# devtools::install("mastergrids", dependencies = TRUE)

# devtools::install("tcruziutils", dependencies = TRUE)

# libraries

devtools::load_all("mastergrids")

devtools::load_all("tcruziutils")

library(dplyr)

library(purrr)

# viz

library(ggplot2)

theme_set(theme_bw())

# modeling

library(scam)

# defaults

# this is just the maximum extent of the endemic zone

# used to crop environmental variables, etc.

extent_tcruzi   <- tcruziutils::extent_tcruzi

# country polygons

data(wrld_simpl, package = "maptools")

countries <- wrld_simpl %>% raster::crop(extent_tcruzi)

# colors

Set1   <- RColorBrewer::brewer.pal(9, "Set1")

## set seed as generation of folds is random

set.seed(101850)

```

## Data Import

- In the publication we used the data from

[Ceccarelli, Balsalobre, Medone, et al. (2018)](https://www.nature.com/articles/sdata201871) and some additional data (but not much). The former is openly available ([figshare download link](https://doi.org/10.6084/m9.figshare.c.3946936) (EXCEL file))

- The code below downloads, imports the file and performs some preprocessing

```r

library(httr)

GET("https://ndownloader.figshare.com/files/10302303",

  write_disk(tf <- tempfile(fileext = ".xls")))

```

```

## Response [https://s3-eu-west-1.amazonaws.com/pstorage-npg-968563215/10302303/SciDataData_CitationCeccarellietal.2018.xls]

##   Date: 2020-05-10 16:54

##   Status: 200

##   Content-Type: binary/octet-stream

##   Size: 5.74 MB

##   /tmp/Rtmpbi1Llj/file32cc611e8598.xls

```

```r

df <- readxl::read_excel(tf, 1L, na = c("", " ", "NR", "NA"))

presence_vector <- df %>%

  mutate(

    area = 25L, # analysis will be performed at 5x5 square km resolution

    Start_year = as.integer(substr(year, 1L, 4L)),

    End_year   = as.integer(substr(year, 6L, 9L)))

presence_vector$reference <- df[[17]]

presence_vector[, 17] <- NULL

presence_vector <- presence_vector %>%

  mutate(

    Public_year = as.integer(stringr::str_extract(.data$reference, "[0-9]{4}")),

    Public_year = ifelse(Public_year > 2018L, NA, Public_year),

    End_year    = ifelse(is.na(End_year), Start_year, End_year)) %>%

  select(scientificName, Start_year, End_year, Public_year, area,

    individualCount, starts_with("decimal"), habitat, reference) %>%

  rename(

    species    = scientificName,

    Latitude   = decimalLatitude,

    Longitude  = decimalLongitude,

    n_observed = individualCount) %>%

  rename_all(~tolower(sub(" ", "_", .))) %>%

  select(-reference)

```

- imputation of the "year" variable if missing

```r

presence_vector <- presence_vector %>%

  rename_all(tolower) %>%

  filter(!(is.na(start_year) & is.na(public_year))) %>%

  filter(start_year <= end_year | is.na(start_year) | is.na(end_year)) %>%

  filter(start_year <= public_year | is.na(start_year) | is.na(public_year)) %>%

  filter(start_year >= 2000 | is.na(start_year)) %>%

  filter(public_year >= 2000 | is.na(public_year)) %>%

  filter(end_year <= public_year | is.na(end_year) | is.na(public_year)) %>%

  mutate(

    imputed = is.na(start_year),

    id      = row_number())

# impute start end year

lm_start <- scam(start_year ~  s(public_year, bs = "mpi"),

  data = presence_vector)

lm_end <- scam(end_year ~ s(start_year, bs = "mpi") +

  s(I(public_year - start_year), bs = "mpi"), data = presence_vector)

presence_vector <- presence_vector %>%

  mutate(start_year = ifelse(is.na(start_year),

    as.integer(floor(predict (lm_start, .))), start_year)) %>%

  mutate(end_year = ifelse(is.na(end_year),

    as.integer(ceiling(predict(lm_end, .))), end_year)) %>%

  # few imputed end year larger than public year

  mutate(end_year = pmin(end_year, public_year))  %>%

  # set end_year to start year if still na (not imputed b/c public year = NA)

  mutate(end_year = if_else(is.na(end_year), start_year, end_year))

```

- remove observations before 2000

```r

presence_vector <- presence_vector %>%

  filter(start_year >= 2000)

```

## Adding spatial/environmental covariates to data set

- in the publication we used raster files available from Servers of

the [Malaria Atlas Project ](https://malariaatlas.org/)

- the environmental variables were extracted for each observation based

on (imputed) year of observation

- here, we use the `raster::getData()` function to obtain the

environmental covariates (we use the same covariate layers for all years)

```r

covs <- raster::getData("worldclim", var = "bio", res = 5)[[1:19]]

env_grids <- raster::crop(covs, extent_tcruzi)

# combine presence + covariate data

covs_ex <- raster::extract(env_grids, as_spatial(presence_vector))

presence_vector <- cbind(presence_vector, covs_ex)

```

## Workflow for one species

The procedure below was iterated over all species (with enough observations) and

consists of the following steps:

1. Create presence/background column for the species of interest

(here `Panstrongylus megistus`)

2. Split data into

  - "training/test" (using spatial blocks/spatial CV), used later for model selection (folds 1-4) and to asses the models "extrapolation performance"

  - "evaluation" (random subsample), used later to asses the models "interpolation performance"

3. The spatial blocks/spatial CV is set up using function

`get_sp_folds` which is a wrapper around the package [**`blockCV`**](https://github.com/rvalavi/blockCV) ([Valavi, Elith, Lahoz-Monfort, et al., 2018](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107))

4. Fit the model on "training data" (spatial folds 1-4). Here, for conciseness

we perform a very narrow model comparison of only two competing models.

5. Asses the models performance on data not used during model selection/fit

### 1. Create presence/background dummy

```r

presence_vector <- presence_vector %>%

  mutate(presence = 1L * (species == "Panstrongylus megistus"))

table(presence_vector$presence)

```

```

##

##     0     1

## 13412  1184

```

### 2. Split data

```r

it_pv <- presence_vector %>%

  rsample::initial_split(strata = "species", prop = 4 / 5)

# evaluation data (only used at the very end for "interpolation error" estimate

evaluation_df <- rsample::testing(it_pv)

# train test is split in 5 spatial folds

train_test_df <- rsample::training(it_pv)

```

### 3. Create spatial folds

```r

sp_tt <- as_spatial(train_test_df)

raster::crs(countries) <- raster::crs(sp_tt)

sp_fold_pans_meg <- get_sp_folds(

  data        = sp_tt,

  species     = "Panstrongylus megistus",

  mask        = countries,

  species_var = "species",

  n_blocks    = 50,# number of blocks

  k           = 5, # number of folds

  width       = 5, # width of extended hull (outside of observed )

  calc_range  = FALSE)

```

```

## The best folds was in iteration 84:

##   train_0 train_1 test_0 test_1

## 1    2800     795    485    144

## 2    2574     763    711    176

## 3    2505     706    780    233

## 4    2555     718    730    221

## 5    2706     774    579    165

```

```r

# graphical depiction of spatial folds + presence/background

# within extended hull

tmap_cv(

  sp_fold_pans_meg,

  countries = countries)

```







### 4. Fit the model

- In the paper we run a CV for different model specifications on folds 1 - 4

(fit on 3 folds, evaluation on 4th fold)

- best setting/modell w.r.t. average performance for the 4 CV runs is then

refit on folds 1-4 and evaluated using data from fold 5 (this is AUC value

reported in the table)

- this model was also used for evaluation on the random hold-out data (`evaluation_df`) above

- After evaluation the final model was obtained by refitting this model on all

available data (folds 1 -5 and random evaluation data)

- Here we only compare two different models (one with smooth covariate effects,

one with linear effects) to keep code and runtime short.

#### Fit the GAM

- create `formula` that specifies the linear/additive predictor using

`make_gam_formula`

```r

# create formula

# make_gam_formula: only vars with unique(var) > 20 used

# formula with smooth effects

mod_formula <-

  make_gam_formula(

    as.data.frame(sp_fold_pans_meg$train),

    candidates=names(env_grids),

    type = "smooth") %>%

  add_gp() # add gaussian process smooth (see ?mgcv::gp.smooth)

mod_formula

```

```

## presence ~ s(bio1, by = NA) + s(bio2, by = NA) + s(bio3, by = NA) +

##     s(bio4, by = NA) + s(bio5, by = NA) + s(bio6, by = NA) +

##     s(bio7, by = NA) + s(bio8, by = NA) + s(bio9, by = NA) +

##     s(bio10, by = NA) + s(bio11, by = NA) + s(bio12, by = NA) +

##     s(bio13, by = NA) + s(bio14, by = NA) + s(bio15, by = NA) +

##     s(bio16, by = NA) + s(bio17, by = NA) + s(bio18, by = NA) +

##     s(bio19, by = NA) + s(longitude, latitude, bs = "gp")

## 

```

```r

# formula with linear effects

mod_formula2 <-

  make_gam_formula(

    as.data.frame(sp_fold_pans_meg$train),

    candidates=names(env_grids),

    type = "linear")

mod_formula2

```

```

## presence ~ bio1 + bio2 + bio3 + bio4 + bio5 + bio6 + bio7 + bio8 +

##     bio9 + bio10 + bio11 + bio12 + bio13 + bio14 + bio15 + bio16 +

##     bio17 + bio18 + bio19

## 

```

- Fit the GAM using **`mgcv`**

- `mgcv::bam` could be replaced by `mgcv::gam`, but has less demands w.r.t. to

memory reqiurements and offers significant speed-up, especially with `discrete = TRUE` option ([Wood, Li, Shaddick, et al., 2017](https://doi.org/10.1080/01621459.2016.1195744))

- set `discrete = FALSE` when calling the `predict` function to obtain smoother

predictions

```r

# see ?mgcv::bam for references

folds <- 1:4

train <- as.data.frame(sp_fold_pans_meg$train)

##  create 4 models (each without one of the folds)

# smooth effects

models_smooth <- purrr::map(folds,

  ~ mgcv::bam(mod_formula, as.data.frame(train[train$fold != .x, ]),

      family = "binomial", method = "fREML", discrete = TRUE, gamma = 2L))

# linear effects

models_linear <- purrr::map(folds,

  ~ mgcv::bam(mod_formula2, as.data.frame(train[train$fold != .x, ]),

      family = "binomial", method = "fREML", gamma = 2L))

# auc for each model

auc_smooth <- map_dbl(

  folds,

  ~{

    predicted <- predict(models_smooth[[.x]], train[train$fold == .x, ], discrete = FALSE)

    observed <- train[train$fold == .x, "presence"]

    MLmetrics::AUC(predicted, observed)

  })

# AUC linear models

auc_linear <- map_dbl(

  folds,

  ~{

    predicted <- predict(models_linear[[.x]], train[train$fold == .x, ], discrete = FALSE)

    observed <- train[train$fold == .x, "presence"]

    MLmetrics::AUC(predicted, observed)

  })

# comparison

mean(auc_smooth)

```

```

## [1] 0.8603036

```

```r

mean(auc_linear)

```

```

## [1] 0.8401573

```

```r

# in this case we would select the model with smooth effects of covariates

## Refit model for evaluation

mod <- mgcv::bam(

  formula  = mod_formula,

  data     = as.data.frame(sp_fold_pans_meg$train),

  family   = binomial(),

  method   = "fREML", # fast REML

  discrete = TRUE, # speeds up computation

  gamma    = 2L)

```

### 5. Evaluate the model

```r

prediction_test <- predict(

  mod,

  newdata  = as.data.frame(sp_fold_pans_meg$test),

  type     = "response",

  discrete = FALSE)

prediction_eval_df <- predict(

  mod,

  newdata = evaluation_df,

  type = "response",

  discrete = FALSE

)

# evaluation w.r.t. to extrapolation (i.e. fold 5)

MLmetrics::AUC(prediction_test, sp_fold_pans_meg$test$presence)

```

```

## [1] 0.8935416

```

```r

# evaluation w.r.t. interpolation (i.e. random hold-out data)

MLmetrics::AUC(prediction_eval_df, evaluation_df$presence)

```

```

## [1] 0.9632264

```

### 6. Refit model on all data

```r

# extract data points within extended hull of Panstrongylus megistus

df_all <- presence_vector %>%

  as_spatial() %>%

  raster::crop(sp_fold_pans_meg$hull)

# refit model with all data

mod_all <- update(mod, data = as.data.frame(df_all))

```

### 7. Visualize results

- Final prediction:

```r

# newdata with covariate values for each 5x5 pixel within hull of Panstrongylus megistus

env_grids <- env_grids %>%

  raster::crop(sp_fold_pans_meg$hull) %>%

  raster::mask(sp_fold_pans_meg$hull)

ndf <- grid_to_df(env_grids)

# calculate predictions, set discrete = FALSE to obtain smoother predictions

prediction <- predict(mod_all, newdata = ndf, type = "link", discrete = FALSE,

  se = TRUE)

ndf$prediction <- exp(prediction$fit)/(1 + exp(prediction$fit))

# calculate CI

ndf$se <- prediction$se

ci_lower <- prediction$fit - 2*prediction$se

ci_upper <- prediction$fit + 2*prediction$se

ndf$ci_lower <- exp(ci_lower)/(1 + exp(ci_lower))

ndf$ci_upper <- exp(ci_upper)/(1 + exp(ci_upper))

ndf$ci <- ndf$ci_upper - ndf$ci_lower

# retransform df to raster for plotting

pred_raster <- df_to_grid(ndf, env_grids[[1]], "prediction")

tm_shape(raster::crop(countries, raster::extent(sp_fold_pans_meg$hull))) +

  tm_borders() +

  tm_shape(pred_raster) +

  tm_raster(style = "cont", palette = viridis::magma(1e3),

    breaks = seq(0, 1, by = .2), alpha = .8)

```







 - Bivariate map (this is not well implemented in the moment in R), manual hacks

 required (alternatively, could predict upper/lower CI and plot CI alongside prediction)

```r

## Note, this is just for illustration. Specific cut-offs and color palette

# for bivariate maps were used in the publication

# create map + legend

# cut points could be specified

bivar_map <- tm_bivariate(ndf, env_grids[[1]], sp_fold_pans_meg)

# draw figure, x and y control position of legend

tm_bivar_draw(bivar_map, x = .55, y = .05)

```







## References

[Bender, A, A. Python, S.

Lindsay, et al.](#cite-bender_modelling_2019) (2019). "Modelling

geospatial distributions of the triatomine vectors of Trypanosoma cruzi

in Latin America". En. In: _bioRxiv_, p. 738310. DOI:

[10.1101/738310](https://doi.org/10.1101%2F738310). URL:

[https://www.biorxiv.org/content/10.1101/738310v1](https://www.biorxiv.org/content/10.1101/738310v1)

(visited on Aug. 17, 2019).

[Ceccarelli, S, A. Balsalobre,

P. Medone, et al.](#cite-ceccarelli_datatri_2018) (2018). "DataTri, a

database of American triatomine species occurrence". En. In:

_Scientific Data_ 5, p. 180071. ISSN: 2052-4463. DOI:

[10.1038/sdata.2018.71](https://doi.org/10.1038%2Fsdata.2018.71). URL:

[https://www.nature.com/articles/sdata201871](https://www.nature.com/articles/sdata201871)

(visited on Nov. 14, 2018).

[Valavi, R, J. Elith, J. J.

Lahoz-Monfort, et al.](#cite-valavi_blockcv_2018) (2018). "blockCV: An

r package for generating spatially or environmentally separated folds

for k-fold cross-validation of species distribution models". In:

_Methods in Ecology and Evolution_ 0.0. ISSN: 2041-210X. DOI:

[10.1111/2041-210X.13107](https://doi.org/10.1111%2F2041-210X.13107).

URL:

[https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107)

(visited on Feb. 13, 2019).

[Wood, S. N, Z. Li, G. Shaddick, et

al.](#cite-wood_gigadata_2017) (2017). "Generalized Additive Models for

Gigadata: Modeling the U.K. Black Smoke Network Daily Data". In:

_Journal of the American Statistical Association_ 112.519, pp.

1199-1210. ISSN: 0162-1459. DOI:

[10.1080/01621459.2016.1195744](https://doi.org/10.1080%2F01621459.2016.1195744).

URL:

[https://doi.org/10.1080/01621459.2016.1195744](https://doi.org/10.1080/01621459.2016.1195744)

(visited on May. 31, 2018).

## Session Info

```r

sessionInfo()

```

```

## R version 4.0.0 (2020-04-24)

## Platform: x86_64-pc-linux-gnu (64-bit)

## Running under: Ubuntu 18.04.4 LTS

##

## Matrix products: default

## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1

## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

##

## locale:

##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C

##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8

##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C

##  [9] LC_ADDRESS=C               LC_TELEPHONE=C

## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

##

## attached base packages:

## [1] stats     graphics  grDevices utils     datasets  methods   base

##

## other attached packages:

##  [1] sf_0.9-3          httr_1.4.1        scam_1.2-5        mgcv_1.8-31

##  [5] nlme_3.1-147      ggplot2_3.3.0     purrr_0.3.4       dplyr_0.8.99.9002

##  [9] tcruziutils_0.0.7 mastergrids_0.0.3 RefManageR_1.2.12 knitr_1.28

##

## loaded via a namespace (and not attached):

##   [1] readxl_1.3.1          spam_2.5-1            backports_1.1.6

##   [4] lwgeom_0.2-3          plyr_1.8.6            lazyeval_0.2.2

##   [7] sp_1.4-1              splines_4.0.0         crosstalk_1.1.0.1

##  [10] listenv_0.8.0         leaflet_2.0.3         gstat_2.0-5

##  [13] usethis_1.6.1         digest_0.6.25         foreach_1.5.0

##  [16] htmltools_0.4.0       viridis_0.5.1         fansi_0.4.1

##  [19] magrittr_1.5          checkmate_2.0.0       memoise_1.1.0

##  [22] MLmetrics_1.1.1       tensor_1.5            remotes_2.1.1

##  [25] recipes_0.1.12        globals_0.12.5        gower_0.2.1

##  [28] xts_0.12-0            rsample_0.0.6         prettyunits_1.1.1

##  [31] colorspace_1.4-1      rgdal_1.4-8           xfun_0.13

##  [34] leafem_0.1.1          callr_3.4.3           crayon_1.3.4

##  [37] jsonlite_1.6.1        spatstat_1.63-3       spatstat.data_1.4-3

##  [40] zoo_1.8-8             survival_3.1-12       iterators_1.0.12

##  [43] glue_1.4.0            stars_0.4-1           polyclip_1.10-0

##  [46] pals_1.6              gtable_0.3.0          ipred_0.9-9

##  [49] pkgbuild_1.0.8        maps_3.3.0            abind_1.4-5

##  [52] scales_1.1.0          mvtnorm_1.1-0         DBI_1.1.0

##  [55] bibtex_0.4.2.2        Rcpp_1.0.4.6          viridisLite_0.3.0

##  [58] progress_1.2.2        units_0.6-6           mapproj_1.2.7

##  [61] dotCall64_1.0-0       Formula_1.2-3         intervals_0.15.2

##  [64] stats4_4.0.0          lava_1.6.7            prodlim_2019.11.13

##  [67] dismo_1.1-4           htmlwidgets_1.5.1     FNN_1.1.3

##  [70] RColorBrewer_1.1-2    geosphere_1.5-10      ellipsis_0.3.0

##  [73] farver_2.0.3          reshape_0.8.8         pkgconfig_2.0.3

##  [76] XML_3.99-0.3          nnet_7.3-14           deldir_0.1-25

##  [79] caret_6.0-86          tidyselect_1.0.0      rlang_0.4.6.9000

##  [82] reshape2_1.4.4        tmaptools_3.0         cellranger_1.1.0

##  [85] munsell_0.5.0         tools_4.0.0           xgboost_1.0.0.2

##  [88] cli_2.0.2             generics_0.0.2        devtools_2.3.0

##  [91] evaluate_0.14         stringr_1.4.0         goftest_1.2-2

##  [94] ModelMetrics_1.2.2.2  processx_3.4.2        leafsync_0.1.0

##  [97] fs_1.4.1              timereg_1.9.4         blockCV_2.1.1

## [100] pec_2019.11.03        pbapply_1.4-2         future_1.17.0

## [103] xml2_1.3.2            compiler_4.0.0        rstudioapi_0.11

## [106] curl_4.3              png_0.1-7             e1071_1.7-3

## [109] testthat_2.3.2        spatstat.utils_1.17-0 spacetime_1.2-3

## [112] tibble_3.0.1          stringi_1.4.6         highr_0.8

## [115] ps_1.3.3              desc_1.2.0            fields_10.3

## [118] rgeos_0.5-3           lattice_0.20-41       Matrix_1.2-18

## [121] classInt_0.4-3        vctrs_0.3.0           pillar_1.4.4

## [124] lifecycle_0.2.0       furrr_0.1.0           pammtools_0.2.2

## [127] cowplot_1.0.0         data.table_1.12.8     raster_3.1-5

## [130] R6_2.4.1              gridExtra_2.3         KernSmooth_2.23-17

## [133] sessioninfo_1.1.1     codetools_0.2-16      dichromat_2.0-0

## [136] MASS_7.3-51.6         assertthat_0.2.1      pkgload_1.0.2

## [139] rprojroot_1.3-2       withr_2.2.0           hms_0.5.3

## [142] parallel_4.0.0        grid_4.0.0            rpart_4.1-15

## [145] timeDate_3043.102     tidyr_1.0.3           class_7.3-17

## [148] automap_1.0-14        tmap_3.0              pROC_1.16.2

## [151] numDeriv_2016.8-1.1   lubridate_1.7.8       base64enc_0.1-3

```

# Folder structure

Overview of project files and folders:

- Preliminary notes:

    + The initial steps of this project requiere access to the **`mastergrids`** folder at Maleria Atlas Project and will thus not be fully reproducible

- **`mastergrids`**:

An **`R`** package that facilitates the import of environmental

raster data from the **`mastergrids`** folder at BDI MAP and some

utility functions to transform rasters to data frames and vice versa

(the data import won't work outside of the BDI/without connection to mastergrids). Also contains two functions `grid_to_df` and `df_to_grid` which

convert RasterLayer/RasterBrick object to a data frame and vice versa.

- **`tcruziutils`**: An **`R`** package that facilitates all steps of the

analysis. Most functions are specific to this project and should not be used

for general purpose projects. It can be loaded at the beginning of the script

using `devtools::load_all("/tcruziutils")` or installed via

`devtools::install("/tcruziutils")` and then loaded as usual

with `library(tcruziutils)`

- **`infection data`** (not included):

Contains data on infection prevalence and presence in vectors and humans

    + **`External vector database`**: Additional data set (compiled by another

    research group that contains *presence only* data on different

    vector species)

- **`endemic zone`** (not included):

Contains a shape-file that defines the endemic zone of the

disease ("mask"). Can be used to crop environmental grid data and other spatial

objects. The spatial extent defined by this mask is stored in the

**`tcruziutils`** package as `raster::extent` and `sp::bbox` objects for

convenience (see `tcruziutils::extent_tcruzi` and `tcruziutils::bbox_tcruzi`).

- **`polygon boundaries`** (not included):

Shapefiles containing polygon boundaries on administrative district levels 1 and 2. Used to extract location/area and polygon information based on the GAUL code. Cropped according to endemic zone and stored as `shp_admin_1.Rds` and `shp_admin_2.Rds` in the **`preprocessing`** folder

- **`preprocessing`**:

This folder contains the main pre-processing scripts and stores the pre-processed data sets that will be used for modeling

    + `import.R`: Initial data import of the presence/prevalence. For

    observations recorded on a polygon level, also adds the centroid

    and area information of the respective polygon to the data (see function

    `add_spat_dat` in **`tcruziutils`** package)

    + `prep-for-modeling.Rmd`: Builds on the `import.R` script and pre-processes the initially imported data for modeling purposes. This includes:

      - application of inclusion/exclusion criteria

      - imputation of (some) missing data

      - addition of covariate layers to the observed presence/absence data

      (based on coordinates; see `raster::extract` and

      `tcruziutils::add_grid_data`)

      - Splits the data set in *train*/*test* data (and possibly *evaluation*

      data). Additionally creates block-wise cross-validation scheme,

      stored as `fold` variable in the original data.

      - also produces some additional visualizations (stored in

      `tcruzi/figures` as `pdf` and `png`)

- **`modeling`**:

     - **`vector_occurrence-sdm`**:

     Folder containing Species Distribution Models (SDM) based on presense only data:

     - See README therein

# Folder structure

Overview of project files and folders:

- Preliminary notes:

    + The initial steps of this project requiere access to the **`mastergrids`** folder at Maleria Atlas Project and will thus not be fully reproducible

- **`mastergrids`**:

An **`R`** package that facilitates the import of environmental

raster data from the **`mastergrids`** folder at BDI MAP and some

utility functions to transform rasters to data frames and vice versa

(the data import won't work outside of the BDI/without connection to mastergrids). Also contains two functions `grid_to_df` and `df_to_grid` which

convert RasterLayer/RasterBrick object to a data frame and vice versa.

- **`tcruziutils`**: An **`R`** package that facilitates all steps of the

analysis. Most functions are specific to this project and should not be used

for general purpose projects. It can be loaded at the beginning of the script

using `devtools::load_all("/tcruziutils")` or installed via

`devtools::install("/tcruziutils")` and then loaded as usual

with `library(tcruziutils)`

- **`infection data`** (not included):

Contains data on infection prevalence and presence in vectors and humans

    + **`External vector database`**: Additional data set (compiled by another

    research group that contains *presence only* data on different

    vector species)

- **`endemic zone`** (not included):

Contains a shape-file that defines the endemic zone of the

disease ("mask"). Can be used to crop environmental grid data and other spatial

objects. The spatial extent defined by this mask is stored in the

**`tcruziutils`** package as `raster::extent` and `sp::bbox` objects for

convenience (see `tcruziutils::extent_tcruzi` and `tcruziutils::bbox_tcruzi`).

- **`polygon boundaries`** (not included):

Shapefiles containing polygon boundaries on administrative district levels 1 and 2. Used to extract location/area and polygon information based on the GAUL code. Cropped according to endemic zone and stored as `shp_admin_1.Rds` and `shp_admin_2.Rds` in the **`preprocessing`** folder

- **`preprocessing`**:

This folder contains the main pre-processing scripts and stores the pre-processed data sets that will be used for modeling

    + `import.R`: Initial data import of the presence/prevalence. For

    observations recorded on a polygon level, also adds the centroid

    and area information of the respective polygon to the data (see function

    `add_spat_dat` in **`tcruziutils`** package)

    + `prep-for-modeling.Rmd`: Builds on the `import.R` script and pre-processes the initially imported data for modeling purposes. This includes:

      - application of inclusion/exclusion criteria

      - imputation of (some) missing data

      - addition of covariate layers to the observed presence/absence data

      (based on coordinates; see `raster::extract` and

      `tcruziutils::add_grid_data`)

      - Splits the data set in *train*/*test* data (and possibly *evaluation*

      data). Additionally creates block-wise cross-validation scheme,

      stored as `fold` variable in the original data.

      - also produces some additional visualizations (stored in

      `tcruzi/figures` as `pdf` and `png`)

- **`modeling`**:

     - **`vector_occurrence-sdm`**:

     Folder containing Species Distribution Models (SDM) based on presense only data:

     - See README therein