Run global-scale MaxEnt model

Overview

In the last vignette, I produced the background point dataset for the global-scale model. In this vignette, I will start training MaxEnt models and obtaining the outputs I will need for my analysis! This vignette will train the global-scale model and the next few vignettes will train the three regional-scale models. This is the first step toward my goal of analyzing the realized niche of Lycorma delicatula at multiple spatial scales. The output from this global-scale model will be compared directly against the output from an ensemble of regional-scale models to determine the impact of climate change on the suitability of SLF in important viticultural regions. I hypothesize that by creating models at multiple spatial scales, I can have more confidence in suitability predictions made for specific SLF populations and important viticultural regions, and can predict the risk of specific populations spreading now and under climate change. I also hypothesize that the regional-scale models, once ensembled into a single prediction, will make more refined predictions than a global-scale model could.

I begin setup for this model by associating the presence data with the covariates in a data frame that will be fed into MaxEnt. I also separate these data into random folds for cross-validation. Once these are done, I can train the MaxEnt model. Outputs from this model will include a list of summary statistics, suitability maps and point-wise suitability values for all SLF presences and for important viticultural regions. I will also create projected map and suitability outputs for the 3 CMIP6 scenarios I plan to use in my analysis.

This vignette has 3 subvignettes (051-053) that each outline and comment the basic workflow for the 3 R functions created for use in this vignette. These functions include compute_MaxEnt_summary_statistics_CV(), create_MaxEnt_suitability_maps_CV(), and predict_xy_suitability_CV(). Please see these subvignettes for a step-by-step guide to the creation of these functions. This package also has versions of the same functions that are designed for MaxEnt models without cross-validation. These share the same name but without the _CV ending and will be used for the regional models.

Setup

CURRENT MODEL VERSION: v4

# general tools
library(tidyverse)  #data manipulation
library(here) #making directory pathways easier on different instances
# here::here() starts at the root folder of this package.
library(devtools)

# SDMtune and dependencies
library(SDMtune) # main package used to run SDMs
library(dismo) # package underneath SDMtune
library(rJava) # for running MaxEnt
library(plotROC) # plots ROCs

# spatial data handling
library(raster) 
library(terra) 
library(sf)

library(viridis)

# you should also build and install this package 
library(scari)

library(blockCV) # appendix, not necessary to run this vignette

Note: I will be setting the global options of this document so that only certain code chunks are rendered in the final .html file. I will set the eval = FALSE so that none of the code is re-run (preventing files from being overwritten during knitting) and will simply overwrite this in chunks with plots.

SDMtune will run MaxEnt through java via the rJava package. You will need to ensure that your machine has the proper version of java installed (x32 or x64) for your operating system.

checkMaxentInstallation(verbose = TRUE)

## ✔ Java is installed.

## ✔ The packege rJava is installed.

## ✔ The file maxent.jar is present in the correct folder.

## [1] TRUE

This chunk sets the java memory allocation (Xmx). I will increase the memory allocation from 512m (the default) to 2GB of memory.

# xmx sets java memory allocation
options(java.parameters = "-Xmx2048m")

# xss sets java stack size
# options(java.parameters = c("-Xss2560k", "-Xmx2048m"))

1. Format Data for Global Model

I will load in the datasets I will need for the MaxEnt models. These are labeled at the beginning of the object name by the parameter they will be used in the SDMtune::train() function (x, p or a). I will begin by loading in the covariate data and then by loading in the points datasets.

1.1 Import data

First, I will load in the SLF presence dataset created in vignette 020 and the background points dataset created in vignette 040.

# slf presences
p_slf_points <- readr::read_rds(file = file.path(here::here(), "data", "slf_all_coords_final_2025-07-07.rds")) %>%
  dplyr::select(-species) 

# entire eastern USA background point set
a_global_background_points <- readr::read_rds(file = file.path(here::here(), "data", "global_background_points_v4.rds"))

Next, I will load in and stack the covariates needed for the model, which were trimmed in vignette 030.

# path to directory
mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/historical_climate_rasters/chelsa2.1_30arcsec")

# the env covariates used to train the model global
x_global_hist_env_covariates_list <- list.files(path = file.path(mypath, "v1_maxent_10km"), pattern = "\\.asc$", full.names = TRUE) %>%
    grep("bio2_1981-2010_global.asc|bio11_1981-2010_global.asc|bio12_1981-2010_global.asc|bio15_1981-2010_global.asc", ., value = TRUE)

I will load in 3 sets of CMIP6 covariates, 1 per ssp scenario (126/370/585).

# path to directory
  mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/future_climate_rasters/chelsa2.1_30arcsec")



# SSP126
x_global_126_env_covariates_list <- list.files(
  path = file.path(mypath, "2041-2070_ssp126_GFDL", "v1_maxent_10km"), pattern = "\\_global.asc$", full.names = TRUE
  ) 


# SSP370
x_global_370_env_covariates_list <- list.files(
  path = file.path(mypath, "2041-2070_ssp370_GFDL", "v1_maxent_10km"), pattern = "\\_global.asc$", full.names = TRUE
  )

## SSP585
x_global_585_env_covariates_list <- list.files(
  path = file.path(mypath, "2041-2070_ssp585_GFDL", "v1_maxent_10km"), pattern = "\\_global.asc$", full.names = TRUE
  )

I will import rasters of the environmental covariates, stack them and gather summary statistics. I will also shorten their names and exclude possible operators from layer names (for example, using the dash symbol was found to interfere with SDMtune making predictions for tables downstream).

# layer name object. Check order of layers first
env_layer_names <- c("bio11", "bio12", "bio15", "bio2")

# stack env covariates
x_global_hist_env_covariates <- terra::rast(x = x_global_hist_env_covariates_list)

# attributes
nlyr(x_global_hist_env_covariates)
names(x_global_hist_env_covariates)
minmax(x_global_hist_env_covariates) 
ext(x_global_hist_env_covariates)
crs(x_global_hist_env_covariates)


# I will change the name of the variables because they are throwing errors in SDMtune
names(x_global_hist_env_covariates) <- env_layer_names
# confirmed- SDMtune doesnt like dashes in column names (it is read as a mathematical operation)

Now perform the same operation but times 3 for the CMIP6 models.

# stack env covariates
x_global_126_env_covariates <- terra::rast(x = x_global_126_env_covariates_list)

# attributes
nlyr(x_global_126_env_covariates)
names(x_global_126_env_covariates)
minmax(x_global_126_env_covariates)
# ext(x_global_126_env_covariates)
# crs(x_global_126_env_covariates)

names(x_global_126_env_covariates) <- env_layer_names


# stack env covariates
x_global_370_env_covariates <- terra::rast(x = x_global_370_env_covariates_list)

# attributes
nlyr(x_global_370_env_covariates)
names(x_global_370_env_covariates)
minmax(x_global_370_env_covariates)
# ext(x_global_370_env_covariates)
# crs(x_global_370_env_covariates)

names(x_global_370_env_covariates) <- env_layer_names


# stack env covariates
x_global_585_env_covariates <- terra::rast(x = x_global_585_env_covariates_list)

# attributes
nlyr(x_global_585_env_covariates)
names(x_global_585_env_covariates)
minmax(x_global_585_env_covariates)
# ext(x_global_585_env_covariates)
# crs(x_global_585_env_covariates)

names(x_global_585_env_covariates) <- env_layer_names

rm(x_global_hist_env_covariates_list)
rm(x_global_126_env_covariates_list)
rm(x_global_370_env_covariates_list)
rm(x_global_585_env_covariates_list)

1.2 create input data objects

I will need to combine the above covariate, presence and background point datasets into a single dataset to feed into MaxEnt (an SWD, samples with data object). This dataset will need to contain point-wise values for each of the predictor covariates at the presence and background point coordinates. SDMtune takes an SWD object for this purpose, containing the presences and background points with associated covariate data to be fed into the model, so I will create this now.

global_SWD <- SDMtune::prepareSWD(
  species = "Lycorma delicatula",
  env = x_global_hist_env_covariates,
  p = p_slf_points,
  a = a_global_background_points,
  verbose = TRUE # print helpful messages
  )

global_SWD@coords # coordinates
global_SWD@pa # presence / absence (background counted as absence)
global_SWD@data # extracted data from

I usually look at how many records were dropped before I save it, because SDMtune gives an undefined warning about how many samples were discarded. In this case, we only lost 2 presence records, so this is acceptable. I will also save the output to be used later.

SDMtune::swd2csv(swd = global_SWD, file_name = c(
  file.path(here::here(), "vignette-outputs", "data-tables", "global_slf_presences_with_data_v4.csv"),
  file.path(here::here(), "vignette-outputs", "data-tables", "global_background_points_with_data_v4.csv")
  ))

Create training / test split

I will randomly split the presences into training and testing, using 80% of the points for training and 20% for testing. I will then use SDMtune::randomFold() to split the training data into 10 partitions for cross-validation. This method was loosely adapted from Srivastava et.al, 2021.

In my first attempts, I tried a blocked cross-validation methodology using the function blockCV::cv_spatial(). I attempted it because many papers say that spatially segmenting the k-folds is a more rigorous method than simple random k-fold. However, this method did not work for my dataset and left many test folds with 0 presence points. I believe this is due to both the spatial scale I am working in for this model (global) and the comparatively narrow species range (limited to only east Asia and the USA). The block size that reduces autocorrelation was about 1.5 million meters, but this made the blocks too large for each block to receive adequate species presences. I also attempted spatial blocking by distance (cv_cluster()), but this yielded similar results. See the Appendix in this file for code.

Let’s create the training/testing data split.

global_trainTest <-  SDMtune::trainValTest(
  x = global_SWD,
  test = 0.2, # set used for predicting maps and suitability values
  only_presence = TRUE, # only split presences and recycle bg points
  seed = 3
)

# separate off training data
global_train <- global_trainTest[[1]]
global_test <- global_trainTest[[2]]

global_train@coords # coordinates
global_train@pa # presence / absence (background counted as absence)
global_train@data # extracted data from

Now split the training data for 10-fold random cross validation.

# create random folds
global_trainFolds <- SDMtune::randomFolds(
  data = global_train,
  k = 10, # 10 folds
  only_presence = TRUE,
  seed = 4
)

2. Train & Tune Global Model

First, I will train a model with 10 cross-validated replicates. The model will only be trained on 80% of the SLF presence data, with the other 20% being held apart and used downstream for analysis. The default settings will be used for MaxEnt, apart from the following changes:

ALL feature types used (l = linear, q = quadratic, p = product, h = hinge, t = threshold)
replicates = 10
iterations = 5000. This is the max number of iterations for the optimization algorithm to perform before stopping training. Increasing this number from the default of 500 allows the algorithm to make more refined predictions.

I will use these hyperparameters to train the initial model, but these parameters may change based on model tuning, which I will perform next.

global_model <- SDMtune::train(
  method = "Maxent",
  data = global_train,
  folds = global_trainFolds, # 10 folds for dataset
  fc = "lqpht", # feature classes set to ALL
  iter = 5000, # number of iterations
  progress = TRUE
)

2.1 Tuning MaxEnt Model

Next, I will perform model tuning by varying specific hyperparameters. I selected the hyperparameters that I thought were best for the model based on numerous papers. However, modeling papers for other species have made different choices for the parameter selection that I did not. Luckily, SDMtune makes it easy to select the best fit set of hyperparameters from a range of possibilities using the gridSearch() function. This function trains a different model for every possible combination of hyperparameters that is listed. I can then pick the best model based on AUC, TSS or AICc. I selected these hyperparameters for specific reasons listed in the literature.

Feature classes: h, qpht, lqpht Feature classes are important because they govern how your environmental covariates are transformed by the algorithm (for example, if most presences fall beyond a specific point on the range of values, the algorithm might determine that a threshold feature is most appropriate.)

Elith et.al, 2011 used only hinge features or all features and showed that hinge features alone may perform better. They referred to a MaxEnt model using only hinge features as resembling an additive model, not unlike a GAM.
The supplemental materials from the same paper demonstrated that linear features are redundant if also using hinge features.
Most SLF modeling papers, including Huron et.al and others for SLF, simply use all features.

Regularization Multiplier: 0.25, 0.5, 1, 1.5, 2 Regularization penalizes overfitting in the model. If a model is too fit to a range of covariate values, to the point that this model could not be applied to another area with other values, it receives a penalty.

Radosavljevic and Anderson 2014 used a range of regularization multipliers, from 0.25 to 2, to test for the most appropriate value, which I will adopt.

I will compare the models with the highest test AUC and TSS values. I will also save all of the model variations in a .rds object.

# the arguments that can be tuned later
getTunableArgs(global_model)

# list of hyperparameters
hypers <- list(
  reg = c(0.25, 0.5, 1, 1.5, 2), # regularization multiplier
  fc = c("h", "qpht", "lqpht") # feature classes
  )

if (FALSE) {

  # Train a model with all possible combinations of hyperparameters
  global_model_tuned_auc <- SDMtune::gridSearch(
    model = global_model, 
    hypers = hypers, # hyperparameter list
    metric = "auc",
    save_models = TRUE,
    interactive = TRUE, 
    progress = TRUE
    )
  
  
  # show tuning results
  global_model_tuned_auc@results
  # select number of best model
  best.model.auc <- which.max(global_model_tuned_auc@results$test_AUC)
  
}

Based on the test AUC metric, the highest value was attributed to two models: model 11 with regularization multiplier of 1.5 and qpht features, and model 7 with a reg multiplier of 1 and only hinge features. Model 11 was chosen more frequently across previous model versions, so I will choose this model for the next step of tuning. I will also use this model as the base model for all future models performed for SLF at the regional scale. Lets also check the TSS to be sure.

if (FALSE) {

  # Train a model with all possible combinations of hyperparameters
  global_model_tuned_tss <- SDMtune::gridSearch(
    model = global_model, 
    hypers = hypers, # hyperparameter list
    metric = "tss", 
    save_models = TRUE,
    interactive = TRUE, 
    progress = TRUE
    )
  
  
  # show tuning results
  global_model_tuned_tss@results
  # select number of best model
  best.model.tss <- which.max(global_model_tuned_tss@results$test_TSS)
  
}

We get slightly different results based on the test TSS metric. We see that the highest value was attributed to the model with a regularization multiplier of 0.25 and qpht features (model 2).

We will need to choose either model 11 or model 2. Since the difference in test and training TSS are incredibly small between both models (0.001), I will choose model 11 because this model has been consistently more fit across model versions. These tuned parameters will also be applied to all future models performed for SLF at the regional scale.

global_model <- global_model_tuned_auc@models[[11]]

2.2 Summary Statistics

I will apply my function compute_MaxEnt_summary_statistics_CV() to produce all summary statistics for this model. This list includes marginal and univariate response curves, jackknife tests, variable importance, thresholds and confusion matrices, among other statistics. For the complete and annotated workflow used to create this function, see vignettes/051_maxent_workflow.

# output directory
mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/models/slf_global_v4")

scari::compute_MaxEnt_summary_statistics_CV(
  model.obj = global_model, 
  model.name = "global", 
  mypath = mypath, 
  create.dir = FALSE, # create subdirectory
  env.covar.obj = x_global_hist_env_covariates, # env covariates raster stacked
  train.obj = global_train, # training data used to create model
  trainFolds.obj = global_trainFolds,  # k-folds of training data
  test.obj = global_test, # data you wish to use to test the model
  jk.test.type = c("train", "test"), # types of jackknife curves to be created
  plot.type = c("cloglog", "logistic") # types of univariate and marginal response curves to be created
  )

I will now save the results from the model tuning that I performed a few chunks ago, because I have created the directory for these model results. I will also save the gridded tuning result figures manually from the Viewer pane to the model directory folder that was created using compute_MaxEnt_summary_statistics_CV().

# save gridsearch interactive tables from Viewer pane

# save df results of model tuning
write.csv(global_model_tuned_auc@results, file = file.path(mypath, "global_model_tuning_auc_results.csv"))
write.csv(global_model_tuned_tss@results, file = file.path(mypath, "global_model_tuning_tss_results.csv"))

# save tuned models
readr::write_rds(global_model_tuned_auc, file = file.path(mypath, "global_model_tuned_auc.rds"))
readr::write_rds(global_model_tuned_tss, file = file.path(mypath, "global_model_tuned_tss.rds"))

3. Create Outputs for Analysis

mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/models/slf_global_v4")

global_model <- read_rds(file = file.path(mypath, "global_model.rds"))

3.1 Create suitability maps- mean and thresholded

I will use the SDMtune::predict() function to predict the suitability for the range of interest. I will threshold the map by the MTP, MTSS and 10_percentile thresholds. The workflow for this application will use my function create_MaxEnt_suitability_maps_CV().

I will take the above map and reclassify it so that areas that aren’t suitable are more obvious. I will use the MTSS (maximum training sensitivity plus specificity threshold), as this threshold is one of the most rigorous available. It maximizes the TSS, a threshold dependent goodness of fit metric and has been shown to perform well and produce consistent predictions for presence-only methods (Liu et al, 2013; Liu et al, 2016). I will also use the MTP (minimum training presence) and 10_percentile (10 percentile training threshold) thresholds. These thresholds represent a continuum of training data exclusion. The MTP threshold does not exclude any data and assumes confidence in the validity of the training data, making the most conservative predictions (the largest suitable range for invasive species) The 10_percentile threshold excludes the top 10% of suitable training samples, which would be more appropriate if I have less confidence in my training data. Finally, the MTSS threshold maximizes the sensitivity (the likelihood of detecting a true positive) (Maryam Bordkhani (n.d.)).

I will retrieve the specific value for the MTSS threshold for this model and use it to define a what is considered unsuitable in this raster. Then, I will reclassify the mean raster so that anything below this threshold is now “unsuitable”. I will use the mean cloglog threshold for all iterations. I will create a binary version of the mean raster, with unsuitable regions (below the MTSS training threshold) as 0, and those above the threshold as 1. Then, I can reclassify averaged raster of MaxEnt predictions. I will load in the averaged prediction for this model and reclassify the raster using a temporary raster layer that I will create. This raster will only have values where the model predicts the climate to be unsuitable, so that it can be plotted overtop the averaged raster to mask unsuitable areas.

scari::create_MaxEnt_suitability_maps_CV(
  model.obj = global_model,
  model.name = "global", 
  mypath = mypath, 
  create.dir = FALSE, 
  env.covar.obj = x_global_hist_env_covariates, 
  describe.proj = "globe_1981-2010", # name of area or time period projected to
  clamp.pred = TRUE,
  thresh = c("MTP", "MTSS"),
  map.thresh = TRUE, # whether thresholded versions of these maps should be created
  map.thresh.extra = "MTP", # whether an added underlayer threshold should be plotted (MTP plotted under MTSS)
  summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)

I will also create suitability maps for each of the projected 2041-2070 climate change scenarios.

scari::create_MaxEnt_suitability_maps_CV(
  model.obj = global_model,
  model.name = "global", 
  mypath = mypath, 
  create.dir = FALSE, 
  env.covar.obj = x_global_126_env_covariates, 
  describe.proj = "globe_2041-2070_GFDL_ssp126", # name of area or time period projected to
  clamp.pred = TRUE,
  thresh = c("MTP", "MTSS"),
  map.thresh = TRUE, # whether thresholded versions of these maps should be created
  map.thresh.extra = "MTP",
  summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)

scari::create_MaxEnt_suitability_maps_CV(
  model.obj = global_model,
  model.name = "global", 
  mypath = mypath, 
  create.dir = FALSE, 
  env.covar.obj = x_global_370_env_covariates, 
  describe.proj = "globe_2041-2070_GFDL_ssp370", # name of area or time period projected to
  clamp.pred = TRUE,
  thresh = c("MTP", "MTSS"),
  map.thresh = TRUE, # whether thresholded versions of these maps should be created
  map.thresh.extra = "MTP",
  summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)

scari::create_MaxEnt_suitability_maps_CV(
  model.obj = global_model,
  model.name = "global", 
  mypath = mypath, 
  create.dir = FALSE, 
  env.covar.obj = x_global_585_env_covariates, 
  describe.proj = "globe_2041-2070_GFDL_ssp585", # name of area or time period projected to
  clamp.pred = TRUE,
  thresh = c("MTP", "MTSS"),
  map.thresh = TRUE, # whether thresholded versions of these maps should be created
  map.thresh.extra = "MTP",
  summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)

3.2 Predict suitability for known SLF presences

I will get projected suitability values, calculated on the cloglog scale (between 0 and 1) for each of the known SLF populations I used as input data for my models. These suitability values will be added back to the original data frame and saved. The workflow for this application will use my function predict_xy_suitability_CV(). This function extracts covariate data from each data point and then uses the model object to predict the suitability for SLF establishment at that location based on the climate variables. This function has two modes for predicting the suitability: simple and buffered. I will use simple (the suitability at that exact point) for the location of SLF populations.

Later in my analysis, I will use these point-wise suitability predictions to create a scatter plot in the global vs regional models.

mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/models/slf_global_v4")

# slf presence data
slf_presences <- read.csv(file = file.path(here::here(), "vignette-outputs", "data-tables", "slf_all_coords_final_2025-07-07.csv")) %>%
  dplyr::select(-species)

scari::predict_xy_suitability_CV(
  xy.obj = slf_presences,
  xy.type = "Lycorma delicatula",
  env.covar.obj = x_global_hist_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE, # clamp predictions
  buffer.pred = FALSE, # will not buffer for SLF occurrence data because this is more accurate than viticultural region data
  predict.type = c("cloglog", "logistic"),
  output.name = "global_slf_all_coords_1981-2010"
)

I will repeat the same process for the CMIP6 data.

scari::predict_xy_suitability_CV(
  xy.obj = slf_presences,
  xy.type = "Lycorma delicatula",
  env.covar.obj = x_global_126_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE,
  predict.type = c("cloglog", "logistic"),
  output.name = "global_slf_all_coords_2041-2070_GFDL_ssp126"
)

scari::predict_xy_suitability_CV(
  xy.obj = slf_presences,
  xy.type = "Lycorma delicatula",
  env.covar.obj = x_global_370_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE,
  predict.type = c("cloglog", "logistic"),
  output.name = "global_slf_all_coords_2041-2070_GFDL_ssp370"
)

scari::predict_xy_suitability_CV(
  xy.obj = slf_presences,
  xy.type = "Lycorma delicatula",
  env.covar.obj = x_global_585_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE,
  predict.type = c("cloglog", "logistic"),
  output.name = "global_slf_all_coords_2041-2070_GFDL_ssp585"
)

3.3 Predict suibability for IVR regions

I will perform the same action as above, but for the locations of over 1,000 important wineries around the world. I will use the buffered method of predicting suitability for this application. The buffered method creates a buffer zone around the point and considers all values within that buffer for predicting suitability. This is important because these points represent regions that are important to viticulture and are not exact geospatial coordinates, but are likely mapped to the centroid of areas with vast expanses of vineyards. So, using a buffer zone accounts for this lack of exact coordinates. I will take this maximum value from this buffer, but the mean, minimum and others can be used as well. I will use a buffer radial distance of 20,000m (the default, 20km) because the resolution of the predictions is at 10km, so this distance considers the two surrounding cells on any side.

mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/models/slf_global_v4")

# load in all IVR points
IVR_regions <- readr::read_rds(file.path(here::here(), "data", "wineries_esri54017.rds"))

IVR_regions <- IVR_regions %>%
  dplyr::select(x, y)

scari::predict_xy_suitability_CV(
  xy.obj = IVR_regions,
  xy.type = "IVR regions",
  env.covar.obj = x_global_hist_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE,
  buffer.pred = TRUE, # 20km buffer around points
  predict.type = c("cloglog", "logistic"),
  output.name = "global_wineries_1981-2010"
)

I will repeat the same process for the CMIP6 data.

scari::predict_xy_suitability_CV(
  xy.obj = IVR_regions,
  xy.type = "IVR regions",
  env.covar.obj = x_global_126_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE,
  buffer.pred = TRUE,
  predict.type = c("cloglog", "logistic"),
  output.name = "global_wineries_2041-2070_GFDL_ssp126"
)

scari::predict_xy_suitability_CV(
  xy.obj = IVR_regions,
  xy.type = "IVR regions",
  env.covar.obj = x_global_370_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE,
  buffer.pred = TRUE,
  predict.type = c("cloglog", "logistic"),
  output.name = "global_wineries_2041-2070_GFDL_ssp370"
)

scari::predict_xy_suitability_CV(
  xy.obj = IVR_regions,
  xy.type = "IVR regions",
  env.covar.obj = x_global_585_env_covariates,
  model.obj = global_model,
  mypath = mypath,
  clamp.pred = TRUE,
  buffer.pred = TRUE,
  predict.type = c("cloglog", "logistic"),
  output.name = "global_wineries_2041-2070_GFDL_ssp585"
)

Now I have all of the outputs I need from the global model! These outputs will be used in comparison with the regional version of their time period to analyze the realized regional niche for SLF at multiple spatial scales. These will be used to make inferences about the impacts of climate change on suitability for SLF.

In the next vignette, I will set up for training each of the 3 regional models.

summary_global <- read_csv(file.path(mypath, "global_summary_all_iterations.csv"))

readr::write_rds(summary_global, file = file.path(here::here(), "data", "global_summary_all_iterations_v4.rds"))

4. Take mean of ssp scenario predictions

I will load the three ssp scenario rasters I just created back into R and will take a mean of those rasters for my predictions step later downstream.

# ssp 126
pred_global_2055_126 <- terra::rast(
  x = file.path(mypath, "global_pred_suit_clamped_cloglog_globe_2041-2070_GFDL_ssp126_mean.asc")
)
# ssp 370
pred_global_2055_370 <- terra::rast(
  x = file.path(mypath, "global_pred_suit_clamped_cloglog_globe_2041-2070_GFDL_ssp370_mean.asc")
)
# ssp 585
pred_global_2055_585 <- terra::rast(
  x = file.path(mypath, "global_pred_suit_clamped_cloglog_globe_2041-2070_GFDL_ssp585_mean.asc")
)

pred_global_2055_stack <- c(pred_global_2055_126, pred_global_2055_370, pred_global_2055_585)

# apply mean between layers
pred_global_2055_mean <- terra::app(
  x = pred_global_2055_stack,
  fun = "mean",
  filename = file.path(mypath, "global_pred_suit_clamped_cloglog_globe_2041-2070_GFDL_ssp_averaged.asc"),
  filetype = "AAIGrid",
  overwrite = FALSE
)

References

Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists: Statistical explanation of MaxEnt. Diversity and Distributions, 17(1), 43–57. https://doi.org/10.1111/j.1472-4642.2010.00725.x
Feng, X. (2022). Shandongfx/nimbios_enm [HTML]. https://github.com/shandongfx/nimbios_enm (Original work published 2018).
Gallien, L., Douzet, R., Pratte, S., Zimmermann, N. E., & Thuiller, W. (2012). Invasive species distribution models – how violating the equilibrium assumption can create new insights. Global Ecology and Biogeography, 21(11), 1126–1136. https://doi.org/10.1111/j.1466-8238.2012.00768.x
Liu, C., Newell, G., & White, M. (2016). On the selection of thresholds for predicting species occurrence with presence-only data. Ecology and Evolution, 6(1), 337–348. https://doi.org/10.1002/ece3.1878
Liu, C., White, M., & Newell, G. (2013). Selecting thresholds for the prediction of species occurrence with presence-only data. Journal of Biogeography, 40(4), 778–789. https://doi.org/10.1111/jbi.12058
Maryam Bordkhani. (n.d.). Threshold rule [Online post].
Radosavljevic, A., & Anderson, R. P. (2014). Making better Maxent models of species distributions: Complexity, overfitting and evaluation. Journal of Biogeography, 41(4), 629–643. https://doi.org/10.1111/jbi.12227
Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190(3), 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026
Steven Phillips. (2017). A Brief Tutorial on Maxent. http://biodiversityinformatics.amnh.org/open_source/maxent/.
Steven J. Phillips, Miroslav Dudík, & Robert E. Schapire. (2023). Maxent software for modeling species niches and distributions (Version 3.4.3 (Java)) [Computer software]. http://biodiversityinformatics.amnh.org/open_source/maxent/.
Srivastava, V., Roe, A. D., Keena, M. A., Hamelin, R. C., & Griess, V. C. (2021). Oh the places they’ll go: Improving species distribution modelling for invasive forest pests in an uncertain world. Biological Invasions, 23(1), 297–349. https://doi.org/10.1007/s10530-020-02372-9
Valavi, R., Elith, J., Lahoz-Monfort, J. J., & Guillera-Arroita, G. (2019). blockCV: An r package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods in Ecology and Evolution, 10(2), 225–232. https://doi.org/10.1111/2041-210X.13107
VanDerWal, J., Shoo, L. P., Graham, C., & Williams, S. E. (2009). Selecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know? Ecological Modelling, 220(4), 589–594. https://doi.org/10.1016/j.ecolmodel.2008.11.010
Vignali, S., Barras, A. G., Arlettaz, R., & Braunisch, V. (2020). SDMtune: An R package to tune and evaluate species distribution models. Ecology and Evolution, 10(20), 11488–11506. https://doi.org/10.1002/ece3.6786

Appendix

Attempt at Spatially and environmentally Blocked Cross-validation

I attempted this method because of the body of literature that says blocked cross validation is more rigorous than random CV. Blocked CV results is more spatially separated training and testing data, which prevents autocorrelation within the environmental covariates. However, this method did not work because of the combination of the scale of the covariates and the comparatively narrow range of SLF presences. The global scale of analysis means that the block size needs to be quite large (about 1.5 million meters) to reduce spatial autocorrelation between the training points. The narrow spread of presence points in comparison means that not every blocked CV group (out of 10) received test presences and some had no test points. I also tried this with 5 groups, but point distributions were still not ideal.

#presences
global_presences_with_data <- read_csv(
  file.path(here::here(), "vignette-outputs", "data-tables", "global_slf_presences_with_data_v4.csv")
  ) %>%
  # add presence/absence column
  mutate(pa = 1)

# background points
global_background_with_data <-  read_csv(
  file.path(here::here(), "vignette-outputs", "data-tables", "global_background_points_with_data_v4.csv")
  ) %>%
  mutate(pa = 0)


# join
global_all_points_with_data <- full_join(global_presences_with_data, global_background_with_data) %>%
  sf::st_as_sf(coords = c("X", "Y"), crs = 4326)

First, check spatial autocorrelation level. This determines the appropriate block size to prevent spatial autocorrelation.

# set seed for consistency of random sampling
set.seed(2)

# take spatial autocorrelation distance
spatial_autocor <- blockCV::cv_spatial_autocor(
  x = global_all_points_with_data,
  column = "pa",
  plot = TRUE,
  progress = TRUE
)

spatial_autocor$range
spatial_autocor$plots

# OLD METHOD
# convert to a Lambert projection because variograms do not work with lat/long projections
# using EPSG:9834, Lambert Cylindrical Equal Area (Spherical)
#global_raster <- terra::project(x = x_global_hist_env_covariates, y = "EPSG:9834", method = "bilinear", threads = TRUE)
# test for spatial autocorrelation distance
#spatial_autocor <- blockCV::cv_spatial_autocor(
#  r = global_raster,
#  num_sample = 20000, # quadrupled from default due to raster size
#  plot = TRUE,
#  progress = TRUE
#)

The range of spatial autocorrelation is estimated at 2,263,953 meters for the presence/absence data I am using to build the model. I will also use an interactive tool to visualize block size based on pa data.

blockCV::cv_block_size(
  x = global_all_points_with_data,
  column = "pa"
)

I will go ahead and use the 2.25 million meter distance for the spatial blocks.

global_folds <- blockCV::cv_spatial(
  x = global_all_points_with_data,
  column = "pa",
  r = x_global_hist_env_covariates,
  k = 5,
  size = 2263953,
  hexagon = FALSE, # I am pretty sure the function used to calculate the autocorrelation distance used square blocks
  selection = "random",
  iteration = 50,
  seed = 3,
  progress = TRUE, 
  report = TRUE
)

# 
View(global_folds$folds_list)
# this fold list contains each block 
global_folds$records

blockCV::cv_plot(
  cv = global_folds,
  r = x_global_hist_env_covariates
)

This method did not work because about 5 test folds out of 10 received 0 test points. The spread of the presences seems too narrow for each fold to receive adequate training and test presences. I will also try a spatial clustering algorithm.

global_folds <- blockCV::cv_cluster(
  x = global_all_points_with_data,
  column = "pa",
  k = 5,
  r = x_global_hist_env_covariates,
  scale = TRUE,
  raster_cluster = TRUE,
  report = TRUE
)

Clustering based on distance also did not put test presences in each fold (some had 0), so the blocked CV method will not be used and I will revert to k-fold random CV.

Samuel M. Owens1

2025-07-07