
Run global-scale MaxEnt model
Samuel M. Owens1
2024-08-05
Source:vignettes/050_run_global_MaxEnt_model.Rmd
050_run_global_MaxEnt_model.Rmd
Overview
In the last vignette, I produced the background point dataset for the global-scale model. In this vignette, I will start training MaxEnt models and obtaining the outputs I will need for my analysis! This vignette will train the global-scale model and the next few vignettes will train the three regional-scale models. This is the first step toward my goal of analyzing the realized niche of Lycorma delicatula at multiple spatial scales. The output from this global-scale model will be compared directly against the output from an ensemble of regional-scale models to determine the impact of climate change on the suitability of SLF in important viticultural regions. I hypothesize that by creating models at multiple spatial scales, I can have more confidence in suitability predictions made for specific SLF populations and important viticultural regions, and can predict the risk of specific populations spreading now and under climate change. I also hypothesize that the regional-scale models, once ensembled into a single prediction, will make more refined predictions than a global-scale model could.
I begin setup for this model by associating the presence data with the covariates in a data frame that will be fed into MaxEnt. I also separate these data into random folds for cross-validation. Once these are done, I can train the MaxEnt model. Outputs from this model will include a list of summary statistics, suitability maps and point-wise suitability values for all SLF presences and for important viticultural regions. I will also create projected map and suitability outputs for the 3 CMIP6 scenarios I plan to use in my analysis.
This vignette has 3 subvignettes (051-053) that each outline and
comment the basic workflow for the 3 R functions created for use in this
vignette. These functions include
compute_MaxEnt_summary_statistics_CV()
,
create_MaxEnt_suitability_maps_CV()
, and
predict_xy_suitability_CV()
. Please see these subvignettes
for a step-by-step guide to the creation of these functions. This
package also has versions of the same functions that are designed for
MaxEnt models without cross-validation. These share the same name but
without the _CV
ending and will be used for the regional
models.
Setup
CURRENT MODEL VERSION: v3
# general tools
library(tidyverse) #data manipulation
library(here) #making directory pathways easier on different instances
# here::here() starts at the root folder of this package.
library(devtools)
# SDMtune and dependencies
library(SDMtune) # main package used to run SDMs
library(dismo) # package underneath SDMtune
library(rJava) # for running MaxEnt
library(plotROC) # plots ROCs
# spatial data handling
library(raster)
library(terra)
library(sf)
library(viridis)
library(blockCV) # appendix, not necessary to run this vignette
Note: I will be setting the global options of this
document so that only certain code chunks are rendered in the final
.html file. I will set the eval = FALSE
so that none of the
code is re-run (preventing files from being overwritten during knitting)
and will simply overwrite this in chunks with plots.
SDMtune
will run MaxEnt through java via the
rJava
package. You will need to ensure that your machine
has the proper version of java installed (x32 or x64) for your operating
system.
checkMaxentInstallation(verbose = TRUE)
## ✔ Java is installed.
## ✔ The packege rJava is installed.
## ✔ The file maxent.jar is present in the correct folder.
## [1] TRUE
This chunk sets the java memory allocation (Xmx
). I will
increase the memory allocation from 512m (the default) to 2GB of
memory.
# xmx sets java memory allocation
options(java.parameters = "-Xmx2048m")
# xss sets java stack size
# options(java.parameters = c("-Xss2560k", "-Xmx2048m"))
1. Format Data for Global Model
I will load in the datasets I will need for the MaxEnt models. These
are labeled at the beginning of the object name by the parameter they
will be used in the SDMtune::train()
function (x, p or a).
I will begin by loading in the covariate data and then by loading in the
points datasets.
1.1 Import data
First, I will load in the SLF presence dataset created in vignette 020 and the background points dataset created in vignette 040.
# slf presences
p_slf_points <- readr::read_rds(file = file.path(here::here(), "data", "slf_all_coords_final_2024-08-05.rds")) %>%
dplyr::select(-species)
# entire eastern USA background point set
a_global_background_points <- readr::read_rds(file = file.path(here::here(), "data", "global_background_points_v3.rds"))
Next, I will load in and stack the covariates needed for the model, which were trimmed in vignette 030.
# path to directory
mypath <- file.path(here::here() %>%
dirname(),
"maxent/historical_climate_rasters/chelsa2.1_30arcsec")
# the env covariates used to train the model global
x_global_hist_env_covariates_list <- list.files(path = file.path(mypath, "v1_maxent_10km"), pattern = "\\.asc$", full.names = TRUE) %>%
grep("bio2_1981-2010_global.asc|bio11_1981-2010_global.asc|bio12_1981-2010_global.asc|bio15_1981-2010_global.asc", ., value = TRUE)
I will load in 3 sets of CMIP6 covariates, 1 per ssp scenario (126/370/585).
# path to directory
mypath <- file.path(here::here() %>%
dirname(),
"maxent/future_climate_rasters/chelsa2.1_30arcsec")
# SSP126
x_global_126_env_covariates_list <- list.files(
path = file.path(mypath, "2041-2070_ssp126_GFDL", "v1_maxent_10km"), pattern = "\\_global.asc$", full.names = TRUE
) %>%
# dont include Access to cities
grep(pattern = "atc_2015", value = TRUE, invert = TRUE)
# SSP370
x_global_370_env_covariates_list <- list.files(
path = file.path(mypath, "2041-2070_ssp370_GFDL", "v1_maxent_10km"), pattern = "\\_global.asc$", full.names = TRUE
) %>%
# dont include Access to cities
grep(pattern = "atc_2015", value = TRUE, invert = TRUE)
## SSP585
x_global_585_env_covariates_list <- list.files(
path = file.path(mypath, "2041-2070_ssp585_GFDL", "v1_maxent_10km"), pattern = "\\_global.asc$", full.names = TRUE
) %>%
# dont include Access to cities
grep(pattern = "atc_2015", value = TRUE, invert = TRUE)
I will import rasters of the environmental covariates, stack them and gather summary statistics. I will also shorten their names and exclude possible operators from layer names (for example, using the dash symbol was found to interfere with SDMtune making predictions for tables downstream).
# layer name object. Check order of layers first
env_layer_names <- c("bio11", "bio12", "bio15", "bio2")
# stack env covariates
x_global_hist_env_covariates <- terra::rast(x = x_global_hist_env_covariates_list)
# attributes
nlyr(x_global_hist_env_covariates)
names(x_global_hist_env_covariates)
minmax(x_global_hist_env_covariates)
# ext(x_global_hist_env_covariates)
# crs(x_global_hist_env_covariates)
# I will change the name of the variables because they are throwing errors in SDMtune
names(x_global_hist_env_covariates) <- env_layer_names
# confirmed- SDMtune doesnt like dashes in column names (it is read as a mathematical operation)
Now perform the same operation but times 3 for the CMIP6 models.
# stack env covariates
x_global_126_env_covariates <- terra::rast(x = x_global_126_env_covariates_list)
# attributes
nlyr(x_global_126_env_covariates)
names(x_global_126_env_covariates)
minmax(x_global_126_env_covariates)
# ext(x_global_126_env_covariates)
# crs(x_global_126_env_covariates)
names(x_global_126_env_covariates) <- env_layer_names
# stack env covariates
x_global_370_env_covariates <- terra::rast(x = x_global_370_env_covariates_list)
# attributes
nlyr(x_global_370_env_covariates)
names(x_global_370_env_covariates)
minmax(x_global_370_env_covariates)
# ext(x_global_370_env_covariates)
# crs(x_global_370_env_covariates)
names(x_global_370_env_covariates) <- env_layer_names
# stack env covariates
x_global_585_env_covariates <- terra::rast(x = x_global_585_env_covariates_list)
# attributes
nlyr(x_global_585_env_covariates)
names(x_global_585_env_covariates)
minmax(x_global_585_env_covariates)
# ext(x_global_585_env_covariates)
# crs(x_global_585_env_covariates)
names(x_global_585_env_covariates) <- env_layer_names
1.2 create input data objects
I will need to combine the above covariate, presence and background
point datasets into a single dataset to feed into MaxEnt (an SWD,
samples with data object). This dataset will need to contain point-wise
values for each of the predictor covariates at the presence and
background point coordinates. SDMtune
takes an SWD object
for this purpose, containing the presences and background points with
associated covariate data to be fed into the model, so I will create
this now.
global_SWD <- SDMtune::prepareSWD(
species = "Lycorma delicatula",
env = x_global_hist_env_covariates,
p = p_slf_points,
a = a_global_background_points,
verbose = TRUE # print helpful messages
)
global_SWD@coords # coordinates
global_SWD@pa # presence / absence (background counted as absence)
global_SWD@data # extracted data from
I usually look at how many records were dropped before I save it, because SDMtune gives an undefined warning about how many samples were discarded. In this case, we only lost 2 presence records, so this is acceptable. I will also save the output to be used later.
SDMtune::swd2csv(swd = global_SWD, file_name = c(
file.path(here::here(), "vignette-outputs", "data-tables", "global_slf_presences_with_data_v3.csv"),
file.path(here::here(), "vignette-outputs", "data-tables", "global_background_points_with_data_v3.csv")
))
Create training / test split
I will randomly split the presences into training and testing, using
80% of the points for training and 20% for testing. I will then use
SDMtune::randomFold()
to split the training data into 10
partitions for cross-validation. This method was loosely adapted from
Srivastava et.al, 2021.
In my first attempts, I tried a blocked cross-validation methodology
using the function blockCV::cv_spatial()
. I attempted it
because many papers say that spatially segmenting the k-folds is a more
rigorous method than simple random k-fold. However, this method did not
work for my dataset and left many test folds with 0 presence points. I
believe this is due to both the spatial scale I am working in for this
model (global) and the comparatively narrow species range (limited to
only east Asia and the USA). The block size that reduces autocorrelation
was about 1.5 million meters, but this made the blocks too large for
each block to receive adequate species presences. I also attempted
spatial blocking by distance (cv_cluster()
), but this
yielded similar results. See the Appendix in this file for code.
Let’s create the training/testing data split.
global_trainTest <- SDMtune::trainValTest(
x = global_SWD,
test = 0.2, # set used for predicting maps and suitability values
only_presence = TRUE, # only split presences and recycle bg points
seed = 3
)
# separate off training data
global_train <- global_trainTest[[1]]
global_test <- global_trainTest[[2]]
global_train@coords # coordinates
global_train@pa # presence / absence (background counted as absence)
global_train@data # extracted data from
Now split the training data for 10-fold random cross validation.
# create random folds
global_trainFolds <- SDMtune::randomFolds(
data = global_train,
k = 10, # 10 folds
only_presence = TRUE,
seed = 4
)
2. Train & Tune Global Model
First, I will train a model with 10 cross-validated replicates. The model will only be trained on 80% of the SLF presence data, with the other 20% being held apart and used downstream for analysis. The default settings will be used for MaxEnt, apart from the following changes:
- ALL feature types used (l = linear, q = quadratic, p = product, h = hinge, t = threshold)
- replicates = 10
- iterations = 5000. This is the max number of iterations for the optimization algorithm to perform before stopping training. Increasing this number from the default of 500 allows the algorithm to make more refined predictions.
I will use these hyperparameters to train the initial model, but these parameters may change based on model tuning, which I will perform next.
global_model <- SDMtune::train(
method = "Maxent",
data = global_train,
folds = global_trainFolds, # 10 folds for dataset
fc = "lqpht", # feature classes set to ALL
iter = 5000, # number of iterations
progress = TRUE
)
2.1 Tuning MaxEnt Model
Next, I will perform model tuning by varying specific
hyperparameters. I selected the hyperparameters that I thought were best
for the model based on numerous papers. However, modeling papers for
other species have made different choices for the parameter selection
that I did not. Luckily, SDMtune
makes it easy to select
the best fit set of hyperparameters from a range of possibilities using
the gridSearch()
function. This function trains a different
model for every possible combination of hyperparameters that is listed.
I can then pick the best model based on AUC, TSS or AICc. I selected
these hyperparameters for specific reasons listed in the literature.
Feature classes: h, qpht, lqpht Feature classes are important because they govern how your environmental covariates are transformed by the algorithm (for example, if most presences fall beyond a specific point on the range of values, the algorithm might determine that a threshold feature is most appropriate.)
- Elith et.al, 2011 used only hinge features or all features and showed that hinge features alone may perform better. They referred to a MaxEnt model using only hinge features as resembling an additive model, not unlike a GAM.
- The supplemental materials from the same paper demonstrated that linear features are redundant if also using hinge features.
- Most SLF modeling papers, including Huron et.al and others for SLF, simply use all features.
Regularization Multiplier: 0.25, 0.5, 1, 1.5, 2 Regularization penalizes overfitting in the model. If a model is too fit to a range of covariate values, to the point that this model could not be applied to another area with other values, it receives a penalty.
- Radosavljevic and Anderson 2014 used a range of regularization multipliers, from 0.25 to 2, to test for the most appropriate value, which I will adopt.
I will compare the models with the highest test AUC and TSS values. I
will also save all of the model variations in a .rds
object.
# the arguments that can be tuned later
getTunableArgs(global_model)
# list of hyperparameters
hypers <- list(
reg = c(0.25, 0.5, 1, 1.5, 2), # regularization multiplier
fc = c("h", "qpht", "lqpht") # feature classes
)
if (FALSE) {
# Train a model with all possible combinations of hyperparameters
global_model_tuned_auc <- SDMtune::gridSearch(
model = global_model,
hypers = hypers, # hyperparameter list
metric = "auc",
save_models = TRUE,
interactive = TRUE,
progress = TRUE
)
# show tuning results
global_model_tuned_auc@results
# select number of best model
best.model.auc <- which.max(global_model_tuned_auc@results$test_AUC)
}
Based on the test AUC metric, the highest value was attributed to the model with a regularization multiplier of 1.5 and qpht features (model 11). I will also check based on the TSS.
if (FALSE) {
# Train a model with all possible combinations of hyperparameters
global_model_tuned_tss <- SDMtune::gridSearch(
model = global_model,
hypers = hypers, # hyperparameter list
metric = "tss",
save_models = TRUE,
interactive = TRUE,
progress = TRUE
)
# show tuning results
global_model_tuned_tss@results
# select number of best model
best.model.tss <- which.max(global_model_tuned_tss@results$test_TSS)
}
We get slightly different results based on the test TSS metric. We see that the highest value was attributed to the model with a regularization multiplier of 1.5 and lqpht (ALL) features (model 12).
We will need to choose either model 11 or model 12. Since the difference in test AUC and TSS are incredibly small between both models (0.002), I will choose model 12 because this model has been consistently more fit across model versions. These tuned parameters will also be applied to all future models performed for SLF at the regional scale.
global_model <- global_model_tuned_auc@models[[12]]
2.2 Summary Statistics
I will apply my function
compute_MaxEnt_summary_statistics_CV()
to produce all
summary statistics for this model. This list includes marginal and
univariate response curves, jackknife tests, variable importance,
thresholds and confusion matrices, among other statistics. For the
complete and annotated workflow used to create this function, see
vignettes/051_maxent_workflow
.
# output directory
mypath <- file.path(here::here() %>%
dirname(),
"maxent/models/slf_global_v3")
scari::compute_MaxEnt_summary_statistics_CV(
model.obj = global_model,
model.name = "global",
mypath = mypath,
create.dir = FALSE, # create subdirectory
env.covar.obj = x_global_hist_env_covariates, # env covariates raster stacked
train.obj = global_train, # training data used to create model
trainFolds.obj = global_trainFolds, # k-folds of training data
test.obj = global_test, # data you wish to use to test the model
jk.test.type = c("train", "test"), # types of jackknife curves to be created
plot.type = c("cloglog", "logistic") # types of univariate and marginal response curves to be created
)
I will now save the results from the model tuning that I performed a
few chunks ago, because I have created the directory for these model
results. I will also save the gridded tuning result figures manually
from the Viewer pane to the model directory folder that was created
using compute_MaxEnt_summary_statistics_CV()
.
# save gridsearch interactive tables from Viewer pane
# save df results of model tuning
write.csv(global_model_tuned_auc@results, file = file.path(mypath, "global_model_tuning_auc_results.csv"))
write.csv(global_model_tuned_tss@results, file = file.path(mypath, "global_model_tuning_tss_results.csv"))
# save tuned models
readr::write_rds(global_model_tuned_auc, file = file.path(mypath, "global_model_tuned_auc.rds"))
readr::write_rds(global_model_tuned_tss, file = file.path(mypath, "global_model_tuned_tss.rds"))
3. Create Outputs for Analysis
3.1 Create suitability maps- mean and thresholded
I will use the SDMtune::predict()
function to predict
the suitability for the range of interest. I will threshold the map by
the MTP
, MTSS
and 10_percentile
thresholds. The workflow for this application will use my function
create_MaxEnt_suitability_maps_CV()
.
I will take the above map and reclassify it so that areas that aren’t
suitable are more obvious. I will use the MTSS
(maximum
training sensitivity plus specificity threshold), as this threshold is
one of the most rigorous available. It maximizes the TSS, a threshold
dependent goodness of fit metric and has been shown to perform well and
produce consistent predictions for presence-only methods (Liu et al,
2013; Liu et al, 2016). I will also use the MTP
(minimum
training presence) and 10_percentile
(10 percentile
training threshold) thresholds. These thresholds represent a continuum
of training data exclusion. The MTP
threshold does not
exclude any data and assumes confidence in the validity of the training
data, making the most conservative predictions (the largest suitable
range for invasive species) The 10_percentile
threshold
excludes the top 10% of suitable training samples, which would be more
appropriate if I have less confidence in my training data. Finally, the
MTSS
threshold maximizes the sensitivity (the likelihood of
detecting a true positive) (Maryam Bordkhani (n.d.)).
I will retrieve the specific value for the MTSS threshold for this model and use it to define a what is considered unsuitable in this raster. Then, I will reclassify the mean raster so that anything below this threshold is now “unsuitable”. I will use the mean cloglog threshold for all iterations. I will create a binary version of the mean raster, with unsuitable regions (below the MTSS training threshold) as 0, and those above the threshold as 1. Then, I can reclassify averaged raster of MaxEnt predictions. I will load in the averaged prediction for this model and reclassify the raster using a temporary raster layer that I will create. This raster will only have values where the model predicts the climate to be unsuitable, so that it can be plotted overtop the averaged raster to mask unsuitable areas.
scari::create_MaxEnt_suitability_maps_CV(
model.obj = global_model,
model.name = "global",
mypath = mypath,
create.dir = FALSE,
env.covar.obj = x_global_hist_env_covariates,
describe.proj = "globe_1981-2010", # name of area or time period projected to
clamp.pred = TRUE,
thresh = c("MTP", "MTSS"),
map.thresh = TRUE, # whether thresholded versions of these maps should be created
map.thresh.extra = "MTP", # whether an added underlayer threshold should be plotted (MTP plotted under MTSS)
summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)
I will also create suitability maps for each of the projected 2041-2070 climate change scenarios.
scari::create_MaxEnt_suitability_maps_CV(
model.obj = global_model,
model.name = "global",
mypath = mypath,
create.dir = FALSE,
env.covar.obj = x_global_126_env_covariates,
describe.proj = "globe_2041-2070_GFDL_ssp126", # name of area or time period projected to
clamp.pred = TRUE,
thresh = c("MTP", "MTSS"),
map.thresh = TRUE, # whether thresholded versions of these maps should be created
map.thresh.extra = "MTP",
summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)
scari::create_MaxEnt_suitability_maps_CV(
model.obj = global_model,
model.name = "global",
mypath = mypath,
create.dir = FALSE,
env.covar.obj = x_global_370_env_covariates,
describe.proj = "globe_2041-2070_GFDL_ssp370", # name of area or time period projected to
clamp.pred = TRUE,
thresh = c("MTP", "MTSS"),
map.thresh = TRUE, # whether thresholded versions of these maps should be created
map.thresh.extra = "MTP",
summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)
scari::create_MaxEnt_suitability_maps_CV(
model.obj = global_model,
model.name = "global",
mypath = mypath,
create.dir = FALSE,
env.covar.obj = x_global_585_env_covariates,
describe.proj = "globe_2041-2070_GFDL_ssp585", # name of area or time period projected to
clamp.pred = TRUE,
thresh = c("MTP", "MTSS"),
map.thresh = TRUE, # whether thresholded versions of these maps should be created
map.thresh.extra = "MTP",
summary.file = file.path(mypath, "global_summary_all_iterations.csv")
)
3.2 Predict suitability for known SLF presences
I will get projected suitability values, calculated on the cloglog
scale (between 0 and 1) for each of the known SLF populations I used as
input data for my models. These suitability values will be added back to
the original data frame and saved. The workflow for this application
will use my function predict_xy_suitability_CV()
. This
function extracts covariate data from each data point and then uses the
model object to predict the suitability for SLF establishment at that
location based on the climate variables. This function has two modes for
predicting the suitability: simple and buffered. I will use simple (the
suitability at that exact point) for the location of SLF
populations.
Later in my analysis, I will use these point-wise suitability predictions to create a scatter plot in the global vs regional models.
mypath <- file.path(here::here() %>%
dirname(),
"maxent/models/slf_global_v3")
# slf presence data
slf_presences <- read.csv(file = file.path(here::here(), "vignette-outputs", "data-tables", "slf_all_coords_final_2024-08-05.csv")) %>%
dplyr::select(-species)
scari::predict_xy_suitability_CV(
xy.obj = slf_presences,
xy.type = "Lycorma delicatula",
env.covar.obj = x_global_hist_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE, # clamp predictions
predict.type = c("cloglog", "logistic"),
output.name = "global_slf_all_coords_1981-2010"
)
I will repeat the same process for the CMIP6 data.
scari::predict_xy_suitability_CV(
xy.obj = slf_presences,
xy.type = "Lycorma delicatula",
env.covar.obj = x_global_126_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE,
predict.type = c("cloglog", "logistic"),
output.name = "global_slf_all_coords_2041-2070_GFDL_ssp126"
)
scari::predict_xy_suitability_CV(
xy.obj = slf_presences,
xy.type = "Lycorma delicatula",
env.covar.obj = x_global_370_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE,
predict.type = c("cloglog", "logistic"),
output.name = "global_slf_all_coords_2041-2070_GFDL_ssp370"
)
scari::predict_xy_suitability_CV(
xy.obj = slf_presences,
xy.type = "Lycorma delicatula",
env.covar.obj = x_global_585_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE,
predict.type = c("cloglog", "logistic"),
output.name = "global_slf_all_coords_2041-2070_GFDL_ssp585"
)
3.3 Predict suibability for IVR regions
I will perform the same action as above, but for the locations of over 1,000 important wineries around the world. I will use the buffered method of predicting suitability for this application. The buffered method creates a buffer zone around the point and considers all values within that buffer for predicting suitability. This is important because these points represent regions that are important to viticulture and are not exact geospatial coordinates, but are likely mapped to the centroid of areas with vast expanses of vineyards. So, using a buffer zone accounts for this lack of exact coordinates. I will take this maximum value from this buffer, but the mean, minimum and others can be used as well. I will use a buffer radial distance of 20,000m (the default, 20km) because the resolution of the predictions is at 10km, so this distance considers the two surrounding cells on any side.
mypath <- file.path(here::here() %>%
dirname(),
"maxent/models/slf_global_v3")
# load in all IVR points
load(file = file.path(here::here(), "data", "wineries.rda"))
IVR_regions <- wineries %>%
dplyr::select(x, y)
scari::predict_xy_suitability_CV(
xy.obj = IVR_regions,
xy.type = "IVR regions",
env.covar.obj = x_global_hist_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE,
predict.type = c("cloglog", "logistic"),
output.name = "global_wineries_1981-2010",
# buffer
buffer.pred = TRUE
)
I will repeat the same process for the CMIP6 data.
scari::predict_xy_suitability_CV(
xy.obj = IVR_regions,
xy.type = "IVR regions",
env.covar.obj = x_global_126_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE,
predict.type = c("cloglog", "logistic"),
output.name = "global_wineries_2041-2070_GFDL_ssp126",
buffer.pred = TRUE
)
scari::predict_xy_suitability_CV(
xy.obj = IVR_regions,
xy.type = "IVR regions",
env.covar.obj = x_global_370_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE,
predict.type = c("cloglog", "logistic"),
output.name = "global_wineries_2041-2070_GFDL_ssp370",
buffer.pred = TRUE
)
scari::predict_xy_suitability_CV(
xy.obj = IVR_regions,
xy.type = "IVR regions",
env.covar.obj = x_global_585_env_covariates,
model.obj = global_model,
mypath = mypath,
clamp.pred = TRUE,
predict.type = c("cloglog", "logistic"),
output.name = "global_wineries_2041-2070_GFDL_ssp585",
buffer.pred = TRUE
)
Now I have all of the outputs I need from the global model! These outputs will be used in comparison with the regional version of their time period to analyze the realized regional niche for SLF at multiple spatial scales. These will be used to make inferences about the impacts of climate change on suitability for SLF.
In the next vignette, I will set up for training each of the 3 regional models.
4. Take mean of ssp scenario predictions
I will load the three ssp scenario rasters I just created back into R and will take a mean of those rasters for my predictions step later downstream.
# ssp 126
pred_global_2055_126 <- terra::rast(
x = file.path(mypath, "global_pred_suit_clamped_cloglog_globe_2041-2070_GFDL_ssp126_mean.asc")
)
# ssp 370
pred_global_2055_370 <- terra::rast(
x = file.path(mypath, "global_pred_suit_clamped_cloglog_globe_2041-2070_GFDL_ssp370_mean.asc")
)
# ssp 585
pred_global_2055_585 <- terra::rast(
x = file.path(mypath, "global_pred_suit_clamped_cloglog_globe_2041-2070_GFDL_ssp585_mean.asc")
)
pred_global_2055_stack <- c(pred_global_2055_126, pred_global_2055_370, pred_global_2055_585)
References
Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists: Statistical explanation of MaxEnt. Diversity and Distributions, 17(1), 43–57. https://doi.org/10.1111/j.1472-4642.2010.00725.x
Feng, X. (2022). Shandongfx/nimbios_enm [HTML]. https://github.com/shandongfx/nimbios_enm (Original work published 2018).
Gallien, L., Douzet, R., Pratte, S., Zimmermann, N. E., & Thuiller, W. (2012). Invasive species distribution models – how violating the equilibrium assumption can create new insights. Global Ecology and Biogeography, 21(11), 1126–1136. https://doi.org/10.1111/j.1466-8238.2012.00768.x
Liu, C., Newell, G., & White, M. (2016). On the selection of thresholds for predicting species occurrence with presence-only data. Ecology and Evolution, 6(1), 337–348. https://doi.org/10.1002/ece3.1878
Liu, C., White, M., & Newell, G. (2013). Selecting thresholds for the prediction of species occurrence with presence-only data. Journal of Biogeography, 40(4), 778–789. https://doi.org/10.1111/jbi.12058
Maryam Bordkhani. (n.d.). Threshold rule [Online post].
Radosavljevic, A., & Anderson, R. P. (2014). Making better Maxent models of species distributions: Complexity, overfitting and evaluation. Journal of Biogeography, 41(4), 629–643. https://doi.org/10.1111/jbi.12227
Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190(3), 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026
Steven Phillips. (2017). A Brief Tutorial on Maxent. http://biodiversityinformatics.amnh.org/open_source/maxent/.
Steven J. Phillips, Miroslav Dudík, & Robert E. Schapire. (2023). Maxent software for modeling species niches and distributions (Version 3.4.3 (Java)) [Computer software]. http://biodiversityinformatics.amnh.org/open_source/maxent/.
Srivastava, V., Roe, A. D., Keena, M. A., Hamelin, R. C., & Griess, V. C. (2021). Oh the places they’ll go: Improving species distribution modelling for invasive forest pests in an uncertain world. Biological Invasions, 23(1), 297–349. https://doi.org/10.1007/s10530-020-02372-9
Valavi, R., Elith, J., Lahoz-Monfort, J. J., & Guillera-Arroita, G. (2019). blockCV: An r package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods in Ecology and Evolution, 10(2), 225–232. https://doi.org/10.1111/2041-210X.13107
VanDerWal, J., Shoo, L. P., Graham, C., & Williams, S. E. (2009). Selecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know? Ecological Modelling, 220(4), 589–594. https://doi.org/10.1016/j.ecolmodel.2008.11.010
Vignali, S., Barras, A. G., Arlettaz, R., & Braunisch, V. (2020). SDMtune: An R package to tune and evaluate species distribution models. Ecology and Evolution, 10(20), 11488–11506. https://doi.org/10.1002/ece3.6786
Appendix
Attempt at Spatially and environmentally Blocked Cross-validation
I attempted this method because of the body of literature that says blocked cross validation is more rigorous than random CV. Blocked CV results is more spatially separated training and testing data, which prevents autocorrelation within the environmental covariates. However, this method did not work because of the combination of the scale of the covariates and the comparatively narrow range of SLF presences. The global scale of analysis means that the block size needs to be quite large (about 1.5 million meters) to reduce spatial autocorrelation between the training points. The narrow spread of presence points in comparison means that not every blocked CV group (out of 10) received test presences and some had no test points. I also tried this with 5 groups, but point distributions were still not ideal.
#presences
global_presences_with_data <- read_csv(
file.path(here::here(), "vignette-outputs", "data-tables", "global_slf_presences_with_data_v3.csv")
) %>%
# add presence/absence column
mutate(pa = 1)
# background points
global_background_with_data <- read_csv(
file.path(here::here(), "vignette-outputs", "data-tables", "global_background_points_with_data_v3.csv")
) %>%
mutate(pa = 0)
# join
global_all_points_with_data <- full_join(global_presences_with_data, global_background_with_data) %>%
sf::st_as_sf(coords = c("X", "Y"), crs = 4326)
First, check spatial autocorrelation level. This determines the appropriate block size to prevent spatial autocorrelation.
# set seed for consistency of random sampling
set.seed(2)
# take spatial autocorrelation distance
spatial_autocor <- blockCV::cv_spatial_autocor(
x = global_all_points_with_data,
column = "pa",
plot = TRUE,
progress = TRUE
)
spatial_autocor$range
spatial_autocor$plots
# OLD METHOD
# convert to a Lambert projection because variograms do not work with lat/long projections
# using EPSG:9834, Lambert Cylindrical Equal Area (Spherical)
#global_raster <- terra::project(x = x_global_hist_env_covariates, y = "EPSG:9834", method = "bilinear", threads = TRUE)
# test for spatial autocorrelation distance
#spatial_autocor <- blockCV::cv_spatial_autocor(
# r = global_raster,
# num_sample = 20000, # quadrupled from default due to raster size
# plot = TRUE,
# progress = TRUE
#)
The range of spatial autocorrelation is estimated at 2,263,953 meters for the presence/absence data I am using to build the model. I will also use an interactive tool to visualize block size based on pa data.
blockCV::cv_block_size(
x = global_all_points_with_data,
column = "pa"
)
I will go ahead and use the 2.25 million meter distance for the spatial blocks.
global_folds <- blockCV::cv_spatial(
x = global_all_points_with_data,
column = "pa",
r = x_global_hist_env_covariates,
k = 5,
size = 2263953,
hexagon = FALSE, # I am pretty sure the function used to calculate the autocorrelation distance used square blocks
selection = "random",
iteration = 50,
seed = 3,
progress = TRUE,
report = TRUE
)
#
View(global_folds$folds_list)
# this fold list contains each block
global_folds$records
blockCV::cv_plot(
cv = global_folds,
r = x_global_hist_env_covariates
)
This method did not work because about 5 test folds out of 10 received 0 test points. The spread of the presences seems too narrow for each fold to receive adequate training and test presences. I will also try a spatial clustering algorithm.
global_folds <- blockCV::cv_cluster(
x = global_all_points_with_data,
column = "pa",
k = 5,
r = x_global_hist_env_covariates,
scale = TRUE,
raster_cluster = TRUE,
report = TRUE
)
Clustering based on distance also did not put test presences in each fold (some had 0), so the blocked CV method will not be used and I will revert to k-fold random CV.