
Workflow for the functions 'compute_MaxEnt_summary_statistics()'
Samuel M. Owens1
2024-08-27
Source:vignettes/051_compute_MaxEnt_summary_statistics_workflow.Rmd
051_compute_MaxEnt_summary_statistics_workflow.Rmd
Note This vignette is an example of the workflow for the
function within this package called
compute_MaxEnt_summary_statistics()
, which is used in
vignette 050 to obtain the final results. This vignette does not need to
be run to obtain the final results of the workflow.
Overview
This vignette gives an example of the workflow for producing a MaxEnt model in R and all of its associated outputs. Chunks 1-10 share the same formatting as with vignette 050, which is the parent of this vignette. This vignette was created to isolate the entire workflow for creating a model. Some processes in this vignette are not used for all models produced, such as a rasterized predictions of suitability.
The workflow depends heavily on the R package SDMtune. This workflow can be used for any dataset to train a maxent model, extract the summary statistics and make predictions for pointwise species presence data or entire areas (both are exemplified in this workflow). Here is a list of the outputs and summary statistics that will be obtained from the models produced by this vignette:
- .RDS file of the MaxEnt model
- Presence data used to train and test the model
- suitability map projected to the area of interest
- presence/absence map generated using the Maximum Sensitivity plus specificity threshold produced by the model
- point-wise predictions for species presence data
- AUC
- ROC plot
- TSS
- Suitability thresholds (including minimum training presence, 10% training presence, etc)
- Marginal response curves
- Uni-variate response curves
- Variable permutation importance
- Jackknife leave-one-out test and plot
- Other model settings
Most of these can also be obtained using a function in this package
called compute_MaxEnt_summary_statistics()
, which creates
outputs of the summary statistics, writes the .RDS file and records the
presence data used to train and test the model (so it does NOT create a
suitability map, presence/absence map or point-wise predictions).
To run this vignette, first run up until step 2 of vig 050. This will create the necessary files and objects to run this vignette.
Setup
Note: I will be setting the global options of this
document so that only certain code chunks are rendered in the final
.html file. I will set the eval = FALSE
so that none of the
code is re-run (preventing files from being overwritten during knitting)
and will simply overwrite this in chunks with plots.
library(tidyverse) #data manipulation
library(here) #making directory pathways easier on different instances
# here() starts at the root folder of this package.
library(devtools)
library(rJava) # for running MaxEnt
library(dismo) # package underneath SDMtune
library(SDMtune) # main package used to run SDMs
# dependencies of SDMtune
library(kableExtra) # required for producing model reports
library(plotROC) # plots ROCs
library(rasterVis)
library(viridis)
# spatial data handling
library(terra)
library(pROC)
map_style <- list(
xlab("longitude"),
ylab("latitude"),
theme_classic(),
theme(legend.position = "bottom",
panel.background = element_rect(fill = "lightblue",
colour = "lightblue")
),
coord_equal()
)
mypath <- file.path(here::here() %>%
dirname(),
"maxent/models")
# create directory for model
dir.create(path = file.path(mypath, "slf_global_v3_TEST"))
dir.create(path = file.path(mypath, "slf_global_v3_TEST", "plots"))
NOTE the output created from this point to the end of the
vignette can be created using the function
scari::compute_MaxEnt_summary_statistics()
. This does not
include the ending appendix or the chunks that create an output using
SDMtune::predict()
.
mypath <- file.path(here::here() %>%
dirname(),
"maxent/models/slf_global_v3_TEST")
# save model
saveRDS(object = global_model, file = file.path(mypath, "global_model.rds"))
From here::here, the analysis is performed in SDMtune
.
If you wanted to perform your analysis in dismo
, the
function SDMtune::SDMmodel2MaxEnt()
will convert the above
model object into a MaxEnt object in the dismo package.
Get AUC, TSS, and variable contributions
I will retrieve AUC and TSS values for both training and test data and save this output as a .csv.
# vector of AUC values
AUC <- c(
# get training AUC value
SDMtune::auc(global_model),
# get test AUC value
SDMtune::auc(global_model, test = global_test)
)
# vector of TSS values
TSS <- c(
# get training TSS value
SDMtune::tss(global_model),
# get test TSS value
SDMtune::tss(global_model, test = global_test)
)
# compile to data frame
auc_tss <- data.frame(AUC, TSS, row.names = c("Training", "Test"))
# write csv
write.csv(auc_tss, file = file.path(mypath, "global_model_auc_tss.csv"), row.names = TRUE)
The relative contribution and permutation importance of each
environmental covariate is also available via the
SDMtune::maxentVarImp()
function. These will be put into a
table.
# variable importance
var_imp <- SDMtune::maxentVarImp(global_model)
# write to csv
write.csv(var_imp, file = file.path(mypath, "global_model_var_contrib.csv"), row.names = FALSE)
Plot ROC, univariate and marginal response curves
First, I will plot the ROC curves, which show the true positive vs false positive rates. This will be done for each iteration of the model. An AUC of 0.5 or better is desirable.
# loop that plots an ROC curve for each model iteration
for (a in seq(length(global_model@models))) {
# load in temp plot object
plot.ROC <- SDMtune::plotROC(global_model@models[[a]], test = global_test)
# add title
plot.ROC <- plot.ROC +
ggtitle(paste("ROC curve (sensitivity vs 1 - Specificity) for", "global_model", "iteration", a))
# save output
ggsave(plot.ROC,
filename = file.path(mypath, "plots", paste0("global_model_", "iter", a, "_ROC.jpg")),
height = 8,
width = 10,
device = "jpeg",
dpi = "retina")
# remove temp object
rm(plot.ROC)
}
Next, I will plot both the univariate and marginal response curves.
According to the MaxEnt output, the univariate response curves “represent a different model, namely, a Maxent model created using only the corresponding variable” (Phillips et.al, 2006). These models were trained with each variable alone, to show how that variable affects the probability of presence.
threshold.types <- c("cloglog", "logistic")
# for each variable used in the model, create a response curve
for (a in names(x_env_covariates)) {
# load in temp plot object
plot.univar <- SDMtune::plotResponse(
model = global_model,
var = a,
type = "cloglog",
fun = "mean",
rug = TRUE
)
# add title
plot.univar <- plot.univar +
ggtitle(paste("global_model", a, "univariate response curve- cloglog output"))
# save output
ggsave(plot.univar,
filename = file.path(mypath, "plots", paste0("global_model_", a, "_univar_resp_curve_cloglog", ".jpg")),
height = 8,
width = 10,
device = "jpeg",
dpi = "retina")
# remove temp object
rm(plot.univar)
}
Meanwhile, the marginal response curves “show how each environmental variable affects the Maxent prediction. The curves show how the predicted probability of presence changes as each environmental variable is varied, keeping all other environmental variables at their average sample value” (Phillips et.al, 2006). The marginal response curves show the effects of changing that variable.
# for each variable used in the model, create a response curve
for (a in names(x_env_covariates)) {
# for each threshold type, plot a response curve
for (b in threshold.types) {
# load in temp plot object
plot.response <- SDMtune::plotResponse(
model = global_model,
var = a,
type = b,
marginal = TRUE,
fun = "mean",
rug = TRUE
)
# add title
plot.response <- plot.response +
ggtitle(paste("global_model", a, "marginal response curve-", b, "output"))
# save output
ggsave(plot.response,
filename = file.path(mypath, "plots", paste0("global_model_", a, "_marg_resp_curve_", b, ".jpg")),
height = 8,
width = 10,
device = "jpeg",
dpi = "retina")
# remove temp object
rm(plot.response)
}
}
Extract summary statistics
Next, I will extract the summary statistics from each model that include the presence/absence thresholds, training statistics and other useful metrics. Specifically, I am interested in extracting particular thresholds (minimum training presence, 10 percentile training presence and maximum training sensitivity plus specificity) so that I can create presence/absence maps based on these. I will save each of the individual tables. I will also extract the results of each model iteration into a table, take the mean of those statistics, and save it.
# create empty date frame for joining
empty.table <- as.data.frame(global_model@models[[1]]@model@results) %>%
mutate(statistic = row.names(global_model@models[[1]]@model@results)) %>% # add column of statistic names
dplyr::select(statistic)
# loop to write summary statistics for each iteration of the model
for (i in seq(length(global_model@models))) {
# load in temp object
data.obj <- as.data.frame(global_model@models[[i]]@model@results)
# change colname
colnames(data.obj) <- paste0("iter_", i, "_summary")
# write to csv
write.csv(
x = data.obj,
file = file.path(mypath, paste0("global_model_summary_iter", i, ".csv")),
row.names = TRUE
)
# add column name to data object
data.obj <- mutate(data.obj, statistic = row.names(global_model@models[[1]]@model@results))
# while i has not reached the length of the model replicates, append the summary statistics to the empty table
while (i < length(global_model@models)) {
# join data.obj temporary object to empty table
empty.table <- right_join(empty.table, data.obj, by = "statistic")
break
}
# if i has reached the length of the model replicates, compute a mean and write the final table to .csv
if (i == length(global_model@models)) {
# compute the mean of the 5 columns
empty.table <- mutate(empty.table, mean = rowMeans(empty.table[, -1]))
# write to csv
write.csv(
x = empty.table,
file = file.path(mypath, paste0("global_model_summary_all_iterations.csv")),
row.names = FALSE
)
}
# remove temp object
rm(data.obj)
}
Write training and testing data to file
It is always a good idea to have the exact data used to train and test a model on hand, so I will write these to the same directory.
# write training data to csv
SDMtune::swd2csv(swd = global_train, file_name = file.path(mypath, "global_train_Kfold.csv"))
# load csv in again
global_train_Folds <- read.csv(file = file.path(mypath, "global_train_Kfold.csv"))
# bind training data with folds
global_train_Folds <- cbind(global_train_Folds, global_trainFolds)
# overwrite csv for training
write_csv(x = global_train_Folds, file = file.path(mypath, "global_train_Kfold.csv"))
# write testing data to csv
SDMtune::swd2csv(swd = global_test, file_name = file.path(mypath, "global_test.csv"))
Jackknife test for variable importance
The jackknife test is a leave-one-out test that iteratively leaves out each predictor variable and produces a model with all other variables. It also has the option to create a model for that predictor variable alone. It then records the gain or loss in AUC value. This is very informative test and is designed to test variable importance. I will perform a jackknife for the entire model’s training data alone. I will also create one for each model iteration using both training and testing data.
# jackknife
global_JK <- SDMtune::doJk(
model = global_model,
# test = global_test, # not used for models with replicates
metric = "auc",
with_only = TRUE, # also run test for each variable alone
progress = TRUE
)
# write raw data to csv
write_csv(x = global_JK, file = file.path(mypath, "global_model_jackknife_all_iterations_training.csv"))
# plot jackknife
global_JK_plot <- plotJk(jk = global_JK, type = "train")
global_JK_plot <- global_JK_plot +
ggtitle(paste("global_entire- jackknife test")) +
labs(subtitle = "all iterations, training data")
ggsave(global_JK_plot,
filename = file.path(mypath, "plots", "global_model_jackknife_all_iterations_training.jpg"),
height = 8,
width = 10,
device = "jpeg",
dpi = "retina")
I will also perform a jackknife for each of the model iterations, based on training and test data.
plot.types <- c("train", "test")
# for loop to output csv file of jackknife test, one per model iteration
for (a in seq(length(global_model@models))) {
# jackknife
jk.obj <- SDMtune::doJk(
model = global_model@models[[a]],
test = global_test,
metric = "auc",
with_only = TRUE, # also run test for each variable alone
progress = TRUE
)
# write raw data to csv
write_csv(x = jk.obj, file = file.path(mypath, paste0("global_model_jackknife_iter", a, "_training_testing.csv")))
# sub-loop to output 2 types of plots per test, training and testing
for(b in plot.types) {
# plot training jackknife
jk.obj_plot <- plotJk(jk = jk.obj, type = b)
# modify objects to add titles and subtitles
jk.obj_plot <- jk.obj_plot +
ggtitle(paste("global_entire- jackknife test")) +
labs(subtitle = paste0("iter ", a, ", ", b, "ing data"))
ggsave(jk.obj_plot,
filename = file.path(mypath, "plots", paste0("global_model_jackknife_iter", a, "_", b, "ing.jpg")),
height = 8,
width = 10,
device = "jpeg",
dpi = "retina")
# remove temp object
rm(jk.obj_plot)
}
# remove temp object
rm(jk.obj)
}
Variable importance
Lastly, the variable importance will be calculated as a percentage probability.
# calculate variable importance
var_imp <- SDMtune::varImp(model = global_model, progress = TRUE)
# write to csv
write_csv(x = var_imp, file = file.path(mypath, "global_model_variable_importance.csv"))
# plot
var_imp_plot <- SDMtune::plotVarImp(df = var_imp)
var_imp_plot <- var_imp_plot +
ggtitle("Variable importance for global_model")
ggsave(var_imp_plot,
filename = file.path(mypath, "plots", "global_model_variable_importance.jpg"),
height = 8,
width = 10,
device = "jpeg",
dpi = "retina")
Confusion Matrix for common threshold values
Lastly, I will pick out the mean cloglog threshold values for each type of threshold used in the model and run a confusion matrix for each value. A confusion matrix will show the number of true positives, false positives, true negatives and false negatives per threshold value and will help to determine which threshold makes the most sense, given our data. I will need to create a vector of these values, which can be found in the model_summary_all_iterations file created by this workflow.
# load in summary file and slice the threshold values
conf.matr.data <- read.csv(file.path(mypath, "global_model_summary_all_iterations.csv")) %>%
dplyr::slice(20, 24, 28, 32, 36, 40, 44, 48, 52)
# create empty table with threshold labels
conf.matr.output <- as.data.frame(conf.matr.data[, 1]) %>%
rename("hold_type" = "conf.matr.data[, 1]")
for (a in seq(length(global_model@models))) {
conf.matr.hold <- SDMtune::confMatrix(
model = global_model@models[[a]],
test = global_test,
type = "cloglog",
th = conf.matr.data[, 6] # the mean value for all cloglog thresholds
)
# bind threshold names
conf.matr.hold <- cbind(conf.matr.hold, conf.matr.data[, 1]) %>%
rename("hold_type" = "conf.matr.data[, 1]") %>%
dplyr::select(hold_type, everything())
# write individual run results to file
write.table(
x = conf.matr.hold,
sep = ",",
file = file.path(mypath, paste0("global_model_thresh_confusion_matrix_iter", a, ".csv")),
row.names = FALSE,
col.names = c("threshold_type", "threshold_value", "tp", "fp", "fn", "tn")
)
# before all model iterations have been summarized, append to empty table
while (a < length(global_model@models)) {
# join conf.matr.hold temporary object to empty table
conf.matr.output <- right_join(conf.matr.output, conf.matr.hold, by = "hold_type")
break
}
# when all model iterations have been appended, take the mean threshold value and write to csv
if (a == length(global_model@models)) {
# compute the mean of the 5 columns
conf.matr.output <- mutate(conf.matr.output,
tp_mean = rowMeans(dplyr::across(contains("tp"))), # compute the average of all "tp" columns
fp_mean = rowMeans(dplyr::across(contains("fp"))),
fn_mean = rowMeans(dplyr::across(contains("fn"))),
tn_mean = rowMeans(dplyr::across(contains("tn"))),
th = rowMeans(dplyr::across(contains("th")))
) %>%
dplyr::select(hold_type, th, tp_mean, fp_mean, fn_mean, tn_mean) %>%
rename("threshold_value" = "th",
"threshold_type" = "hold_type") %>%
dplyr::select(threshold_type, threshold_value, everything())
# write to csv
write.csv(
x = conf.matr.output,
file = file.path(mypath, paste0("global_model_thresh_confusion_matrix_all_iterations.csv")),
row.names = FALSE
)
}
#remove temp object
rm(conf.matr.hold)
}
References
Feng, X. (2022). Shandongfx/nimbios_enm [HTML]. https://github.com/shandongfx/nimbios_enm (Original work published 2018).
Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190(3), 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026
Srivastava, V., Roe, A. D., Keena, M. A., Hamelin, R. C., & Griess, V. C. (2021). Oh the places they’ll go: Improving species distribution modelling for invasive forest pests in an uncertain world. Biological Invasions, 23(1), 297–349. https://doi.org/10.1007/s10530-020-02372-9
Vignali, S., Barras, A. G., Arlettaz, R., & Braunisch, V. (2020). SDMtune: An R package to tune and evaluate species distribution models. Ecology and Evolution, 10(20), 11488–11506. https://doi.org/10.1002/ece3.6786