Workflow for the functions 'compute_MaxEnt_summary

Note This vignette is an example of the workflow for the function within this package called compute_MaxEnt_summary_statistics(), which is used in vignette 050 to obtain the final results. This vignette does not need to be run to obtain the final results of the workflow.

Overview

This vignette gives an example of the workflow for producing a MaxEnt model in R and all of its associated outputs. Chunks 1-10 share the same formatting as with vignette 050, which is the parent of this vignette. This vignette was created to isolate the entire workflow for creating a model. Some processes in this vignette are not used for all models produced, such as a rasterized predictions of suitability.

The workflow depends heavily on the R package SDMtune. This workflow can be used for any dataset to train a maxent model, extract the summary statistics and make predictions for pointwise species presence data or entire areas (both are exemplified in this workflow). Here is a list of the outputs and summary statistics that will be obtained from the models produced by this vignette:

.RDS file of the MaxEnt model
Presence data used to train and test the model
suitability map projected to the area of interest
presence/absence map generated using the Maximum Sensitivity plus specificity threshold produced by the model
point-wise predictions for species presence data

AUC
ROC plot
TSS
Suitability thresholds (including minimum training presence, 10% training presence, etc)
Marginal response curves
Uni-variate response curves
Variable permutation importance
Jackknife leave-one-out test and plot
Other model settings

Most of these can also be obtained using a function in this package called compute_MaxEnt_summary_statistics(), which creates outputs of the summary statistics, writes the .RDS file and records the presence data used to train and test the model (so it does NOT create a suitability map, presence/absence map or point-wise predictions).

To run this vignette, first run up until step 2 of vig 050. This will create the necessary files and objects to run this vignette.

Setup

Note: I will be setting the global options of this document so that only certain code chunks are rendered in the final .html file. I will set the eval = FALSE so that none of the code is re-run (preventing files from being overwritten during knitting) and will simply overwrite this in chunks with plots.

library(tidyverse)  #data manipulation

library(here) #making directory pathways easier on different instances
# here() starts at the root folder of this package.
library(devtools)
library(rJava) # for running MaxEnt

library(dismo) # package underneath SDMtune
library(SDMtune) # main package used to run SDMs

# dependencies of SDMtune
library(kableExtra) # required for producing model reports
library(plotROC) # plots ROCs
library(rasterVis)

library(viridis)

# spatial data handling
library(terra) 

library(pROC)

map_style <- list(
  xlab("longitude"),
  ylab("latitude"),
  theme_classic(),
  theme(legend.position = "bottom",
        panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue")
        ),
  coord_equal() 
)

mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/models")

# create directory for model
dir.create(path = file.path(mypath, "slf_global_v3_TEST"))
dir.create(path = file.path(mypath, "slf_global_v3_TEST", "plots"))

NOTE the output created from this point to the end of the vignette can be created using the function scari::compute_MaxEnt_summary_statistics(). This does not include the ending appendix or the chunks that create an output using SDMtune::predict().

mypath <- file.path(here::here() %>% 
                       dirname(),
                     "maxent/models/slf_global_v3_TEST")

# save model 
saveRDS(object = global_model, file = file.path(mypath, "global_model.rds"))

From here::here, the analysis is performed in SDMtune. If you wanted to perform your analysis in dismo, the function SDMtune::SDMmodel2MaxEnt() will convert the above model object into a MaxEnt object in the dismo package.

Get AUC, TSS, and variable contributions

I will retrieve AUC and TSS values for both training and test data and save this output as a .csv.

# vector of AUC values
AUC <- c(
  # get training AUC value
  SDMtune::auc(global_model),
  # get test AUC value
  SDMtune::auc(global_model, test = global_test)
)

# vector of TSS values
TSS <- c(
   # get training TSS value
  SDMtune::tss(global_model),
  # get test TSS value
  SDMtune::tss(global_model, test = global_test)
)

# compile to data frame
auc_tss <- data.frame(AUC, TSS, row.names = c("Training", "Test"))

# write csv
write.csv(auc_tss, file = file.path(mypath, "global_model_auc_tss.csv"), row.names = TRUE)

The relative contribution and permutation importance of each environmental covariate is also available via the SDMtune::maxentVarImp() function. These will be put into a table.

# variable importance
var_imp <- SDMtune::maxentVarImp(global_model)
# write to csv
write.csv(var_imp, file = file.path(mypath, "global_model_var_contrib.csv"), row.names = FALSE)

Plot ROC, univariate and marginal response curves

First, I will plot the ROC curves, which show the true positive vs false positive rates. This will be done for each iteration of the model. An AUC of 0.5 or better is desirable.

# loop that plots an ROC curve for each model iteration
for (a in seq(length(global_model@models))) {
  
  # load in temp plot object 
  plot.ROC <- SDMtune::plotROC(global_model@models[[a]], test = global_test)
  
  # add title
  plot.ROC <- plot.ROC +
    ggtitle(paste("ROC curve (sensitivity  vs 1 - Specificity) for", "global_model", "iteration", a))  
  
  # save output
  ggsave(plot.ROC, 
         filename = file.path(mypath, "plots", paste0("global_model_", "iter", a, "_ROC.jpg")),
         height = 8, 
         width = 10,
         device = "jpeg",
         dpi = "retina")
  
  # remove temp object 
  rm(plot.ROC)

}

Next, I will plot both the univariate and marginal response curves.

According to the MaxEnt output, the univariate response curves “represent a different model, namely, a Maxent model created using only the corresponding variable” (Phillips et.al, 2006). These models were trained with each variable alone, to show how that variable affects the probability of presence.

threshold.types <- c("cloglog", "logistic")

# for each variable used in the model, create a response curve
for (a in names(x_env_covariates)) {
  
  # load in temp plot object 
  plot.univar <- SDMtune::plotResponse(
    model = global_model,
    var = a,
    type = "cloglog",
    fun = "mean",
    rug = TRUE
    )
  
  # add title
  plot.univar <- plot.univar +
    ggtitle(paste("global_model", a, "univariate response curve- cloglog output"))  
  
  # save output
  ggsave(plot.univar, 
         filename = file.path(mypath, "plots", paste0("global_model_", a, "_univar_resp_curve_cloglog", ".jpg")),
         height = 8, 
         width = 10,
         device = "jpeg",
         dpi = "retina")
  
  # remove temp object 
  rm(plot.univar)
  
}

Meanwhile, the marginal response curves “show how each environmental variable affects the Maxent prediction. The curves show how the predicted probability of presence changes as each environmental variable is varied, keeping all other environmental variables at their average sample value” (Phillips et.al, 2006). The marginal response curves show the effects of changing that variable.

# for each variable used in the model, create a response curve
for (a in names(x_env_covariates)) {
  
  # for each threshold type, plot a response curve
  for (b in threshold.types) {
  
    # load in temp plot object 
    plot.response <- SDMtune::plotResponse(
      model = global_model,
      var = a,
      type = b,
      marginal = TRUE,
      fun = "mean",
      rug = TRUE
      )
    
    # add title
    plot.response <- plot.response +
      ggtitle(paste("global_model", a, "marginal response curve-", b, "output"))  
    
    # save output
    ggsave(plot.response, 
           filename = file.path(mypath, "plots", paste0("global_model_", a, "_marg_resp_curve_", b, ".jpg")),
           height = 8, 
           width = 10,
           device = "jpeg",
           dpi = "retina")
    
    # remove temp object 
    rm(plot.response)
  
  }

}

Extract summary statistics

Next, I will extract the summary statistics from each model that include the presence/absence thresholds, training statistics and other useful metrics. Specifically, I am interested in extracting particular thresholds (minimum training presence, 10 percentile training presence and maximum training sensitivity plus specificity) so that I can create presence/absence maps based on these. I will save each of the individual tables. I will also extract the results of each model iteration into a table, take the mean of those statistics, and save it.

# create empty date frame for joining
empty.table <- as.data.frame(global_model@models[[1]]@model@results) %>%
  mutate(statistic = row.names(global_model@models[[1]]@model@results)) %>% # add column of statistic names
  dplyr::select(statistic) 

# loop to write summary statistics for each iteration of the model
for (i in seq(length(global_model@models))) {
  
  # load in temp object 
  data.obj <- as.data.frame(global_model@models[[i]]@model@results)
  
  # change colname
  colnames(data.obj) <- paste0("iter_", i, "_summary")
  # write to csv
  write.csv(
    x = data.obj, 
    file = file.path(mypath, paste0("global_model_summary_iter", i, ".csv")), 
    row.names = TRUE
    )
  
  # add column name to data object
  data.obj <- mutate(data.obj, statistic = row.names(global_model@models[[1]]@model@results))
  
  # while i has not reached the length of the model replicates, append the summary statistics to the empty table
  while (i < length(global_model@models)) {
    # join data.obj temporary object to empty table
    empty.table <- right_join(empty.table, data.obj, by = "statistic")
    
    break
    
  }
  
  # if i has reached the length of the model replicates, compute a mean and write the final table to .csv
  if (i == length(global_model@models)) {
     # compute the mean of the 5 columns     
     empty.table <- mutate(empty.table, mean = rowMeans(empty.table[, -1]))
     # write to csv
     write.csv(
       x = empty.table, 
       file = file.path(mypath, paste0("global_model_summary_all_iterations.csv")),
       row.names = FALSE
       )
      
    }
  
  # remove temp object
  rm(data.obj)
  
}

Write training and testing data to file

It is always a good idea to have the exact data used to train and test a model on hand, so I will write these to the same directory.

# write training data to csv
SDMtune::swd2csv(swd = global_train, file_name = file.path(mypath, "global_train_Kfold.csv"))
# load csv in again
global_train_Folds <- read.csv(file = file.path(mypath, "global_train_Kfold.csv"))
# bind training data with folds
global_train_Folds <- cbind(global_train_Folds, global_trainFolds)
# overwrite csv for training 
write_csv(x = global_train_Folds, file = file.path(mypath, "global_train_Kfold.csv"))

# write testing data to csv
SDMtune::swd2csv(swd = global_test, file_name = file.path(mypath, "global_test.csv"))

Jackknife test for variable importance

The jackknife test is a leave-one-out test that iteratively leaves out each predictor variable and produces a model with all other variables. It also has the option to create a model for that predictor variable alone. It then records the gain or loss in AUC value. This is very informative test and is designed to test variable importance. I will perform a jackknife for the entire model’s training data alone. I will also create one for each model iteration using both training and testing data.

# jackknife
global_JK <- SDMtune::doJk(
  model = global_model,
  # test = global_test, # not used for models with replicates
  metric = "auc",
  with_only = TRUE, # also run test for each variable alone
  progress = TRUE
  )

# write raw data to csv
write_csv(x = global_JK, file = file.path(mypath, "global_model_jackknife_all_iterations_training.csv"))

# plot jackknife
global_JK_plot <- plotJk(jk = global_JK, type = "train")

global_JK_plot <- global_JK_plot +
  ggtitle(paste("global_entire- jackknife test")) +
  labs(subtitle = "all iterations, training data")

ggsave(global_JK_plot, 
           filename = file.path(mypath, "plots", "global_model_jackknife_all_iterations_training.jpg"),
           height = 8, 
           width = 10,
           device = "jpeg",
           dpi = "retina")

I will also perform a jackknife for each of the model iterations, based on training and test data.

plot.types <- c("train", "test")

# for loop to output csv file of jackknife test, one per model iteration
for (a in seq(length(global_model@models))) {
  
  # jackknife
  jk.obj <- SDMtune::doJk(
    model = global_model@models[[a]],
    test = global_test, 
    metric = "auc",
    with_only = TRUE, # also run test for each variable alone
    progress = TRUE
    )
  
  # write raw data to csv
  write_csv(x = jk.obj, file = file.path(mypath, paste0("global_model_jackknife_iter", a, "_training_testing.csv")))
    
  # sub-loop to output 2 types of plots per test, training and testing
    for(b in plot.types) {
    
      # plot training jackknife
      jk.obj_plot <- plotJk(jk = jk.obj, type = b)
      
      # modify objects to add titles and subtitles
      jk.obj_plot <- jk.obj_plot +
        ggtitle(paste("global_entire- jackknife test")) +
        labs(subtitle = paste0("iter ", a, ", ", b, "ing data"))
      
      ggsave(jk.obj_plot, 
             filename = file.path(mypath, "plots", paste0("global_model_jackknife_iter", a, "_", b, "ing.jpg")),
             height = 8, 
             width = 10,
             device = "jpeg",
             dpi = "retina")
      
      # remove temp object 
      rm(jk.obj_plot)
  
  }
  
  # remove temp object 
  rm(jk.obj)
  
}

Variable importance

Lastly, the variable importance will be calculated as a percentage probability.

# calculate variable importance
var_imp <- SDMtune::varImp(model = global_model, progress = TRUE)
# write to csv
write_csv(x = var_imp, file = file.path(mypath, "global_model_variable_importance.csv"))

# plot
var_imp_plot <- SDMtune::plotVarImp(df = var_imp)

var_imp_plot <- var_imp_plot +
  ggtitle("Variable importance for global_model")  

ggsave(var_imp_plot, 
       filename = file.path(mypath, "plots", "global_model_variable_importance.jpg"),
       height = 8, 
       width = 10,
       device = "jpeg",
       dpi = "retina")

Confusion Matrix for common threshold values

Lastly, I will pick out the mean cloglog threshold values for each type of threshold used in the model and run a confusion matrix for each value. A confusion matrix will show the number of true positives, false positives, true negatives and false negatives per threshold value and will help to determine which threshold makes the most sense, given our data. I will need to create a vector of these values, which can be found in the model_summary_all_iterations file created by this workflow.

# load in summary file and slice the threshold values
conf.matr.data <- read.csv(file.path(mypath, "global_model_summary_all_iterations.csv")) %>%
  dplyr::slice(20, 24, 28, 32, 36, 40, 44, 48, 52)

# create empty table with threshold labels
  conf.matr.output <- as.data.frame(conf.matr.data[, 1]) %>%
    rename("hold_type" = "conf.matr.data[, 1]")

for (a in seq(length(global_model@models))) {

  conf.matr.hold <- SDMtune::confMatrix(
    model = global_model@models[[a]],
    test = global_test,
    type = "cloglog",
    th = conf.matr.data[, 6] # the mean value for all cloglog thresholds
    ) 
  
  # bind threshold names
  conf.matr.hold <- cbind(conf.matr.hold, conf.matr.data[, 1]) %>%
    rename("hold_type" = "conf.matr.data[, 1]") %>%
    dplyr::select(hold_type, everything())
  
  # write individual run results to file
   write.table(
    x = conf.matr.hold, 
    sep = ",",
    file = file.path(mypath, paste0("global_model_thresh_confusion_matrix_iter", a, ".csv")), 
    row.names = FALSE,
    col.names = c("threshold_type", "threshold_value", "tp", "fp", "fn", "tn")
    )
  
   # before all model iterations have been summarized, append to empty table
   while (a < length(global_model@models)) {
      # join conf.matr.hold temporary object to empty table
      conf.matr.output <- right_join(conf.matr.output, conf.matr.hold, by = "hold_type")
      
      break
      
      }
    
     # when all model iterations have been appended, take the mean threshold value and write to csv
    if (a == length(global_model@models)) {
      # compute the mean of the 5 columns     
      conf.matr.output <- mutate(conf.matr.output, 
                                 tp_mean = rowMeans(dplyr::across(contains("tp"))), # compute the average of all "tp" columns
                                 fp_mean = rowMeans(dplyr::across(contains("fp"))),
                                 fn_mean = rowMeans(dplyr::across(contains("fn"))),
                                 tn_mean = rowMeans(dplyr::across(contains("tn"))),
                                 th = rowMeans(dplyr::across(contains("th")))
                                 ) %>%
        dplyr::select(hold_type, th, tp_mean, fp_mean, fn_mean, tn_mean) %>%
        rename("threshold_value" = "th",
               "threshold_type" = "hold_type") %>%
        dplyr::select(threshold_type, threshold_value, everything())
     
      # write to csv
      write.csv(
        x = conf.matr.output, 
        file = file.path(mypath, paste0("global_model_thresh_confusion_matrix_all_iterations.csv")),
        row.names = FALSE
        )
  
       }
  
  #remove temp object
  rm(conf.matr.hold)

}

References

Feng, X. (2022). Shandongfx/nimbios_enm [HTML]. https://github.com/shandongfx/nimbios_enm (Original work published 2018).
Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190(3), 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026
Srivastava, V., Roe, A. D., Keena, M. A., Hamelin, R. C., & Griess, V. C. (2021). Oh the places they’ll go: Improving species distribution modelling for invasive forest pests in an uncertain world. Biological Invasions, 23(1), 297–349. https://doi.org/10.1007/s10530-020-02372-9
Vignali, S., Barras, A. G., Arlettaz, R., & Braunisch, V. (2020). SDMtune: An R package to tune and evaluate species distribution models. Ecology and Evolution, 10(20), 11488–11506. https://doi.org/10.1002/ece3.6786

Workflow for the functions 'compute_MaxEnt_summary_statistics()'

Samuel M. Owens¹

2024-08-27

Overview

Setup

Get AUC, TSS, and variable contributions

Plot ROC, univariate and marginal response curves

Extract summary statistics

Write training and testing data to file

Jackknife test for variable importance

Variable importance

Confusion Matrix for common threshold values

References

Workflow for the functions 'compute_MaxEnt_summary_statistics()'

Samuel M. Owens1

2024-08-27

Overview

Setup

Get AUC, TSS, and variable contributions

Plot ROC, univariate and marginal response curves

Extract summary statistics

Write training and testing data to file

Jackknife test for variable importance

Variable importance

Confusion Matrix for common threshold values

References

Samuel M. Owens¹