Data Tidying • slfrsk

Wrangle and Tidy Package data

All of the following data sources will be read in as raw data from slfrsk/data/ (originally cleaned a bit from raw data versions in slfrsk/data-raw/) and then cleaned up individually in this vignette (see /data-raw/convert_data_rda.R for script and commenting that includes earlier unit conversion and transformation of raw data from .csv to .rda files used here). Depending on the situation, the resultant “cleaned/tidied” data may be saved to slfrsk/data/ as a new .rda file. In this vignette, the following data are treated:

Establishment potential
Transport potential
Impact potential

Setup

Here we load the required packages to prepare the data.

library(slfrsk) #this package for data sources
library(tidyverse) #data tidying and cleaning
library(here) #making directory pathways easier on different instances

Establishment

Maxent Ensemble Model

Previously, we built three separate species distribution models (SDMs) with MaxEnt (See species distribution modeling analysis for more information). We also obtained summary statistics for suitability in each geopolitical unit (country and U.S. state) in each SDM (as well as an ensemble SDM), which were transferred to the states_extracts and countries_extracts datasets (or states_extracts_ensemble and countries_extracts_ensemble for the ensemble SDM).

Here, we explore the ensemble maximum suitability by obtaining the maximum suitability per geopolitical unit for each SDM and calculating the mean and standard error across the models (or just use the extracted maximum value for ensemble models). These summarized suitabilities constitute our predicted Establishment potential.

We also strip Western Sahara from the countries datasets here and for other datsets, as it is not informative for our analyses (zero trade for Transport potential below).

#new extract of ensemble
if(TRUE){
  data("states_extracts_ensemble")

summary_states_extracts <- states_extracts_ensemble %>%
  group_by(geopol_unit) %>%
  summarize(grand_mean_max = mean(obs_max))

data("countries_extracts_ensemble")
  
summary_countries_extracts <- countries_extracts_ensemble %>%
  group_by(geopol_unit) %>%
  summarize(grand_mean_max = mean(obs_max))

#rm W. Sahara
summary_countries_extracts <- summary_countries_extracts %>%
    filter(geopol_unit != "Western Sahara")
  
}

Transport

Import Trade Data

Now that Establishment data are nicely organized, we can move on to Transport. We estimate Transport potential based on the average annual mass of goods traded (in metric tons) with SLF-established U.S. states over 2012–2017 (we also track value of traded goods in USD). Our existing data can be partitioned into geopolitical units as above (states and countries), but can further be separated into two categories: 1. states with confirmed SLF populations (present) and 2. states that have a high probability of hosting SLF populations in the near future based on proximity to present states and records of isolated observations (future).

Although we maintain several possible permutations of data to analyze, we focus on mass of goods traded for present scenarios, as these data are unlikely to be skewed by trade of high value goods that are less likely to lead to transportation of egg masses (e.g., electronics) and provide us with a more careful outlook for risk of slf paninvasion.

U.S. States

We first treat the states data by adding two letter codes and SLF-establishment status. For present, we treat the following states as having established SLF: CT, DE, MD, NJ, NY, OH, PA, VA, and WV. For future, we add the following states as having established SLF: NC. From here, the data are tidied and gathered before calculating the mean and standard error (SE) of trade with established states for 2012–2017. To do so, we first removed all cases of intrastate trade for established states. Then, we calculated the total trade with established states for each year before obtaining the overall mean and SE.

Below, we demonstrate the code for both present and future trade data in value (USD) for U.S. states.

#load states value
data("states_trade_summary_slf")
data("states_trade_summary_slf_future")

#code snippet to attach infection status and add state abbreviations
#state id's for adding to extracts
stateid <- c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "DC", "WV", "WI", "WY")
stateid <- tibble(stateid, states_trade_summary_slf$destination)
colnames(stateid) <- c("ID", "geopol_unit")

#PRESENT

#add in the infection status of each state while adding the abbreviations
 states_trade_summary_slf <-  left_join(states_trade_summary_slf, stateid, by = c("destination" = "geopol_unit")) %>%
    mutate(status = ifelse(ID %in% c("CT", "DE", "MD", "NJ", "NY", "OH", "PA", "VA", "WV"), "established", "not established")) %>%
    dplyr::select(destination, ID, status, everything())
  
  #tidying of data
  states_trade_summary_slf <- states_trade_summary_slf %>%
    gather(-destination:-status, key = "infected_state", value = "trade") %>%
    as.data.frame(.)
  
  #total infected states trade, average/SE
  summary_states_trade_value <- states_trade_summary_slf %>%
    group_by(destination) %>%
    #filter to be just trade with infected states present
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "ct_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_")
           ) %>%
    #remove the all state trade from consideration
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%  
    #make a temporary col
    mutate(tempcol = str_split(infected_state, "_")) %>%  
    #group by row
    rowwise() %>% 
    #split out the info about states and years of trade
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    #rm the temp col
    dplyr::select(-tempcol) %>% 
    #group based now on the destination, year, and infection status
    #filter out the intrastate trade for infected states to avoid extra low averages by creating a hybrid string (DestinationState+abbreviated_infected state)
    group_by(destination, year, status) %>% 
    mutate(clear_intra = paste0(destination, source)) %>%
    filter(!clear_intra %in% 
             c("Connecticutct", 
               "Delawarede", 
               "Marylandmd", 
               "New Jerseynj", 
               "New Yorkny", 
               "Ohiooh", 
               "Pennsylvaniapa", 
               "Virginiava", 
               "West Virginiawv")
           ) %>%
    #remove workaround infected intrastate column
    dplyr::select(-clear_intra) %>%
    #add up the trade with infected states for each destination state and year (THIS IS NOT AVG INFECTED TRADE YET)
    summarize(avg_infected_trade = sum(trade)) %>%  
    group_by(destination, status) %>%
    #obtain the standard error (SE) of infected trade across years (this repeats across all of the entries across years and thus becomes its own value)
    mutate(se = (sd(avg_infected_trade, na.rm = T) / sqrt(length(year)))) %>% 
    #summarize the mean and SE across years! (there is no need to actually take the mean of SE here, but it makes summarizing simplier)
    summarise(avg_infected_trade = mean(avg_infected_trade, na.rm = T), se = mean(se))  
  
#FUTURE
    states_trade_summary_slf_future <-  left_join(states_trade_summary_slf_future, stateid, by = c("destination" = "geopol_unit")) %>%
    mutate(status = ifelse(ID %in% c("CT", "DE", "MD", "NC", "NJ", "NY", "OH", "PA", "VA", "WV"), "established", "not established")) %>%
    dplyr::select(destination, ID, status, everything())
  
  #tidying of data
  states_trade_summary_slf_future <- states_trade_summary_slf_future %>%
    gather(-destination:-status, key = "infected_state", value = "trade") %>%
    as.data.frame(.)
  
  #total infected states trade, average/SE
  summary_states_trade_value_future <- states_trade_summary_slf_future %>%
    group_by(destination) %>%
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "nc_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_") | 
             str_detect(infected_state, "ct_")
           ) %>%
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%
    mutate(tempcol = str_split(infected_state, "_")) %>%
    rowwise() %>%
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    dplyr::select(-tempcol) %>%
    group_by(destination, year, status) %>%
    #filter out the intrastate trade for infected states to avoid extra low averages
    mutate(clear_intra = paste0(destination, source)) %>%
    filter(!clear_intra %in% 
             c("Connecticutct", 
               "Delawarede", 
               "Marylandmd", 
               "North Carolinanc", 
               "New Jerseynj", 
               "New Yorkny", 
               "Ohiooh", 
               "Pennsylvaniapa", 
               "Virginiava", 
               "West Virginiawv"
               )
           ) %>%
    #remove workaround infected intrastate column
    dplyr::select(-clear_intra) %>%
    summarize(avg_infected_trade = sum(trade)) %>%
    group_by(destination, status) %>%
    mutate(se = (sd(avg_infected_trade, na.rm = T) / sqrt(length(year)))) %>%
    summarise(avg_infected_trade = mean(avg_infected_trade, na.rm = T), se = mean(se))

We repeat the same process for states but for the total mass of imported goods (mass). In doing so, we convert the existing units from kg to metric tons.

#load states mass
data("states_trade_mass_summary_slf")
data("states_trade_mass_summary_slf_future")

#PRESENT
  #add in the infection status of each state while adding the abbreviations
   states_trade_mass_summary_slf <- left_join(states_trade_mass_summary_slf, stateid, by = c("destination" = "geopol_unit")) %>%
    mutate(status = ifelse(ID %in% c("CT", "DE", "MD", "NJ", "NY", "OH", "PA", "VA", "WV"), "established", "not established")) %>%
    dplyr::select(destination, ID, status, everything())

     #tidying of data and dividing by 1000 to convert from kg to metric tons
  states_trade_mass_summary_slf <- states_trade_mass_summary_slf %>%
    gather(-destination:-status, key = "infected_state", value = "mass") %>%
    mutate(mass = mass / 1000) %>%
    as.data.frame(.)
  
  #mass
  summary_states_trade_mass <- states_trade_mass_summary_slf %>%
    group_by(destination) %>%
    #filter to be just trade with infected states present
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "ct_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_")
           ) %>%
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%
    mutate(tempcol = str_split(infected_state, "_")) %>%
    rowwise() %>%
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    dplyr::select(-tempcol) %>%
    group_by(destination, year, status) %>%
    #filter out the intrastate trade for infected states to avoid extra low averages
    mutate(clear_intra = paste0(destination, source)) %>%
    filter(!clear_intra %in% 
             c("Connecticutct", 
               "Delawarede", 
               "Marylandmd", 
               "New Jerseynj", 
               "New Yorkny", 
               "Ohiooh", 
               "Pennsylvaniapa", 
               "Virginiava", 
               "West Virginiawv")
           ) %>%
    #remove workaround infected intrastate column
    dplyr::select(-clear_intra) %>%
    summarize(avg_infected_mass = sum(mass)) %>%
    group_by(destination, status) %>%
    mutate(se = (sd(avg_infected_mass, na.rm = T) / sqrt(length(year)))) %>%
    summarise(avg_infected_mass = mean(avg_infected_mass, na.rm = T), se = mean(se))

#FUTURE
   #add col to identify infected states
  states_trade_mass_summary_slf_future <-  left_join(states_trade_mass_summary_slf_future, stateid, by = c("destination" = "geopol_unit")) %>%
    mutate(status = ifelse(ID %in% c("CT", "DE", "MD", "NC", "NJ", "NY", "OH", "PA", "VA", "WV"), "established", "not established")) %>%
    dplyr::select(destination, ID, status, everything())
  
  #tidying of data and dividing by 1000 to convert from kg to metric tons
  states_trade_mass_summary_slf_future <- states_trade_mass_summary_slf_future %>%
    gather(-destination:-status, key = "infected_state", value = "mass") %>%
    mutate(mass = mass / 1000) %>%
    as.data.frame(.)
  
  #mass
  summary_states_trade_mass_future <- states_trade_mass_summary_slf_future %>%
    group_by(destination) %>%
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "nc_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_") | 
             str_detect(infected_state, "ct_")
           ) %>%    
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%
    mutate(tempcol = str_split(infected_state, "_")) %>%
    rowwise() %>%
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    dplyr::select(-tempcol) %>%
    group_by(destination, year, status) %>%
    #filter out the intrastate trade for infected states to avoid extra low averages
    mutate(clear_intra = paste0(destination, source)) %>%
    filter(!clear_intra %in% 
             c("Connecticutct", 
               "Delawarede", 
               "Marylandmd", 
               "North Carolinanc", 
               "New Jerseynj", 
               "New Yorkny", 
               "Ohiooh", 
               "Pennsylvaniapa", 
               "Virginiava", 
               "West Virginiawv"
               )
           ) %>%
    #remove workaround infected intrastate column
    dplyr::select(-clear_intra) %>%
    summarize(avg_infected_mass = sum(mass)) %>%
    group_by(destination, status) %>%
    mutate(se = (sd(avg_infected_mass, na.rm = T) / sqrt(length(year)))) %>%
    summarise(avg_infected_mass = mean(avg_infected_mass, na.rm = T), se = mean(se))

Here, you can see the summary for the first 10 states for the future trade data for mass of trade (metric tons).

destination	status	avg_infected_mass	se
Alabama	not established	8042185.41	89597.110
Alaska	not established	91775.66	3473.339
Arizona	not established	1541726.22	30613.203
Arkansas	not established	2122765.62	15105.761
California	not established	16449015.73	447722.953
Colorado	not established	1789258.81	25597.146
Connecticut	established	24300544.86	249553.550
Delaware	established	17481966.16	345415.534
District of Columbia	not established	3122179.60	30663.092
Florida	not established	13669574.02	397146.471

Countries

From here, we turn to complete the same analyses for countries as states (value of trade in USD and then mass of trade in metric tons). Notably, we do not have to worry about the intrastate trade that we filtered out in the states datasets. As such, we do not bother with attaching a $status identifier to each country (this is done as needed for visualization later).

#COUNTRIES

#VALUE
#load countries values
data("countries_trade_summary_slf")
data("countries_trade_summary_slf_future")

#PRESENT

#tidying of data and log transforming values
countries_trade_summary_slf <- countries_trade_summary_slf %>%
  gather(-destination, key = "infected_state", value = "trade") %>%
  as.data.frame(.)

  #value
  summary_countries_trade_value <- countries_trade_summary_slf %>%
    group_by(destination) %>%
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "ct_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_")
           ) %>%
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%
    mutate(tempcol = str_split(infected_state, "_")) %>%
    rowwise() %>%
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    dplyr::select(-tempcol) %>%
    group_by(destination, year) %>%
    summarize(avg_infected_trade = sum(trade)) %>%
    group_by(destination) %>%
    mutate(se = (sd(avg_infected_trade, na.rm = T) / sqrt(length(year)))) %>%
    summarise(avg_infected_trade = mean(avg_infected_trade, na.rm = T), se = mean(se))

#FUTURE
  
#tidying of data and log transforming values
countries_trade_summary_slf_future <- countries_trade_summary_slf_future %>%
  gather(-destination, key = "infected_state", value = "trade") %>%
  as.data.frame(.)

  #value
  summary_countries_trade_value_future <- countries_trade_summary_slf_future %>%
    group_by(destination) %>%
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "nc_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_") | 
             str_detect(infected_state, "ct_")
           ) %>%    
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%
    mutate(tempcol = str_split(infected_state, "_")) %>%
    rowwise() %>%
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    dplyr::select(-tempcol) %>%
    group_by(destination, year) %>%
    summarize(avg_infected_trade = sum(trade)) %>%
    group_by(destination) %>%
    mutate(se = (sd(avg_infected_trade, na.rm = T) / sqrt(length(year)))) %>%
    summarise(avg_infected_trade = mean(avg_infected_trade, na.rm = T), se = mean(se))

#MASS
#load countries mass
data("countries_trade_mass_summary_slf")
data("countries_trade_mass_summary_slf_future")

#PRESENT

#tidying of data and converting from kg to metric tons 
countries_trade_mass_summary_slf <- countries_trade_mass_summary_slf %>%
  gather(-destination, key = "infected_state", value = "mass") %>%
  mutate(mass = mass / 1000) %>%
  as.data.frame(.)

  #mass
  summary_countries_trade_mass <- countries_trade_mass_summary_slf %>%
    group_by(destination) %>%
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "ct_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_")
           ) %>%
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%
    mutate(tempcol = str_split(infected_state, "_")) %>%
    rowwise() %>%
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    dplyr::select(-tempcol) %>%
    group_by(destination, year) %>%
    summarize(avg_infected_mass = sum(mass)) %>%
    group_by(destination) %>%
    mutate(se = (sd(avg_infected_mass, na.rm = T) / sqrt(length(year)))) %>%
    summarise(avg_infected_mass = mean(avg_infected_mass, na.rm = T), se = mean(se))
  
#FUTURE
  
#tidying of data and converting from kg to metric tons 
countries_trade_mass_summary_slf_future <- countries_trade_mass_summary_slf_future %>%
  gather(-destination, key = "infected_state", value = "mass") %>%
  mutate(mass = mass / 1000) %>%
  as.data.frame(.)

  #mass
  summary_countries_trade_mass_future <- countries_trade_mass_summary_slf_future %>%
    group_by(destination) %>%
    filter(str_detect(infected_state, "pa_") | 
             str_detect(infected_state, "de_") | 
             str_detect(infected_state, "va_") | 
             str_detect(infected_state, "nj_") | 
             str_detect(infected_state, "md_") | 
             str_detect(infected_state, "nc_") | 
             str_detect(infected_state, "ny_") | 
             str_detect(infected_state, "oh_") | 
             str_detect(infected_state, "wv_") | 
             str_detect(infected_state, "ct_")
           ) %>%    
    filter(str_detect(infected_state, "201") & !str_detect(infected_state, "all")) %>%
    mutate(tempcol = str_split(infected_state, "_")) %>%
    rowwise() %>%
    mutate(source = unlist(tempcol)[1], year = unlist(tempcol)[2]) %>%
    dplyr::select(-tempcol) %>%
    group_by(destination, year) %>%
    summarize(avg_infected_mass = sum(mass)) %>%
    group_by(destination) %>%
    mutate(se = (sd(avg_infected_mass, na.rm = T) / sqrt(length(year)))) %>%
    summarise(avg_infected_mass = mean(avg_infected_mass, na.rm = T), se = mean(se))

Here, you can see the summary for the first 10 countries for the future trade data for mass of trade (metric tons), just like the states data.

destination	avg_infected_mass	se
Afghanistan	29071.49650	9715.82565
Albania	13101.98933	3989.92572
Algeria	138022.73517	35664.42137
Andorra	96.19667	24.11481
Angola	29848.68950	3197.03340
Anguilla	1281.39267	184.55033
Antigua and Barbuda	4080.22700	1238.47426
Argentina	643884.60350	151554.71666
Armenia	6604.53150	968.13989
Aruba	57376.83367	30835.63316

Combined Final Trade Data

Now that we have several permutations of trade with SLF-established states data (both present and future scenarios) for both geopolitical units (states and countries) as measured in value and mass (USD and metric tons, respectively), we can combine some of these data into fewer dataframes to minimize the number of files we are juggling for future analyses. To do this, we merge the value and mass data for each geopolitical unit dataset.

As we do so, we take the time to create $\log_{10}$ transformed versions of the trade value and mass data. To avoid possible errors for this transformation, we add a constant ($value+1$) to data prior to transformation. We use the same transformation process for Impact potential data.

#STATES
summary_states_trade <- left_join(summary_states_trade_value, summary_states_trade_mass, by = c("destination", "status"), suffix = c("_trade", "_mass")) %>%
    mutate(log10_avg_infected_trade = log10(avg_infected_trade+1), log10_avg_infected_mass = log10(avg_infected_mass+1))

summary_states_trade_future <- left_join(summary_states_trade_value_future, summary_states_trade_mass_future, by = c("destination", "status"), suffix = c("_trade", "_mass")) %>%
    mutate(log10_avg_infected_trade = log10(avg_infected_trade+1), log10_avg_infected_mass = log10(avg_infected_mass+1))

#COUNTRIES

summary_countries_trade <- left_join(summary_countries_trade_value, summary_countries_trade_mass, by = c("destination"), suffix = c("_trade", "_mass")) %>%
    filter(destination != "Western Sahara") %>%
  mutate(log10_avg_infected_trade = log10(avg_infected_trade+1), log10_avg_infected_mass = log10(avg_infected_mass+1))

summary_countries_trade_future <- left_join(summary_countries_trade_value_future, summary_countries_trade_mass_future, by = c("destination"), suffix = c("_trade", "_mass"))  %>%
    filter(destination != "Western Sahara") %>%
  mutate(log10_avg_infected_trade = log10(avg_infected_trade+1), log10_avg_infected_mass = log10(avg_infected_mass+1))

Here is the merged states data for future:

destination	status	avg_infected_trade	se_trade	avg_infected_mass	se_mass	log10_avg_infected_trade	log10_avg_infected_mass
Alabama	not established	25746519575	86757733	8042185.41	89597.110	10.41072	6.905374
Alaska	not established	3682900711	231580906	91775.66	3473.339	9.56619	4.962732
Arizona	not established	16164900001	134829727	1541726.22	30613.203	10.20857	6.188008
Arkansas	not established	10808696870	42330206	2122765.62	15105.761	10.03377	6.326902
California	not established	114295367886	471914734	16449015.73	447722.953	11.05803	7.216140
Colorado	not established	15392144303	237938106	1789258.81	25597.146	10.18730	6.252673
Connecticut	established	50771613052	367967442	24300544.86	249553.550	10.70562	7.385616
Delaware	established	26854722185	498504512	17481966.16	345415.534	10.42902	7.242590
District of Columbia	not established	11196345230	573341994	3122179.60	30663.092	10.04908	6.494458
Florida	not established	67241610026	770574348	13669574.02	397146.471	10.82764	7.135755

And here is the merged countries data for future:

destination	avg_infected_trade	se_trade	avg_infected_mass	se_mass	log10_avg_infected_trade	log10_avg_infected_mass
Afghanistan	235695855	86773679.7	29071.49650	9715.82565	8.372352	4.463482
Albania	16536138	2452105.0	13101.98933	3989.92572	7.218434	4.117370
Algeria	503940704	64383082.2	138022.73517	35664.42137	8.702379	5.139954
Andorra	1457978	174345.8	96.19667	24.11481	6.163751	1.987651
Angola	119484021	34733545.6	29848.68950	3197.03340	8.077310	4.474940
Anguilla	3681561	196681.7	1281.39267	184.55033	6.566032	3.108021
Antigua and Barbuda	14561905	2078482.2	4080.22700	1238.47426	7.163218	3.610791
Argentina	1201585227	96144901.7	643884.60350	151554.71666	9.079755	5.808809
Armenia	22230451	1940347.2	6604.53150	968.13989	7.346948	3.819908
Aruba	76165589	9470751.5	57376.83367	30835.63316	7.881759	4.758744

Impact

Given the close relationship between cultivation of grapes (grapes) and winemaking (wine), we evaluate impact potential of SLF based on data from both.

Grapes

For grapes, two forms of cultivation data were available:

grape production (metric tons)
grape yield per unit area (metric tons/hectare)

Although we provide both sources of data, we focus more on production as a total measure of output per geopolitical unit.

data("states_grapes")

#summarize and log transform
summary_states_grapes <- states_grapes %>%
  pivot_wider(names_from = Item, values_from = Value, id_cols = c(geopol_unit, Year)) %>%
  group_by(geopol_unit) %>%
  summarize(avg_yield = mean(grape_yield, na.rm = TRUE), 
            se_yield = sd(grape_yield) / sqrt(n_distinct(Year)),
            avg_prod = mean(grape_production, na.rm = T), se_prod = sd(grape_production) / sqrt(n_distinct(Year))
            ) %>%
  drop_na(.) %>%
  mutate(log10_avg_yield = log10(avg_yield+1), log10_avg_prod = log10(avg_prod+1))

data("countries_grapes")

#first rm Western Sahara, then summarize and log transform, while adding in the step of changing NA's to zeros
summary_countries_grapes <- countries_grapes %>%
  filter(geopol_unit != "Western Sahara") %>%
  pivot_wider(names_from = Item, values_from = Value, id_cols = c(geopol_unit, Year)) %>%
  group_by(geopol_unit) %>%
  summarize(avg_yield = mean(grape_yield, na.rm = TRUE), 
            se_yield = sd(grape_yield) / sqrt(n_distinct(Year)),
            avg_prod = mean(grape_production, na.rm = T), se_prod = sd(grape_production) / sqrt(n_distinct(Year))
            ) %>%
  drop_na(.) %>%
  mutate(log10_avg_yield = log10(avg_yield+1), log10_avg_prod = log10(avg_prod+1))

#fix some problematic names of countries
summary_countries_grapes$geopol_unit[grep(pattern = "Ivoire", x = summary_countries_grapes$geopol_unit, value = F)] <- "Ivory Coast"
summary_countries_grapes$geopol_unit[grep(pattern = "Bolivia (Plurinational State of)", x = summary_countries_grapes$geopol_unit, value = F)] <- "Bolivia"
summary_countries_grapes$geopol_unit[grep(pattern = "China, Taiwan Province of", x = summary_countries_grapes$geopol_unit, value = F)] <- "Taiwan"
summary_countries_grapes$geopol_unit[grep(pattern = "Iran (Islamic Republic of)", x = summary_countries_grapes$geopol_unit, value = F)] <- "Iran"
summary_countries_grapes$geopol_unit[grep(pattern = "Venezuela (Bolivarian Republic of)", x = summary_countries_grapes$geopol_unit, value = F)] <- "Venezuela"
summary_countries_grapes$geopol_unit[grep(pattern = "Viet Nam", x = summary_countries_grapes$geopol_unit, value = F)] <- "Vietnam"

Wine

For wine, we were able to obtain wine production for both geopolitical unit datasets in units of mass, metric tons.

data("states_wine")

#summarize and log transform
summary_states_wine <- states_wine %>%
  group_by(geopol_unit) %>%
  summarize(avg_wine = mean(Mass, na.rm = T), se_wine = sd(Mass) / sqrt(n_distinct(Year)), log10_avg_wine = log10(mean(Mass, na.rm = T)+1))

data("countries_wine")

#summarize and log transform, then rm anything that started out as NA or zero
summary_countries_wine <- countries_wine %>%
  group_by(geopol_unit) %>%
  summarize(avg_wine = mean(Mass, na.rm = T), se_wine = sd(Mass) / sqrt(n_distinct(Year)), log10_avg_wine = log10(mean(Mass, na.rm = T)+1)) %>%
  filter(!is.infinite(log10_avg_wine) & !is.nan(log10_avg_wine))

Potentials Summary

Now that all of these data sources have been summarized and tidied, we will merge them together with left_join() sequentially, using the trade data (Transport potential) as the inital dataset that all others are joined to. The resultant datasets contain all permuatations of geopolitical units (states and countries) and infection scenarios (present and future).

During this process, we also use the wine and grape data to produce binary categorical data for each geopolitical unit to indicate whether it is involved directly in that industry. For this, we also manually add the United States as a grape producer in the countries datasets. Additionally, we manually add $status to the countries datasets as follows:

Native: China, India, Taiwan, and Vietnam
Established: Japan, South Korea, and United States

#STATES

##PRESENT

#add the binaries
states_summary_present <- summary_states_extracts %>%
  mutate(grapes = ifelse(geopol_unit %in% summary_states_grapes$geopol_unit, yes = "yes", no = "no")) %>%
  mutate(wine = ifelse(geopol_unit %in% summary_states_wine$geopol_unit, yes = "yes", no = "no")) %>%
  mutate(grapes = factor(grapes, levels = c("yes", "no")), wine = factor(wine, levels = c("yes", "no"))) 

#now add in the other data
states_summary_present <- states_summary_present %>%
  left_join(., summary_states_trade, by = c("geopol_unit" = "destination")) %>%   #trade
  left_join(., summary_states_wine, by = "geopol_unit") %>%   #wine
  left_join(., summary_states_grapes, by = "geopol_unit")     #grapes

#deal with NA
states_summary_present[is.na(states_summary_present)] <- 0

#join the new ID's
states_summary_present <- left_join(stateid, states_summary_present, by = "geopol_unit")
#################################
#tidy the order of cols
states_summary_present <- states_summary_present %>%
  #dplyr::select(ID, geopol_unit, status, grand_mean_max, grand_se_max, wine, grapes, avg_wine:log10_avg_prod, everything())
  dplyr::select(ID, geopol_unit, status, grand_mean_max, wine, grapes, avg_wine:log10_avg_prod, everything())


##FUTURE

#add the binaries
states_summary_future <- summary_states_extracts %>%
  mutate(grapes = ifelse(geopol_unit %in% summary_states_grapes$geopol_unit, yes = "yes", no = "no")) %>%
  mutate(wine = ifelse(geopol_unit %in% summary_states_wine$geopol_unit, yes = "yes", no = "no")) %>%
  mutate(grapes = factor(grapes, levels = c("yes", "no")), wine = factor(wine, levels = c("yes", "no"))) 

#now add in the other data
states_summary_future <- states_summary_future %>%
  left_join(., summary_states_trade_future, by = c("geopol_unit" = "destination")) %>%   #trade
  left_join(., summary_states_wine, by = "geopol_unit") %>%   #wine
  left_join(., summary_states_grapes, by = "geopol_unit")     #grapes

#deal with NA
states_summary_future[is.na(states_summary_future)] <- 0

#join the new ID's
states_summary_future <- left_join(stateid, states_summary_future, by = "geopol_unit")

#tidy the order of cols
states_summary_future <- states_summary_future %>%
    #dplyr::select(ID, geopol_unit, status, grand_mean_max, grand_se_max, wine, grapes, avg_wine:log10_avg_prod, everything())
    dplyr::select(ID, geopol_unit, status, grand_mean_max, wine, grapes, avg_wine:log10_avg_prod, everything())


#################################



#COUNTRIES

##PRESENT

#add in the binaries
countries_summary_present <- summary_countries_extracts %>%
  mutate(grapes = ifelse(geopol_unit %in% summary_countries_grapes$geopol_unit, yes = "yes", no = "no")) %>%
  mutate(wine = ifelse(geopol_unit %in% summary_countries_wine$geopol_unit, yes = "yes", no = "no"))

#change U.S. from no to yes for grapes
countries_summary_present$grapes[countries_summary_present$geopol_unit == "United States"] <- "yes"

#The factor fixing step is now here
countries_summary_present <- countries_summary_present %>%
  mutate(grapes = factor(grapes, levels = c("yes", "no")), wine = factor(wine, levels = c("yes", "no"))) 

#now drop NAs for the extracts and add in the other data
countries_summary_present <- countries_summary_present %>%
  drop_na(.) %>%
  left_join(., summary_countries_trade, by = c("geopol_unit" = "destination")) %>%   #trade
  left_join(., summary_countries_wine, by = "geopol_unit") %>%   #wine
  left_join(., summary_countries_grapes, by = "geopol_unit")     #grapes

#deal with NA
countries_summary_present[is.na(countries_summary_present)] <- 0

# add ID to make is simpler for labeling points
countries_summary_present <- countries_summary_present %>%  mutate(ID = geopol_unit)

# add a column for status establishment
countries_summary_present <- countries_summary_present %>%  mutate(status = "not established")
countries_summary_present$status[countries_summary_present$geopol_unit %in% c("China", "India", "Taiwan", "Vietnam")] <- "native"
countries_summary_present$status[countries_summary_present$geopol_unit %in% c("Japan", "South Korea", "United States")] <- "established"

#tidy the order of cols
countries_summary_present <- countries_summary_present %>%
    #dplyr::select(ID, geopol_unit, status, grand_mean_max, grand_se_max, wine, grapes, avg_wine:log10_avg_prod, everything())
    dplyr::select(ID, geopol_unit, status, grand_mean_max, wine, grapes, avg_wine:log10_avg_prod, everything())

##FUTURE

#add in the binaries
countries_summary_future <- summary_countries_extracts %>%
  mutate(grapes = ifelse(geopol_unit %in% summary_countries_grapes$geopol_unit, yes = "yes", no = "no")) %>%
  mutate(wine = ifelse(geopol_unit %in% summary_countries_wine$geopol_unit, yes = "yes", no = "no"))

# NOTE: must This also includes manually changing the **United States** to a grape producer.
#change U.S. from no to yes for grapes
countries_summary_future$grapes[countries_summary_future$geopol_unit == "United States"] <- "yes"

#The factor fixing step is now here
countries_summary_future <- countries_summary_future %>%
  mutate(grapes = factor(grapes, levels = c("yes", "no")), wine = factor(wine, levels = c("yes", "no"))) 

#now drop NAs for the extracts and add in the other data
countries_summary_future <- countries_summary_future %>%
  drop_na(.) %>%
  left_join(., summary_countries_trade_future, by = c("geopol_unit" = "destination")) %>%   #trade
  left_join(., summary_countries_wine, by = "geopol_unit") %>%   #wine
  left_join(., summary_countries_grapes, by = "geopol_unit")     #grapes

#deal with NA
countries_summary_future[is.na(countries_summary_future)] <- 0

# add ID to make is simpler for labeling points
countries_summary_future <- countries_summary_future %>%  mutate(ID = geopol_unit)

# add a column for status establishment
countries_summary_future <- countries_summary_future %>%  mutate(status = "not established")
countries_summary_future$status[countries_summary_future$geopol_unit %in% c("China", "India", "Taiwan", "Vietnam")] <- "native"
countries_summary_future$status[countries_summary_future$geopol_unit %in% c("Japan", "South Korea", "United States")] <- "established"

#tidy the order of cols
countries_summary_future <- countries_summary_future %>%
    #dplyr::select(ID, geopol_unit, status, grand_mean_max, grand_se_max, wine, grapes, avg_wine:log10_avg_prod, everything())
    dplyr::select(ID, geopol_unit, status, grand_mean_max, wine, grapes, avg_wine:log10_avg_prod, everything())

Save data

This section provides the code to save each of the final summary objects to the /data directory. They are currently set to not run, as the package should ship with the files already in /data. Note that this chunk can be run by setting the top if from FALSE to TRUE!

if(FALSE){
save(states_summary_present, file = file.path(here(), "data", "states_summary_present.rda"))
save(states_summary_future, file = file.path(here(), "data", "states_summary_future.rda"))
save(countries_summary_present, file = file.path(here(), "data", "countries_summary_present.rda"))
save(countries_summary_future, file = file.path(here(), "data", "countries_summary_future.rda"))
}

if(FALSE){
states_summary_future_ensemble <- states_summary_future
states_summary_present_ensemble <- states_summary_present
countries_summary_future_ensemble <- countries_summary_future
countries_summary_present_ensemble <- countries_summary_present
  
save(states_summary_future_ensemble, file = file.path(here(), "data", "states_summary_future_ensemble.rda"))
save(states_summary_present_ensemble, file = file.path(here(), "data", "states_summary_present_ensemble.rda"))
save(countries_summary_future_ensemble, file = file.path(here(), "data", "countries_summary_future_ensemble.rda"))
save(countries_summary_present_ensemble, file = file.path(here(), "data", "countries_summary_present_ensemble.rda"))

}