vignette-042-trade-presence.Rmd
Evaluate if Trade with Established States Relates to Regulatory Incidents or Establishment
This vignette fits a simple logistic regression model in the form of a binary dependent variable (either states with regulatory incidents or states with established SLF) on a continuous independent variable (trade with established states, mass in metric tons transformed as \(log_{10}(value+1)\)).
#packages to load
library(tidyverse) #data tidying
library(magrittr) #fast pipe assign
library(stargazer) #make easy model table for pub
library(sf) #manipulate by_county
library(here) #easy relative pathing
#packages to use to extract raw data
#library(lubridate) #extract years
#library(tigris) #county level shapefiles
#library(lycormap) #package to get tinyslf dataset from
#options(tigris_use_cache = TRUE) # make sure that tigris does not keep downloading the files
#data to load
#slfrsk data (trade)
data("states_summary_present_ensemble", package = "slfrsk")
#recode the established variable as binary
states_summary_present_ensemble <- states_summary_present_ensemble %>%
mutate(estab = ifelse(status == "established", 1, 0)) %>%
dplyr::select(ID:status, estab, everything())
The data used here for the spread figure are based on an upcoming
suite of packages (lycodata
and lycormap
)
produced by the iEcoLab. The records in these packages contain data that
are not yet available to the public, so all questions regarding the raw
data should be directed to the creator of the lycor*
suite,
Seba De Bona (SDB, (sebastiano.debona@temple.edu))
or to Matthew R. Helmus (mrhelmus@temple.edu).
Notably these data are used only to get regulatory events. The
slfrsk
data already has established states, but it lacks
the states that have regulatory events recorded but no established
populations.
Here, we will read in a .rda
file that was created using
the example code below from the raw data. These example code chunks are
adapted from SDB code.
#data call
data(by_county, package = "slfrsk")
We obtain the counties shapefile as a simple feature (sf) from the
tigris
package.
counties <- counties(cb = T, resolution = "5m")
Now, we read in the SLF records from the package
lycormap
. We also obtain the supplemental SLF records that
contain the updated transportation data for uninvaded states, which was
obtained from the raw data in the lycodata
package. At the
time of publication of slfrsk
, additional data were saved
as raw data that comes with lycordata
(slf_county_record_timeline.csv
).
data("tinyslf", package = "lycormap")
# reading manually added data (for a later step)
man_county_data <- read_csv(file.path(here(), "..", "..", "lycordata", "data-raw", "additional", "slf_county_record_timeline.csv"),
col_types = cols(.default = "c", FIPS = "d",
RT_check = "l", ST_check = "l", RS_match = "l"))
This code finds the county that each presence point corresponds to and attaches the county shapefile data to those rows. It also removes records that do not fall in a U.S. county (likely erroneous records). We also remove an errant record for Wasatch county, UT.
#adapted from SDB code in lycordata
#add column of temporary row ID's for later reference
tinyslf %<>%
add_column(row_ID = 1:nrow(.))
tinyslf %<>%
# reducing data to coordinates only, and ID for future merging
dplyr::select(latitude, longitude, row_ID) %>%
# transforming into sf object
st_as_sf(coords = c("longitude", "latitude"),
crs = st_crs(counties)) %>%
#intersecting state polygons with data coordinates to match them to U.S. counties
st_join(., counties, join = st_intersects) %>%
#make it a tibble now
as_tibble() %>%
#simplify to row ID, county name and identity
dplyr::select(row_ID, county = NAME, GEOID) %>%
#then joining this into main tinyslf data
left_join(tinyslf, ., by = "row_ID") %>%
#remove the temporary id column
dplyr::select(-c(row_ID))
#remove the NA values for county. which are the points that do not fall in a county
tinyslf %<>%
filter(!is.na(county))
#rm errant Wasatch co, UT record
tinyslf %<>%
filter(!(state %in% "UT" & county %in% "Wasatch"))
Add the observations and establishment dates to the county outline
data. Now, the earliest year of establishment or recording of an
observation (dead or alive) are identified from the tinyslf
dataset and then added accordingly to county shapefile data. Lastly, the
projection is confirmed.
#get the year of establishment
by_county <- tinyslf %>%
group_by(GEOID) %>%
arrange(year) %>%
filter(slf_established) %>%
summarize(YearOfEstablishment = min(year)) %>%
ungroup() %>%
#join to the counties shapefile
left_join(counties, ., by = "GEOID")
# and add the first record year
by_county <- tinyslf %>%
group_by(GEOID) %>%
arrange(year) %>%
filter(slf_present) %>%
summarize(FirstRecord = min(year)) %>%
ungroup() %>%
#join to shapefile that also includes year of establishment
left_join(by_county, ., by = "GEOID")
# providing correct projections
by_county %<>%
st_transform('+proj=longlat +datum=WGS84')
Because our first record and establishment data are constantly
updating from multiple sources, we have a googlesheet that is the source
of man_by_county
(saved as
slf_county_record_timeline.csv
). We again adapt code from
SDB to tidy and splice these new records into the counties dataset.
#we can use the FIPS code to merge to the main data
#extract the status and the year of record/establishment
#there are three date columns that hopefully can be collapsed into 1
#standardize date format
man_county_data %<>%
mutate_at(vars(starts_with("Date")), .funs = ~parsedate::parse_date(.))
#data can have multiple dates associated to it, meaning a county can have info on when the first record was scored separately from establishment date
#not all infestations have an establishment data set, so cases where it is absent will be set to 2020 based on when data were analyzed
# for Date_Alive/Date_Morbound, we'll take the smallest, if both are present
man_county_data %<>%
mutate(Year_Alive = year(Date_Alive),
Year_Morbund = year(Date_Morbund),
Year_FirstRecord = ifelse(!(is.na(Year_Alive) & is.na(Year_Morbund)),
pmin(Year_Alive, Year_Morbund, na.rm = T),
NA)
) %>%
dplyr::select(-c(Year_Alive, Year_Morbund))
#create two variables that echo those in the by_county for years of first record and establishment
man_county_data %<>%
filter(!is.na(Status)) %>%
#first grabbing the date (if present) from the Date_Establish or Date_Alive/Morbound columns or setting 2020 to established records without dates
#establishment
mutate(YearOfEstablishment_man = ifelse(
#if infestation and has a date of establishment
Status == "Infestation" & !is.na(Date_Establish),
year(Date_Establish),
#if infestation and no date, sets to 2020
ifelse(Status == "Infestation" & is.na(Date_Establish), 2020, NA)),
#first record
FirstRecord_man = ifelse(
#not infested and has a year of first record
Status != "Infestation" & !is.na(Year_FirstRecord),
Year_FirstRecord,
#infested and does not have a year of first record is set to 2020
ifelse(Status == "Infestation" & is.na(Year_FirstRecord), 2020, NA))) %>%
dplyr::select(-Year_FirstRecord)
Now that both datasets are cleaned to have years of first record and establishment, they can be combined. A combined FIPS code is added to the full county dataset first.
#create combined FIPS columns
by_county %<>%
mutate(FIPS = as.numeric(paste0(STATEFP, COUNTYFP)))
#merge
by_county <- man_county_data %>%
dplyr::select(FIPS, YearOfEstablishment_man, FirstRecord_man) %>%
left_join(by_county, ., by = "FIPS")
There may be some duplicate rows, and so the first record and establishment rows are retained. This dataset contains incomplete data for 2021, so we recode these as well, since our data are for as of 2020.
#pick earliest year (between the records mined from the data and the manually compiled records)
by_county %<>%
mutate(YearOfEstablishment = pmin(YearOfEstablishment, YearOfEstablishment_man, na.rm = T),
FirstRecord = pmin(FirstRecord, FirstRecord_man, na.rm = T)) %>%
#recode the 2021 data
#get cases where the first record is 2021, and set record to NA
mutate(FirstRecord = case_when(
FirstRecord == 2021 ~ NA_real_,
TRUE ~ FirstRecord
),
#get cases where the first record or year of establishment is 2021 and set establish to NA
YearOfEstablishment = case_when(
FirstRecord == 2021 ~ NA_real_,
YearOfEstablishment == 2021 ~ NA_real_,
TRUE ~ YearOfEstablishment
)
)
This current version uses the
data(by_country, package = "slfrsk")
. Again,
states_summary_present_ensemble
comes with established
state data.
#by_county version
reg_states <- by_county %>%
st_drop_geometry() %>%
as_tibble() %>%
filter(!is.na(FirstRecord) | !is.na(YearOfEstablishment) | !is.na(FirstRecord_man) | !is.na(YearOfEstablishment_man)) %>%
summarize(states = toupper(unique(STUSPS))) %>%
#clear it to just unique values
distinct(states)
#recode the regulatory event states
states_summary_present_ensemble <- states_summary_present_ensemble %>%
mutate(regulatory = ifelse(ID %in% reg_states$states, 1, 0)) %>%
dplyr::select(ID:estab, regulatory, everything())
We have two models that we look at, specifically the relationship between trade with established states AND:
estab
)regulatory
)
#logistic regression
glm.fit1 <- glm(estab ~ log10_avg_infected_mass, data = states_summary_present_ensemble, family = "binomial")
#summarize to see fit
summary(glm.fit1)
#>
#> Call:
#> glm(formula = estab ~ log10_avg_infected_mass, family = "binomial",
#> data = states_summary_present_ensemble)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.47592 -0.19554 -0.03601 -0.00405 2.02849
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -42.740 15.266 -2.800 0.00512 **
#> log10_avg_infected_mass 5.643 2.034 2.774 0.00554 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 47.532 on 50 degrees of freedom
#> Residual deviance: 19.608 on 49 degrees of freedom
#> AIC: 23.608
#>
#> Number of Fisher Scoring iterations: 8
#logistic regression for regulatory event states
glm.fit2 <- glm(regulatory ~ log10_avg_infected_mass, data = states_summary_present_ensemble, family = "binomial")
#summarize to see fit
summary(glm.fit2)
#>
#> Call:
#> glm(formula = regulatory ~ log10_avg_infected_mass, family = "binomial",
#> data = states_summary_present_ensemble)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6670 -0.6181 -0.2227 0.5172 2.6132
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -21.5407 6.0197 -3.578 0.000346 ***
#> log10_avg_infected_mass 2.9962 0.8472 3.537 0.000405 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 64.924 on 50 degrees of freedom
#> Residual deviance: 38.506 on 49 degrees of freedom
#> AIC: 42.506
#>
#> Number of Fisher Scoring iterations: 6
star_table <- stargazer(glm.fit1, glm.fit2,
type = "html",
intercept.bottom = T,
ci = FALSE,
digits = 2,
style = "io",
#style = "default",
title = "SI Table: Logistic regression of SLF status on trade",
out = paste0(here::here(),"/vignettes/logistic_models.doc")
)
estab | regulatory | |
(1) | (2) | |
LOG10_AVG_INFECTED_MASS | 5.64*** | 3.00*** |
(2.03) | (0.85) | |
CONSTANT | -42.74*** | -21.54*** |
(15.27) | (6.02) | |
Observations | 51 | 51 |
Log likelihood | -9.80 | -19.25 |
Akaike information criterion | 23.61 | 42.51 |
Notes: | p < .01; p < .05; p < .1 |