#1: Isolating dispersal jumps and diffusive spread

Aim and setup

The spread of invasive species is due both to diffusive spread and human-assisted jump dispersal, often caused by species hitchhiking on vehicles. The aim of this vignette is to differentiate diffusive spread from jump dispersal in invasive species based on occurrence data. This identification required the development of a directional analysis of species presence. We demonstrate this method using occurrence data of the spotted lanternfly, Lycorma delicatula, an invasive Hemiptera in the United States.

1. Data overlook

The SLF dataset is available from the lydemapr companion package. In the download_data folder, then in the v2_2023 subfolder, download the lyde_data_v2.zip file and place it in the data folder of your local jumpID project. At the date at which this vignette is written, lydemapr contains data up to year 2022.

We load this dataset that contains all the occurrences of the spotted lanternfly, and take a look at it.

##   bio_year latitude longitude lyde_established
## 1     2015 40.41441 -75.65882            FALSE
## 2     2016 40.33333 -75.63529            FALSE
## 3     2016 40.36036 -75.48235            FALSE
## 4     2016 40.36937 -75.62353            FALSE
## 5     2016 40.37838 -75.71765            FALSE
## 6     2016 40.47748 -75.62353            FALSE

The dataset contains 831,039 rows, each corresponding to a SLF survey at a specific date and location. The columns that we will using are:
* bio_year: the biological year of the occurrence (see lydemapr for a description of the difference between year and bio_year)
* latitude and longitude (XY coordinates). The precise location of SLF surveys is here summarized in 1-km² cells. All coordinates are in WGS84 (EPSG 4326)
* lyde_established: whether an established SLF population was found in this survey (at least two live individuals or an egg mass, see lydemapr)

2. Data formatting

First, to use the jumpID package, we need to rename the columns with generic names expected by the package. We also remove rows where the SLF establishment is NA.

slf %<>% rename(year = bio_year,
                established = lyde_established) %>% 
  filter(is.na(established) == F)

Second, multiple surveys were sometimes conducted at the same location during the same year, resulting in a redundant dataset for the purpose of our analysis. We summarize the information available at each location every year, so that a row now represents the detection status at a location for a given year.

Note: when several surveys indicate that SLF are “present” the same year at the same location, it could be tempting to categorize SLF as “established”. However, the category “present” often refers to dead individuals, although this information is not explicitly available. We use a conservative approach and consider SLF as established in a cell only if one of the surveys considered it as established.

grid_data <- slf %>%
  dplyr::group_by(year, latitude, longitude) %>%
  dplyr::summarise(established = any(established)) %>% 
  ungroup()

## `summarise()` has grouped output by 'year', 'latitude'. You can override using
## the `.groups` argument.

head(grid_data)

## # A tibble: 6 × 4
##    year latitude longitude established
##   <int>    <dbl>     <dbl> <lgl>      
## 1  2014     39.8     -76.8 FALSE      
## 2  2014     39.8     -76.3 FALSE      
## 3  2014     39.8     -76.4 FALSE      
## 4  2014     39.9     -76.6 FALSE      
## 5  2014     39.9     -76.9 FALSE      
## 6  2014     39.9     -76.5 FALSE

The dataset now has 123,532 rows, corresponding to 1280 x 1596 cells. Let’s look at these data on a map.

Cells were predominantly surveyed in the Eastern US, but there are also cells surveyed in other parts of the US, starting in 2019. Positive cells (established SLF) are only in the Eastern US, so maps will be zoomed in on the Eastern US from here.

Every year, there is a large number of cells surveyed with positive (SLF established) or negative results (SLF not established). A zoomed-in map allows to see the cell grid.

The progression of the invasion will be measured starting from the introduction point that has been documented in Barringer et al. 2015. We store the coordinates of the introduction point in an object for the next analyses. If the precise introduction point is unknown, it may be replaced by the centroid of the invasion at the time the invasive species was discovered in the invasive range.

# Coordinates of the introduction site, extracted from Barringer et al. 2015
# As a table (for distance calculations):
centroid_df <- data.frame(longitude = -75.675340, 
                        latitude = 40.415240) 


# Replace by the centroid of the invasion in the first years of the invasion if the 
# introduction point is unknown:
# centroid_df <- slf %>% filter(year == 2014) %>%
#   summarise(latitude = mean(.$latitude),
#             longitude = mean(.$longitude))

To finish preparing the dataset for analysis, we calculate the distance between each survey cell and the introduction point.

grid_data %<>% 
  dplyr::mutate(DistToIntro = geosphere::distGeo(grid_data[,c('longitude','latitude')],
                                                 centroid_df[,c('longitude', 'latitude')])/1000) 
# the distance is obtained in meters, and we divide it by 1000 to obtain kilometers 
# given the scale of this invasion.

head(grid_data)

## # A tibble: 6 × 5
##    year latitude longitude established DistToIntro
##   <int>    <dbl>     <dbl> <lgl>             <dbl>
## 1  2014     39.8     -76.8 FALSE             121. 
## 2  2014     39.8     -76.3 FALSE              83.7
## 3  2014     39.8     -76.4 FALSE              90.4
## 4  2014     39.9     -76.6 FALSE             101. 
## 5  2014     39.9     -76.9 FALSE             121. 
## 6  2014     39.9     -76.5 FALSE              92.2

The dataset is now ready for the jump analysis.

3. Differentiating diffusive spread and jump dispersal: principle

A set of four successive functions will be run to separate cells where SLF is established due to continuous, diffusive spread, from cells where SLF is established due to jump dispersal. We will first run the functions on a tentative set of parameters to understand how the analysis works. The optimization of the parameters will be described later in this vignette.

a- attribute_sectors()

Considering that the expansion of the invasion is heterogeneous in space, the jump analyses requires the invasion range to be divided into sectors with the introduction site as the origin. The first function, attribute_sectors() attributes a sector number to each cell of the dataset, to be used in the jump analysis. Here, we divide the invasion range into 16 sectors. See vignette #2 “Decreasing calculation time” to learn how changing the number of sectors may make your analysis faster.

grid_data_sectors <- jumpID::attribute_sectors(dataset = grid_data, 
                                                 # dataset to be explored
                             nb_sectors = 16, 
                             # number of sectors to divide the invasion range
                             centroid = c(-75.675340, 40.415240)
                             # vector containing the centroid coordinates as long/lat
                             )

## 2024-08-22 14:34:47.474122 Start sector attribution... Sector attribution completed.

The jump analysis will go through each sector successively.

b- find_thresholds()

The function find_thresholds() will then search for the first discontinuity in the SLF distribution. Points are sorted by increasing distance to the introduction point, and the function goes through each consecutive pair of points, starting from the introduction point and going outwards. It stops once it identifies a distance larger than a defined gap_size between two consecutive points, marking a discontinuity in the SLF distribution. The last point before this discontinuity marks the putative limit of the diffusive spread. The function runs independently for each sector. It is considered that populations do not go extinct, and data are cumulated over time. The limit of diffusive spread cannot go back towards the introduction point over time.

The input dataset is the output of attribute_sectors(). The gap_size parameter defines the minimal distance between two consecutive points that marks a discontinuity in the SLF distribution, in km. In this example, it is set to 15 km.
If the option “negatives” is set to TRUE (default), the function will check that there are negative surveys in the discontinuity, so that the absence of SLF established is not due to an absence of surveys in the area.

gap_size = 15 

Results_thresholds <- jumpID::find_thresholds(dataset = grid_data_sectors, 
                                                gap_size = gap_size, 
                                                negatives = T)

## 2024-08-22 14:35:00.539064 Start finding thresholds... Sector 1/16...  2/16...  3/16...  4/16...  5/16...  6/16...  7/16...  8/16...  9/16...  10/16...  11/16...  12/16...  13/16...  14/16...  15/16...  16/16
## Threshold analysis done. 4243 potential jumps were found.

# Resulting objects are
#- Results_thresholds$preDist, a data frame of threshold cells = extreme points of the 
# diffusive spread. Will be completed in find_secDiff()
#- Results_thresholds$potJumps, a data frame of potential jumps. Will be
# pruned in find_jumps()

On this map, blue points indicate the invasion front identified by find_thresholds(), i.e., the limits of the diffusive spread in every direction, for every year. At this point of the analysis, all points farther than these thresholds are stored as potential jumps (green points).

c- find_jumps()

The function find_jumps() prunes the list of potential jumps of cells that are less than from a positive cell from a previous year, as the species likely spread from this other cell. The final list of jumps is obtained in the object Jumps.

Results_jumps <- jumpID::find_jumps(grid_data = grid_data, 
                       potJumps = Results_thresholds$potJumps,
                       gap_size = gap_size)

## 2024-08-22 14:41:08.43217 Start finding jumps... Year 2014 ... Year 2015 ... Year 2016 ... Year 2017 ... Year 2018 ... Year 2019 ... Year 2020 ... Year 2021 ... Year 2022 ... Jump analysis done. 387 jumps were identified.

# Resulting objects are:
#- Results_jumps$Jumps, a data frame containing all jumps
#- Results_jumps$diffusers, a data frame of positive cells stemming from diffusive spread
#- Results_jumps$potDiffusion, a data frame of remaining cells, containing a mix of 
# secondary diffusion and additional threshold points. Will be pruned in find_secDiff()

The analysis may be stopped here if only dispersal jumps are of biological interest. As a complement, secondary diffusion may be examined in the last function, find_secDiff().

d- find_secDiff()

Cells that were discarded from the jump list are either additional threshold points or secondary diffusion from a previous jump. find_secDiff() attributes them to the correct category by checking whether they are close from a previous jump, or a diffusion point.

Results_secDiff <- jumpID::find_secDiff(potDiffusion = Results_jumps$potDiffusion,
                                Jumps = Results_jumps$Jumps,
                                diffusers = Results_jumps$diffusers,
                                Dist = Results_thresholds$preDist,
                                gap_size = gap_size)

## 2024-08-22 14:41:53.572058 Start finding secondary diffusion... Year 2017 ...Year 2018 ...Year 2019 ...Year 2020 ...Year 2021 ...Year 2022 ...Analysis of secondary diffusion done.

# Resulting objects are:
#- Results_secDiff$secDiff, a data frame of all cells that are secondary diffusion 
# from a previous jump
#- Results_secDiff$Dist, a data frame of all extreme points on the invasion front (thresholds)

4. Wrapper for jump analysis

All these analyses can be done in one instance using the wrapper function.

jumps_wrapper <- jumpID::find_jumps_wrapper(dataset = grid_data, 
                                              nb_sectors = 16,
                                              centroid = c(-75.675340, 40.415240),
                                              gap_size = 15,
                                              negatives = T)

## 2024-08-22 14:43:16.131378 Start sector attribution... Sector attribution completed. 
## 2024-08-22 14:43:16.164782 Start finding thresholds... Sector 1/16...  2/16...  3/16...  4/16...  5/16...  6/16...  7/16...  8/16...  9/16...  10/16...  11/16...  12/16...  13/16...  14/16...  15/16...  16/16
## Threshold analysis done. 4243 potential jumps were found. 
## 2024-08-22 14:49:13.378279 Start finding jumps... Year 2014 ... Year 2015 ... Year 2016 ... Year 2017 ... Year 2018 ... Year 2019 ... Year 2020 ... Year 2021 ... Year 2022 ... Jump analysis done. 387 jumps were identified.
## 2024-08-22 14:49:39.405896 Start finding secondary diffusion... Year 2017 ...Year 2018 ...Year 2019 ...Year 2020 ...Year 2021 ...Year 2022 ...Analysis of secondary diffusion done.

We plot the results on a map.

5. Rarefy jump clusters

Some jumps were clustered in the same area. Jump clusters may stem from independent dispersal jumps, i.e. SLF hitchhiked multiple times to these locations the same year. Alternatively, jump clusters may result from SLF quickly spreading around a single dispersal jump. Finally, they can be a mix between these two hypotheses. We identify these jump clusters to better understand jump locations.

First, we delineate these jump clusters as jumps located less than the gap size from each other. Check how many points there are per jump cluster.

a- group_jumps()

Jump_groups <- jumpID::group_jumps(Jumps, gap_size = 15)

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 2
##    year mean_njumps
##   <int>       <dbl>
## 1  2017        1.5 
## 2  2018        8.29
## 3  2019        4.23
## 4  2020        1.65
## 5  2021        2.28
## 6  2022        2.41

b- rarefy_groups()

We summarize each jump cluster by summarizing each of them by their most central point.

Jump_clusters <- jumpID::rarefy_groups(Jump_groups) %>%
  dplyr::mutate(Rarefied = TRUE)

These 387 jumps were rarefied into 152 jump clusters.

6. Data is ready for biological analysis and interpretation!

– end of vignette –

Nadege Belouard1

2024-08-22