This vignette explains how records were obtained from GBIF and added to the data base. The code may have issues due to update with the ‘rgbif’ package. The outputs of the code are also not shown to make rendering the vignette easier.
The aim of this vignette is to show how we update our herpetological records data with species records from the Global Biodiversity Information Facility (GBIF) via two functions in the ‘caribmacro’ package.
First we need to load in the caribmacro package.
# Project package
library(caribmacro)
If you do not have this package installed, then run
devtools::load_all
in order to load all of the necessary
functions if you are running this code in the caribmacro R project in
the shared folder. If you are not, then see the Functions vignette for
the functions needed.
# Load all of the functions of the caribmacro package
devtools::load_all("./", export_all = FALSE)
The other R packages needed for this are:
The first step to updating the herpetogical records in our database
is to retrieve all of the records of herpetofauna in the Caribbean
bioregion as described in Hedges et al. (2019) and Bermuda. This is done
by loading in a shapefile that outlines the area(s) of interest
(i.e. the region previously described) and using the
gbif_records()
function.
The shapefile needs to read in using the
rgdal::readOGR()
function.
# Load in shapefile for Geographic bounds
outline<-readOGR(file.path(here(), 'data_raw', 'gis', 'Carib_Poly.shp'))
Now we can use gbif_records()
to download the GBIF
records for certain groups of taxa. Here we are downloading the records
for the taxonomic classes ‘Reptilia’ and ‘Amphibia’ for the past two
years. The groups have to be of the same taxonomic rank.
# Collect reptile and amphibian occurrences
h.occ <- gbif_records(groups = c('Amphibia','Reptilia'),
rank = 'class',
years = 2,
bounds = outline)
# Remove the shapefile since it is large and can slow down the computer
rm(outline)
It is important to know the default setting for this function because
you should check that these defaults returns all of the records you
want. The first default you should check is the number of records that
was returned by the function. The default setting for this is
limit = 50000
, and to check if this is enough, all you have
to do is to see if the number of records for he the taxanomic groups
idividually is less than 50,000 (i.e. make sure the number of rows of
the data frame for a taxanomic group is less than 50,000). If it does,
then you should increase the limit (i.e. limit = 60000
) and
check again.
(nrow(h.occ[which(h.occ$Group == 'Amphibia'), ]) == 50000)
(nrow(h.occ[which(h.occ$Group == 'Reptilia'), ]) == 50000)
As you can see we did not max out the limit of records for either
group. However, if either of the line of code above returned a
TRUE
, then we would have had to increase the
limit
argument. The function takes a while to run and will
take even longer the more records it searches for. Therefore, it is
recommended to only increase the limit in small increments.
The maximum number of records that can be retrieved for a single group is 200,000. If you max out that limit then you need to break up the time period into smaller intervals. The intervals are by year so if you still max out the 200,000 record limit, then use smaller groups of species (i.e. instead of using classes, like Reptilia, use orders, like Squamata).
In order to break up the time period from which you want to retrieve
records, you need to understand how the function works. The
years
argument sets the earliest year you want records by
subtracting the number of years you tell it to from another argument,
yr_from
. The default for yr_from
is the
current year of the computer (i.e. substr(Sys.Date(), 1, 4)
which at the time this vignette was made
substr(Sys.Date(), 1, 4)
= 2022). Therefore, if you, for
example, want the records from 2016 through 2018, then you would set
yr_from = 2018
and years = 2
, but if you want
the records from 2018 through the current year, you only have to set
years = 2
and use the default for yr_from
.
Now let see what is returned by gbif_records()
.
head(h.occ)
As you can see only a subset of information is returned by
gbif_records()
(much more information is available about
the records in GBIF). However, this is the minimum amount of information
you need to check the credibility of these records.
You may have noticed that the gbif_records()
does not
tell you what geographic feature from which they came. For our purposes
we want to know from which what island banks the records came.
Unfortunately, the coordinates of the records are not that great making
the identification of the islands from which they came nearly
impossible. HOwever, this is currently being worked on to make this
possible.
To determine from where the records came we need a shapefile of the
geographic features of interest. We will do this using the
sf::st_read()
function because it is faster and
sf
objects are easier to work with.
# Load in shapefile of Geographic Features
carib.bnk <- st_read(file.path(here(), 'data_raw', 'gis', 'Carib_Banks.shp'))
The function we will use to label the records with their geographic origin also filters out the records of species that we already have in our database. Therefore, we also need to load our data.
# Load in our data
herp <- read.csv(file.path(here(), 'data_raw', 'IBT_Herp_Records_final.csv'), header=TRUE)
Now we can use the rec_update()
function to select the
records that we do not have and label them with the island bank on which
they were recorded.
Note, that in rec_update()
the argument
geo_name = 'bank'
refers the name of the attribute in the
shapefile with the geographic feature names and the column in the data
that has those same names. These must match for the function to work
properly.
new_recs <- rec_update(gbif_occ = h.occ,
geography = carib.bnk,
geo_name = 'bank',
data = herp,
sp_name = 'binomial')
Now when we look at the records we can see that a new column has been created to denote the island bank the to which the records belong.
names(new_recs)
However, because the coordinates for the records are not that great, you should plot a few of the island banks to see how the records match their banks.
tmp1 <- carib.bnk[which(carib.bnk$bank == 'puerto rico' | carib.bnk$bank == 'mona' |
carib.bnk$bank == 'monito' | carib.bnk$bank == 'st croix'), ]
tmp2 <- droplevels(new_recs[which(new_recs$bank == 'puerto rico' | new_recs$bank == 'mona' |
new_recs$bank == 'monito'| new_recs$bank == 'st croix'), ])
tmp2 <- sf::st_as_sf(tmp2, coords = c("decimalLongitude", "decimalLatitude"),
crs = st_crs(tmp1)$proj4string)
names(tmp2)[which(names(tmp2) == 'bank')] <- 'bank2'
ggplot() +
geom_sf(data = carib.bnk, fill = 'azure2', color = 'black') +
geom_sf(data = tmp1, aes(fill = bank), color = 'black', alpha = 0.3) +
geom_sf(data = tmp2, aes(fill = bank2), shape = 21, color = 'black', alpha = 1) +
coord_sf(xlim = c(-68.25, -63.75), ylim = c(17, 19.5), expand = FALSE) +
xlab('Longitude') +
ylab('Latitude') +
ggtitle('Puerto Rico and Nearby Banks')
tmp1 <- carib.bnk[which(carib.bnk$bank == 'great bahama' | carib.bnk$bank == 'little bahama'), ]
tmp2 <- droplevels(new_recs[which(new_recs$bank == 'great bahama' |
new_recs$bank == 'little bahama'), ])
tmp2 <- sf::st_as_sf(tmp2, coords = c("decimalLongitude", "decimalLatitude"),
crs = st_crs(tmp1)$proj4string)
names(tmp2)[which(names(tmp2) == 'bank')] <- 'bank2'
ggplot() +
geom_sf(data = carib.bnk, fill = 'azure2', color = 'black') +
geom_sf(data = tmp1, aes(fill = bank), color = 'black', alpha = 0.3) +
geom_sf(data = tmp2, aes(fill = bank2), shape = 21, color = 'black', alpha = 1) +
coord_sf(xlim = c(-80.5, -74), ylim = c(22, 28), expand = FALSE) +
xlab('Longitude') +
ylab('Latitude') +
ggtitle('The Bahama Banks')
These plot show the locations of the records in relation to their banks. From these plots you can see how well the records match their banks. Most likely, you will have many records that do not match that well. However, even if all the records you download match perfectly, you still need to check all of the records.
To check the records you first need to export the new records from R. Because we want to keep a records of the GBIF records downloaded and keep the package folder small, you need to export these records to the shared folder. The code below will export the records into the appropriate folder with the appropriate file name for the Caribbean Macrosystems Project.
Now that the new records are exported you need to check the records
for there credibility. This needs to be outside of R
, but
let’s see what the new records data look like.
head(new_recs)
As you can see, the new_recs
data frame looks the same
as the h.occ
data frame. However there is another column in
the new_recs
data frame that denotes if there are more than
one record for that species on that particular bank. This is the first
step in determining the credibility of the records. If a species has at
least five records for a patricular bank then you can mark it to be
included in the database. The species with fewer than five records for a
particular bank probably should not be included in the data base.
However, just because there is only a few records does not mean that
that record is bad. Additionally, you can reduce the number of records
you will have to check the credibility for by comparing the species
names to the list of synonyms in the database (this may be built into
the rec_update()
function in the future).
The easiest way you can check the credibility of the record is to look at the ‘basisOfRecord’ column. If that column says that record is ‘PRESERVED_SPECIMEN’ then you can take that record as research grade and mark it to be included in the database.
If the record’s ‘basisOfRecord’ column says ‘HUMAN_OBSERVATION’ then you need to check the source of the record. The ‘occurrenceID’ column has a link to the iNaturalist posting for that record. By following that link you can see the credibility of the identification. We only what IDs that are designated “Research Grade” by iNaturalist. Most of the records will be from iNaturalist. If there is no link in the ‘occurrenceID’ column, then check the ‘references’ column for the source of the record and check that. On iNaturalist, you can also see all of the other records of that species in a map. You can use also this to get an idea of the first recording of the species and the validity of the record. For instance if you see the species has been recorded on a particular bank for multiple years than the validity of that species being on that bank is high. All of the records you deem having high validity should be marked for inclusion in the database.
Once you have marked all of the credible records, you need to do literature searches following the literature search protocols to see if there are any published records of that species on that bank. If so then record the earliest reference.
After you do all of this, the records that were marked for inclusion should be added to the database following those protocols.
Temple University, jmg5214@gmail.com↩︎