vignettes/121_data_formatting.Rmd
121_data_formatting.Rmd
This section provides the guidelines for how to build and format data tables so that they are easily manipulated, edited, analyzed, and most importantly understood. In addition, we will discuss the proper formatting of other types of data as well such as spatial and audio-visual data. The formatting outlined in this section should be established well before the final version of the data is made meaning that the early files that are being manipulated/edited do not necessarily have to be formatted this way. However, it is highly recommended that the data (especially data tables) be formatted following these guidelines during the first edit (i.e. the edit(s) creating version 1).
Data tables should be saved as CSV files and be in long format
Column headers are required, should be consistent, and easy to understand (see iEco Header Conventions)
All text should be formatted consistently with dates in the YYYY-MM-DD format
All cells should have values and if a value is missing then an NA should be entered
All data tables should be saved as a comma delimited file (never tab delimited) such as a CSV (comma separated values) file. The reason for this is that Excel files may not be able to be easily read by researchers without Microsoft Office. This is also the same for other proprietary file types such as Access. If you are using a relational database software, such as Access, then the various tables should also be saved as CSV files. This is also true for data stored in R file types (e.g. Rdata), because not everyone knows how to use R.
Even though CSV and comma delimited text files are essentially the same, CSV files are the preferred format for data tables over text files because the .csv extension provides an easy way to sort or filter out the data files since the documentation files (see the File Documentation section) are often saved as text file (.txt extension). Additionally, CSV files are readable by most software and easy to load into most computer programs.
Data tables should always be in a long format unless a long format is too cumbersome (e.g. large community data). A long-formatted data table is a table where one row corresponds to the minimum observable unit of data (e.g. a single trial for an individual in a repeated behavioral assay) that has multiple rows for grouping factors (e.g. individual) and single columns for data (e.g. choice). However, even if long formats seem too cumbersome, long formats are better than wide for many reasons. Specifically, long formats allow for the easy conversion to other formats. Long formats also allow for the easy aggregation, subsetting, and analysis.
Below is an example of data in a long format:
Date | Site | Species | Sex | Count |
---|---|---|---|---|
04/24/2020 | A | Grackle | M | 2 |
04/24/2020 | A | Grackle | F | 1 |
04/24/2020 | A | House Finch | M | 7 |
04/24/2020 | A | House Finch | F | 9 |
04/24/2020 | A | American Robin | M | 4 |
04/24/2020 | A | American Robin | F | 3 |
04/24/2020 | B | Grackle | M | 0 |
04/24/2020 | B | Grackle | F | 2 |
04/24/2020 | B | House Finch | M | 10 |
04/24/2020 | B | House Finch | F | 15 |
04/24/2020 | B | American Robin | M | 4 |
04/24/2020 | B | American Robin | F | 6 |
Every data table is REQUIRED to have column headers. Column headers should be short and easy to understand. Avoid using symbols and abbreviations whenever possible. For example, instead of using ‘dc’ to denote the distance to Cuba, use ‘distance_cuba’. Abbreviations are helpful, especially during analysis coding, but can lead to difficulties in comprehension. Make sure that if you use abbreviations you use common, easy to understand abbreviations. For instance, instead of ‘distance_cuba’ we could use ‘dist_cuba’ since “dist” is an often-used abbreviation for distance. Here at the iEco Lab we prefer you to not use abbreviations except for instances when the column header is very long (a general guideline is that headers should be no more than 15 characters and therefore ‘distance_cuba’ is not too long). Additionally, for column headers, spaces should be replaced with an underscore (i.e. “_”). Some analysis software and R functions cannot handle spaces in column headers, which can cause the data to erroneously load or return errors. Column header conventions should be made for individual projects with multiple data tables or for whole labs. Column header conventions help to make data easily combined for analysis and understood.
The iEco column header conventions can be found HERE. These conventions can be updated as needed, so if you are creating a data table and the convention for a type of data in a column of your table has not been made, feel free to make the convention for the lab (remember, with great power comes great responsibility).
Column headers also should not include units. The units of measurement should be included in you meta data for that file (see File Documentation section). If you have multiple columns that have the same data but different units that are easy to convert (e.g. m and km) then delete one of those columns and convert in your analysis code if needed. The only exception to not having units in the header is if you have the same data with different units that are not easily converted between each other and are on very different scales (e.g. atmospheric pressure, temperature), or if for some reason you need both metric and imperial units (e.g. for making maps for US and non-US users).
The format of any text (including numbers) in your data table should be as basic as possible. The file types that should be used (i.e. csv or text files) strip all formatting from any text. This can cause issues if you use special characters, symbols, and fonts. Therefore, all special characters and symbols should be avoided. Additionally, most analysis software do not know how to read special characters or symbols.
Here is a list of acceptable symbols and the top 10 worst symbols:
\(\underline{Acceptable \space Symbols:}\)
Name | Symbol |
---|---|
Underscore | _ |
Dash | - |
Veritcle Line | | |
Semicolon | ; |
Period | . |
\(\underline{Symbols \space to \space Avoid \space at \space All \space Cost:}\)
Name | Symbol |
---|---|
Dollar Sign | $ |
Question Mark | ? |
Percen Sign | % |
Number Sign | # |
Commercial At | @ |
Comma | , |
Colon | : |
Backslash (Reverse Solidus) | \ |
Quotation Mark | " |
Apostrophe | ’ |
For text strings (column headers and factors with more than 1 character) avoid using capitalized letters where possible. If you do use capital letters, make sure you are consistent with how you use them. For example, if you have binomial species names and you capitalize the genus but not the species, then make sure you use that same framework throughout. In the above example the first letter of each word in the species name is capitalized throughout that column. However, a good rule of thumb to follow is to never use capitalization where possible.
Entry IDs should be included in all tables you produce. Entry IDs are unique identifiers for each entry (i.e. row) in a data table. This allows for easy tracking of specific changes you make to the data as well as allowing for the easy creation of relational keys between your data tables for easy merging. Avoid just using simple row numbers for these IDs. Good entry IDs are alpha-numeric text strings that provides some information about the entry, allows for easy sorting, and should be the same number of characters for each entry. These IDs should not be typed in with the data and be simple (few numbers and characters) or complex. It is often easiest to make IDs by concatenating multiple columns together with a separating symbol.
For example, a good entry ID for the first entry in the above table would be “20200424_A_02_M” which tells us that entry is for site A, the second species alphabetically, is male, and was made on a particular survey day. We could also use that ID to sort so that the sites and species within those sites are alphabetical from the earliest survey to the last.
Entry IDs should also be unique to the data table meaning that the ID scheme should not be repeated across tables that may be later joined. The entry ID should be created at some point before the final version of the data is created. The timing at which the entry ID is made depends on the specific needs of the project and data editing. Once the entry ID is made then it is set for the entire lifetime of the data and should not be change. However, it is recommended to create the entry ID after all of the large edits to the data are made. This is because if large sections of data are deleted then you will have large gaps in the entry ID series. Do not worry about adding data after an entry ID is created since it is easy to just continue the series where you left off. Importantly, entry IDs should never be recycled even if an entry has been deleted. This could make following the change logs difficult.
Dates in data tables should be in the format of YYYY-MM-DD, and no other format should be used. In the above table the format used for the date is MM/DD/YYYY. This format is bad for multiple reasons. First, not every culture and, therefore, their computers use that format. In many regions, the day comes before the month (i.e. DD/MM/YYYY). This can cause issues for dates like 4/5/2020. Is that April 5th or is it May 4th? Computers in different cultures will also ask that question and may just assume that the first number is the day or month depending on its default setting. The year-month-day format with dashes in between the year, month, and day is understood by all computers and software. Additionally, by having dates in the YYYY-MM-DD you can easily sort your data by date. When including years in dates they always should be 4 digits so that 2020 is not confused with 1920. If you like having the character form of months (e.g. March or Mar) in your data tables, then you should have that separate from the date in a new column.
Missing data should be handled the same throughout the data table. This includes any comment columns you have in the table. NEVER leave cells blank for missing data. Blank cells have an ambiguous meaning since they could mean that there was no data for that cell, the cell was accidentally deleted, or it was erroneously skipped over during transcription. Instead, use something like ‘NA’ to denote missing data. It is preferred to use ‘NA’ to denote missing data since it is read by R software in that way. No matter what is the cause of the missing data (e.g. data not recorded, data not able to be recorded, etc.), missing data should always by represented by an NA. If it is important for the project to differentiate between different causes of NAs then this should be recorded in a ‘notes’ or ‘comments’ column. If a specific value is wanted to denote a certain cause of missing data (e.g. NR for data not recorded), then that should be a decision made by the PI and Project Lead. However, be careful when reading this data into statistical software since they will not know what that value (e.g. NR) means and will likely read it as a character making the whole column a character variable.
When creating data tables, the most important thing is the data to be consistent and the nature of the data recorded in the meta data file (see File Documentation section). For example, species names should always be either only common or only scientific in a single column, and in comment columns, if a comment means the same thing, it should be kept the same (e.g. “date not recorded” is always entered, and never “date was not recorded”, “date missed”, “unrecorded date” and so on). Further, in comments DO NOT use commas to separate clauses. This is because in CSV files, commas are used to separate each field (i.e. column).
Lastly, make sure that there are no leading or trailing spaces in your data. This can cause some software to mess up your data or treat text strings as separate levels of a factor even if they are identical except for a trailing or leading space.
Now if we rework the above example table into the ideal formatting it will look like:
entry_id | date | site | common | sex | count |
---|---|---|---|---|---|
20200424_a_02_m | 2020-04-24 | a | grackle | m | 2 |
20200424_a_02_f | 2020-04-24 | a | grackle | f | 1 |
20200424_a_03_m | 2020-04-24 | a | house finch | m | 7 |
20200424_a_03_f | 2020-04-24 | a | house finch | f | 9 |
20200424_a_01_m | 2020-04-24 | a | american robin | m | 4 |
20200424_a_01_f | 2020-04-24 | a | american robin | f | 3 |
20200424_b_02_m | 2020-04-24 | b | grackle | m | 0 |
20200424_b_02_f | 2020-04-24 | b | grackle | f | 2 |
20200424_b_03_m | 2020-04-24 | b | house finch | m | 10 |
20200424_b_03_f | 2020-04-24 | b | house finch | f | 15 |
20200424_b_01_m | 2020-04-24 | b | american robin | m | 4 |
20200424_b_01_f | 2020-04-24 | b | american robin | f | 6 |
Now this table does not have any uppercase letters, has an informative alpha-numeric entry ID, the data is formatted correctly, and the headers adhere to the iEcoLab Header Conventions (i.e. ‘Species’ was changed to ‘common’ and there are no uppercase letters).
Spatial data is complex and requires thorough documentation in order to be read properly without distortion. The actual formatting of the data itself is often dictated by the type of data they are. This means that one overarching rule for spatial may not be applicable for the formatting of spatial data. At the bare minimum, spatial data needs to have the spheroid, datum, and when applicable the projection well documented for each file. Many of the file types used for spatial data have this information included in the file itself or in auxiliary files. However, spatial data stored as a CSV often does not include this information, so it needs to be supplied in a user made documentation file (see File Documentation section).
Point and vector (e.g. polygons) data types can be stored as CSV and shapefiles. However, CSV files are better suited for point data than for vector data. For point data, the coordinates should be kept in two columns named ‘latitude’ and ‘longitude’ for lat-long coordinates or ‘northing’ and ‘easting’ for UTM coordinates. For UTM coordinates the UTM quadrat or zone needs to be documented which can be done in the CSV file in a new column if working in multiple quadrats/zones or in the file’s metadata file. However, it is recommended to use lat-long coordinates for final versions of the data that will be shared.
For vector data, CSV file can be used but they can be cumbersome to work with. Therefore, shapefiles are recommended since they are readable with open-source software like QGIS. Shapefiles are nice in that they automatically create the necessary documentation for geographic systems to properly read them. However, this is done through multiple auxiliary files which can be easily lost if one is not careful with file management. It is recommended that each shapefile and its auxiliary files are kept in individual folders which can be turned into a zip folder for easy sharing. Keeping these files together also aids in version control when creating and editing vector data and avoids the need to long file names for which some geographic software has limits. Point data can also be saved as shapefiles if wanted but CSV files are often smaller. KML files (i.e. Google Earth files) should be avoided for the storing and sharing of these kinds of data since it is difficult to read outside of Google’s systems.
Raster data should be saved as a GeoTIFF file(s). GeoTIFFs are tif images that are georeferenced and are tagged with the important georeferencing information (i.e. spheroid, datum, projection, etc.). These types of files can be read by most software and programming languages and is in continual development meaning that any compatibility issues for common software or programming languages will most likely be addressed in future releases. GeoTIFF files can support both single and multi-band raster data (e.g. RBG aerial images). For multi-band raster data, the individual bands can be stored as individual tif images or can be combined into one file. If you store the bands as individual files, then you should follow the same folder organization that was described for shapefiles in the above paragraph.
Spatial data with high dimensionality such as raster data with bands for different data types should be saved as netCDF files. This file type is a scientific standard used for storing and sharing multidimensional spatial data and recommended by NASA and other government agencies and can be read by most programming languages and open source programs (i.e. panoply). However, they can be difficult to work with so these types of files should only be used if needed due to the dimensionality of the data making the use of the other file types too cumbersome.
For more information on proper spatial data management see:
Ramapriyan, H. K., and P. J. T. Leonard. 2020. Data Product Development Guide (DPDG) for Data Producers version 1. NASA Earth Science Data and Information System Standards Office, 9 July 2020. LINK
Audio-visual data should only be stored in common file types that are able to be read by most software. These file types include MP3 and WAV for audio files, JPEG and TIFF for images, and MPEG formats (i.e. MPEG and MPEG-4: .mpg and .mp4 file extensions respectively) for video.
For Images, JPEG is recommended for photos taken in the field or lab if the loss of quality will not diminish the data in the photograph (e.g. high contrast images, animal images, etc.). JPEG files are compressed and the more they are manipulated the lower the quality of the image will be. However, because they are compressed, they are usually smaller in size and therefore take up less memory. Additionally, JPEG are readable on any operating system, can have metadata embedded in the file, and the loss of quality due to compression can be mitigated with high quality cameras. Other file types for images such as RAW file types, PNG, GIF, etc. should be avoided due to their compression techniques (for all but RAW) and the fact that they are more difficult to read, often requiring proprietary software.
The TIFF file type should largely be reserved for images where the loss of quality will greatly diminish the data in the image. TIFF files do not lose information when they are compressed, manipulated, copied, re-saved, etc. This makes them great for figures, maps, and other complex images. However, because TIFF files do not lose information during compression, they can be quite large and take up a lot of memory. When using TIFF files for figures they should be saved at least at 300dpi (i.e. 300 dots per inch) since most publishers require a figure to be a minimum of 300dpi resolution (some require 600dpi).
Like images, there are many file formats available for video. However, many of them are proprietary and require specific software to read. MPEG is readable by most software and retains video quality while keeping file sizes relatively small. Therefore, MPEG formats are recommended over MOV or WMV which were made specifically for Apple and Microsoft products, respectively. Some digital video recorders (DVRs) record video in H.264 and should be avoided when possible. H.264 video formats are typically used for constant recording because the file sizes are extremely small which can be advantageous for 24hr monitoring of, for example, a nest. However, the H.264 video format requires expensive software to read, and even though there is a free program to work with these files, it is extremely difficult to work with.
Often the type of audio-visual file format is dictated by the hardware used to record it and the quality of the recorder trumps the recommended file types. Therefore, it is important to make sure that you are able to read the file types that specific hardware records before purchasing the hardware.
Outside of file type, the formatting of audio-visual data is dependent on the use of the data. However, whenever recording audio-visual data, you should take the appropriate steps to make sure the recordings and/or images are of the highest quality possible. There are many tutorials online for how to use various hardware and the best practices for recording certain types of data.
A few practices that are recommended for all types of audio-visual data is to include the necessary information to comprehend the data in the file and in a separate data table that included the file name in a column. The bare minimum amount of information should be recorded is the date, time, study name, person taking the recording. However, any other pertinent information for the comprehension of what the data show or is for should also be included. This information can take the form of a label in the frame of a photograph, a message board at the start of video, or a spoken message at the start of an audio recording. For data that are remotely taken (e.g. camera trap images/videos), this information can be tagged to the video or image through many kinds of image or video editing software. To do this, we recommend Adobe Bridge for tagging any audio-visual data file with pertinent information because it is free to download and easy to use for editing a media file’s embedded metadata. The information in the data table and the labels, messages, and tags are important since it protects from information loss during the editing, coping, and/or sharing of data.
Temple University, jmg5214@gmail.com↩︎
Temple University, sebastiano.debona@gmail.com↩︎
Temple University, mrhelmus@temple.edu↩︎
Temple University, jebehm@temple.edu↩︎