vignettes/113_versioning.Rmd
113_versioning.Rmd
Data version control should be done manually by adding the version at the end of the filename
At least three versions of the data are required: raw, v0, and final
Version control is another essential part of proper data management. With high quality version controlling you will never have to worry about making an error during your data processing because, if you do, you can always revert back to an earlier version of your data. The frequency in which you create new version will largely depend on how you process your data. You should create a new version every time you remove or add a large portion of data as well as every time you edit a large portion of data. Small changes do not require new versions until they sum up to a large portion of the data. However, all changes no matter how small need to be logged in both your file documentation (i.e. Changelog; see File Documentation section) and in the code you wrote for data processing (see R Packaging section).
At the very least you should have 3 versions of your data:
a raw file that you should never edit (denoted ‘raw’ in the file name)
an initial version (denoted ‘v0’ in the file name)
a final version that you used for your publication (denoted ‘final’ in the file name).
Any additional versions created during your data editing and processing should be denoted by v1, … , v{number of version before final} in the file name. If you think you will have more than ten versions then put a ‘0’ in front of the version number for versions 0 through 9 (i.e. make them version v00, v01, v02, … , v09). If you move directly from the initial version to your final version you will use for analysis, then you will only have these three versions (raw, v0, and final). However, it is good data hygiene to export several versions from your R code so that any mistakes can be easily diagnosed if the need arises.
The easiest way to do manual versioning as described above is to set data processing goals and for each goal create a new version. For instance, if you are working with spotted lanternfly occurrence data, your first data processing goal may be to check, correct, and standardize every location name. Once you do this you would then create version v1. You then may set the goal to include data from a different state (the data from the other state will have a raw file in the raw data subfolder). Once you do that you would create a new version, v2, and so on. A typical first data processing goal is reformatting the data to adhere to the guidelines set forth by the DMP (see Data Formatting section) meaning that the v0 version does not have to be formatted as the DMP states. The frequency that you make versions is up to the you, but the more versions you make the easier it will be to fix any mistakes made during data processing.
These versioning guidelines are for anything produced during a project which includes, but is not limited to, data tables, GIS data, photographic data, video data, non-R scripts, figures, and manuscripts. Versioning for R scripts will be discussed in the Git R Script Version Control section.
Here is an example of file versioning that may occur that does not follow typical data versioning: If you are working from multiple data files that will be compiled (e.g. data collected by many researchers), then the compiled data file will be your raw data and have ‘raw’ as the version in the filename. This data file will be saved in the raw subfolder of the data folder of you project and never edited (except maybe if you compile the data again). You will then copy that data file into your root data folder and edit it. These edits will be logged in any code you use AND the changelog. Once you edit a significant portion of data then you will resave the data changing the version in the filename to v1. The v0 version will then be moved to the ‘old’ subfolder of your data folder. You will then edit the v1 version. These steps will be repeated, updating the version number, until you have created your final version. In your data folder this file will still have a version number. This is in case you will continue editing these data after publication of your final version. All older versions are moved to the old files subfolder. The final version will then be saved in the data_raw subfolder of your R project folder and have ‘final’ as the version in the filename. This is the file you will use in your analyses and upload to a data repository upon publication. If you do all of your editing with R code and do not want to export any incremental versions, then you will load the v0 version into R, edit it, and then export the final version into the root data folder and the data_raw folder of the R project folder.
For more complicated data or filetypes that may have limits on the number of characters (e.g. GIS files), you should have those files in a folder for that specific data file and then name that folder using the file naming convention and version control the folder (or in some cases the simpler file names in the folder).
If you would like to use a versioning software talk to the PIs to discuss what you would like to use. However, whatever versioning method you use, everyone involved with the data processing for your project should have access to previous versions, or you need to be available as a point of contact for everyone who has data questions (decisions about data access should be made on a project by project basis with the PIs, see the Pre-Project Planning section).
See LINK for a list of software. However, use caution since automated versioning of some data types (e.g. GIS data, large data sets, etc.) can require an immense amount of memory which renders their use extremely inconvenient. When looking for versioning software make sure the user has the ability to control when new version are made.
Temple University, jmg5214@gmail.com↩︎
Temple University, sebastiano.debona@gmail.com↩︎
Temple University, mrhelmus@temple.edu↩︎
Temple University, jebehm@temple.edu↩︎