My apologies if this is old hat for people. I come from a relatively low-data, high-computation world, and have taught Software Carpentry workshops a number of times. I'm moving into a world where data integrity, etc., is quite important and I'm curious if there is a resource detailing best practices and workflows. Just to give you an idea of the types of operations we expect to do:
- Importing and pre-processing of raw data into some refined database
- this is likely infrequent and possibly computationally intensive
- Construction of "input files" from a refined database, plus some other "assumptions" (basically, additional input specific to a given instance)
- additional input may be grouped, may be up to "medium" sized (say kB-MB uncompressed), may be related to a sensitivity analysis, etc.
- for collaboration, publication, etc, so tagging and publishing "version" of data is important
- the data is "living", so tracking changes is important (basically a DB/binary file version of version control)
I know these are all well-known topics for the code analog in software engineering, namely separating concerns and using version control. I'm curious if this community has a suggested resource(s) for someone with a decent background in software best practices to get up to speed quickly.
I'm interested in the results of this too. I found Christopher Gandrud's book "Reproducible Research in R and RStudio" to be a source of interesting ideas on at least (1) and (3). Gandrud's principle is that everything manually created should be in version control. I believe that means in your examples the raw data files and the code needed to turn them into DB/binary files would be versioned, but the binary file would not. You would keep the code needed to construct your input files in (2) under version control, but not necessarily the input files themselves.
I'm a relative newcomer to using version control with data projects, so no idea if that's appropriate or not. I'm curious about combining binary files and version control -- good idea or bad?
Hi @atiretoo, can you be more specific about what "binary files" you which to commit in version control? In general, it is possible to commit binary files, but text-based files are preferable so that version control "diff" tools can be used to display differences.
All sorts of binary files, but in particular docx or HTML5 presentations files that result from an R Markdown file. In my recent playing around with Git I kept getting warnings about diffs that were too big (several megabytes in size). I had assumed that was caused by the docx or HTML5 presentation files? There were no other substantial files in the directory. I might have been doing something else wrong too!
I'm interested in the best way to handle docx files because the vast majority of my students (and myself) will need to produce such files for their supervisors and collaborators. RStudio + knitr + pandoc now does a pretty good job of producing those files, but then the best path to follow after getting a file full of corrections back from a collaborator isn't clear to me.
Binary databases are of interest too. I have a project with 29 years of plant demography data stored as a CSV file. A script cleans up that historical data and puts it in an SQLite table. Then I am working on a shiny app to add new data to that table. So what should I put under version control? Scripts obviously, but not sure about the database file unless there is a way to turn off the attempts to diff such files? Write out CSV files as backups and version those?
If I understand correctly, the main advantage of diff tools is when it comes time to merge changes by more than 1 collaborator. In my case there would only be one person entering data, but even if there were multiple people you'd handle that differently by setting up the database on a server and allowing multiple people to enter records simultaneously. Data isn't like code. I suppose you could have multiple people entering the same records to catch data entry errors, but seems like overkill ...
Diffs really have to do with displaying the line-by-line changes between files. pandoc can be hooked into git to compare docx files and see the differences. But, git+pandoc only helps when using git on your local machine. As of right now, github will not show line-by-line differences in docx files, so when you try to view a commit that modifies a docx file on github you'll likely get a "diff too big" message.
Caveat: Others certainly know more about data versioning that I do, so take the following with a grain of salt that I'm not a data scientist
Does your SQLite DB have more than one table in it? If not, then sharing the "raw" CSV and the "cleaned" CSV would be best so other researchers would know what you started with and what transformations you made. If the SQLiteDB is essential then in general I would suggest to commit the script that converts the CSV into the SQLiteDB and let users run that conversion themselves. Doing this would also help them convert their own data.
Hope that all makes reasonable sense,
Yes, I saw a blog post on how to do that, but haven't tried it yet.
Thanks for the thoughts on the DB. At the moment there's just the one table, but I am working on other aspects (geographic location of sites, climate data, remote sensing data etc). that would represent additional tables.
Right -- and that's what @gidden was originally asking -- who does know and where have they written it all down!
Hi folks, thanks for the conversation so far. @atiretoo has the gist of it -- I'm curious if there's some basic information about best practices. Perhaps there's another forum to ask such questions? Maybe I could go to the SWC lists if this isn't the best place?
See the discussion started by Gary Wilson over at swcarpentry on this issue. Quite a few good references, but no knockout, definitive answers. Yet!