Exploring Data Version Control for Research

Motivation

Real world data, such as logs, comments, and reviews, are usually messy and require substantial (pre-)processing. Sometimes, data may also be updated during the project, for example, when data collection is ongoing (as is the case for many COVID-19 papers). This can result in different versions of datasets throughout the research project (e.g., v2, v3, featureXpresent, featureYabsent) and may lead to hardly reproducible results when looking at an analysis that has been conducted a few weeks back. Moreover, the data that researchers in many fields have at hand become increasingly large, often exceeding several gigabytes. Even if the source code is posted, it may not be feasible for other researchers to reproduce results that will run hours to complete. Versioning intermediate datasets with Git is often not feasible due to storage restrictions of code repositories and technical difficulties to efficiently handle large files.

Current Practices

I distributed a survey¹ through the channels of the Open Science Fellows Program (n=15) to understand current data research practices among other researchers from various disciplines². Specifically, I asked other researchers how they were versioning the different parts of their research projects, such as the manuscript, data analysis, but also the data itself, to ensure reproducibility. As shown in the graph below, sophisticated data versioning practices (using GitHub or built-in functionalities of cloud storages) was prevalent for the manuscript itself (~50%) and the data analysis scripts (~66%). This shows that topics, such as reproducibility and versioning, are on the radar of many researchers. However, few researchers used Git for the data itself (n=3) and none of them were aware of specific tools for the versioning of larger data sets. Instead, several researchers voiced their interest in such tools and asked to be updated about this project in the future. In sum, the results indicate that version control of data is neither a prevalent practice nor do researchers know about the advantages and disadvantages of specific tools. This shows a gap in the way researchers version the different parts of their research projects.

Existing Tools

These findings presented a great motivation for my project because they indicated a clear need in the research community (of course, within the narrow sample of participants I asked). While searching for specific tools, I came across an old Stack Exchange post from 2013, which asked whether “Git for data” existed.

‘Is there a Git for data’ on Stack Exchange (source: https://opendata.stackexchange.com/questions/748/is-there-a-git-for-data)

Interestingly, back then, Git for data wasn’t something that existed as a software tool. However, with the increase in data volume and through the broad application of machine learning in many organizations, it became increasingly required to reproduce and share data. Thus, more recently, there has been a rise in open source tools that allow versions of datasets to be tracked over time. During my research, I came across several tools, of which I selected 5 tools³, which fulfilled the following criteria: First, they should be entirely open source. Second, they should not be restricted to a specific application area, such as machine learning or extract, transform, load (ETL) pipelines. The available tools range from dedicated databases with version control capabilities (e.g., DoltHub, TerminusDB) to tools that are storage agnostic and allow for the versioning of the entire data pipeline (e.g., DVC).

Name	URL	Focus
DVC	https://dvc.org	Data version control & ML
Dolt	https://www.dolthub.com	Revision controlled SQL database
Git LFS	https://git-lfs.github.com	Version control
Qri	https://qri.io	Data version control
TerminusDB	https://terminusdb.com	Revision controlled graph database

Which Tools Should Researchers Use?

It depends on the use case. As outlined above, there are different kinds of tools available. DoltHub, for example, is an open source SQL database with the ability to track differences in the data stored in the database. If multiple researchers are crawling data and each of them is submitting their piece to the database, such a solution might be ideal. However, in this case, researchers do not have the full flexibility as to where their data is stored. DVC, on the other hand, is a lightweight tool and storage agnostic, which only saves meta data about changes in the datasets. These files can then be tracked via Git. So, as you can see from these examples, researchers should first choose their area of application before choosing the solution.

Overall, as researchers have increasingly embraced Git to version control their research projects, ‘Git for data’-tools are not on the radar of many researchers. In this first post, I gave a brief outline of the motivation and presented an overview of available products. Due to the heterogeneity in the tool landscape, I will be posting further in-depth discussions of different tools within this blog series. The first tool that I evaluate is DVC and the post is available here.

I used formr because it is an open source tool that produces well-formatted surveys based on CSV files.↩︎
A summary of the results is available from: https://de.wikiversity.org/wiki/Wikiversity:Fellow-Programm_Freies_Wissen/Einreichungen/Data_Version_Control:_Best_Practice_for_Reproducible,_Shareable_Science%3F/Book and the raw data is available from https://osf.io/vjxkq/.↩︎
A more comprehensive list of tools is available from: https://docs.google.com/spreadsheets/d/1jGQY_wjj7dYVne6toyzmU7Ni0tfm-fUEmdh7Nw_ZH0k/edit?ts=5fc6a2a5#gid=0. Moreover, a discussion of these tools is available from: https://www.youtube.com/watch?v=r5uxntl_hWg.↩︎

Exploring Data Version Control for Research

Florian Pethig

Last updated: 17 June 2021

Motivation

Current Practices

Existing Tools

Which Tools Should Researchers Use?