Data Storage

Jump to: navigation, search

This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While Data_Security is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.


Read First

  • include here key points you want to make sure all readers understand


Version control for data

There is no as dominant industry standard for version control of data as Git is for version control of code. We use Git to both keep track of when changes are done to the code as well as to restore old versions of the code. There is no system that deals with that as elegantly for data, and we therefore suggest different methods for those to use cases when it comes to data.

To keep track of when changes are done to data we have to use checksums, data signature or hashes. They all work differently but follows the same principle. The data is boiled down to a single data format. A very simplified version of this for individual words would be that you take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and then 2+4=6. So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2.

A neat practicality about this is that no matter how big your data is (in this case how long our word is) the hash signature is of the same size. The main problem with this very simplified hash algorithm is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the checksum command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.

It is much easier to compare the checksums of your data to see if anything in your data has changed, and the chance that the checksum signature is the same after your data has changed is one in 10 million which is an acceptable risk. If needed, there are other implementations like SHA256 that has 16^32 possible signatures. Stata also have the command datasignature that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While the checksum command can be used on any file, datasignature can only be used on Stata files.

In addition to detecting changes, the other important aspect of version control is being able to restore old version. Since data files tends to be be big, there is no good long term solutions for this. Some data storage types discussed below offer some ability to restore old files, but in those system there is always a time limit for how long old versions will be stored as it otherwise would take too much disk space. This time limit is usually a couple of months but since we have project that spans over multiple years this cannot be the solution we rely on.

Instead of trying to have the ability to restore the dataset itself, we recommend that you securely backup the raw dataset (see below) and make sure that you can restore the code that generated each version of your dataset. Git is an excellent tool for this, and as long as you track your code in git and have a version of the original data you indirectly have a way to restore each version of your data.

Data storage types

You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.

Storage for different size of data

A useful rule of thumb when deciding which type of storage is suitable given the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that all users have identical files. This is different from storing data on network drives of cloud storage where it is the same file each user uses, not just an identical file. When each user has their own version of the file access to that file tend to be faster.

If the total size of all data in the project folder is too big to fit on a regular but the size of the files relevant to each users is not too big, then synced storages can still be used if the syncing service allows you to sync only specific folders or files in your project folder.

If the data in the project folder is too big to be synced to a typical laptop, then a the data can be stored in a network drive or a cloud storage. However, in these solution there is no copy of the file stored on each users hard drive, and depending on the exact service used and connectivity speed, this can be slow. And even if network or cloud storage has next to unlimited storage capacity

Back-up protocols

Back-up storage types

Data retention

Back to Parent

This article is part of the topic *topic name, as listed on main page*


Additional Resources

  • list here other articles related to this topic, with a brief description and link