Difference between revisions of "Data Storage"

Jump to: navigation, search
Line 10: Line 10:
There is no as dominant industry standard for version control of data as Git is for version control of code. We use Git to both keep track of when changes are done to the code as well as to restore old versions of the code. There is no system that deals with that as elegantly for data, and we therefore suggest different methods for those to use cases when it comes to data.  
There is no as dominant industry standard for version control of data as Git is for version control of code. We use Git to both keep track of when changes are done to the code as well as to restore old versions of the code. There is no system that deals with that as elegantly for data, and we therefore suggest different methods for those to use cases when it comes to data.  


To keep track of when changes are done to data we have to use checksums, data signature or hashes. They all work differently but follows the same principle. The data is boiled down to a single data format. A very simplified version of this for individual words would be that you take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and then 2+4=6. So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. The good thing about this is that no matter how big your data is (in this case how long our word is) the hash signature is of the same size. The main problem with this very simplified hash algorithm is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures. It is much easier to compare the checksums of your data to see if anything in your data has changed, and the chance that the checksum signature is the same after your data has changed is one in 10 million which is an acceptable risk. If needed, there are other implementations like SHA256 that has 16^32 possible signatures. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While the <code>checksum</code> command can be used on any file, <code>datasignature</code> can only be used on Stata files.
To keep track of when changes are done to data we have to use checksums, data signature or hashes. They all work differently but follows the same principle. The data is boiled down to a single data format. A very simplified version of this for individual words would be that you take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and then 2+4=6. So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2.  


In addition to detecting changes, the other important aspect of version control is being able to restore old version. Since data files tends to be be big, there is no good long term solutions for this. Some data storage types discussed below offer some ability to restore old files, but in those system there is always a time limit for how long old versions will be stored as it otherwise would take too much disk space. This time limit is usually a couple of months but since we have project that spans over multiple years this cannot be the solution we rely on. Instead of trying to have the ability to restore the dataset itself, we recommend that you securely backup the raw dataset (see below) and make sure that you can restore the code that generated each version of your dataset. Git is an excellent tool for this, and as long as you track your code in git and have a version of the original data you indirectly have a way to restore each version of your data.
A neat practicality about this is that no matter how big your data is (in this case how long our word is) the hash signature is of the same size. The main problem with this very simplified hash algorithm is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.
 
It is much easier to compare the checksums of your data to see if anything in your data has changed, and the chance that the checksum signature is the same after your data has changed is one in 10 million which is an acceptable risk. If needed, there are other implementations like SHA256 that has 16^32 possible signatures. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While the <code>checksum</code> command can be used on any file, <code>datasignature</code> can only be used on Stata files.
 
In addition to detecting changes, the other important aspect of version control is being able to restore old version. Since data files tends to be be big, there is no good long term solutions for this. Some data storage types discussed below offer some ability to restore old files, but in those system there is always a time limit for how long old versions will be stored as it otherwise would take too much disk space. This time limit is usually a couple of months but since we have project that spans over multiple years this cannot be the solution we rely on.  
 
Instead of trying to have the ability to restore the dataset itself, we recommend that you securely backup the raw dataset (see below) and make sure that you can restore the code that generated each version of your dataset. Git is an excellent tool for this, and as long as you track your code in git and have a version of the original data you indirectly have a way to restore each version of your data.


== Data storage types ==
== Data storage types ==

Revision as of 21:38, 24 February 2021

This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While Data_Security is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.


Read First

  • include here key points you want to make sure all readers understand


Version control for data

There is no as dominant industry standard for version control of data as Git is for version control of code. We use Git to both keep track of when changes are done to the code as well as to restore old versions of the code. There is no system that deals with that as elegantly for data, and we therefore suggest different methods for those to use cases when it comes to data.

To keep track of when changes are done to data we have to use checksums, data signature or hashes. They all work differently but follows the same principle. The data is boiled down to a single data format. A very simplified version of this for individual words would be that you take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and then 2+4=6. So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2.

A neat practicality about this is that no matter how big your data is (in this case how long our word is) the hash signature is of the same size. The main problem with this very simplified hash algorithm is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the checksum command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.

It is much easier to compare the checksums of your data to see if anything in your data has changed, and the chance that the checksum signature is the same after your data has changed is one in 10 million which is an acceptable risk. If needed, there are other implementations like SHA256 that has 16^32 possible signatures. Stata also have the command datasignature that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While the checksum command can be used on any file, datasignature can only be used on Stata files.

In addition to detecting changes, the other important aspect of version control is being able to restore old version. Since data files tends to be be big, there is no good long term solutions for this. Some data storage types discussed below offer some ability to restore old files, but in those system there is always a time limit for how long old versions will be stored as it otherwise would take too much disk space. This time limit is usually a couple of months but since we have project that spans over multiple years this cannot be the solution we rely on.

Instead of trying to have the ability to restore the dataset itself, we recommend that you securely backup the raw dataset (see below) and make sure that you can restore the code that generated each version of your dataset. Git is an excellent tool for this, and as long as you track your code in git and have a version of the original data you indirectly have a way to restore each version of your data.

Data storage types

You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.

Storage for different size of data

A good starting point when deciding which type of storage is suitable for the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that the

Back-up protocols

Back-up storage types

Data retention

Back to Parent

This article is part of the topic *topic name, as listed on main page*


Additional Resources

  • list here other articles related to this topic, with a brief description and link