Difference between revisions of "Data Storage"
Kbjarkefur (talk | contribs) |
Kbjarkefur (talk | contribs) |
||
Line 6: | Line 6: | ||
* include here key points you want to make sure all readers understand | * include here key points you want to make sure all readers understand | ||
== Version control for data == | |||
There is no as dominant industry standard for version control of data as Git is for version control of code. We use Git to both keep track of when changes are done to the code as well as to restore old versions of the code. There is no system that deals with that as elegantly for data, and we therefore suggest different methods for those to use cases when it comes to data. | |||
To keep track of when changes are done to data we have to use checksums, data signature or hashes. They all work differently but follows the same principle. The data is boiled down to a single data format. A very simplified version of this for individual words would be that you take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and then 2+4=6. So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. The good thing about this is that no matter how big your data is (in this case how long our word is) the hash signature is of the same size. The main problem with this very simplified hash algorithm is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures. It is much easier to compare the checksums of your data to see if anything in your data has changed, and the chance that the checksum signature is the same after your data has changed is one in 10 million which is an acceptable risk. If needed, there are other implementations like SHA256 that has 16^32 possible signatures. Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While the <code>checksum</code> command can be used on any file, <code>datasignature</code> can only be used on Stata files. | |||
In addition to detecting changes, the other important aspect of version control is being able to restore old version. Since data files tends to be be big, there is no good long term solutions for this. Some data storage types discussed below offer some ability to restore old files, but in those system there is always a time limit for how long old versions will be stored as it otherwise would take too much disk space. This time limit is usually a couple of months but since we have project that spans over multiple years this cannot be the solution we rely on. Instead of trying to have the ability to restore the dataset itself, we recommend that you securely backup the raw dataset (see below) and make sure that you can restore the code that generated each version of your dataset. Git is an excellent tool for this, and as long as you track your code in git and have a version of the original data you indirectly have a way to restore each version of your data. | |||
== Data storage types == | == Data storage types == | ||
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. | You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost. | ||
=== Storage for different size of data === | |||
A good starting point when deciding which type of storage is suitable for the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that the | |||
== Back-up protocols == | == Back-up protocols == |
Revision as of 21:35, 24 February 2021
This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While Data_Security is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.
Read First
- include here key points you want to make sure all readers understand
Version control for data
There is no as dominant industry standard for version control of data as Git is for version control of code. We use Git to both keep track of when changes are done to the code as well as to restore old versions of the code. There is no system that deals with that as elegantly for data, and we therefore suggest different methods for those to use cases when it comes to data.
To keep track of when changes are done to data we have to use checksums, data signature or hashes. They all work differently but follows the same principle. The data is boiled down to a single data format. A very simplified version of this for individual words would be that you take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and then 2+4=6. So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. The good thing about this is that no matter how big your data is (in this case how long our word is) the hash signature is of the same size. The main problem with this very simplified hash algorithm is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the checksum
command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures. It is much easier to compare the checksums of your data to see if anything in your data has changed, and the chance that the checksum signature is the same after your data has changed is one in 10 million which is an acceptable risk. If needed, there are other implementations like SHA256 that has 16^32 possible signatures. Stata also have the command datasignature
that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While the checksum
command can be used on any file, datasignature
can only be used on Stata files.
In addition to detecting changes, the other important aspect of version control is being able to restore old version. Since data files tends to be be big, there is no good long term solutions for this. Some data storage types discussed below offer some ability to restore old files, but in those system there is always a time limit for how long old versions will be stored as it otherwise would take too much disk space. This time limit is usually a couple of months but since we have project that spans over multiple years this cannot be the solution we rely on. Instead of trying to have the ability to restore the dataset itself, we recommend that you securely backup the raw dataset (see below) and make sure that you can restore the code that generated each version of your dataset. Git is an excellent tool for this, and as long as you track your code in git and have a version of the original data you indirectly have a way to restore each version of your data.
Data storage types
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.
Storage for different size of data
A good starting point when deciding which type of storage is suitable for the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that the
Back-up protocols
Back-up storage types
Data retention
Back to Parent
This article is part of the topic *topic name, as listed on main page*
Additional Resources
- list here other articles related to this topic, with a brief description and link