Difference between revisions of "Data Storage"
Kbjarkefur (talk | contribs) |
Kbjarkefur (talk | contribs) |
||
Line 8: | Line 8: | ||
== Version control for data == | == Version control for data == | ||
There is no | There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case. | ||
=== Version control the code the generates the data === | |||
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data. | |||
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data). | |||
=== Version control using checksums/hashes === | |||
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not. | |||
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. | |||
No matter how big the data is (how long the word is in the example above) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the <code>checksum</code> command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures. | |||
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "''collisions''". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. | |||
Stata also have the command <code>datasignature</code> that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files. | |||
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data. | |||
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that. | |||
== Data storage types == | == Data storage types == |
Revision as of 15:36, 1 June 2021
This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While Data_Security is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.
Read First
- include here key points you want to make sure all readers understand
Version control for data
There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.
Version control the code the generates the data
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data.
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).
Version control using checksums/hashes
One way to keep track of changes done to data is to use checksums, hashes or data signatures. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2.
No matter how big the data is (how long the word is in the example above) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the checksum
command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "collisions". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer.
Stata also have the command datasignature
that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata .dta
files.
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command datasignature
you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.
Data storage types
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.
Storage for different size of data
A useful rule of thumb when deciding which type of storage is suitable given the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that all users have identical files. This is different from storing data on network drives of cloud storage where it is the same file each user uses, not just an identical file. When each user has their own version of the file access to that file tend to be faster.
If the total size of all data in the project folder is too big to fit on a regular but the size of the files relevant to each users is not too big, then synced storages can still be used if the syncing service allows you to sync only specific folders or files in your project folder.
If the data in the project folder is too big to be synced to a typical laptop, then a the data can be stored in a network drive or a cloud storage. However, in these solution there is no copy of the file stored on each users hard drive, and depending on the exact service used and connectivity speed, this can be slow. And even if network or cloud storage has next to unlimited storage capacity
Back-up protocols
Back-up storage types
Data retention
Back to Parent
This article is part of the topic *topic name, as listed on main page*
Additional Resources
- list here other articles related to this topic, with a brief description and link