Difference between revisions of "Data Storage"
Kbjarkefur (talk | contribs) |
Kbjarkefur (talk | contribs) |
||
Line 9: | Line 9: | ||
== Version control for data == | == Version control for data == | ||
Version control is used for two purposes. The first is to keep track of changes and modifications and the second one is to provide the possibility to revert a file | Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case. | ||
=== Version control code that generates data === | === Version control code that generates data === | ||
All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if | All derivative datasets (datasets that the project team creates from data they received or collected) should be generated by code and should be reproducible by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are implicitly version controlled. | ||
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data). | While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data). However, when the original data is unchanged, changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data. | ||
=== Version control using checksums/hashes === | === Version control using checksums/hashes === |
Revision as of 18:08, 3 June 2021
This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While Data_Security is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.
Read First
- Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions
- Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.
- All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.
Version control for data
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.
Version control code that generates data
All derivative datasets (datasets that the project team creates from data they received or collected) should be generated by code and should be reproducible by re-running that code. Therefore, if the original data is properly backed-up and all code is version controlled using Git, then the derivative datasets are implicitly version controlled.
While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data). However, when the original data is unchanged, changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.
Version control using checksums/hashes
One way to keep track of changes done to data is to use checksums, hashes or data signatures. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.
One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the checksum
command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.
With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "collisions". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command datasignature
that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata .dta
files.
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command datasignature
you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.
This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.
Version control in sync software
File syncing software (read more about them below) often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.
Data storage types
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases.
File sync services
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file.
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always encrypt the data before saving it in a synced folder. DIME Analytics have published guidelines to encrypt files with Veracrypt which is a secure and free software often used for this purpose.
Cloud storage
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means S3 storage on AWS or Blob storage on Azure, but are not limited to those.
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use.
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.
Network drive storage
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.
Back-up protocols
Use
Back-up storage types
You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.
Data retention
Back to Parent
This article is part of the topic *topic name, as listed on main page*
Additional Resources
- list here other articles related to this topic, with a brief description and link