Difference between revisions of "Data Storage"

Jump to: navigation, search
Line 37: Line 37:
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.


Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder.
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption|encrypt]] the data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose.


=== Network drives ===
=== Network drives ===

Revision as of 18:19, 1 June 2021

This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While Data_Security is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.


Read First

  • include here key points you want to make sure all readers understand


Version control for data

There is no dominant standard for version control of data the way Git is the dominant standard version control of code. Git can be used to both keep track of changes made to code as well as restore older versions of code. There is no system that as elegantly does both of those two things for data. This section therefore suggests different methods and the project team should pick the method best for their exact use case.

Version control the code the generates the data

All datasets should be generated by code and should be reproducible by re-running that code. Therefore, if you have a good back-up system for the original data and version control all your code using Git, then you have an implicit version control for your data. As long as the original data is unchanged, changes to datasets you generate can only happen through changes to the code tracked in Git. Git can also be used to restore old version of the data as you simple restore an old version of the code and re-generate the data.

While this method is often an excellent option, it does not work when the original data is updated frequently (ongoing data collection or data streams) or when the code is not accessible (someone else is generating the data).

Version control using checksums/hashes

One way to keep track of changes done to data is to use checksums, hashes or data signatures. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. Then that string or number can be compared across datasets to test if they are identical or not.

One very simplified way to explain how this works would be the following example. In this hashing algorithm we start with a word instead of a dataset, but real world hashing algorithms can handle both. Start with a word and then take the corresponding number in the alphabet for each letter, sum those numbers, and then add the digits of each number until you have a single digit. So for "cat" we would get 3+1+20=24 (c=3,a=1,t=20), and in next step 24 would is turned into 6 (2+4=6). So the hash signature for "cat" would be 6. "Horse" would be 8+15+18+19+5=65. 6+5=11 and 1+1=2. So the hash signature would be 2. No matter how big the data is (how long the word is in this simple example) the hash signature will always be the same size and format and we can quickly test if they are the same. The main problem with the very simplified hash algorithm above is that there are only 10 values so many words would share the same signature. However, real world hash algorithms like the checksum command in Stata has signatures on the format "2694850408" which has 10^10 possible signatures.

With 10^10 possible signatures there is chance that two datasets have the same checksum. This is called "collisions". However, checksums are implemented so that two similar dataset are very unlikely to have collisions or even similar checksums, making the risk of two versions of the same dataset having a collision being extremely low. And other algorithms are implemented so that there are many more combinations, but then the hash signature gets longer. Stata also have the command datasignature that has a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata .dta files.

The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say if the current version of the same dataset is identical. You cannot say what differs if they are different. If you used the Stata command datasignature you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.

This method is a good fit for when you want to have a quick way to test that the dataset has not changed, and the details of what has changed does not matter or you have another acceptable and perhaps manual way to find out what those details are. This can be if you are accessing someone else's data set and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it produces, and then this method is a great way to verify that.

Data storage types

There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases.

File sync services

Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file.

However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.

Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always encrypt the data before saving it in a synced folder. DIME Analytics have published guidelines to encrypt files with Veracrypt which is a secure and free software often used for this purpose.

Network drives

Cloud storage

Storage for different size of data

A useful rule of thumb when deciding which type of storage is suitable given the size of your data is if the total size of all data in your project can fit the space usually available on a typical user's laptop. If the data is small enough to fit on a regular laptop then synced storages (such as World Bank OneDrive or DropBox) becomes available. In synced storages each user has their own copy saved on their computer, and the sync software makes sure that all users have identical files. This is different from storing data on network drives of cloud storage where it is the same file each user uses, not just an identical file. When each user has their own version of the file access to that file tend to be faster.

If the total size of all data in the project folder is too big to fit on a regular but the size of the files relevant to each users is not too big, then synced storages can still be used if the syncing service allows you to sync only specific folders or files in your project folder.

If the data in the project folder is too big to be synced to a typical laptop, then a the data can be stored in a network drive or a cloud storage. However, in these solution there is no copy of the file stored on each users hard drive, and depending on the exact service used and connectivity speed, this can be slow. And even if network or cloud storage has next to unlimited storage capacity

Back-up protocols

Back-up storage types

You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.

Data retention

Back to Parent

This article is part of the topic *topic name, as listed on main page*


Additional Resources

  • list here other articles related to this topic, with a brief description and link