Data Storage

Jump to: navigation, search

This article discuss different aspects of data storage (such as different types of storage, data back up and data retention). While Data_Security is a very important topic related to data storage, it is not covered in this article as there is a dedicated link for that.


Read First

  • Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions
  • Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.
  • All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.

Version control for data

Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.

Version control code that generates data

All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly backed-up and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.

While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.

Version control using checksums/hashes

One way to keep track of changes done to data is to use checksums, hashes or data signatures. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.

Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command datasignature that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata .dta files.

The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command datasignature you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.

However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.

Version control in sync software

File syncing software often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.

Data storage types

There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases.

File sync services

Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file.

However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be downloaded on demand when your code trying to access them leading to your programming software to crash when trying to read the file.

Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always encrypt the data before saving it in a synced folder. DIME Analytics have published guidelines to encrypt files with Veracrypt which is a secure and free software often used for this purpose.

Cloud storage

Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means S3 storage on AWS or Blob storage on Azure, but are not limited to those.

The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use.

Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. Even on a fast internet connection will the download time be too long for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.

At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.

Network drive storage

Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.

The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.

Back-up protocols

Use

Back-up storage types

You should never use the same file or storage solution for your back-up copy of your original or raw data, as the files and storage solution you use for your day-to-day work. While some storage types could be used for both day-to-day and backup, this sections covers the former use case. Backing up of data is covered in a later section below. There are multiple types of storage and what type is best for you is how depends on the following factors: size of your data, where it will be used and cost.

Data retention

Back to Parent

This article is part of the topic *topic name, as listed on main page*


Additional Resources

  • list here other articles related to this topic, with a brief description and link