Difference between revisions of "Data Storage"
Kbjarkefur (talk | contribs) |
Kbjarkefur (talk | contribs) |
||
(40 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This article discusses different aspects of data storage (such as different types of storage, data back up and data retention). It is important to make sure you have appropriate data storage solutions before you start receiving data. You should plan your data storage for the full life-cycle of a project and not just for your immediate needs. Changing data storage solution mid-project can be costly and break the code already written for the project making earlier research outputs non-reproducible. | |||
== Read First == | |||
* Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions | |||
* Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used. | |||
* All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created. | |||
== Version control for data == | |||
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case. | |||
=== Version control code that generates data === | |||
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly [[Data_Storage#Back-up_protocols | backed-up]] and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data. | |||
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled. | |||
=== Version control using checksums/hashes === | |||
One way to keep track of changes done to data is to use ''checksums'', ''hashes'' or ''data signatures''. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not. | |||
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command <code>datasignature</code> that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata <code>.dta</code> files. | |||
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command <code>datasignature</code> you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data. | |||
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that. | |||
== | === Version control in sync software === | ||
[[Data_Storage#File_sync_services | File syncing software]] often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that. | |||
== Data storage types == | |||
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases. | |||
=== File sync services === | |||
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file. | |||
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as ''smart sync''). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file. | |||
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always [[encryption| encrypt ]] any non-public data before saving it in a synced folder. DIME Analytics have published [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/veracrypt-guidelines.md guidelines to encrypt files with Veracrypt] which is a secure and free software often used for this purpose. | |||
=== Cloud storage === | |||
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means [https://aws.amazon.com/s3/ S3 storage on AWS] or [https://docs.microsoft.com/en-us/azure/storage/blobs/ Blob storage on Azure], but are not limited to those. | |||
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use. | |||
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource. | |||
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them. | |||
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed. | |||
=== Network drive storage === | |||
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop. | |||
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed. | |||
== Back-up protocols == | |||
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided. | |||
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the [https://www.backblaze.com/blog/the-3-2-1-backup-strategy 3-2-1 strategy]. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has [[Personally_Identifiable_Information_(PII) | PII information]] (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use. | |||
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy. | |||
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified. | |||
== Data retention == | |||
The project team should have a written document that clearly regulates how long identified data should be kept by the research team before it is permanently destroyed. It is a bad, but unfortunately common, practice that research teams keeps identified datasets without any timeline on when to delete them. Note that it is not a bad practice to keep or publish de-identified datasets, but best data privacy practice is to have a clear timeline for when datasets should be deleted. The document regulating this timeline is called a "data retention policy". There is a tradeoff between the length of time that the data will be retained and the right to privacy of the human subject participating in your research. Therefore, a data retention policy should not allow the research team to keep the data longer than what the research team can justify by linking it to a research utility. Keeping identified data just for the sake that it might be good to have one day is bad data privacy practice. | |||
Here is a [https://kb.mit.edu/confluence/display/istcontrib/Removing+Sensitive+Data good guide] for how to make sure that data is actually deleted: | |||
== Back to Parent == | == Back to Parent == | ||
Line 23: | Line 79: | ||
== Additional Resources == | == Additional Resources == | ||
* | * Pillar 4 in the [https://github.com/worldbank/dime-standards DIME Research Standards] covers data security | ||
[[Category: | [[Category: Data Management]] |
Latest revision as of 14:37, 21 June 2021
This article discusses different aspects of data storage (such as different types of storage, data back up and data retention). It is important to make sure you have appropriate data storage solutions before you start receiving data. You should plan your data storage for the full life-cycle of a project and not just for your immediate needs. Changing data storage solution mid-project can be costly and break the code already written for the project making earlier research outputs non-reproducible.
Read First
- Data should be version controlled. Ideally the version control system used should both be able to detect changes to the data and provide a possibility to revert to old versions
- Data storage depends on how big the data is, if there is confidential information in the data and how the data will be used.
- All original data (the data sets collected or received by the team) should be backed-up. All derivative data (data sets created from original data) should be created by reproducible code and that code should be version controlled in Git. Then derivative data does not need to be backed-up as it can be re-created.
Version control for data
Version control is used for two purposes. The first is to keep track of changes and modifications to files and the second one is to provide the possibility to revert a file an old version. For code there is an industry standard that efficiently fulfills these two purposes and that system is called Git. Git can be used to both keep track of changes made to code as well as restore older versions of code. Unfortunately there is no system that as elegantly does both of those two things for data over a longer period of time. There is therefore no industry wide one-size-fits-all solutions for version control of data. This article therefore suggests different methods and the project team should pick the method best for their exact use case.
Version control code that generates data
All derivative datasets (datasets that the project team creates from original data they received or collected) should be generated by reproducible code so that it can be re-created by re-running that code. Therefore, if the original data is properly backed-up and all code is version controlled using Git, then the derivative datasets are already implicitly version controlled since changes to derivative datasets can only happen through changes to the code tracked in Git. Git can also be used to restore old version of code that in turn can be used to restore old versions of the derivative data.
While this method often is an excellent option to version control derivative data, it does not work when the original data is updated frequently (ongoing data collection or when data is continuously received) or when the code is not accessible (someone else is generating the data). Derivative data should, in these cases, still be generated using reproducible code tracked in Git, but the effect is no longer that those derivative datasets are implicitly properly version controlled.
Version control using checksums/hashes
One way to keep track of changes done to data is to use checksums, hashes or data signatures. These three concepts all work slightly differently but all follows the same principle and serves the same purpose in the context of version control of data. The principle they all follow is that a datasets is boiled down to a short string of text or a number. The same data is always boiled down to the same string or number, but even small differences to the data boils down to a different string or number. That string or number can then quickly and effortlessly be compared across datasets to test if they are identical or not.
Most checksums or hashes boils down to a string or a number that has no interpretable meaning to humans, it is just meant to serve the purpose of answering the yes/no question if the two datasets are identical or not. It is usually impossible to tell from a has or a checksum if the files are almost identical or very different. One exception to this is the Stata command datasignature
that generates a signature that combines a hash value with some basic data on the dataset such as number of observations. While this provides some useful human readable information, it can only be used on Stata .dta
files.
The drawback of all these methods is that it says nothing about how the dataset was created. So there is no way to recreate a dataset based only on the checksum or hash signature. If you have the checksum of a dataset used in the past but not the corresponding version of the dataset, then all you can do is to say that dataset was identical to the current version of the same dataset or not. You cannot say what differs if they are different. If you used the Stata command datasignature
you can get some clues of what differs, such as number of observations or number of variables, but you will still not be able to recreate the old version of the data.
However, this method is a good fit for when you want to have a quick way to test that the dataset has not changed, and details of what has changed either does not matter or you have another way to find out (perhaps manually) what those differences are. Examples of this can be if you are accessing someone else's dataset and you want to know if that person makes changes to that dataset. Another way this can be used is to check which if any datasets are updated when the code is updated. Sometimes we want to make changes to the code that should not change the dataset it generates, and then this method is a great way to verify that.
Version control in sync software
File syncing software often have version control systems that allows you to both detect changes made to data files and allows you to restore old versions of those files. However, the way they it is done in these systems is so storage in-efficient that you can only restore files version that are less than a few months old. If your project is completed with that time frame, then this is a great solution, however, typically a project runs for much longer than that.
Data storage types
There are different types of data storage option and which one that is best for each use case typically depends on how big the data is, if there is confidential information in the data and how the data will be used. Below are some different storage types with their pros and cons. The type of data that is considered in this article are data files, and not, for example, data stored in data bases.
File sync services
Common examples of file sync services are DropBox, OneDrive, Box among several others. This storage type is suitable when the data files required for a project is not bigger than that they can be stored on a regular laptop or desktop computer. In file sync services a copy of the file exists locally on all computers that has the folder synced. This means that accessing the file is quick and that there is never a problem for multiple people to access the file at the same time as they have their own version of the file.
However, because all files are saved locally this also means that if one person works on many projects there is a risk that there is not space for all files for those projects on that persons computer. Many file syncing services have options to mitigate this. For example you can chose to not sync the parts of the project folder that you know you do not need access to. Some sync services also allow you to have non-synced files ready to be downloaded on demand (sometimes referred to as smart sync). However, this is not great for data work as data files are usually to big to be instantly downloaded on demand when your code trying to access them leading to your code to crash when trying to read the file.
Another concern with file syncing services are privacy concerns when sharing confidential data. There are enterprise subscriptions that can be installed on your own servers. This makes sharing data much more secure as the data is never stored or transferred on the servers that belong to the company that offers the syncing service. If you do not have an enterprise subscription or are not sure if you do, you should always encrypt any non-public data before saving it in a synced folder. DIME Analytics have published guidelines to encrypt files with Veracrypt which is a secure and free software often used for this purpose.
Cloud storage
Cloud storage comes in many different kinds and it is out of the scope of this wiki article to describe them all. This article will only cover general points about cloud storage. This article only covers data stored in files, so cloud storage here typically means S3 storage on AWS or Blob storage on Azure, but are not limited to those.
The benefit of cloud storage is that you do not need to buy or upgrade any hardware. To get started or to increase your capacity you only need to click a button or it might even be upgraded automatically. However, you will be charged more the more you use.
Cloud storage might not be the best option when the data will frequently be accessed from laptops and desktops. The download time will be too long even on a fast internet connection for most use cases if the data is downloaded each time the script is run. Instead, cloud storage is best used when frequently accessed from another cloud resource.
However, make sure that the other cloud storage is in the same physical location as the cloud resource that will use it. The same cloud provider have data centers across the world and you might have issues if your storage is on the other side of the world from the resource that will use it. There are ways to address these issues but they are outside of the scope of this article. The main takeaway is that cloud storage is only a great solution if the resources that needs access to that have a very quick connection to them.
At the same time, this does not mean that cloud storage should never be accessed from a laptop or a desktop. Cloud storage can be a great place to store data that is infrequently accessed by users on regular computers. There will still be a delay, but that can be acceptable as the data is only infrequently accessed.
Network drive storage
Network drive storage is in many way similar to the types of cloud storage that we discussed above. The major difference for the context discussed here is that they can be located in the same location as a user accessing them from a laptop/desktop making it a viable option for such a user to access it frequently. This is a great solution when a regular laptop/desktop frequently needs access to the data and the combined size of all of the data files are too big to all be stored on the laptop/desktop.
The problem with this solution is that you need to purchase hardware and run your own network drive. This can be complicated if you do not work at a large organization where there is an IT team that can do this for you. Another drawback is that if you work on an organization that is spread out geographically then you will have the same issues as with cloud storage with slow access speed.
Back-up protocols
The DIME Analytics recommendation for data back up starts by classifying all data into either original data or derivative data. Original data is data that was received or collected by the project team, and derivative data is data created from the original data using reproducible code. Original data must be backed up, but derivative data does not have to be backed up as long as the code used to create it is backed up (typically through GitHub or another version control system for code). If the original data and the code is backed up, then the derivative data is implicitly backed up as well. By this method, backing up every single derivative data file which would be storage space inefficient is avoided.
There are many good alternative data back-up protocols but a great place to start for anyone who has not adapted a back-up protocol for their projects yet is the 3-2-1 strategy. This strategy means that the project has 3 copies of the data, where 2 are easily accessible and 1 is stored on an remote location. This means that one copy, called the first copy, is the only copy that the code ever should read the data from. This copy can be shared in a file syncing service. If the original data file has PII information (which they often do) the file needs to be encrypted at all times. A de-identified version of this data that does not need to be encrypted can be created for easier day-to-day use.
If the first copy of the data is accidentally deleted or modified then the team will access the second copy and make a new first copy of the second copy. The second copy could, for example, be stored on a external drive. This external drive can be stored in the office of someone in the project team. If the data has PII data it must be encrypted unless the external drive is stored in a locked safe. When the team has access to a secure physical safe, then it is a good idea to keep the data unencrypted on the drive that is stored locked in the safe. This is to reduce the risk of loosing data due to lost encryption keys. If an external hard drive is not an option, then the second copy can be stored in any other file storage that is not the same service as the first copy.
Finally, the third copy should be stored off-site. In today's world this usually means somewhere in the cloud. The second copy can also be stored in the cloud, but the difference from the third copy is that the second copy may be stored locally. The third copy may never be stored locally. This is meant to make sure that if theft, fire, disaster etc. or anything similar happens, then the third copy should be stored in a way there is no way it is affected by the same event. The other important consideration for the third copy storage is that it is a suitable long term storage. This means that this storage should not be affected by a single person leaving the team or if it does there should be a clear protocol for how access is maintained when that person is leaving the team. If you are using a sync service for the third copy it should ideally be a different sync service than for the first copy. Regardless which sync service you are using, if you are using a sync service you should never have the third copy synced to any computer. This file should live in an online-only folder. Alternatives to sync services for the third copy would be S3 on AWS or Blob storage on Azure. If you are using such services, see if you can make these files read-only so there is no chance that they are accidentally deleted or modified.
Data retention
The project team should have a written document that clearly regulates how long identified data should be kept by the research team before it is permanently destroyed. It is a bad, but unfortunately common, practice that research teams keeps identified datasets without any timeline on when to delete them. Note that it is not a bad practice to keep or publish de-identified datasets, but best data privacy practice is to have a clear timeline for when datasets should be deleted. The document regulating this timeline is called a "data retention policy". There is a tradeoff between the length of time that the data will be retained and the right to privacy of the human subject participating in your research. Therefore, a data retention policy should not allow the research team to keep the data longer than what the research team can justify by linking it to a research utility. Keeping identified data just for the sake that it might be good to have one day is bad data privacy practice.
Here is a good guide for how to make sure that data is actually deleted:
Back to Parent
This article is part of the topic *topic name, as listed on main page*
Additional Resources
- Pillar 4 in the DIME Research Standards covers data security