Data Security
This page outlines the steps in a typical research project and lists each topic within data security that a research team should consider at that point.
Read First
- The central topic in this conversation is encryption.
- Data must always be encrypted when being sent over the internet. This is very important, but fortunately it is almost always implemented in the background without you having to worry about it.
- Data should be encrypted when it is stored at at third party service provider.
- When data is saved locally, a de-identified version of the data should be created that the research team has access to, and the identified should either be encrypted or not at all saved on computers.
Sending data from field to cloud
Every single time your data is transferred on the internet you absolutely must encrypt it. Unencrypted traffic on the internet can quite easily be read by anyone with the technical skills to do so. There are many different types of encryption and methods to encrypt your data. Luckily, we do not need to understand the technical details, as any established data collection service will always encrypt all data submitted from the field automatically while in transit (i.e., upload or download). So if you use servers hosted by a provider such as SurveyCTO or SurveySolutions, this is nothing you need to worry about. Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection) until it reaches the server. Anyone spying on your traffic will not be able to make out any data.
If you are using a less commonly used service for other technical reasons, then it is still likely that your data is encrypted in transit. It is not technically difficult for a service provider to implement and it is a basic standard of modern internet traffic security. But do ask the service providers if they do encryption in transit and do some research about the service when using a new service, as your data may be openly visible on the internet if it is not.
Storing data in the cloud
Unlike encryption in transit, encryption at rest (ie, in cloud storage) is rarely the default in data collection services, even in well-established services like SurveyCTO and SurveySolutions. One reason for this is that encryption at rest is somewhat resource-intensive, and to save server time and costs the provider typically will not want to add this unless the user requests it. However, the much more important reason is that the server host will not encrypt user data unless they are sure that user knows how to operate the system and assumes its risks. Encryption at rest is different from password-protection: encryption at rest makes the underlying data itself unreadable even if accessed except to users who have a specific private key file. Encryption at rest requires active participation from you, the user, and you should be fully aware that if your private key in a private/public key pair is lost, there is absolutely no way to recover your data. This is why even well-established services do not make this the default for all data collection.
Encryption at rest is the only way to ensure that PII data remains private when it is stored on someone else’s server on the open internet. The World Bank’s and many of our donors’ security requirements for data storage can only be fulfilled by this method. We recommend keeping your data encrypted-at-rest whenever PII data is collected — therefore, we recommend it for all field data collection.
Different data collection services implement encryption at rest differently, but the main principles will always be the same. For example, encryption at rest in SurveyCTO must be affirmatively enabled at the time you create a new form. If you have already developed your form and want to encrypt its incoming data, you have to create a new form, upload the form file to that form, and enable encryption at that point. In encrypted forms, the data is encrypted already on the tablet or in the browser before being sent to the server, and it is not decrypted until you use your private key in the browser or in SurveyCTO Sync, so the cloud server never handles the data unencrypted. In SurveyCTO you can also make certain fields publishable meaning that they can be downloaded or viewed in the data explorer without the private key.
Downloading data from the cloud
This is very similar to the section Sending data from the field to cloud. If you use an established service like SurveyCTO or SurveySolutions, you do not need to worry about this. If you are not using an well-established service, then make sure that the data is at least encrypted in transit.
Locally storing identified PII data
Almost all the data we collect include information that can be used to identify who the respondent is. We call this data PII Data, and typical examples of that data are names, GPS coordinates of households, telephone numbers, etc. However, almost all types of data can be identifying depending on the context. Read the PII Data section for more information on what is identifying.
If we have enabled encryption at rest in the cloud, then the PII data is protected at the server. To keep the data protected after we download it, we want to do something similar on our local computers. Eventually we will de-identify the data, i.e. remove all identifying information, so we have a “de-identified” data set (without PII data) that we can share freely among the research team. But first we need to create a safe storage place for the PII data.
A World Bank-approved software solution (approved in the sense we have permission to download and install it, but ITS has not reviewed this practice and have not commented on this being a best practice or not) is VeraCrypt. On World Bank computers we should install it using the Software Center.
When you use VeraCrypt to create a secure storage container in a folder, VeraCrypt creates a private/public key pair, encrypts the folder using the public key, and gives you one-time access to store the private key (which will be a long string of random numbers and letters). If your computer were to be lost or stolen, or someone in any other way gets access to your computer, then the encrypted container will be visible, but its contents will only look like gibberish if anyone would try to read them. When you need to access the encrypted files, you decrypt the container using VeraCrypt by entering the private key, and the software will load the contents of the container for you as a “virtual thumb drive”. Your computer will recognize this just as if you plugged in a physical device and the container will have all its contents exposed as normal files and you can read them using Stata, Word etc. The files will remain decrypted until you tell VeraCrypt to re-encrypt them, at which point it will re-scramble them and update the container object to the new contents.
Note that with all cases when private/public key pairs are generated there is no way you can regenerate the private key. If the private key is lost, then there is no way of decrypting the data. So it is important that the private key is stored in safe and convenient way using password managers. (You may use the enterprise-level OneDrive account for storage of unencrypted data, but we recommend this only as a backup for your peace of mind, not for long-term project storage or sharing. Other secure long-term storage solutions can be provided by World Bank ITS on a project-by-project basis.)
If you are setting up your project folder using iefolder then a dedicated folder called /EncryptedData/ has been created for this purpose. This would be the folder you would create encryption containers inside, and then you create de-identified versions of the data outside of the /EncryptedData/ folder.
Storage of de-identified data
The purpose of de-identifying the data is so we can create a dataset that can be shared freely among the analysis team. This dataset should be constructed so that if it were stolen, that would be okay from the perspective of the privacy of the respondents. This usually means removing identifying information fields but not making any other changes to the dataset. This dataset should be able to be stored in Dropbox unencrypted so that it can be used by the research team, and it should have basically everything that is needed for data analysis, so that most of the team never even needs to access the encrypted PII dataset.
Back to Parent
This article is part of the topic Data_Management