Data Security

Revision as of 20:49, 26 August 2020 by Kbjarkefur (talk | contribs) (Locally storing identified PII data)
Jump to: navigation, search

This page outlines the steps in a typical research project and lists each topic within data security that a research team should consider at that point. If you are following these best practices, then not even the full research team has access to identifying data, but very rarely that is ever needed to do the analysis.

Read First

  • DIME Analytics Data Security Policy Details
  • The central topic in this conversation is encryption.
  • Data must always be encrypted when being sent over the internet. This is very important, but fortunately it is almost always implemented in the background without you having to worry about it.
  • Data should be encrypted when it is stored at at third party service provider.
  • When data is saved locally, a de-identified version of the data should be created that the research team has access to, and the identified should either be encrypted or not at all saved on computers.

Sending data from field to cloud

Every single time your data is transferred on the internet you absolutely must encrypt it. Unencrypted traffic on the internet can quite easily be read by anyone with the technical skills to do so. There are many different types of encryption and methods to encrypt your data. This is called encryption in transit and luckily, we do not need to understand the technical details, as any established data collection service will always encrypt all data submitted from the field automatically while in transit (i.e., upload or download). So if you use servers hosted by a provider such as SurveyCTO or SurveySolutions, this is nothing you need to worry about. Your data will be encrypted from the time it leaves the device (in tablet-assisted data collation) or your browser (in web data collection) until it reaches the server. Anyone spying on your traffic will not be able to make out any data.

If you are using a less commonly used service for other technical reasons, then it is still likely that your data is encrypted in transit. It is not technically difficult for a service provider to implement and it is a basic standard of modern internet traffic security. But do ask the service providers if they do encryption in transit and do some research about the service when using a new service, as your data may be openly visible on the internet if it is not.

Storing data in the cloud

Unlike encryption in transit, encryption at rest (i.e., in cloud storage) is rarely the default in data collection services, even in well-established services like SurveyCTO and SurveySolutions. One reason for this is that encryption at rest is somewhat resource-intensive, and to save server time and costs the provider typically will not want to add this unless the user requests it. However, the much more important reason is that the server host will not encrypt user data unless they are sure that user knows how to operate the system and assumes its risks. The risks are that if a few simple, but very important, steps are not followed, then the data is lost forever. But if you follow the best practices related to public/private key pairs your data will be safely encrypted and you will not risk permanently losing access to it.

Encryption at rest is different from password-protection. Passwords restrict who has access to the data files, but encryption at rest makes the those files unreadable to anyone who does not have the private encryption key. Encryption at rest requires active participation from you, the user, and you should be fully aware that if your private key in a private/public key pair is lost, there is absolutely no way to recover your data. This is why even well-established services do not make this the default for all data collection.

Encryption at rest is the only way to ensure that PII data remains private when it is stored on someone else’s server on the open internet. The World Bank’s and many of our donors’ security requirements for data storage can only be fulfilled by this method. We recommend keeping your data encrypted-at-rest whenever PII data is collected — therefore, we recommend it for all field data collection.

Different data collection services implement encryption at rest differently, but the main principles will always be the same. For example, encryption at rest in SurveyCTO must be affirmatively enabled at the time you create a new form. If you have already developed your form and want to encrypt its incoming data, you have to create a new form, upload the form file to that form, and enable encryption at that point. In encrypted forms, the data is encrypted already on the tablet or in the browser before being sent to the server, and it is not decrypted until you use your private key in the browser or in SurveyCTO Sync, so the cloud server never handles the data unencrypted. In SurveyCTO you can also make certain fields publishable meaning that they can be downloaded or viewed in the data explorer without the private key.

Downloading data from the cloud

This is very similar to the section Sending data from the field to cloud above. If you use an established service like SurveyCTO or SurveySolutions, you do not need to worry about this. If you are not using an well-established service, then make sure that the data is at least encrypted in transit.

Locally storing identified PII data

Almost all the data we collect include information that can be used to identify who the respondent is. We call this data PII Data, and typical examples of that data are names, GPS coordinates of households, telephone numbers, etc. However, almost all types of data can be identifying depending on the context. Read the PII Data section for more information on what is identifying.

If we have enabled encryption at rest in the cloud, then the PII data is protected at the server. To keep the data protected after we download it, we want to do something similar on our local computers. Eventually we will de-identify the data, i.e. remove all identifying information, so we have a de-identified data set (without PII data) that we can share freely among the research team. But first we need to create a safe storage place for the PII data.

A World Bank-approved software solution (approved in the sense we have permission to download and install it, but ITS has not reviewed this practice and have not commented on this being a best practice or not) is VeraCrypt. On World Bank computers we should install it using the Software Center.

In VeraCrypt you can set up secure encrypted folders. If your computer were to be lost or stolen, or someone in any other way gets access to your computer, then the encrypted folders will be visible, but its contents will only look like gibberish if anyone would try to read them.

When creating secure folders in VeraCrypt, you need to create a key that will be used like a password to encrypt and decrypt the folder. As with any other proper encryption, if you were to lose the key there is no way to ever again restore the information in the encrypted folder. So it is important that the private key is stored in safe and convenient way using password managers. A password manager can also help you genreate a long random string of letters, digits and symbols which is the strongest type of key. See DIME's step-by-step guide to Veracrypt, that also links to best practices related to password managers.

You may use the enterprise-level OneDrive account for storage of unencrypted data if you are not syncing the folder to your computer, but we recommend this only as a backup for your peace of mind, not for long-term project storage or sharing. See more details here. Other secure long-term storage solutions can be provided by World Bank ITS on a project-by-project basis.

If you are setting up your project folder using iefolder then a dedicated folder called /EncryptedData/ has been created for this purpose. This would be the folder you would create encryption containers inside, and then you create de-identified versions of the data outside of the /EncryptedData/ folder.

Storage of de-identified data

The purpose of de-identifying the data is so we can create a dataset that can be shared freely among the analysis team. This dataset should be constructed so that if it were stolen, that would be okay from the perspective of the privacy of the respondents. This usually means removing identifying information fields but not making any other changes to the dataset. This dataset should be able to be stored in Dropbox unencrypted so that it can be used by the research team, and it should have basically everything that is needed for data analysis, so that most of the team never even needs to access the encrypted PII dataset.

Software suggestions

%One method for doing this is to keep raw data in a secure cloud location such as Amazon Web Services S3. Another is to use an enterprise-grade solution such as Microsoft OneDrive; but these can be expensive. Dropbox is not a secure storage location by default, and neither is GitHub. Depending on your specific security requirements, it may be acceptable to create an encrypted version of the dataset using software like VeraCrypt, and store that in Dropbox. To securely transfer information, you can use Dropbox with encrypted data only, or you can use another service such as WeTransfer, or share a password-protected .zip file created by 7-Zip, WinRAR,or on MacOS Terminal by writing:

   zip -er [archive.zip] [folder]

Email is never a secure data transfer method, unless the file is password-protected or encrypted.

We recommend that you use strong passwords and two-factor authentication for all of your online accounts. If you are handling data properly, it is going to end up password-protected in one way or another. In a collaborative, long-term project, you need to be able to store and share these passwords. Therefore you are going to have to get smart about password management and account security. We recommend a service like LastPass. LastPass is free for personal features and cheap for sharing features. It allows you to store an unlimited number of passwords, and you will have many. It also allows you to store small encryption keyfiles from services like SurveyCTO safely, attached to secure notes. You will have one (or more) for every website, for every project, for every encrypted dataset, and so on. A service like LastPass allows you to store and share these passwords inside its own ecosystem, since emailing passwords is just as vulnerable to theft or loss as writing them down.

LastPass also provides an app-based implementation for two-factor authentication (2FA), both for its own service and for other services you use (Gmail, Dropbox, Facebook, Amazon, etc.), and we recommend you enable this feature for both your business and personal accounts. Two-factor authentication means you have to enter a second confirmation using your mobile device to log into any of your password-protected services from new locations. This is important not because anyone is up to anything nefarious, but because random hacking is everywhere these days and can arbitrarily damage, delete, or lock you out of your files in the worst case.

Additional Resources

Back to Parent

This article is part of the topic Data_Management