Difference between revisions of "Encryption"

Jump to: navigation, search
 
(45 intermediate revisions by 2 users not shown)
Line 1: Line 1:
In today's world of research, researchers regularly handle data, send it over the internet and store it in the cloud. At any point, especially when the internet is involved, the data is exposed to some risk. Keeping data safe and encrypted is hence a key component of [[IRB Approval | IRB requirements]] and [[Research Ethics | research ethics]]. Encryption should take place whenever dealing with [[Personally Identifiable Information (PII) | sensitive data]] in any stage of research: from [[Sampling & Power Calculations | sampling]] and [[Primary Data Collection | data collection]] to [[Data Cleaning | cleaning]] and [[Data Analysis | analysis]]. This page discusses encryption in transit and at rest; key pairs; password management; and encryption with SurveyCTO data.  
In today's world of research, researchers regularly handle data, send it over the internet, and store it in the cloud. At any point, especially when the internet is involved, the data is exposed to some risk. Keeping data safe and encrypted is hence a key component of [[IRB Approval | IRB requirements]] and [[Research Ethics | research ethics]]. Encryption should take place whenever dealing with [[Personally Identifiable Information (PII) | PII]] in any stage of research: from [[Power Calculations | sampling]] and [[Primary Data Collection | data collection]] to [[Data Cleaning | cleaning]] and [[Data Analysis | analysis]]. This page discusses three different types of encryption and two different contexts when encryption is needed, in addition to some shorter related topics.  


== Read First ==
== Read First ==
*Store encrypted data <code>[[iefolder]]</code>’s [[DataWork_Folder#Survey_Encrypted_Data | EncryptedData]] folder. Note that while <code>[[iefolder]]</code> makes the folder, it does not encrypt it.
* There are three main types of encryption algorithms relevant to our field. They are '''symmetric encryption ''', '''asymmetric encryption ''', and '''hashing '''.  
*World Bank SurveyCTO server data must be encrypted via SurveyCTO.
* When transferring or sharing [[Personally Identifying Information (PII)|PII]], the data needs to be both '''encrypted in transit''' and '''encrypted at rest'''.
* Almost all encryption depends on a [[Encryption#Public.2FPrivate_Key_Pair |public/private key pair]], which should be securely stored in a password manager.
* Encryption keys that can decrypt  PII must be shared using password managers.
* [[Encryption#Encryption_In_Transit | Encryption in transit]], or encryption while data is sent over the internet, is extremely important: there is never a case when not using encryption in transit is at all ok. [[Encryption#Encryption_At_Rest | Encryption at rest]], or encryption while data is stored on a server or computer, is also important.


==Encryption in Transit and at Rest==
== Three types of Encryption ==


=== Encryption in Transit ===
There are three types of encryption which you must know to safely operate in the digital world and to handle '''Personally Identifying Information (PII)|PII]] data. The difference between these three encryption types are
Encryption in transit is by far the most important type of encryption. Service providers almost always take care of (i.e. Survey Solutions, [[SurveyCTO_Coding_Practices|SurveyCTO]], OneDrive). However, when using less established services, confirm that they use encryption in transit by looking at the internet address. If the service provider uses secure transfer methods, you will see <code>https://</code> in the internet address instead of simply  <code>http://</code>. Data sent from an API using <code>http://</code> can easily be spied on.
*the circumstances under which the information can be decrypted
*the key or password needed to do so.  


Never send anything of importance over the internet unless the URL starts with <code>https://</code>. Data transferred over an <code>http://</code> connection can often be openly read by every server through which the data passes. Those servers are controlled by governments and private companies; hackers can easily tap into their traffic and read data, copy files, read passwords etc. <code>https://</code> is not the only secure way to transfer data over the internet, but it is the one that researchers should know of as we use it frequently. If setting up advanced protocols to send files, make sure that they are set up to be secure. For example, instead of using ''FTP,'' use ''FTPS''.
The three types are:
*'''Symmetric Encryption''': information can be both encrypted and decrypted and the same password / key is used when encrypting and decrypting
*'''Asymmetric Encryption''': information can both be encrypted and decrypted but different passwords / keys are used when encrypting and decrypting
*'''One-way hashing''': no password is used when encrypting, and there is no way decrypt anything that was one-way hashed


Note that encryption in transit has nothing to do with a service requiring a username or password. While a password-protected resource can only be requested by someone with the correct password, once it is in transit to the authorized user, the servers handling it can still see it if it is not encrypted.  
For each of the three types of encryption, there are multiple algorithms that can be used to implement them. A typical researcher does not know the details in which they differ, but what a researcher should know is that some of these algorithms are outdated and can easily be hacked. So even if a service provides, for example, '''symmetric encryption''', it is possible that it is implemented with an outdated algorithm that is not secure. However, if you are using well-established services for which warnings are not found after a quick google search, then you should be fine. All services recommended on this page are known to have up to date implementation of encryption algorithms.


=== Encryption at Rest ===
== Symmetric Encryption ==
Encryption at rest means that the data stored on the server is scrambled in such a way that it is unreadable by anyone – even if they can access the file directly. If the data is not encrypted at rest, then anyone with access to the server can read that data (including, for example, the host company and your team’s administrators). If the data is encrypted at rest, however, the data is impossible to read even if someone gained unauthorized access to the database or the files where the data is stored. Encryption at rest uses an authorization tool called a [[Encryption#Public.2FPrivate_Key_Pair |public/private key pair]].
In '''symmetric encryption''', you both encrypt and decrypt information using the same key. “''Key''” in this context is almost synonymous with password, but passwords are only one type of key. Advanced encryption may use such long passwords as a key that the key becomes a file. This encryption is best suited for when you share a file both ways, meaning that both the sender and the receiver have equal access to the file. Sharing a file in syncing services like DropBox is an example of this.


Encryption at rest is not as easy to implement as encryption in transit. Both types of encryption use a private/public key pair to ensure that no unauthorized person gets access to the data. Since the time during which the data must be encrypted in transit is so short, the web servers never need to give users the key pair. Once the transfer is complete, the key pair is discarded and never needed or used again. However, in encryption at rest, the research team encrypts the data at one point in time and will access the data at a later point in time. The computer therefore must give the key pair to a human. As is so often the case, the weakest link is always the human factor. No private/public key pair is secure if the computer that generated the key saves it or is able to re-generate it. Thus, humans must safekeep the key pair. If we lose it, the data is lost forever and there is no way whatsoever to regain access to it.  
Encrypting using '''symmetric encryption''' takes two inputs, the encryption key and the information to be encrypted; the output is the encrypted unreadable version of the information. Decryption also takes two inputs: the exact same key used when encrypting and the encrypted unreadable version of the information while the output is the original readable version of the information.  


The exact way in which encryption at rest is implemented depends on the service. For more information, read the instruction specific to your service.  
Anyone with access to both the encrypted file and the key can decrypt and read the information, so when third party services encrypt our data it is never secure enough if someone other than you and your team have access to the key. In research, when you have [[IRB Approval | IRB approval]] — which you should always have if you collect information on individuals — no one but the people listed on the '''IRB''' should have access to the key. This is a central aspect when assessing if an encryption is good enough.  


== Public/Private Key Pair ==
One extremely important aspect of '''symmetric encryption''' is that there is absolutely no way to decrypt the file if the key is lost. We recommend '''password managers''' as a secure long term storage of keys. If you are using any service that claims to be able to recover information decrypted with '''symmetric encryption''' without the key, then they are either promising the impossible or that service is not secure enough as they keep a copy of your key.


The keys used in the private/public key pair are either strings or small files. Exactly how the private/public key pair is created differs depending on the service. Complex mathematical relationships connect the two keys (we never need to understand these), allowing anyone with the private key to decrypt anything encrypted by someone with the public key. In essence, the public/private key pair system is like a vault with two doors. One door has a tiny opening where you can only put things in the vault but not take anything out. To open this door, you only need the public key. Since the door cannot be used to take anything out of the vault, it is safe for multiple people to have this key. The second door is a big door that can be used to take out all the content of the vault. To open this door, you need the private key. It is therefore very important to restrict access to the private key.
For example, services like DropBox and Amazon AWS offer automatic encryption on their servers. The encryption algorithms they are using are state-of-the-art and correctly implemented. However, in order to make it automatic, they keep a copy of the key, and as long as DropBox or Amazon are not included in your project's '''IRB''', this is not good enough encryption. There is nothing wrong with them encrypting our files automatically, as it means that no-one but them can read the files (as long as they are not hacked), but we also need to exclude service providers when we protect [[Personally Identifying Information (PII)|sensitive data]].


The following principles always hold for public/private key pairs:


* If you lose your private key, there is no way of decrypting your data ever again. Your data is lost forever. To safeguard the private key, use a password manager.
=== Symmetric encryption software - VeraCrypt ===
* The key pair can only be created once. Any services that claim to be able to re-generate the private key for you are not safe: this is equivalent to them having your password. This either means that the data is not properly encrypted or that they saved a copy of your private key that gives them full access to your encrypted data.
If we want to share a file with [[Personally Identifying Information (PII)|sensitive data]] over insecure services like e-mail or DropBox, we need to manually encrypt the file before we share it. We recommend the free of charge and open source software VeraCrypt. The software can be downloaded [https://www.veracrypt.fr here] and DIME Analytics has a user guide written with researchers as a target audience [https://osf.io/fm7sh/ here].
* It is perfectly fine if the service keeps a copy of your public key. This allows them to encrypt new data as it is coming in. Some services only give you the private key and you do not have to worry about the public key. It depends on the context.
 
* Most services provide a convenient way of decrypting your data, but if you are not asked for the key every time the data is viewed or decrypted, then your data is not encrypted properly.
VeraCrypt creates a ''secure folder'' that we will compare to a safe, and helps you generate a secure key that opens and locks this safe. Just like a physical safe, anyone with access to the safe who also has the key will be able to access its content. You have to specify how much data the ''secure folder'' should be able to store, for example 100MB, and then that folder will take that amount of disk space on your computer regardless if the folder is empty or full. This is again similar to a physical safe where it has the same size whether it is full or empty, which means that no hacker can tell if your ''secure folder'' has content or not. You cannot store more data than the limit you set when you create the ''secure folder'', so if you run out of space, you must create a new ''secure folder'' with a bigger capacity and move the files there.
 
VeraCrypt decrypts the ''secure folder'' in such a way that is only decrypted as long as it is actively decrypted. This means that as soon as you choose to stop decrypting it, VeraCrypt stops working or your computer shuts down / loses power and the file is immediately encrypted again. When a ''secure folder'' is no longer decrypted it is always re-encrypted with the key that was used when it was initially created.
 
Once the ''secure folder'' is created, you can share it while its empty or with content over insecure connections — like e-mail, DropBox, etc. — as long as you share the key in a secure fashion. If you share the key in an insecure way, for example by putting it in an e-mail or in DropBox, it would be like shipping a physical safe with the key taped to the outside of it. The secure way to share the key is to use a type of software called '''password managers'''. Examples of '''password managers''' are [https://lastpass.com LastPass] (closed source but World Bank approved) and [https://bitwarden.com BitWarden] (open source but can only be accessed through the browser on World Bank computers). You use a '''password manager''' to safely share keys.
 
You can also use VeraCrypt to secure files for when you are not sharing them with anyone. This is a good idea for storing important documents, such as tax papers, etc., on your computer. This is done the same way but you simply do not share the ''secure folder'' or key with anyone. You still need the key yourself to decrypt the information later, so we strongly recommend using a '''password manager''' as long term storage of keys even if you will not share them with anyone.
 
'''Is open source less secure?'''
Open source is often said to be less secure, but that is not true if you use a software that has been around for a long time and has been examined and used by many cyber security experts. VeraCrypt fulfills both those requirements. When it comes to encryption, it is safer to use scrutinized open source software since you can be sure there is no “backdoor” implemented where there is a secret master key that can decrypt all ''secure folders'' created by that software. Non-open source software packages are often forced by national intelligence agencies to implement “backdoors” so that government agencies get access to your encrypted data. But it is much more difficult to add backdoors in open source software without it being visible to everyone. So if a large community of users trusts an open source software, then that is your most secure option.
 
== Asymmetric Encryption ==
 
While '''symmetric encryption''' is like a safe where everyone viewing, modifying or adding content have must have the same key, '''asymmetric encryption''' is like the post office’s post collection box where anyone can add content but, only the post office staff has the key that can open and view the content. So instead of one key, in '''asymmetric encryption''' there is a public/private key pair. In the post box example, there would be a public key to open the mail chute that anyone has access to and a private key that only the postal workers have that they use when they empty all the letters. While you do not risk compromising any data you have already collected by publishing the public key to the world, you should not do that as it would allow a malicious person to send you so much fake data to your collection box that it might be difficult to tell what data is real and what data isn't.
 
While the main benefit of '''asymmetric encryption''' is that you do not share the private key with anyone who sends you data, that is also the main limitation as it is only secure for one-way communication. Another benefit is that it allows for a more automatic set-up as it is ok for servers to handle sharing of the public key. If you set up the private/public key pair and you want to share the private key with anyone so that they can also decrypt the data, then that private key needs to be shared using a '''password manager''' as in '''symmetric encryption'''.
 
Just as with '''symmetrical encryption''', there is absolutely no way to decrypt files if the private key is lost. So even if you do not intend to share the private key with anyone, we still strongly recommend that you store it in a '''password manager''' for the future. The public key cannot be re-generated either if lost, but that doesn’t tend to be an issue, as you can create a new digital collection box and use the new private/public key pair.
 
 
=== Use case - sending data from the field ===
'''Asymmetric encryption''' is perfect for [[Primary Data Collection|data collection]] as the data is only intended to flow one way, and we do not want to have to set up '''password managers''' on the devices used in '''data collection'''. By using '''asymmetric encryption''', we can allow the tablets to send data securely to the server without making it possible for anyone using a tablet to see what is already on the server. It is as if we are sending a post collection box where the tablet can safely store the information, and no one will be able access it apart from us, not even anyone using the tablet where the data was first encrypted.
 
Just as in '''symmetric encryption''', it is important that no one who is not on the [[IRB Approval | IRB]] has access to the decryption key, so the private key cannot be shared with any third party service we use for '''data collection'''. The private key should not be used to decrypt the data while it is still on the server; it should only be decrypted when or after it is downloaded. If you did not set up encryption or if you decrypted your data while it was still on the server, then the third party '''data collection''' service provider can read your data and they are most likely not listed on your '''IRB'''. Some '''data collation''' service providers let you view encrypted data in your browser without downloading it by providing the decryption key. It is perfectly possible to securely implement that without the service provider gaining access to the data, but make sure that you trust that service providers ability to do so.
 
When you download the data, you should decrypt it with the private key, and then put it in a folder that is encrypted using '''symmetric encryption'''. The reason we shouldn’t keep using the asymmetric public/private key pair is that we need different keys to view and add data and there is no practical way of modifying already encrypted data. In '''symmetric encryption''', the same key is used to view, modify and add data, which is much more practical when you start working with your data on your computer.
 
== One-way hashing ==
This type of encryption is quite different from '''symmetric and asymmetric encryption''', as anything ''hashed'' (encrypted with one-way hashing) is impossible to decrypt, or un-hash. There are use-cases where it makes sense to hash our data, but encryption is not one of them as there is no way of decrypting data that has been hashed. The reason why we are brining it up is that it is very central to secure online activities, especially password handling, and that is related to encryption and data security.
 
While it is impossible to un-hash anything, the same piece of information is always hashed the same way as long as the same hashing algorithms is used. This means the output or the “hash” is the always the same if the input is the same. So, when the input to a hashing algorithm is the same, then the output is always the same. This also means that if the output to a hashing algorithm is different, then the inputs used could never have been the same.
 
There is no practically possible way to calculate the input based on the output even if you know exactly which algorithm that was used. The input to output conversion takes milliseconds but the output to input conversion would take millennia if attempted, rendering it practically impossible even though it is theoretically possible. Good hashing algorithms are implemented so that two similar inputs still have widely different outputs so there is no way to guess the input, just that the output is similar to another output you know the input for. Also, if you hash the output again, you do not get to anything similar to the input. In fact, re-hashing something 10 times is a common trick to make it even more impossible to crack a hash.
 
 
=== Example of use-case of one-way hashing ===
When you create an account on any service like Facebook or DropBox, they take your password and use it in an one-way hashing algorithm, and then only save the output, or the hash, in their data base. If their servers were hacked, or if their data base engineers were undercover hackers, all they would see is the output of the one-way hash algorithm and not the input, i.e. your password. Every time you log in to your account, your password will be put into the same hashing algorithm and the output will be compared to what they saved in their data base when you created your account. If the outputs or hashes are the same, then they know that you used the same password both times, even though they never saved your password when you signed up.
 
This is not difficult to implement, so any service that does not hash user passwords before saving them is extremely insecure and should always be avoided. No web-company will ever show you their database but here are examples on how you can know that they did not hash your password before storing it. If you ever encounter any of the scenarios below, stop using that service and report it on cyber-security forums immediately.
 
* They are able to send you the password if you forget it. Secure services will never be able to send you your password, instead they send you a temporary password or a password rest link
* Anyone in customer support is able to tell you your password, or they say that they can see it on their screen
 
'''Weaknesses of one-way hashing''': common input leads to common output. Very common passwords like “Password1”, “qwerty” or “12345678” have very well known hashes, and hackers have already calculated and published the hashes for hundred of thousands of common passwords. So if someone hacks into a database and you have used “12345678” as your password, then your hash is quickly identified as a known hash, and the hacker immediately knows your password. In addition to common passwords like “Password1”, hackers have already calculated all hashes for all possible combinations of passwords up to a few letters long. That is why services require you to have password of a certain length, and use symbols and numbers: letter combinations are the first thing hackers calculate as they cover all combinations of longer and longer passwords.
 
== Encryption in Transit ==
Almost all internet traffic is '''encrypted in transit''', but when it is not, your data and meta-data (passwords, IP-addresses etc.) are completely open for anyone to view. This means that traffic not '''encrypted in transit''' can be read by web-servers owned by internet service provider of both the sender and the receiver, web-servers of the governments where the sender and the receiver is, and the owner of all web-servers the traffic passes through. Traffic not '''encrypted in transit''' can be read by all those web-servers, and often stored in log-files. Hackers often poses as fake web servers online just sitting there waiting for un-encrypted traffic that they can read and perhaps exploit.
 
While the consequences of not '''encrypting data in transit''' cannot be underestimated, it is luckily straightforward to implement and all well-established service providers we use (i.e. Survey Solutions, SurveyCTO, DropBox, OneDrive) have already implemented this. However, when using less established services, confirm that they use '''encryption in transit''' by looking at the internet address. If the service provider uses secure transfer methods, you will see <code>https://</code> in the internet address instead of simply <code>http://</code>. Data sent from an '''API''' using <code>http://</code> can easily be spied on. Modern browsers warn the user if the response was not signed in the first place or if the signature could not be confirmed. This is usually shown with the pop-up box, “connection is insecure", “a secure connection could not be established”, or something similar. You should never send passwords, credit card information or any other sensitive information if you get any of those warnings or if the URL starts with <code>http://</code> instead of <code>https://</code>.
 
Secure <code>https://</code> uses '''asymmetric encryption''' to secure internet traffic. When you communicate with an internet server over <code>https://</code>, you first send a message with no [[Personally Identifying Information (PII)|sensitive data]] to the server saying that you are ready to send something in a secure way. The server then creates a public/private key pair and sends the public key back to you. Your browser then encrypts your data using the public key and sends it to the server. The server then decrypts it with the private key, which was kept secret as it never left the server. That specific public/private key pair is then discarded to never again be used. This is repeated thousands of times each time you use the internet. All browsers handles this automatically and it is done in milliseconds, so you have used this millions of times in your life and probably a few hundred times already today without even noticing.
 
Note that '''encryption in transit''' has nothing to do with a service requiring a username or password. While a password-protected resource can only be requested by someone with the correct password, once it is in transit to the authorized user, the servers handling it can still see it if it is not encrypted.
 
== Encryption at Rest ==
'''Encryption at rest''' means that the data stored on the server is scrambled in such a way that it is unreadable by anyone – even if they can access the file directly. If the data is not '''encrypted at rest''', then anyone with access to the server can read that data (including, for example, the host company and your team’s administrators). If the data is '''encrypted at rest''', however, the data is impossible to read even if someone gains unauthorized access to the database or the files where the data is stored. '''Encryption at rest''' can use either '''symmetric encryption''' or '''asymmetric encryption''' depending on if the person or device uploading the data to the server is meant to also be able to read the data.
 
'''Encryption at rest''' is not as easy to implement as '''encryption in transit'''. The main reason for this is that in '''encryption in transit''', we trust the server to handle the decryption key as it is used for such a short time and thereafter deleted. When we '''encrypt data at rest''' we need the key over a long time as we might access the data over months or years; therefore the web server handling incoming and outgoing requests is not a safe place to store the key. Instead we need to manually set this up using encryption software or use features for this in third party software providers.
 
The encryption we set up needs to make sure that we are in full control of the decryption key. If the key gives access to [[Personally Identifiable Information (PII)|sensitive data]], we need to make sure that key is not exposed to anyone who is not listed on the [[IRB Approval | IRB]]. We want to make sure that the data is encrypted before it is sent to the server where it will be '''encrypted at rest'''. We cannot trust the web server to do this for us, as the web-server can be compromised and then it is too late to encrypt it by the time it reaches that server. When we manually encrypt the data before sending it and carefully share the key, then any part of the internet infrastructure can be compromised and all that the hackers can see is encrypted information which without the key is unreadable gibberish.
 
== Other recommendations ==
 
=== Encryption decryption keys ===
 
The following principles always hold for keys:
 
* Keys can only be created once. Any services that claim to be able to re-store lost decryption keys for you are not safe: this is equivalent of them having your password. This either means that the data is not properly encrypted or that they saved a copy of your decryption key that gives them full access to your encrypted data.  
* If you lose one key in a public/private key pair you need to recreate a new pair. Anything encrypted with the original public key can never be decrypted with anything but the original private key
* Many services offer a convenient way of decrypting your data, but if you are not asked for the key every time the data is viewed or decrypted, then your data is not encrypted securely enough as the service provider keeps a key for you.


=== Password and Key Management ===
=== Password and Key Management ===
When using public/private key pairs, you need to keep a lot of keys safe and well-organized. You can never store, organize, or transfer them using unsecure methods like email or Dropbox. DIME Analytics recommends using password managers as a convenient and safe solution for storing private keys and passwords. While there are many password managers out there, DIME Analytics recommends [https://www.lastpass.com/business-password-manager LastPass], whose free, basic tier satisfies all requirements a research team would ever need.
When using public/private key pairs, you need to keep a lot of keys safe and well-organized. You can never store, organize, or transfer them using unsecure methods like email or Dropbox. DIME Analytics recommends using '''password managers''' as a convenient and safe solution for storing private keys and passwords. While there are many '''password managers''' out there, DIME Analytics recommends [https://www.lastpass.com LastPass], whose free, basic tier satisfies all requirements a research team would ever need.  
 
Password managers like LastPass allow you to store small text files or string fields, as private/public key pairs sometimes come in file format. LastPass can save keys as small text files or as strings for you as ''secure notes,'' which you can also use for other important information, like noting to which encryption container the private key pertains. Password managers like LastPass can also help you randomize passwords so that your passwords are impossible-to-guess, long strings. Since the password manager store and remembers this password for you, you will not need to memorize it and you can easily have a different long random string for each account you have.


Some password managers allow you to share passwords across accounts. This is a great feature: if one person updates the password, then everyone else has the updated password in their account immediately. Password sharing is sometimes a paid feature. One way around paying is for the research team to create one account and share access to that account. However, make sure to not share this account to more people than those who really needs it.
'''Password managers''' like LastPass allow you to store small text files or string fields, as private/public key pairs sometimes come in this file format. LastPass can save keys as small text files or as strings for you as ''secure notes,'' which you can also use for other important information, like noting to which encryption container the private key pertains. '''Password managers''' like LastPass can also help you [[Randomization|randomize]] passwords so that your passwords are impossible-to-guess, long strings. Since the '''password manager''' stores and remembers this password for you, you will not need to memorize it and you can easily have a different long random string for each account you have.


==Encryption with SurveyCTO Data==
Some '''password managers''' allow you to share passwords across accounts. This is a great feature: if one person updates the password, then everyone else has the updated password in their account immediately. Password sharing is sometimes a paid feature. One way around paying is for the research team to create one account and share access to that account. However, make sure to not share this account to more people than those who really needs it.
When [[Questionnaire Programming | programming]] questionnaires in SurveyCTO, use the key generator to create a public-private key pair for your data with the name of your intended survey. These will download two files to your laptop: these are the keys. Create an encrypted survey with the public key. Then put the keys in the vault via LastPass or VeraCrypt. Note that you will need the private key for the form data download. Store encrypted data <code>[[iefolder]]</code>’s [[DataWork_Folder#Survey_Encrypted_Data | EncryptedData]] folder. Note, however, that while <code>[[iefolder]]</code> makes the folder, it does not encrypt it.


In your first [[Data Cleaning | cleaning]] do-file, [[De-identification | de-identify]] the [[Personally Identifiable Information (PII) | PII data]] to create a de-identified dataset. If you’re using Veracrypt, the <code>veracrypt</code> command allows Stata to call for VeraCrypt to mount the drive. Then you have to manually enter the password anytime you run the do-file. The first cleaning file should move the non-PII version of your data to the regular data folder.
=== Encryption with SurveyCTO Data ===
When [[Questionnaire Programming | programming questionnaires]] in SurveyCTO, use the key generator to create a public-private key pair for your data with the name of your intended [[Survey Pilot|survey]]. These will download two files to your laptop: these are the keys. Create an encrypted '''survey''' with the public key. Then store the keys securely in a '''password manager'''. Note that you will need the private key for downloading the data. Store encrypted data in <code>[[iefolder]]</code>’s [[DataWork_Folder#Survey_Encrypted_Data | EncryptedData]] folder. Note, however, that while <code>iefolder</code> makes the folder, it does not encrypt it.


== Back to Parent ==
== Back to Parent ==

Latest revision as of 20:22, 9 August 2023

In today's world of research, researchers regularly handle data, send it over the internet, and store it in the cloud. At any point, especially when the internet is involved, the data is exposed to some risk. Keeping data safe and encrypted is hence a key component of IRB requirements and research ethics. Encryption should take place whenever dealing with PII in any stage of research: from sampling and data collection to cleaning and analysis. This page discusses three different types of encryption and two different contexts when encryption is needed, in addition to some shorter related topics.

Read First

  • There are three main types of encryption algorithms relevant to our field. They are symmetric encryption , asymmetric encryption , and hashing .
  • When transferring or sharing PII, the data needs to be both encrypted in transit and encrypted at rest.
  • Encryption keys that can decrypt PII must be shared using password managers.

Three types of Encryption

There are three types of encryption which you must know to safely operate in the digital world and to handle Personally Identifying Information (PII)|PII]] data. The difference between these three encryption types are

  • the circumstances under which the information can be decrypted
  • the key or password needed to do so.

The three types are:

  • Symmetric Encryption: information can be both encrypted and decrypted and the same password / key is used when encrypting and decrypting
  • Asymmetric Encryption: information can both be encrypted and decrypted but different passwords / keys are used when encrypting and decrypting
  • One-way hashing: no password is used when encrypting, and there is no way decrypt anything that was one-way hashed

For each of the three types of encryption, there are multiple algorithms that can be used to implement them. A typical researcher does not know the details in which they differ, but what a researcher should know is that some of these algorithms are outdated and can easily be hacked. So even if a service provides, for example, symmetric encryption, it is possible that it is implemented with an outdated algorithm that is not secure. However, if you are using well-established services for which warnings are not found after a quick google search, then you should be fine. All services recommended on this page are known to have up to date implementation of encryption algorithms.

Symmetric Encryption

In symmetric encryption, you both encrypt and decrypt information using the same key. “Key” in this context is almost synonymous with password, but passwords are only one type of key. Advanced encryption may use such long passwords as a key that the key becomes a file. This encryption is best suited for when you share a file both ways, meaning that both the sender and the receiver have equal access to the file. Sharing a file in syncing services like DropBox is an example of this.

Encrypting using symmetric encryption takes two inputs, the encryption key and the information to be encrypted; the output is the encrypted unreadable version of the information. Decryption also takes two inputs: the exact same key used when encrypting and the encrypted unreadable version of the information while the output is the original readable version of the information.

Anyone with access to both the encrypted file and the key can decrypt and read the information, so when third party services encrypt our data it is never secure enough if someone other than you and your team have access to the key. In research, when you have IRB approval — which you should always have if you collect information on individuals — no one but the people listed on the IRB should have access to the key. This is a central aspect when assessing if an encryption is good enough.

One extremely important aspect of symmetric encryption is that there is absolutely no way to decrypt the file if the key is lost. We recommend password managers as a secure long term storage of keys. If you are using any service that claims to be able to recover information decrypted with symmetric encryption without the key, then they are either promising the impossible or that service is not secure enough as they keep a copy of your key.

For example, services like DropBox and Amazon AWS offer automatic encryption on their servers. The encryption algorithms they are using are state-of-the-art and correctly implemented. However, in order to make it automatic, they keep a copy of the key, and as long as DropBox or Amazon are not included in your project's IRB, this is not good enough encryption. There is nothing wrong with them encrypting our files automatically, as it means that no-one but them can read the files (as long as they are not hacked), but we also need to exclude service providers when we protect sensitive data.


Symmetric encryption software - VeraCrypt

If we want to share a file with sensitive data over insecure services like e-mail or DropBox, we need to manually encrypt the file before we share it. We recommend the free of charge and open source software VeraCrypt. The software can be downloaded here and DIME Analytics has a user guide written with researchers as a target audience here.

VeraCrypt creates a secure folder that we will compare to a safe, and helps you generate a secure key that opens and locks this safe. Just like a physical safe, anyone with access to the safe who also has the key will be able to access its content. You have to specify how much data the secure folder should be able to store, for example 100MB, and then that folder will take that amount of disk space on your computer regardless if the folder is empty or full. This is again similar to a physical safe where it has the same size whether it is full or empty, which means that no hacker can tell if your secure folder has content or not. You cannot store more data than the limit you set when you create the secure folder, so if you run out of space, you must create a new secure folder with a bigger capacity and move the files there.

VeraCrypt decrypts the secure folder in such a way that is only decrypted as long as it is actively decrypted. This means that as soon as you choose to stop decrypting it, VeraCrypt stops working or your computer shuts down / loses power and the file is immediately encrypted again. When a secure folder is no longer decrypted it is always re-encrypted with the key that was used when it was initially created.

Once the secure folder is created, you can share it while its empty or with content over insecure connections — like e-mail, DropBox, etc. — as long as you share the key in a secure fashion. If you share the key in an insecure way, for example by putting it in an e-mail or in DropBox, it would be like shipping a physical safe with the key taped to the outside of it. The secure way to share the key is to use a type of software called password managers. Examples of password managers are LastPass (closed source but World Bank approved) and BitWarden (open source but can only be accessed through the browser on World Bank computers). You use a password manager to safely share keys.

You can also use VeraCrypt to secure files for when you are not sharing them with anyone. This is a good idea for storing important documents, such as tax papers, etc., on your computer. This is done the same way but you simply do not share the secure folder or key with anyone. You still need the key yourself to decrypt the information later, so we strongly recommend using a password manager as long term storage of keys even if you will not share them with anyone.

Is open source less secure? Open source is often said to be less secure, but that is not true if you use a software that has been around for a long time and has been examined and used by many cyber security experts. VeraCrypt fulfills both those requirements. When it comes to encryption, it is safer to use scrutinized open source software since you can be sure there is no “backdoor” implemented where there is a secret master key that can decrypt all secure folders created by that software. Non-open source software packages are often forced by national intelligence agencies to implement “backdoors” so that government agencies get access to your encrypted data. But it is much more difficult to add backdoors in open source software without it being visible to everyone. So if a large community of users trusts an open source software, then that is your most secure option.

Asymmetric Encryption

While symmetric encryption is like a safe where everyone viewing, modifying or adding content have must have the same key, asymmetric encryption is like the post office’s post collection box where anyone can add content but, only the post office staff has the key that can open and view the content. So instead of one key, in asymmetric encryption there is a public/private key pair. In the post box example, there would be a public key to open the mail chute that anyone has access to and a private key that only the postal workers have that they use when they empty all the letters. While you do not risk compromising any data you have already collected by publishing the public key to the world, you should not do that as it would allow a malicious person to send you so much fake data to your collection box that it might be difficult to tell what data is real and what data isn't.

While the main benefit of asymmetric encryption is that you do not share the private key with anyone who sends you data, that is also the main limitation as it is only secure for one-way communication. Another benefit is that it allows for a more automatic set-up as it is ok for servers to handle sharing of the public key. If you set up the private/public key pair and you want to share the private key with anyone so that they can also decrypt the data, then that private key needs to be shared using a password manager as in symmetric encryption.

Just as with symmetrical encryption, there is absolutely no way to decrypt files if the private key is lost. So even if you do not intend to share the private key with anyone, we still strongly recommend that you store it in a password manager for the future. The public key cannot be re-generated either if lost, but that doesn’t tend to be an issue, as you can create a new digital collection box and use the new private/public key pair.


Use case - sending data from the field

Asymmetric encryption is perfect for data collection as the data is only intended to flow one way, and we do not want to have to set up password managers on the devices used in data collection. By using asymmetric encryption, we can allow the tablets to send data securely to the server without making it possible for anyone using a tablet to see what is already on the server. It is as if we are sending a post collection box where the tablet can safely store the information, and no one will be able access it apart from us, not even anyone using the tablet where the data was first encrypted.

Just as in symmetric encryption, it is important that no one who is not on the IRB has access to the decryption key, so the private key cannot be shared with any third party service we use for data collection. The private key should not be used to decrypt the data while it is still on the server; it should only be decrypted when or after it is downloaded. If you did not set up encryption or if you decrypted your data while it was still on the server, then the third party data collection service provider can read your data and they are most likely not listed on your IRB. Some data collation service providers let you view encrypted data in your browser without downloading it by providing the decryption key. It is perfectly possible to securely implement that without the service provider gaining access to the data, but make sure that you trust that service providers ability to do so.

When you download the data, you should decrypt it with the private key, and then put it in a folder that is encrypted using symmetric encryption. The reason we shouldn’t keep using the asymmetric public/private key pair is that we need different keys to view and add data and there is no practical way of modifying already encrypted data. In symmetric encryption, the same key is used to view, modify and add data, which is much more practical when you start working with your data on your computer.

One-way hashing

This type of encryption is quite different from symmetric and asymmetric encryption, as anything hashed (encrypted with one-way hashing) is impossible to decrypt, or un-hash. There are use-cases where it makes sense to hash our data, but encryption is not one of them as there is no way of decrypting data that has been hashed. The reason why we are brining it up is that it is very central to secure online activities, especially password handling, and that is related to encryption and data security.

While it is impossible to un-hash anything, the same piece of information is always hashed the same way as long as the same hashing algorithms is used. This means the output or the “hash” is the always the same if the input is the same. So, when the input to a hashing algorithm is the same, then the output is always the same. This also means that if the output to a hashing algorithm is different, then the inputs used could never have been the same.

There is no practically possible way to calculate the input based on the output even if you know exactly which algorithm that was used. The input to output conversion takes milliseconds but the output to input conversion would take millennia if attempted, rendering it practically impossible even though it is theoretically possible. Good hashing algorithms are implemented so that two similar inputs still have widely different outputs so there is no way to guess the input, just that the output is similar to another output you know the input for. Also, if you hash the output again, you do not get to anything similar to the input. In fact, re-hashing something 10 times is a common trick to make it even more impossible to crack a hash.


Example of use-case of one-way hashing

When you create an account on any service like Facebook or DropBox, they take your password and use it in an one-way hashing algorithm, and then only save the output, or the hash, in their data base. If their servers were hacked, or if their data base engineers were undercover hackers, all they would see is the output of the one-way hash algorithm and not the input, i.e. your password. Every time you log in to your account, your password will be put into the same hashing algorithm and the output will be compared to what they saved in their data base when you created your account. If the outputs or hashes are the same, then they know that you used the same password both times, even though they never saved your password when you signed up.

This is not difficult to implement, so any service that does not hash user passwords before saving them is extremely insecure and should always be avoided. No web-company will ever show you their database but here are examples on how you can know that they did not hash your password before storing it. If you ever encounter any of the scenarios below, stop using that service and report it on cyber-security forums immediately.

  • They are able to send you the password if you forget it. Secure services will never be able to send you your password, instead they send you a temporary password or a password rest link
  • Anyone in customer support is able to tell you your password, or they say that they can see it on their screen

Weaknesses of one-way hashing: common input leads to common output. Very common passwords like “Password1”, “qwerty” or “12345678” have very well known hashes, and hackers have already calculated and published the hashes for hundred of thousands of common passwords. So if someone hacks into a database and you have used “12345678” as your password, then your hash is quickly identified as a known hash, and the hacker immediately knows your password. In addition to common passwords like “Password1”, hackers have already calculated all hashes for all possible combinations of passwords up to a few letters long. That is why services require you to have password of a certain length, and use symbols and numbers: letter combinations are the first thing hackers calculate as they cover all combinations of longer and longer passwords.

Encryption in Transit

Almost all internet traffic is encrypted in transit, but when it is not, your data and meta-data (passwords, IP-addresses etc.) are completely open for anyone to view. This means that traffic not encrypted in transit can be read by web-servers owned by internet service provider of both the sender and the receiver, web-servers of the governments where the sender and the receiver is, and the owner of all web-servers the traffic passes through. Traffic not encrypted in transit can be read by all those web-servers, and often stored in log-files. Hackers often poses as fake web servers online just sitting there waiting for un-encrypted traffic that they can read and perhaps exploit.

While the consequences of not encrypting data in transit cannot be underestimated, it is luckily straightforward to implement and all well-established service providers we use (i.e. Survey Solutions, SurveyCTO, DropBox, OneDrive) have already implemented this. However, when using less established services, confirm that they use encryption in transit by looking at the internet address. If the service provider uses secure transfer methods, you will see https:// in the internet address instead of simply http://. Data sent from an API using http:// can easily be spied on. Modern browsers warn the user if the response was not signed in the first place or if the signature could not be confirmed. This is usually shown with the pop-up box, “connection is insecure", “a secure connection could not be established”, or something similar. You should never send passwords, credit card information or any other sensitive information if you get any of those warnings or if the URL starts with http:// instead of https://.

Secure https:// uses asymmetric encryption to secure internet traffic. When you communicate with an internet server over https://, you first send a message with no sensitive data to the server saying that you are ready to send something in a secure way. The server then creates a public/private key pair and sends the public key back to you. Your browser then encrypts your data using the public key and sends it to the server. The server then decrypts it with the private key, which was kept secret as it never left the server. That specific public/private key pair is then discarded to never again be used. This is repeated thousands of times each time you use the internet. All browsers handles this automatically and it is done in milliseconds, so you have used this millions of times in your life and probably a few hundred times already today without even noticing.

Note that encryption in transit has nothing to do with a service requiring a username or password. While a password-protected resource can only be requested by someone with the correct password, once it is in transit to the authorized user, the servers handling it can still see it if it is not encrypted.

Encryption at Rest

Encryption at rest means that the data stored on the server is scrambled in such a way that it is unreadable by anyone – even if they can access the file directly. If the data is not encrypted at rest, then anyone with access to the server can read that data (including, for example, the host company and your team’s administrators). If the data is encrypted at rest, however, the data is impossible to read even if someone gains unauthorized access to the database or the files where the data is stored. Encryption at rest can use either symmetric encryption or asymmetric encryption depending on if the person or device uploading the data to the server is meant to also be able to read the data.

Encryption at rest is not as easy to implement as encryption in transit. The main reason for this is that in encryption in transit, we trust the server to handle the decryption key as it is used for such a short time and thereafter deleted. When we encrypt data at rest we need the key over a long time as we might access the data over months or years; therefore the web server handling incoming and outgoing requests is not a safe place to store the key. Instead we need to manually set this up using encryption software or use features for this in third party software providers.

The encryption we set up needs to make sure that we are in full control of the decryption key. If the key gives access to sensitive data, we need to make sure that key is not exposed to anyone who is not listed on the IRB. We want to make sure that the data is encrypted before it is sent to the server where it will be encrypted at rest. We cannot trust the web server to do this for us, as the web-server can be compromised and then it is too late to encrypt it by the time it reaches that server. When we manually encrypt the data before sending it and carefully share the key, then any part of the internet infrastructure can be compromised and all that the hackers can see is encrypted information which without the key is unreadable gibberish.

Other recommendations

Encryption decryption keys

The following principles always hold for keys:

  • Keys can only be created once. Any services that claim to be able to re-store lost decryption keys for you are not safe: this is equivalent of them having your password. This either means that the data is not properly encrypted or that they saved a copy of your decryption key that gives them full access to your encrypted data.
  • If you lose one key in a public/private key pair you need to recreate a new pair. Anything encrypted with the original public key can never be decrypted with anything but the original private key
  • Many services offer a convenient way of decrypting your data, but if you are not asked for the key every time the data is viewed or decrypted, then your data is not encrypted securely enough as the service provider keeps a key for you.

Password and Key Management

When using public/private key pairs, you need to keep a lot of keys safe and well-organized. You can never store, organize, or transfer them using unsecure methods like email or Dropbox. DIME Analytics recommends using password managers as a convenient and safe solution for storing private keys and passwords. While there are many password managers out there, DIME Analytics recommends LastPass, whose free, basic tier satisfies all requirements a research team would ever need.

Password managers like LastPass allow you to store small text files or string fields, as private/public key pairs sometimes come in this file format. LastPass can save keys as small text files or as strings for you as secure notes, which you can also use for other important information, like noting to which encryption container the private key pertains. Password managers like LastPass can also help you randomize passwords so that your passwords are impossible-to-guess, long strings. Since the password manager stores and remembers this password for you, you will not need to memorize it and you can easily have a different long random string for each account you have.

Some password managers allow you to share passwords across accounts. This is a great feature: if one person updates the password, then everyone else has the updated password in their account immediately. Password sharing is sometimes a paid feature. One way around paying is for the research team to create one account and share access to that account. However, make sure to not share this account to more people than those who really needs it.

Encryption with SurveyCTO Data

When programming questionnaires in SurveyCTO, use the key generator to create a public-private key pair for your data with the name of your intended survey. These will download two files to your laptop: these are the keys. Create an encrypted survey with the public key. Then store the keys securely in a password manager. Note that you will need the private key for downloading the data. Store encrypted data in iefolder’s EncryptedData folder. Note, however, that while iefolder makes the folder, it does not encrypt it.

Back to Parent

This article is part of the topic Data Security

Additional Research