Difference between revisions of "De-identification"

Jump to: navigation, search
 
(18 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== Read First ==
+
De-identification is the process of removing or masking [[Personally Identifiable Information (PII) | personally identifiable information (PII)]] in order to reduce the risk that subjects’ identities be connected with data. De-identification is a critical component of [[Research Ethics | ethical]] [[Protecting Human Research Subjects | human subjects]]  research. This page will discuss how to handle and de-identify incoming PII data before [[Data Cleaning | cleaning]], [[Data Analysis | analyzing]], or [[Publishing Data | publishing]] data.
<onlyinclude>
+
* Some survey variables allow identification of individual respondents. This is called Personally Identifiable Information (PII). What variables are considered PII or not varies with the context of the survey. It is the responsibility of researchers to make sure this data is private and safely stored, and no PII can ever be publicly released without explicit consent
+
==Read First==
* Variables including personally identifiable information that is not related to the research question should be dropped as soon as possible in the project, and all PII must be stored in an encrypted folder. PII variables that are needed for analysis can either encoded or masked, depending on the type of information they contain and who has access to the data
+
*In general, the research team should always work with and analyze de-identified data, except when planning follow-up data collection or [[Monitoring Data Quality | monitoring]] data.
</onlyinclude>
+
*Publicly released data or replication data shared with other researchers must always be carefully de-identified.
==Personally Identifiable Information ==
+
*To de-identify data, 1) drop PII variables not necessary for the analysis, then 2) de-identify PII variables necessary for the analysis by masking, encoding, and anonymizing. For more details on what constitutes PII, see [[Personally Identifiable Information (PII)]].
In the context of a survey, Personally identifiable information (PII) are the variables that can, either on their own or in combination with other variables, lead to identifying a single surveyed individual with reasonable certainty. Here's a list of variables that may lead to personal identification:
 
* Names of survey respondent, household members, enumerators and other individuals
 
* Names of schools, clinics, villages and possibly other administrative units (depending on the survey)
 
* Dates of birth
 
* GPS coordinates
 
* Contact information
 
* Record identifier (social security number, process number, medical record number, national clinic code, license plate, IP address)
 
* Pictures (of individuals, houses, etc)
 
  
 +
== Data Flow==
  
A few examples of sensitive variables that depending on survey context may contain personally identifying information:
+
The following steps ensure proper handling and storage of PII:
* Age
 
* Gender
 
* Ethnicity
 
* Grades, salary,  job position
 
  
 +
# Save the raw, identified data to the [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data]] folder, housed in the [[DataWork_Survey_Round#Encrypted_Round_Folder | Encrypted Round Folder]]. The data in this folder should be exactly as you got it: absolutely no changes should be made to it. 
 +
# De-identify the data by dropping the PII variables not necessary for analysis and masking, coding, or anonymizing the PII variables necessary for the analysis. Make sure to create [[Reproducible Research | reproducible]] do-files for the de-identification process. Save these do-files in the [[DataWork_Survey_Round#Dofiles_Import | Dofiles Import Folder]], housed in the [[DataWork_Survey_Round#Encrypted_Round_Folder | Encrypted Round Folder]].
 +
# Save the de-identified data set in the [[DataWork_Survey_Round#DataSets_Folder#De-identified_Folder | De-identified Folder]], housed in the [[DataWork_Survey_Round#DataSets_Folder | DataSets Folder]]. This is the raw data set with which the research team will begin to work.
  
As these variables exemplify, what exactly is PII will depend on the context of each survey. For example, if a survey covers a small farming community, variables such as plot size and crops cultivated can be combined to identify an individual household. Administrative units can be considered PII if there are few individuals in each of them.
+
In general, the research team should only use the data in the [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data]] folder to plan follow-up data collection or to [[Monitoring Data Quality | monitor]] data quality. Otherwise, the research team should work with de-identified data in the DataSets Folder. If necessary, the research team can work with data sets containing PII for reasons outside of follow-ups and monitoring, given they take special measures to ensure that the data set is secure and protected. However, note that not all file sharing services facilitate secure sharing of encrypted files.
Details on how to calculate the disclosure risk -- that is, the risk of someone being able to track individual respondents from the available data can be found in [https://dimewiki.worldbank.org/wiki/De-identification#Additional_Resources Additional Resources]. It is common to define a threshold on the minimum number of individuals with a certain value of a variable that needs to be observed for it to be considered safe to disclose it. For example, if a school has less than 10 students of a certain age, then age is considered PII, as it may be used with other information to identify these students. The value of this thresholds depends on the context of the survey.
 
The guidelines to deal with PII will be discussed below, but for common solutions are (1)  restrict access to the data,  (2) drop PII variables, (3) use anonymous codes for categoric variables, and (3) mask their values. The two first solutions make the data unavailable, while the last one edits the information shared when compared to the original survey data.
 
  
==Access restriction==
+
The remainder of this page details how to de-identify a dataset before saving it to the De-Identified Folder.
Data sets that are only available to the research team may contain identifiable information, and publicly released data, such as analysis datasets submitted as replication files for academic paper must be carefully de-identified. In between these two extremes, it is also common to share some relatively identifiable data under conditional access. The conditions required to access the data depend on how easy it is to identify an individual from it.
 
==De-identification==
 
There are different ways to de-identify data sets, resulting in different levels of information loss. It is advisable to remove immediately identifying variables such as names and contact information as early as possible in the project and stored under encryption, but what other information should be de-identified depends on how relevant the information is to the research question, and who has access to the data.
 
Any identifiable information that is not related to the research question should be dropped, but there's a trade-off between ensuring data privacy and losing information and results quality when dealing with relevant variables. For example, a common practice is to create perturbed data, meaning some change is made to the shared variable compared to the original survey. Different methods to introduce change affect regression results and inference in different ways, and it is important to document the type of changes introduced so researchers can take this into account.
 
  
=== Drop variables===
+
== Dropping PII Not Necessary for Analysis==
Variables such as individual names (including survey respondent, family members, employees, enumerators), household coordinates, birth dates, contact information, IP address, job position should be dropped. This applies to any PII that is not necessary for analysis. They may be needed for high-frequency checks, back-checks and monitoring of intervention implementation and survey progress, but should be dropped from any data sets that are not used exactly for that.
 
  
===Encode variables===
+
To begin de-identification, drop all PII variables not necessary for analysis. This may include household coordinates; birth dates; contact information; IP address; and/or the names of survey respondents, family members, employees, and enumerators. If the research team later needs this information for follow-up surveys, high-frequency checks, [[Back Checks | back-checks]], or other monitoring, they should refer to the [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data Folder]]. Otherwise, the data regularly handled by the research team should not include this information.  
Personally identifiable categoric variables that are needed for analysis, such as administrative units, ethnicity, etc, can be de-identified by encoding. That means dropping the [https://dimewiki.worldbank.org/wiki/Data_Cleaning#Labels value label] of a factor variable, so it is possible to tell which individuals are in the same group, but not what group that is.  Be careful to use [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs anonymous IDs] in this case, not some pre-existing code such as the State code used by the National Statistics Bureau or other authority.
 
  
===Mask values===
+
== De-identifying PII Necessary for Analysis==
For numeric variables that are related to the research question and may be used to identify individuals, there are different methods that can be used to limit disclosure. This is necessary if the data is publicly available. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See [https://dimewiki.worldbank.org/wiki/De-identification#Additional_Resources Additional Resources] for more detailed information on how to implement each of them.
 
When editing variable’s values, make sure to do it in a wait that cannot be reversed, for example by adding different random values to different variables and observations. For example, if you dislocate every GPS coordinate two kilometres South, the original coordinates can easily be traced back. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables.
 
* '''Categorization''': continuous variables can be transformed into categoric variables. This is done by reporting such variable in ranges instead of an individual’s specific value. For example, you can categorize ages and say that an individual is between 18 and 25 years old instead of 22. The range of each category will depend on how many individual observations exist in each of them.
 
* '''Micro-aggregation''': This is done by forming groups with a certain number of observations and substituting the individual values with the group mean. This may affect estimation as even though the variable mean is not affected, the variance is. However, this change is the variance is small if the groups are small.
 
* '''Adding noise''': white noise can be created by generating a new variable with mean zero and positive variance and adding it to the original variable. This causes the variable’s variance to be altered, therefore affecting inference.
 
* '''Rounding''': consists in defining, often randomly, a rounding base and round each observation to its nearest multiple.
 
* '''Top-coding''': when only a few extreme values can be individually identified, such values can be rounded so that, for example, any farmers producing more than a certain quantity of a crop is assigned that quantity.
 
  
===Anonymous IDs===
+
Next, de-identify all PII necessary for analysis by masking or encoding variables. When choosing between methods of masking and encoding, researchers face a trade-off between ensuring data privacy and losing information and thus results quality: different methods alter regression results and inference in different ways. This section details methods and limitations.
When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from such registry or database, e.g. as State codes or clinic registries. Note that if these codes are publicly available, the data set created with them will still be personally identified, even if all names are deleted.
 
  
In general, it is not recommended to use IDs that people outside the team have access to. It would be preferable to create a new, anonymous code. However, that are exceptions to this general rule. Read the [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs Anonymous IDs] article for more information on how to deal with this specific issue.
+
===Encoding Categorical Variables===
 +
Encoding is a process of de-identifying PII categorical variables needed for analysis (i.e. administrative units, ethnicity) by dropping the [https://dimewiki.worldbank.org/wiki/Data_Cleaning#Labels value label] of a factor variable. The unlabeled data then indicates which individuals are in the same group, but not what the group is. When encoding categorical variables, avoid using pre-existing codes such as State codes used by the National Statistics Bureau or another authority, as this would no longer constitute de-identification. Instead, use [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs anonymous IDs] to encode variables.
  
== Back to Parent ==
+
===Masking Continuous Variables===
This article is part of the topic [[Data Cleaning]]
 
  
 +
Masking is the process of limiting disclosure of continuous PII variables needed for analysis. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See [[De-identification#Additional_Resources | Additional Resources]] for more detailed information on how to implement each of them.
 +
 +
* '''Categorization''' is the process of transforming continuous variables into categorical variables by reporting a variable range rather than its specific value. For example, a 22-year-old individual might be classified as “18 and 25 year old.” The range of each category will depend on how many individual observations exist in each of them.
 +
* '''Micro-aggregation''' is the process of forming groups with a certain number of observations and substituting the individual values with the group mean. This method alters the variable variance and, accordingly, may affect estimation. However, the change in variance is small if the groups are small.
 +
* '''Adding noise''' is the process of creating white noise by generating and adding to the original variable a new variable with mean zero and positive variance. This method alters the original variable’s variance, therefore affecting inference.
 +
* '''Rounding''' is the process of defining, often randomly, a rounding base and rounding each observation to its nearest multiple.
 +
* '''Top-coding''' is used when only a few extreme values can be individually identified. In this process, extremely high values are rounded so that, for example, any farmers producing more than a certain quantity of a crop are assigned that quantity.
 +
 +
When masking a variable, make sure to do so in a way that a third party could not reverse to uncover the true value. For example, if you dislocate every GPS coordinate two kilometers south, one could easily trace the value back to the original coordinates. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables.
 +
 +
It is important to [[Data Documentation | document]] any changes made to variables during de-identification so that researchers can take them into account when conducting analysis and interpreting results. Save this documentation in a secure, encrypted location.
 +
 +
== Anonymizing Data ==
 +
 +
When a survey sample comes from a previously existing registry, or when survey data needs to be matched to [[Administrative_and_Monitoring_Data#Administrative_Data|administrative data]], it is common to use a pre-existing [[ID Variable Properties|ID variable]] from the same registry or database, for example, State codes or clinic registries. Since people outside of the [[Impact Evaluation Team|research team]] have access to these IDs, there is no way to guarantee [[Data Security|protection]] or privacy of the collected data. In such cases, it is a best practice to create a new ID variable with no association to the external ID. There are however some [[ID_Variable_Properties#Fifth_property:_Anonymous_IDs|exceptions]] to this general rule.
 +
 +
== Statistical Disclosure Control and sdcMicro ==
 +
 +
Another important aspect of dealing with statistical data is '''statistical disclosure control (SDC)'''. The concept of '''SDC''' seeks to modify and treat the data in such a way that the data can be [[Publishing Data|published]] or released without revealing the [[Personally Identifiable Information (PII)|confidential information]] it contains. At the same time, '''SDC''' tries to limit information loss from the '''data anonymization'''.
 +
 +
In this regard, [https://www.jstatsoft.org/article/view/v067i04 Matthias Templ, Alexander Kowarik and Bernhard Meindl] have created the <code>sdcmicro</code> package in R. This package can be used for generating '''anonymized''' [[Microdata Catalog|microdata]], that is, data for public and scientific use. To install <code>sdcmicro</code> in R, run the following command:
 +
<syntaxhighlight lang="R" line>library(sdcMicro) # loading the sdcMicro package
 +
require(sdcMicro) # loading the sdcMicro package </syntaxhighlight>
 +
 +
Further, as part of its efforts to help researchers with '''anonymizing data''', the [https://ihsn.org/ International Household Survey Network (IHSN)] has released the following resources:
 +
* A [https://sdcpractice.readthedocs.io/en/latest/ practice guide] for <code>sdcmicro</code>,
 +
* An [https://sdctheory.readthedocs.io/en/latest/ introduction] to the theory of '''SDC''', and
 +
* A [https://sdcappdocs.readthedocs.io/en/latest/ guide] to using the '''graphic user interface (GUI)''' for <code>sdcmicro</code>.
 +
 +
== Related Pages ==
 +
[[Special:WhatLinksHere/De-identification|Click here for pages that link to this topic.]]
  
 
== Additional Resources ==
 
== Additional Resources ==
*[https://projecteuclid.org/download/pdfview_1/euclid.ssu/1296828958 Matthews, Gregory J., and Ofer Harel. "Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy." Statistics Surveys 5 (2011): 1-29.]
+
* DIME Analytics (World Bank), [https://osf.io/5p68f/ Research Ethics & Data Security]
*[http://repository.cmu.edu/jpc/vol2/iss1/7/ Shlomo, Natalie (2010) "Releasing Microdata: Disclosure Risk Estimation, Data Masking and Assessing Utility," Journal of Privacy and Confidentiality: Vol. 2 : Iss. 1 , Article 7. ]
+
* DIME Analytics (World Bank), [https://osf.io/zakgv/ Encryption 101]
*[https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for Protecting PII from the Institute of Education Siences]
+
* Institute of Education Sciences (IES), [https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for protecting personally identifiable information (PII)]
 +
* J-PAL, [https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf Guide to De-identifying Data]
 +
* Ori Heffetz and Katrina Ligett, [https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.75 Privacy and Data Based Research]
 +
* Thijs Benschop and Matthew Welch, [http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S1_2_IO_Benschop_Welch_P.pdf A Practice Guide for Microdata Anonymization}
 
[[Category: Data Cleaning]] [[Category: Publishing Data]]
 
[[Category: Data Cleaning]] [[Category: Publishing Data]]

Latest revision as of 20:10, 17 November 2020

De-identification is the process of removing or masking personally identifiable information (PII) in order to reduce the risk that subjects’ identities be connected with data. De-identification is a critical component of ethical human subjects research. This page will discuss how to handle and de-identify incoming PII data before cleaning, analyzing, or publishing data.

Read First

  • In general, the research team should always work with and analyze de-identified data, except when planning follow-up data collection or monitoring data.
  • Publicly released data or replication data shared with other researchers must always be carefully de-identified.
  • To de-identify data, 1) drop PII variables not necessary for the analysis, then 2) de-identify PII variables necessary for the analysis by masking, encoding, and anonymizing. For more details on what constitutes PII, see Personally Identifiable Information (PII).

Data Flow

The following steps ensure proper handling and storage of PII:

  1. Save the raw, identified data to the Survey Encrypted Data folder, housed in the Encrypted Round Folder. The data in this folder should be exactly as you got it: absolutely no changes should be made to it.
  2. De-identify the data by dropping the PII variables not necessary for analysis and masking, coding, or anonymizing the PII variables necessary for the analysis. Make sure to create reproducible do-files for the de-identification process. Save these do-files in the Dofiles Import Folder, housed in the Encrypted Round Folder.
  3. Save the de-identified data set in the De-identified Folder, housed in the DataSets Folder. This is the raw data set with which the research team will begin to work.

In general, the research team should only use the data in the Survey Encrypted Data folder to plan follow-up data collection or to monitor data quality. Otherwise, the research team should work with de-identified data in the DataSets Folder. If necessary, the research team can work with data sets containing PII for reasons outside of follow-ups and monitoring, given they take special measures to ensure that the data set is secure and protected. However, note that not all file sharing services facilitate secure sharing of encrypted files.

The remainder of this page details how to de-identify a dataset before saving it to the De-Identified Folder.

Dropping PII Not Necessary for Analysis

To begin de-identification, drop all PII variables not necessary for analysis. This may include household coordinates; birth dates; contact information; IP address; and/or the names of survey respondents, family members, employees, and enumerators. If the research team later needs this information for follow-up surveys, high-frequency checks, back-checks, or other monitoring, they should refer to the Survey Encrypted Data Folder. Otherwise, the data regularly handled by the research team should not include this information.

De-identifying PII Necessary for Analysis

Next, de-identify all PII necessary for analysis by masking or encoding variables. When choosing between methods of masking and encoding, researchers face a trade-off between ensuring data privacy and losing information and thus results quality: different methods alter regression results and inference in different ways. This section details methods and limitations.

Encoding Categorical Variables

Encoding is a process of de-identifying PII categorical variables needed for analysis (i.e. administrative units, ethnicity) by dropping the value label of a factor variable. The unlabeled data then indicates which individuals are in the same group, but not what the group is. When encoding categorical variables, avoid using pre-existing codes such as State codes used by the National Statistics Bureau or another authority, as this would no longer constitute de-identification. Instead, use anonymous IDs to encode variables.

Masking Continuous Variables

Masking is the process of limiting disclosure of continuous PII variables needed for analysis. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See Additional Resources for more detailed information on how to implement each of them.

  • Categorization is the process of transforming continuous variables into categorical variables by reporting a variable range rather than its specific value. For example, a 22-year-old individual might be classified as “18 and 25 year old.” The range of each category will depend on how many individual observations exist in each of them.
  • Micro-aggregation is the process of forming groups with a certain number of observations and substituting the individual values with the group mean. This method alters the variable variance and, accordingly, may affect estimation. However, the change in variance is small if the groups are small.
  • Adding noise is the process of creating white noise by generating and adding to the original variable a new variable with mean zero and positive variance. This method alters the original variable’s variance, therefore affecting inference.
  • Rounding is the process of defining, often randomly, a rounding base and rounding each observation to its nearest multiple.
  • Top-coding is used when only a few extreme values can be individually identified. In this process, extremely high values are rounded so that, for example, any farmers producing more than a certain quantity of a crop are assigned that quantity.

When masking a variable, make sure to do so in a way that a third party could not reverse to uncover the true value. For example, if you dislocate every GPS coordinate two kilometers south, one could easily trace the value back to the original coordinates. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables.

It is important to document any changes made to variables during de-identification so that researchers can take them into account when conducting analysis and interpreting results. Save this documentation in a secure, encrypted location.

Anonymizing Data

When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from the same registry or database, for example, State codes or clinic registries. Since people outside of the research team have access to these IDs, there is no way to guarantee protection or privacy of the collected data. In such cases, it is a best practice to create a new ID variable with no association to the external ID. There are however some exceptions to this general rule.

Statistical Disclosure Control and sdcMicro

Another important aspect of dealing with statistical data is statistical disclosure control (SDC). The concept of SDC seeks to modify and treat the data in such a way that the data can be published or released without revealing the confidential information it contains. At the same time, SDC tries to limit information loss from the data anonymization.

In this regard, Matthias Templ, Alexander Kowarik and Bernhard Meindl have created the sdcmicro package in R. This package can be used for generating anonymized microdata, that is, data for public and scientific use. To install sdcmicro in R, run the following command:

1library(sdcMicro) # loading the sdcMicro package
2require(sdcMicro) # loading the sdcMicro package

Further, as part of its efforts to help researchers with anonymizing data, the International Household Survey Network (IHSN) has released the following resources:

Related Pages

Click here for pages that link to this topic.

Additional Resources