Difference between revisions of "De-identification"
m (→Masking Continuous Variables: --> corrected typo so that the Additional Resources link can work) |
|||
(22 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
De-identification is the process of removing or masking [[Personally Identifiable Information (PII) | personally identifiable information (PII)]] in order to reduce the risk that subjects’ identities be connected with data. De-identification is a critical component of [[Research Ethics | ethical]] [[Protecting Human Research Subjects | human subjects]] research. This page will discuss how to handle and de-identify incoming PII data before [[Data Cleaning | cleaning]], [[Data Analysis | analyzing]], or [[Publishing Data | publishing]] data. | De-identification is the process of removing or masking [[Personally Identifiable Information (PII) | personally identifiable information (PII)]] in order to reduce the risk that subjects’ identities be connected with data. De-identification is a critical component of [[Research Ethics | ethical]] [[Protecting Human Research Subjects | human subjects]] research. This page will discuss how to handle and de-identify incoming '''PII''' data before [[Data Cleaning | cleaning]], [[Data Analysis | analyzing]], or [[Publishing Data | publishing]] data. | ||
==Read First== | ==Read First== | ||
*In general, the research team should always work with and analyze de-identified data, except when planning follow-up data collection or [[Monitoring Data Quality | monitoring]] data. | *In general, the [[Impact Evaluation Team|research team]] should always work with and analyze de-identified data, except when planning follow-up [[Primary Data Collection|data collection]] or [[Monitoring Data Quality | monitoring]] data. | ||
*Publicly released data or replication data shared with other researchers must always be carefully de-identified. | *[[Publishing Data|Publicly released data]] or replication data shared with other researchers must always be carefully de-identified. | ||
*To de-identify data | *To de-identify data: | ||
**drop [[Personally Identifying Information (PII)|PII]] '''variables''' not necessary for the [[Data Analysis|analysis]] | |||
**de-identify '''PII variables''' necessary for the '''analysis''' by masking, encoding, and anonymizing. | |||
== Data Flow== | == Data Flow== | ||
The following steps ensure proper handling and storage of PII: | The following steps ensure proper handling and storage of [[Personally Identifying Information (PII)|PII]]: | ||
# Save the raw, identified data to the [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data]] folder, housed in the [[DataWork_Survey_Round#Encrypted_Round_Folder | Encrypted Round Folder]]. The data in this folder should be exactly as you got it: absolutely no changes should be made to it. | # Save the raw, identified data to the [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data]] folder, housed in the [[DataWork_Survey_Round#Encrypted_Round_Folder | Encrypted Round Folder]]. The data in this folder should be exactly as you got it: absolutely no changes should be made to it. | ||
# De-identify the data by dropping the PII variables not necessary for analysis and masking, coding, or anonymizing the PII variables necessary for the analysis. Make sure to create [[Reproducible Research | reproducible]] do-files for the de-identification process. Save these do-files in the [[DataWork_Survey_Round#Dofiles_Import | Dofiles Import Folder]], housed in the | # De-identify the data by dropping the '''PII variables''' not necessary for [[Data Analysis|analysis]] and masking, coding, or anonymizing the '''PII variables''' necessary for the '''analysis'''. Make sure to create [[Reproducible Research | reproducible]] '''do-files''' for the de-identification process. Save these '''do-files''' in the [[DataWork_Survey_Round#Dofiles_Import | Dofiles Import Folder]], housed in the '''Encrypted Round Folder'''. | ||
# Save the de-identified | # Save the de-identified [[Master Dataset|dataset]] in the [[DataWork_Survey_Round#DataSets_Folder#De-identified_Folder | De-identified Folder]], housed in the [[DataWork_Survey_Round#DataSets_Folder | DataSets Folder]]. This is the raw '''dataset''' with which the [[Impact Evaluation Team|research team]] will begin to work. | ||
In general, the research team should only use the data in the | In general, the '''research team''' should only use the data in the '''Survey Encrypted Data''' folder to plan follow-up [[Primary Data Collection|data collection]] or to [[Monitoring Data Quality | monitor]] data quality. Otherwise, the '''research team''' should work with de-identified data in the '''DataSets Folder'''. If necessary, the '''research team''' can work with '''datasets''' containing '''PII''' for reasons outside of follow-ups and '''monitoring''', given they take special measures to ensure that the '''dataset''' is secure and protected. However, note that not all file sharing services facilitate secure sharing of [[Encryption|encrypted]] files. | ||
The remainder of this page details how to de-identify a dataset before saving it to the De-Identified Folder. | The remainder of this page details how to de-identify a '''dataset''' before saving it to the '''De-Identified Folder'''. | ||
== Dropping PII Not Necessary for Analysis== | == Dropping PII Not Necessary for Analysis== | ||
To begin de-identification, drop all PII variables not necessary for analysis. This may include household coordinates; birth dates; contact information; IP address; and/or the names of survey respondents, family members, employees, and enumerators. If the research team later needs this information for follow-up surveys, high-frequency checks, [[Back Checks | back-checks]], or other monitoring, they should refer to the [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data Folder]]. Otherwise, the data regularly handled by the research team should not include this information. | To begin de-identification, drop all [[Personally Identifying Information (PII)|PII]] '''variables''' not necessary for [[Data Analysis|data analysis]]. This may include household coordinates; birth dates; contact information; IP address; and/or the names of [[Survey Pilot|survey]] respondents, family members, employees, and [[Enumerator Training|enumerators]]. If the [[Impact Evaluation Team|research team]] later needs this information for follow-up '''surveys''', [[High Frequency Checks|high-frequency checks]], [[Back Checks | back-checks]], or other [[Monitoring Data Quality|monitoring]], they should refer to the [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data Folder]]. Otherwise, the data regularly handled by the '''research team''' should not include this information. | ||
== De-identifying PII Necessary for Analysis== | == De-identifying PII Necessary for Analysis== | ||
Next, de-identify all PII necessary for analysis by masking or encoding variables. When choosing between methods of masking and encoding, researchers face a trade-off between ensuring data privacy and losing information and thus results quality: different methods alter regression results and inference in different ways. This section details methods and limitations. | Next, de-identify all [[Personally Identifying Information (PII)|PII]] necessary for [[Data Analysis|analysis]] by masking or encoding '''variables'''. When choosing between methods of masking and encoding, researchers face a trade-off between [[Research Ethics|ensuring data privacy]] and losing information and thus results quality: different methods alter regression results and inference in different ways. This section details methods and limitations. | ||
===Encoding Categorical Variables=== | ===Encoding Categorical Variables=== | ||
Encoding is a process of de-identifying PII categorical variables needed for analysis (i.e. administrative units, ethnicity) by dropping the [https://dimewiki.worldbank.org/wiki/Data_Cleaning#Labels value label] of a factor variable. The unlabeled data then indicates which individuals are in the same group, but not what the group is. When encoding categorical variables, avoid using pre-existing codes such as State codes used by the National Statistics Bureau or another authority, as this would no longer constitute de-identification. Instead, use [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs anonymous IDs] to encode variables. | Encoding is a process of de-identifying [[Personally Identifying Information (PII)|PII]] categorical '''variables''' needed for [[Data Analysis|analysis]] (i.e. administrative units, ethnicity) by dropping the [https://dimewiki.worldbank.org/wiki/Data_Cleaning#Labels value label] of a factor '''variable'''. The unlabeled data then indicates which individuals are in the same group, but not what the group is. When encoding categorical '''variables''', avoid using pre-existing codes such as State codes used by the National Statistics Bureau or another authority, as this would no longer constitute de-identification. Instead, use [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs anonymous IDs] to encode '''variables'''. | ||
===Masking Continuous Variables=== | ===Masking Continuous Variables=== | ||
Masking is the process of limiting disclosure of continuous PII variables needed for analysis. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See | Masking is the process of limiting disclosure of continuous [[Personally Identifying Information (PII)|PII]] '''variables''' needed for [[Data Analysis|analysis]]. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See the Additional Resources section below for more detailed information on how to implement each of them. | ||
* '''Categorization''' is the process of transforming continuous variables into categorical variables by reporting a variable range rather than its specific value. For example, a 22-year-old individual might be classified as “18 and 25 year old.” The range of each category will depend on how many individual observations exist in each of them. | * '''Categorization''' is the process of transforming continuous '''variables''' into categorical '''variables''' by reporting a '''variable''' range rather than its specific value. For example, a 22-year-old individual might be classified as “18 and 25 year old.” The range of each category will depend on how many individual observations exist in each of them. | ||
* '''Micro-aggregation''' is the process of forming groups with a certain number of observations and substituting the individual values with the group mean. This method alters the variable variance and, accordingly, may affect estimation. However, the change in variance is small if the groups are small. | * '''Micro-aggregation''' is the process of forming groups with a certain number of observations and substituting the individual values with the group mean. This method alters the '''variable''' variance and, accordingly, may affect estimation. However, the change in variance is small if the groups are small. | ||
* '''Adding noise''' is the process of creating white noise by generating and adding to the original variable a new variable with mean zero and positive variance. This method alters the original | * '''Adding noise''' is the process of creating white noise by generating and adding to the original '''variable''' a new '''variable''' with mean zero and positive variance. This method alters the original the variance of the '''variables''', therefore affecting inference. | ||
* '''Rounding''' is the process of defining, often randomly, a rounding base and rounding each observation to its nearest multiple. | * '''Rounding''' is the process of defining, often randomly, a rounding base and rounding each observation to its nearest multiple. | ||
* '''Top-coding''' is used when only a few extreme values can be individually identified. In this process, extremely high values are rounded so that, for example, any farmers producing more than a certain quantity of a crop are assigned that quantity. | * '''Top-coding''' is used when only a few extreme values can be individually identified. In this process, extremely high values are rounded so that, for example, any farmers producing more than a certain quantity of a crop are assigned that quantity. | ||
When masking a variable, make sure to do so in a way that a third party could not reverse to uncover the true value. For example, if you dislocate every GPS coordinate two kilometers south, one could easily trace the value back to the original coordinates. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables. | When masking a '''variable''', make sure to do so in a way that a third party could not reverse to uncover the true value. For example, if you dislocate every GPS coordinate two kilometers south, one could easily trace the value back to the original coordinates. Similarly, if you create one single noise '''variable''' with different values for each observation and add it to multiple '''variables''' to de-identify them, their original value can be obtained more easily than if you add different noises to different '''variables'''. | ||
It is important to [[Data Documentation | document]] any changes made to variables during de-identification so that researchers can take them into account when conducting analysis and interpreting results. Save this documentation in a secure, encrypted location. | It is important to [[Data Documentation | document]] any changes made to '''variables''' during de-identification so that researchers can take them into account when conducting '''analysis''' and interpreting results. Save this documentation in a secure, [[Encryption|encrypted]] location. | ||
== Anonymizing Data == | |||
When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from | When a [[Survey Pilot|survey]] [[Sampling|sample]] comes from a previously existing registry, or when '''survey''' data needs to be matched to [[Administrative_and_Monitoring_Data#Administrative_Data|administrative data]], it is common to use a pre-existing [[ID Variable Properties|ID variable]] from the same registry or database, for example State codes or clinic registries. Since people outside of the [[Impact Evaluation Team|research team]] have access to these IDs, there is no way to guarantee [[Data Security|protection]] or privacy of the collected data. In such cases, it is a best practice to create a new '''ID variable''' with no association to the external ID. There are however some [[ID_Variable_Properties#Fifth_property:_Anonymous_IDs|exceptions]] to this general rule. | ||
== | == Statistical Disclosure Control and sdcMicro == | ||
This | |||
Another important aspect of dealing with statistical data is '''statistical disclosure control (SDC)'''. The concept of '''SDC''' seeks to modify and treat the data in such a way that the data can be [[Publishing Data|published]] or released without revealing the [[Personally Identifiable Information (PII)|confidential information]] it contains. At the same time, '''SDC''' tries to limit information loss from the '''data anonymization'''. | |||
In this regard, [https://www.jstatsoft.org/article/view/v067i04 Matthias Templ, Alexander Kowarik and Bernhard Meindl] have created the <code>sdcmicro</code> package in R. This package can be used for generating '''anonymized''' [[Microdata Catalog|microdata]], that is, data for public and scientific use. To install <code>sdcmicro</code> in [[R Coding Practices|R]], run the following command: | |||
<syntaxhighlight lang="R" line>library(sdcMicro) # loading the sdcMicro package | |||
require(sdcMicro) # loading the sdcMicro package </syntaxhighlight> | |||
Further, as part of its efforts to help researchers with '''anonymizing data''', the [https://ihsn.org/ International Household Survey Network (IHSN)] has released the following resources: | |||
* A [https://sdcpractice.readthedocs.io/en/latest/ practice guide] for <code>sdcmicro</code>, | |||
* An [https://sdctheory.readthedocs.io/en/latest/ introduction] to the theory of '''SDC''', and | |||
* A [https://sdcappdocs.readthedocs.io/en/latest/ guide] to using the '''graphic user interface (GUI)''' for <code>sdcmicro</code>. | |||
== Related Pages == | |||
[[Special:WhatLinksHere/De-identification|Click here for pages that link to this topic.]] | |||
== Additional Resources == | == Additional Resources == | ||
*[https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for | * DIME Analytics (World Bank), [https://osf.io/ey3xr Protect Privacy and Share Data Securely] | ||
*Heffetz and Ligett | * DIME Analytics (World Bank), [https://osf.io/zakgv/ Encryption 101] | ||
* | * Institute of Education Sciences (IES), [https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for protecting personally identifiable information (PII)] | ||
* J-PAL, [https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-deidentifying-data.pdf Guide to De-identifying Data] | |||
* Ori Heffetz and Katrina Ligett, [https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.75 Privacy and Data Based Research] | |||
* Thijs Benschop and Matthew Welch, [http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S1_2_IO_Benschop_Welch_P.pdf A Practice Guide for Microdata Anonymization] | |||
[[Category: Data Cleaning]] [[Category: Publishing Data]] | [[Category: Data Cleaning]] [[Category: Publishing Data]] |
Latest revision as of 14:00, 17 August 2023
De-identification is the process of removing or masking personally identifiable information (PII) in order to reduce the risk that subjects’ identities be connected with data. De-identification is a critical component of ethical human subjects research. This page will discuss how to handle and de-identify incoming PII data before cleaning, analyzing, or publishing data.
Read First
- In general, the research team should always work with and analyze de-identified data, except when planning follow-up data collection or monitoring data.
- Publicly released data or replication data shared with other researchers must always be carefully de-identified.
- To de-identify data:
Data Flow
The following steps ensure proper handling and storage of PII:
- Save the raw, identified data to the Survey Encrypted Data folder, housed in the Encrypted Round Folder. The data in this folder should be exactly as you got it: absolutely no changes should be made to it.
- De-identify the data by dropping the PII variables not necessary for analysis and masking, coding, or anonymizing the PII variables necessary for the analysis. Make sure to create reproducible do-files for the de-identification process. Save these do-files in the Dofiles Import Folder, housed in the Encrypted Round Folder.
- Save the de-identified dataset in the De-identified Folder, housed in the DataSets Folder. This is the raw dataset with which the research team will begin to work.
In general, the research team should only use the data in the Survey Encrypted Data folder to plan follow-up data collection or to monitor data quality. Otherwise, the research team should work with de-identified data in the DataSets Folder. If necessary, the research team can work with datasets containing PII for reasons outside of follow-ups and monitoring, given they take special measures to ensure that the dataset is secure and protected. However, note that not all file sharing services facilitate secure sharing of encrypted files.
The remainder of this page details how to de-identify a dataset before saving it to the De-Identified Folder.
Dropping PII Not Necessary for Analysis
To begin de-identification, drop all PII variables not necessary for data analysis. This may include household coordinates; birth dates; contact information; IP address; and/or the names of survey respondents, family members, employees, and enumerators. If the research team later needs this information for follow-up surveys, high-frequency checks, back-checks, or other monitoring, they should refer to the Survey Encrypted Data Folder. Otherwise, the data regularly handled by the research team should not include this information.
De-identifying PII Necessary for Analysis
Next, de-identify all PII necessary for analysis by masking or encoding variables. When choosing between methods of masking and encoding, researchers face a trade-off between ensuring data privacy and losing information and thus results quality: different methods alter regression results and inference in different ways. This section details methods and limitations.
Encoding Categorical Variables
Encoding is a process of de-identifying PII categorical variables needed for analysis (i.e. administrative units, ethnicity) by dropping the value label of a factor variable. The unlabeled data then indicates which individuals are in the same group, but not what the group is. When encoding categorical variables, avoid using pre-existing codes such as State codes used by the National Statistics Bureau or another authority, as this would no longer constitute de-identification. Instead, use anonymous IDs to encode variables.
Masking Continuous Variables
Masking is the process of limiting disclosure of continuous PII variables needed for analysis. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See the Additional Resources section below for more detailed information on how to implement each of them.
- Categorization is the process of transforming continuous variables into categorical variables by reporting a variable range rather than its specific value. For example, a 22-year-old individual might be classified as “18 and 25 year old.” The range of each category will depend on how many individual observations exist in each of them.
- Micro-aggregation is the process of forming groups with a certain number of observations and substituting the individual values with the group mean. This method alters the variable variance and, accordingly, may affect estimation. However, the change in variance is small if the groups are small.
- Adding noise is the process of creating white noise by generating and adding to the original variable a new variable with mean zero and positive variance. This method alters the original the variance of the variables, therefore affecting inference.
- Rounding is the process of defining, often randomly, a rounding base and rounding each observation to its nearest multiple.
- Top-coding is used when only a few extreme values can be individually identified. In this process, extremely high values are rounded so that, for example, any farmers producing more than a certain quantity of a crop are assigned that quantity.
When masking a variable, make sure to do so in a way that a third party could not reverse to uncover the true value. For example, if you dislocate every GPS coordinate two kilometers south, one could easily trace the value back to the original coordinates. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables.
It is important to document any changes made to variables during de-identification so that researchers can take them into account when conducting analysis and interpreting results. Save this documentation in a secure, encrypted location.
Anonymizing Data
When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from the same registry or database, for example State codes or clinic registries. Since people outside of the research team have access to these IDs, there is no way to guarantee protection or privacy of the collected data. In such cases, it is a best practice to create a new ID variable with no association to the external ID. There are however some exceptions to this general rule.
Statistical Disclosure Control and sdcMicro
Another important aspect of dealing with statistical data is statistical disclosure control (SDC). The concept of SDC seeks to modify and treat the data in such a way that the data can be published or released without revealing the confidential information it contains. At the same time, SDC tries to limit information loss from the data anonymization.
In this regard, Matthias Templ, Alexander Kowarik and Bernhard Meindl have created the sdcmicro
package in R. This package can be used for generating anonymized microdata, that is, data for public and scientific use. To install sdcmicro
in R, run the following command:
library(sdcMicro) # loading the sdcMicro package
require(sdcMicro) # loading the sdcMicro package
Further, as part of its efforts to help researchers with anonymizing data, the International Household Survey Network (IHSN) has released the following resources:
- A practice guide for
sdcmicro
, - An introduction to the theory of SDC, and
- A guide to using the graphic user interface (GUI) for
sdcmicro
.
Related Pages
Click here for pages that link to this topic.
Additional Resources
- DIME Analytics (World Bank), Protect Privacy and Share Data Securely
- DIME Analytics (World Bank), Encryption 101
- Institute of Education Sciences (IES), Guidelines for protecting personally identifiable information (PII)
- J-PAL, Guide to De-identifying Data
- Ori Heffetz and Katrina Ligett, Privacy and Data Based Research
- Thijs Benschop and Matthew Welch, A Practice Guide for Microdata Anonymization