Difference between revisions of "De-identification"

Jump to: navigation, search
Line 61: Line 61:
 
*[https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for Protecting PII from the Institute of Education Siences]
 
*[https://nces.ed.gov/pubs2011/2011603.pdf Guidelines for Protecting PII from the Institute of Education Siences]
 
[[Category: Data Cleaning]] [[Category: Publishing Data]]
 
[[Category: Data Cleaning]] [[Category: Publishing Data]]
 +
*Heffetz and Ligett's [https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.75 Privacy and Data Based Research]

Revision as of 16:43, 16 April 2019

Read First

  • Some survey variables allow identification of individual respondents. This is called Personally Identifiable Information (PII). What variables are considered PII or not varies with the context of the survey. It is the responsibility of researchers to make sure this data is private and safely stored, and no PII can ever be publicly released without explicit consent
  • Variables including personally identifiable information that is not related to the research question should be dropped as soon as possible in the project, and all PII must be stored in an encrypted folder. PII variables that are needed for analysis can either encoded or masked, depending on the type of information they contain and who has access to the data

Personally Identifiable Information

In the context of a survey, Personally identifiable information (PII) are the variables that can, either on their own or in combination with other variables, lead to identifying a single surveyed individual with reasonable certainty. Here's a list of variables that may lead to personal identification:

  • Names of survey respondent, household members, enumerators and other individuals
  • Names of schools, clinics, villages and possibly other administrative units (depending on the survey)
  • Dates of birth
  • GPS coordinates
  • Contact information
  • Record identifier (social security number, process number, medical record number, national clinic code, license plate, IP address)
  • Pictures (of individuals, houses, etc)


A few examples of sensitive variables that depending on survey context may contain personally identifying information:

  • Age
  • Gender
  • Ethnicity
  • Grades, salary, job position


As these variables exemplify, what exactly is PII will depend on the context of each survey. For example, if a survey covers a small farming community, variables such as plot size and crops cultivated can be combined to identify an individual household. Administrative units can be considered PII if there are few individuals in each of them. Details on how to calculate the disclosure risk -- that is, the risk of someone being able to track individual respondents from the available data can be found in Additional Resources. It is common to define a threshold on the minimum number of individuals with a certain value of a variable that needs to be observed for it to be considered safe to disclose it. For example, if a school has less than 10 students of a certain age, then age is considered PII, as it may be used with other information to identify these students. The value of this thresholds depends on the context of the survey. The guidelines to deal with PII will be discussed below, but for common solutions are (1) restrict access to the data, (2) drop PII variables, (3) use anonymous codes for categoric variables, and (3) mask their values. The two first solutions make the data unavailable, while the last one edits the information shared when compared to the original survey data.

Access restriction

Data sets that are only available to the research team may contain identifiable information, and publicly released data, such as analysis datasets submitted as replication files for academic paper must be carefully de-identified. In between these two extremes, it is also common to share some relatively identifiable data under conditional access. The conditions required to access the data depend on how easy it is to identify an individual from it.

De-identification

There are different ways to de-identify data sets, resulting in different levels of information loss. It is advisable to remove immediately identifying variables such as names and contact information as early as possible in the project and stored under encryption, but what other information should be de-identified depends on how relevant the information is to the research question, and who has access to the data. Any identifiable information that is not related to the research question should be dropped, but there's a trade-off between ensuring data privacy and losing information and results quality when dealing with relevant variables. For example, a common practice is to create perturbed data, meaning some change is made to the shared variable compared to the original survey. Different methods to introduce change affect regression results and inference in different ways, and it is important to document the type of changes introduced so researchers can take this into account.

Drop variables

Variables such as individual names (including survey respondent, family members, employees, enumerators), household coordinates, birth dates, contact information, IP address, job position should be dropped. This applies to any PII that is not necessary for analysis. They may be needed for high-frequency checks, back-checks and monitoring of intervention implementation and survey progress, but should be dropped from any data sets that are not used exactly for that.

Encode variables

Personally identifiable categoric variables that are needed for analysis, such as administrative units, ethnicity, etc, can be de-identified by encoding. That means dropping the value label of a factor variable, so it is possible to tell which individuals are in the same group, but not what group that is. Be careful to use anonymous IDs in this case, not some pre-existing code such as the State code used by the National Statistics Bureau or other authority.

Mask values

For numeric variables that are related to the research question and may be used to identify individuals, there are different methods that can be used to limit disclosure. This is necessary if the data is publicly available. Some of the most used methods, as well as their advantages and disadvantages, are discussed below. See Additional Resources for more detailed information on how to implement each of them. When editing variable’s values, make sure to do it in a wait that cannot be reversed, for example by adding different random values to different variables and observations. For example, if you dislocate every GPS coordinate two kilometres South, the original coordinates can easily be traced back. Similarly, if you create one single noise variable with different values for each observation and add it to multiple variables to de-identify them, their original value can be obtained more easily than if you add different noises to different variables.

  • Categorization: continuous variables can be transformed into categoric variables. This is done by reporting such variable in ranges instead of an individual’s specific value. For example, you can categorize ages and say that an individual is between 18 and 25 years old instead of 22. The range of each category will depend on how many individual observations exist in each of them.
  • Micro-aggregation: This is done by forming groups with a certain number of observations and substituting the individual values with the group mean. This may affect estimation as even though the variable mean is not affected, the variance is. However, this change is the variance is small if the groups are small.
  • Adding noise: white noise can be created by generating a new variable with mean zero and positive variance and adding it to the original variable. This causes the variable’s variance to be altered, therefore affecting inference.
  • Rounding: consists in defining, often randomly, a rounding base and round each observation to its nearest multiple.
  • Top-coding: when only a few extreme values can be individually identified, such values can be rounded so that, for example, any farmers producing more than a certain quantity of a crop is assigned that quantity.

Anonymous IDs

When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from such registry or database, e.g. as State codes or clinic registries. Note that if these codes are publicly available, the data set created with them will still be personally identified, even if all names are deleted.

In general, it is not recommended to use IDs that people outside the team have access to. It would be preferable to create a new, anonymous code. However, that are exceptions to this general rule. Read the Anonymous IDs article for more information on how to deal with this specific issue.

Back to Parent

This article is part of the topic Data Cleaning


Additional Resources