Difference between revisions of "De-identification"

Revision as of 16:50, 17 November 2017

Read First

Some survey variables allow identification of individual respondents. This is called Personally Identifiable Information (PII)
It is the responsibility of researchers to make sure this data is private and safely stored
PII must be saved in encrypted folders and removed from data sets as soon as possible in the project
No PII can ever be publicly released without explicit consent

Personally Identifiable Information

In the context of a survey, Personally identifiable information (PII) are the variables that can, either on their own or in combination with other variables, lead to identifying a single surveyed individual. Here's a list of variables that may lead to personal identification:

Names of survey respondent, household members, enumerators and other individuals
Names of schools, clinics, villages and possibly other administrative units (depending on the survey)
Dates of birth
GPS coordinates
Contact information
Record identifier (social security number, process number, medical record number, national clinic code, license plate, IP address)
Pictures (of individuals, houses, etc)

A few examples of sensitive variables that depending on survey context may contain personally identifying information:

Age
Gender
Ethnicity
Grades, salary, job position

As these variables exemplify, what exactly is PII will depend on the context of each survey. For example, if a survey covers a small farming community, variables such as plot size and crops cultivated can be combined to identify an individual household. Administrative units can be considered PII if there are few individuals in each of them. The guidelines to deal with PII will be discussed below, but three common solutions are (1) drop PII variables, (2) use anonymous codes instead of names, and (3) introduce white noise.

Guidelines

Folder Encryption

De-identification

Drop variables

Variables such as individual names (including survey respondent, family members, employees, enumerators), household coordinates, birth dates, contact information, IP address, job position should be dropped. This applies to any PII that is not necessary for analysis. They may be needed for high-frequency checks, back-checks and monitoring of intervention implementation and survey progress, but should be dropped from any data sets that are not used exatcly for that.

Encode variables

Personally identifiable categoric variables that are needed for analysis, such as administrative units, ethnicity, etc, can be de-identified by encoding. That means dropping the value label of a factor variable, so it is possible to tell which individuals are in the same group, but not what group that is. Be careful to use anonymous IDs in this case, not some pre-existing code such as the State code used by the National Statistics Bureau or other authority.

Introduce white noise

For numeric variables that can be used to identify individuals, such as GPS coordinates, white noise can be introduced.

Anonymous IDs

When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from such registry or database, e.g. as State codes or clinic registries. Note that if these codes are publicly available, the data set created with them will still be personally identified, even if all names are deleted.

In general, it is not recommended to use IDs that people outside the team have access to. It would be preferable to create a new, anonymous code. However, that are exceptions to this general rule. Read the Anonymous IDs article for more information on how to deal with this specific issue.

Back to Parent

This article is part of the topic Data Analysis

Additional Resources

list here other articles related to this topic, with a brief description and link

@@ Line 38: / Line 38: @@
 ===Anonymous IDs===
-It is usually not recommended to use ID codes that are publicly available, such as State codes or clinic registries, even though there exceptions to that. Read the [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs Anonymous IDs] article for more information on how to deal with this kind of information.
+When a survey sample comes from a previously existing registry, or when survey data needs to be matched to administrative data, it is common to use a pre-existing ID variable from such registry or database, e.g. as State codes or clinic registries. Note that if these codes are publicly available, the data set created with them will still be personally identified, even if all names are deleted.
+In general, it is not recommended to use IDs that people outside the team have access to. It would be preferable to create a new, anonymous code. However, that are exceptions to this general rule. Read the [https://dimewiki.worldbank.org/wiki/ID_Variable_Properties#Fifth_property:_Anonymous_IDs Anonymous IDs] article for more information on how to deal with this specific issue.
 == Back to Parent ==

Navigation

Tools

Difference between revisions of "De-identification"

Revision as of 16:50, 17 November 2017

Contents

Read First

Personally Identifiable Information

Guidelines

Folder Encryption

De-identification

Drop variables

Encode variables

Introduce white noise

Anonymous IDs

Back to Parent

Additional Resources

Difference between revisions of "De-identification"

Revision as of 16:50, 17 November 2017

Read First

Personally Identifiable Information

Guidelines

Folder Encryption

De-identification

Drop variables

Encode variables

Introduce white noise

Anonymous IDs

Back to Parent

Additional Resources

follow us

newsletter