Difference between revisions of "Personally Identifiable Information (PII)"
(9 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
In the context of a survey, personally identifiable information (PII) are variables that can, either on their own or in combination with other variables, be used to identify a single surveyed individual with reasonable certainty. During all steps of research and field work, research teams must protect PII through [[Encryption | encryption]] and [[De-identification | de-identification]]. This page will explain how to identify PII and how to calculate its disclosure risk | In the context of a [[Survey Pilot|survey]], personally identifiable information (PII) are '''variables''' that can, either on their own or in combination with other '''variables''', be used to identify a single '''surveyed''' individual with reasonable certainty. During all steps of research and field work, [[Impact Evaluation Team|research teams]] must protect PII through [[Encryption | encryption]] and [[De-identification | de-identification]]. This page will explain how to identify PII and how to calculate its disclosure risk. | ||
== Read First == | == Read First == | ||
*All PII must be stored in an [[DataWork_Survey_Round#Encrypted_Round_Folder | encrypted]] folder. | *All PII must be stored in an [[DataWork_Survey_Round#Encrypted_Round_Folder | encrypted]] folder. | ||
*PII should be masked, encoded, or removed from the working dataset and any shared or published datasets. See [[De-identification | de-identification]] for details on how to de-identify data. | *PII should be masked, encoded, or removed from the working [[Master Dataset|dataset]] and any shared or [[Publishing Data|published]] '''datasets'''. See [[De-identification | de-identification]] for details on how to de-identify data. | ||
*No PII can ever be publicly released without explicit consent. Researchers must ensure that this data remains private and safely stored. | *No PII can ever be publicly released without explicit consent. Researchers must ensure that this data remains private and safely stored. | ||
==Personally Identifiable Information == | ==Personally Identifiable Information == | ||
Common PII variables include: | Common PII '''variables''' include: | ||
:* Names of survey respondent, household members, enumerators and other individuals | :* Names of [[Survey Pilot|survey]] respondent, household members, [[Enumerator Training|enumerators]] and other individuals | ||
:* Names of schools, clinics, villages and/or other administrative units (depending on the survey) | :* Names of schools, clinics, villages and/or other administrative units (depending on the '''survey''') | ||
:* Date of birth | :* Date of birth | ||
:* GPS coordinates | :* GPS coordinates | ||
Line 16: | Line 16: | ||
:* Record identifier (i.e. social security number, process number, medical record number, national clinic code, license plate, IP address) | :* Record identifier (i.e. social security number, process number, medical record number, national clinic code, license plate, IP address) | ||
:* Pictures of individuals or houses | :* Pictures of individuals or houses | ||
Depending on survey context, the following variables may also be PII: | Depending on '''survey''' context, the following variables may also be PII: | ||
:* Age | :* Age | ||
:* Gender | :* Gender | ||
:* Ethnicity | :* Ethnicity | ||
:* Grades, salary, job position | :* Grades, salary, job position | ||
These lists aren’t exhaustive: what exactly is PII depends on the context of each survey. For example, if a survey covers a small farming community, variables such as plot size and crops cultivated could be combined to identify an individual household and, as such, would be PII. Administrative units could also be considered PII if there are few individuals in each of them. | These lists aren’t exhaustive: what exactly is PII depends on the context of each '''survey'''. For example, if a '''survey''' covers a small farming community, variables such as plot size and crops cultivated could be combined to identify an individual household and, as such, would be PII. Administrative units could also be considered PII if there are few individuals in each of them. | ||
==Disclosure Risk== | ==Disclosure Risk== | ||
In order to calculate disclosure risk, researchers typically define a minimum threshold of individuals for which a certain value of the '''variable''' must apply in order for the '''variable''' be considered safe to disclose. If the threshold is not met, then the '''variable''' is considered PII. For example, at a threshold of 10, if a school has less than 10 students of a certain age, then age is considered PII as it could be used with other information to identify these students. The value of these thresholds depends on the context of the [[Survey Pilot|survey]]. See [[Publishing_Data | publishing data]] for more details. | |||
== | |||
Further, the US Census has published this [https://www.census.gov/content/dam/Census/library/working-papers/2020/demo/disclosure_avoidance_and_the_census_brief.pdf brief on disclosure avoidance] which lists the various forms of disclosure, and steps to avoid them. | |||
== Related Pages == | |||
[[Special:WhatLinksHere/Personally_Identifiable_Information_(PII)|Click here for pages that link to this topic.]] | |||
== Additional Resources == | == Additional Resources == | ||
*Matthew and Harel | * DIME Analytics (World Bank), [https://osf.io/94aw2/ Encryption 101] | ||
*Shlomo | * DIME Analytics (World Bank), [https://osf.io/6fcvk Research Ethics] | ||
* J-PAL, [https://github.com/J-PAL/stata_PII_scan <code>pii_scan</code>: A Stata program to scan for personally identifiable information (PII)] | |||
* Matthew and Harel (University of Connecticut), [https://projecteuclid.org/download/pdfview_1/euclid.ssu/1296828958 Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy] | |||
* Natalie Shlomo (University of Southampton), [https://journalprivacyconfidentiality.org/index.php/jpc/article/view/584 Releasing Microdata: Disclosure Risk Estimation, Data Masking and Assessing Utility] | |||
[[Category: Data Cleaning]] [[Category: Publishing Data]] | [[Category: Data Cleaning]] [[Category: Publishing Data]] |
Latest revision as of 19:10, 14 August 2023
In the context of a survey, personally identifiable information (PII) are variables that can, either on their own or in combination with other variables, be used to identify a single surveyed individual with reasonable certainty. During all steps of research and field work, research teams must protect PII through encryption and de-identification. This page will explain how to identify PII and how to calculate its disclosure risk.
Read First
- All PII must be stored in an encrypted folder.
- PII should be masked, encoded, or removed from the working dataset and any shared or published datasets. See de-identification for details on how to de-identify data.
- No PII can ever be publicly released without explicit consent. Researchers must ensure that this data remains private and safely stored.
Personally Identifiable Information
Common PII variables include:
- Names of survey respondent, household members, enumerators and other individuals
- Names of schools, clinics, villages and/or other administrative units (depending on the survey)
- Date of birth
- GPS coordinates
- Contact information
- Record identifier (i.e. social security number, process number, medical record number, national clinic code, license plate, IP address)
- Pictures of individuals or houses
Depending on survey context, the following variables may also be PII:
- Age
- Gender
- Ethnicity
- Grades, salary, job position
These lists aren’t exhaustive: what exactly is PII depends on the context of each survey. For example, if a survey covers a small farming community, variables such as plot size and crops cultivated could be combined to identify an individual household and, as such, would be PII. Administrative units could also be considered PII if there are few individuals in each of them.
Disclosure Risk
In order to calculate disclosure risk, researchers typically define a minimum threshold of individuals for which a certain value of the variable must apply in order for the variable be considered safe to disclose. If the threshold is not met, then the variable is considered PII. For example, at a threshold of 10, if a school has less than 10 students of a certain age, then age is considered PII as it could be used with other information to identify these students. The value of these thresholds depends on the context of the survey. See publishing data for more details.
Further, the US Census has published this brief on disclosure avoidance which lists the various forms of disclosure, and steps to avoid them.
Related Pages
Click here for pages that link to this topic.
Additional Resources
- DIME Analytics (World Bank), Encryption 101
- DIME Analytics (World Bank), Research Ethics
- J-PAL,
pii_scan
: A Stata program to scan for personally identifiable information (PII) - Matthew and Harel (University of Connecticut), Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy
- Natalie Shlomo (University of Southampton), Releasing Microdata: Disclosure Risk Estimation, Data Masking and Assessing Utility