Difference between revisions of "ID Variable Properties"

Jump to: navigation, search
 
(25 intermediate revisions by 6 users not shown)
Line 1: Line 1:
An ID variable that identifies an observation should have the properties listed below. Note that this relates to the ID variable that identifies observations across data sets in out project folder. Some commands in Stata, for example <code>reclink</code> requires a <code>masterid()</code> and an <code>userid()</code> and these ID variables created temporarily for that command does not have to have all of these properties.
An ID variable is a variable that identifies each entity in a dataset (person, household, etc) with a distinct value. This article lists five properties of ID variables that researchers should keep in mind when creating, collecting, and merging data.


== Read First ==
== Read First ==
*  
* ID variables should be uniquely identifying, fully identifying, constant across a project, constant throughout the duration of the project, and anonymous.
*Property 1 and Property 2 should be tested when starting to work with a new data set, while Properties 3, 4, and 5 are more relevant when creating an ID variable or assigning ID values to newly added observations.
*Note that this page refers to ID variables that identify observations across data sets in the project folder. Some Stata commands like <code>reclink</code> require a <code>masterid()</code> and an <code>userid()</code> -- the ID variables created temporarily for these commands do not need to have all properties outlined in this page.


== Property 1: Uniquely Identifying ==


==First property: Uniquely Identifying==
An ID variable is uniquely identifying when there are no [[Duplicates and Survey Logs|duplicates]] -- that is, when no two observations share an ID variable value. This property is easily testable in a single dataset via the Stata code <code>duplicates report idvar</code>, where <code>idvar</code> is the ID variable. It is also easily testable via the Stata command <code>isid idvar</code>. While <code>duplicates report</code> provides a more informative output, <code>isid</code> is a quick and easy way to test for both the first and the second property.


The first and the second properties are the most commonly referred to property of an ID variable. An ID variable is uniquely identifying when no two observation share a value in the ID variable. Next paragraph shows that this is easy to test for a single data set, however, the first property does not only apply to a single data set, it applies to the full project. To test the first property for a full project one must first make sure that all observations are added to the [[Master Data Set|master data set]], and then test for the first property as described in the next paragraph.  
Testing Property 1 in multiple related datasets is more complex but equally important. To do so, make sure that all observations are added to the [[Master Data Set|master dataset]]. Then test the ID variable in the master dataset as described in the previous paragraph.  


There are several ways to test for this in Stata. For example <code>duplicates report idvar</code> where <code>idvar</code> is the ID variable. It is also possible to test the first property suing this command <code>isid idvar</code>. While <code>duplicates report</code> provides a more informative output, <code>isid</code> is a quick and easy way to test for both the first and the second property.
== Property 2: Fully Identifying ==


==Second property: Fully Identifying==
An ID variable is fully identifying when all observations have an ID variable value. In other words, no ID variable values are missing. As with Property 1, Property 2 is easily testable in a single dataset. The Stata command <code>isid idvar</code>, where <code>idvar</code> is the ID variable, tests for both the Property 1 and Property 2. Note that missing values should not be used as an ID value even though a missing value technically could be used to identify a single observation. Since missing values imply that the information is missing, the command <code>isid</code> in Stata interprets a missing value as indicating that the ID variable is not fully identifying in the dataset.


==Third property: Constant Across a Project==
If all observations in all datasets have been added to the master dataset, then they should all have a value in the ID variable. Each time you modify the master dataset, test for this property to be sure.


==Fourth property: Constant Throughout the Duration of a Project==
== Property 3: Constant Across a Project ==


==Fifth property: Anonymous IDs==
An ID variable is constant across a project when no observation has a different ID in a different dataset. Datasets collected from different sources might have different IDs when they are first included in the project. If this is the case, make one ID variable constant and dominant. If there is a reason to keep the other ID variable in the dataset, clearly indicate via the name, label or otherwise that it is not the main ID variable for this project.


The fifth property is less a requirement and more a good practice. Sometimes we have access to IDs that satisfy all the properties above, but we should be very careful before using them. Examples of such cases could be individual national IDs, public company IDs, a hospital's patient ID etc. Since records over those IDs are available to people outside our team, there is no way for us to guarantee that we can protect the privacy of the data we collect. In all of these cases we need to create our own ID that has no association with the ID variable created by someone else and is unique to our project and thereby be an anonymous ID that only identifies the observation to us. In the master data set we can include the other ID to enable us to merge data quickly, but then the information in the master data set becomes even more sensitive then usual.
Property 3 is an important one to follow when creating an ID variable. Carefully adding all observations to the master dataset typically ensures that no observation has two distinct ID variable values. It also useful to keep the same primary ID variable in all datasets after the observations have been added to the master dataset.


There is an exception to this rule that can simplify the data work but should only be used with care. If a project has a high-level unit of observation for which the project team is absolutely certain that it will not collect sensitive data, and there is an official code for it, then we could perhaps use this code. It could for example be done for districts or region so that we can easier include publicly available data from those district or region. However, if there is any probability that we would include any data not publicly available, for example district budgets etc., then we need to make our own code. Also, if we have a unit of observation for which we have a single instance in which we have few observations of another level, for example a school with few students or a village with a few households, then we have to create an anonymous IDs for ''all'' instances at that level. Not just that one school or village, but all schools or villages.  
There is no specific test for Property 3.


It is never incorrect to create an anonymous ID, so if there is any uncertainty whether a public ID can be used, then always go for the anonymous option.
== Property 4: Constant ID Value==


== Back to Parent ==
An ID variable is constant throughout the duration of the project when the same observation has the ID variable value throughout the project. The ID assigned to an observation at baseline, for example, should not change throughout the rest of the project. One exception to this rule is when there is a mistake in the ID variable. This hopefully happens very rarely: it is very labor demanding to go over all project do-files to make sure that values are updated and the code will run smoothly.
This article is part of the topic [[Data Management]]


It is always best practice to keep ID variable values constant throughout the project. However, if a project runs out of ID variables and the ID variable format consequently needs a modification, a violation of Property 4 may be justified. In this case, base the new ID variable on the old value. For example, append two additional digits to the old variable to create the new variable. Then the old ID variable can be kept so that old code does not have to be updated. While it is best practice to update all references to the old ID variable with the new one, time constraints may render this unfeasible.
== Property 5: Anonymity ==
The fifth property is less a requirement and more a good practice. Sometimes we have access to IDs that satisfy the four first properties, but we should be very careful before using them. Consider, for example, individual national IDs, public company IDs, or a hospital's patient ID. Since people outside of the research team have access to these IDs, there is no way to guarantee protection or privacy of the data collected with them. In all of these cases, create a new ID variable with no association to the external ID. The new ID variable should be unique to your project. The master data can include the external ID to facilitate quick and easy merges, but then the master dataset becomes even more sensitive than usual. [[Encryption]] in this case is key.
If a project has a high-level unit of observation for which the project team is absolutely certain it will not collect sensitive data, and there is an official code for it, then researchers can sometimes use this code. It could, for example, be done for districts or region in order to more easily include publicly available data from those districts or regions. However, if there is any probability of including any publicly unavailable data like, for example, district budgets, then make your own ID variable even for these units of observations. If there is a unit of observation for which one or more instances have only a few observations of another level mapped to it (i.e. a school with few students or a village with a few households), then create an anonymous IDs for ''all'' instances at that level: not just one school or village, for example, but all schools or villages. If not, the ID of the school or the village can be used to understand who each of those students or farmers are --  even if the student ID and the farmer ID are anonymous.
It is never incorrect to create an anonymous ID. If there is any uncertainty about whether a public ID can be used or not, then always go for the anonymous option.
== Project ID ==
A '''project ID''' is the main '''identifying (or ID)''' variable used in a project to identify [[Unit of Observation|observations]]. A '''unit of observation''' should never have multiple project IDs. For each level of observation, the corresponding project ID variable must '''uniquely''' and '''fully''' identify all observations in the project.
For example, if the level of observation is households, then the variable <code>hhid</code> (household ID) is the project ID.
== Related Pages ==
[[Special:WhatLinksHere/ID_Variable_Properties|Click here for pages that link to this topic.]]


== Additional Resources ==
== Additional Resources ==
* list here other articles related to this topic, with a brief description and link
Please add here related articles, including a brief description and link.
 
[[Category: Data Analysis ]]
[[Category: Data Management ]]
[[Category: Data Management ]]

Latest revision as of 15:17, 13 April 2021

An ID variable is a variable that identifies each entity in a dataset (person, household, etc) with a distinct value. This article lists five properties of ID variables that researchers should keep in mind when creating, collecting, and merging data.

Read First

  • ID variables should be uniquely identifying, fully identifying, constant across a project, constant throughout the duration of the project, and anonymous.
  • Property 1 and Property 2 should be tested when starting to work with a new data set, while Properties 3, 4, and 5 are more relevant when creating an ID variable or assigning ID values to newly added observations.
  • Note that this page refers to ID variables that identify observations across data sets in the project folder. Some Stata commands like reclink require a masterid() and an userid() -- the ID variables created temporarily for these commands do not need to have all properties outlined in this page.

Property 1: Uniquely Identifying

An ID variable is uniquely identifying when there are no duplicates -- that is, when no two observations share an ID variable value. This property is easily testable in a single dataset via the Stata code duplicates report idvar, where idvar is the ID variable. It is also easily testable via the Stata command isid idvar. While duplicates report provides a more informative output, isid is a quick and easy way to test for both the first and the second property.

Testing Property 1 in multiple related datasets is more complex but equally important. To do so, make sure that all observations are added to the master dataset. Then test the ID variable in the master dataset as described in the previous paragraph.

Property 2: Fully Identifying

An ID variable is fully identifying when all observations have an ID variable value. In other words, no ID variable values are missing. As with Property 1, Property 2 is easily testable in a single dataset. The Stata command isid idvar, where idvar is the ID variable, tests for both the Property 1 and Property 2. Note that missing values should not be used as an ID value even though a missing value technically could be used to identify a single observation. Since missing values imply that the information is missing, the command isid in Stata interprets a missing value as indicating that the ID variable is not fully identifying in the dataset.

If all observations in all datasets have been added to the master dataset, then they should all have a value in the ID variable. Each time you modify the master dataset, test for this property to be sure.

Property 3: Constant Across a Project

An ID variable is constant across a project when no observation has a different ID in a different dataset. Datasets collected from different sources might have different IDs when they are first included in the project. If this is the case, make one ID variable constant and dominant. If there is a reason to keep the other ID variable in the dataset, clearly indicate via the name, label or otherwise that it is not the main ID variable for this project.

Property 3 is an important one to follow when creating an ID variable. Carefully adding all observations to the master dataset typically ensures that no observation has two distinct ID variable values. It also useful to keep the same primary ID variable in all datasets after the observations have been added to the master dataset.

There is no specific test for Property 3.

Property 4: Constant ID Value

An ID variable is constant throughout the duration of the project when the same observation has the ID variable value throughout the project. The ID assigned to an observation at baseline, for example, should not change throughout the rest of the project. One exception to this rule is when there is a mistake in the ID variable. This hopefully happens very rarely: it is very labor demanding to go over all project do-files to make sure that values are updated and the code will run smoothly.

It is always best practice to keep ID variable values constant throughout the project. However, if a project runs out of ID variables and the ID variable format consequently needs a modification, a violation of Property 4 may be justified. In this case, base the new ID variable on the old value. For example, append two additional digits to the old variable to create the new variable. Then the old ID variable can be kept so that old code does not have to be updated. While it is best practice to update all references to the old ID variable with the new one, time constraints may render this unfeasible.

Property 5: Anonymity

The fifth property is less a requirement and more a good practice. Sometimes we have access to IDs that satisfy the four first properties, but we should be very careful before using them. Consider, for example, individual national IDs, public company IDs, or a hospital's patient ID. Since people outside of the research team have access to these IDs, there is no way to guarantee protection or privacy of the data collected with them. In all of these cases, create a new ID variable with no association to the external ID. The new ID variable should be unique to your project. The master data can include the external ID to facilitate quick and easy merges, but then the master dataset becomes even more sensitive than usual. Encryption in this case is key.

If a project has a high-level unit of observation for which the project team is absolutely certain it will not collect sensitive data, and there is an official code for it, then researchers can sometimes use this code. It could, for example, be done for districts or region in order to more easily include publicly available data from those districts or regions. However, if there is any probability of including any publicly unavailable data like, for example, district budgets, then make your own ID variable even for these units of observations. If there is a unit of observation for which one or more instances have only a few observations of another level mapped to it (i.e. a school with few students or a village with a few households), then create an anonymous IDs for all instances at that level: not just one school or village, for example, but all schools or villages. If not, the ID of the school or the village can be used to understand who each of those students or farmers are -- even if the student ID and the farmer ID are anonymous.

It is never incorrect to create an anonymous ID. If there is any uncertainty about whether a public ID can be used or not, then always go for the anonymous option.

Project ID

A project ID is the main identifying (or ID) variable used in a project to identify observations. A unit of observation should never have multiple project IDs. For each level of observation, the corresponding project ID variable must uniquely and fully identify all observations in the project.

For example, if the level of observation is households, then the variable hhid (household ID) is the project ID.

Related Pages

Click here for pages that link to this topic.

Additional Resources

Please add here related articles, including a brief description and link.