ID Variable Properties
An ID variable that identifies an observation should have the properties listed below. Note that this relates to the ID variable that identifies observations across data sets in out project folder. Some commands in Stata, for example
reclink requires a
masterid() and an
userid() and these ID variables created temporarily for that command does not have to have all of these properties.
- A dataset should typically have 'one' variable as the ID variable. One common exception to this rule is a panel data set, where the combination of ID variable and time variable identifies the data set.
- This article lists 5 properties of ID variables. Property 1 and Property 2 should be tested for when starting to work with a new data set. Properties 3, 4, and 5 are more relevant when creating an ID variable or assigning ID values to newly-added observations.
Property 1: Uniquely Identifying
An ID variable is uniquely identifying when there are no duplicates -- that is ,no two observation share a value in the ID variable. Next paragraph shows that this is easy to test for in a single data set, however, it is more complex (but equally important) over multiple related datasets, i.e. in all the data for an IE project. To test Property 1 for a full project one must first make sure that all observations are added to the master data set, and then test on the master data set as described in the next paragraph.
There are several ways to test for this in Stata. For example
duplicates report idvar where
idvar is the ID variable. It is also possible to test the first property using this command
isid idvar. While
duplicates report provides a more informative output,
isid is a quick and easy way to test for both the first and the second property.
Property 2: Fully Identifying
An ID variable is fully identifying when all observations have a value in the ID variable, i.e. no values are missing. This property is, similarly to the first property, very easy to test on a single data set, and depends on how well the master data set has been kept up to data in order to test for a full project. If all observations in all data set has been added to the master data set, then they should all been given an value in the ID variable, but each time you modify the master data set you should test for this property to be sure.
There are several ways to test for this in Stata but the command
isid idvar where
idvar is the ID variable is often used as it tests for both the first and the second property. Note that missing values should not be used as an ID value even though a missing value technically could be used to identify a single observation. Missing values implies that the information is missing so the command
isid in Stata treats a missing value as if the ID variable is not fully identifying the data set.
Property 3: Constant Across a Project
The third property says that no observation should have different IDs in different data set. Data sets collected from different sources might have different IDs when they are first included in the project, but one ID variable should be made the dominant one, and the other ID variable should be clearly marked that it is not the main ID variable for this project if there is a reason to at all keep it in the data set.
There is no specific test for this, but this is a rule to follow when creating an ID variable. If the best practice of carefully adding all observations to the master data set is followed, then that usually ensures that no observation has two values in the ID variable, and it also easy to keep just the same primary ID variable in all data sets after the observations have been added to the master data set.
Property 4: Constant Throughout the Duration of a Project
The fourth property is similar to the third property but it says that the same observation should have the same value in the ID variable throughout the project. The ID that an observation was assigned at baseline (or whenever it was assigned) should be not be changed throughout the rest of the project. One exception to this rule is obviously when we find a mistake in the ID variable. This hopefully happens rarely as it is very labor demanding to go over all do-files in a project in order to make sure that no values have to be updated for the code to work as intended.
Another example is if the format of the ID variable needs to be extended in case a project runs out of IDs. This case is one of the rare examples when it could be justified, but it will never be the best practice, to have more than one ID variable. In this case it might be a good idea to create a new ID variable where the new value is based on the old value. For example, the new variable have two more digits or similar. Then the old ID variable can be kept so that old code does not have to be updated. Although it is best practice to update all references to the old ID variable with the new one, but this can be unfeasible due to taking too much time.
Property 5: Anonymous
The fifth property is less a requirement and more a good practice. Sometimes we have access to IDs that satisfy all of the four first properties above, but we should be very careful before using them. Examples of such cases could be individual national IDs, public company IDs, a hospital's patient ID etc. Since records over those IDs are available to people outside our team, there is no way for us to guarantee that we can protect the privacy of the data we collect if we use these IDs. In all of these cases we need to create our own ID that has no association with the ID variable created by someone else and is unique to our project and thereby is an anonymous ID that only identifies the observation to us. In the master data set we can include the other ID to enable us to merge data quickly, but then the information in the master data set becomes even more sensitive then usual.
There is an exception to this rule that can be used to simplify the data work but it should only be used after careful consideration. If a project has a high-level unit of observation for which the project team is absolutely certain that it will not collect sensitive data, and there is an official code for it, then we could sometimesuse this code. It could for example be done for districts or region so that we can easier include publicly available data from those district or region. However, if there is any probability that we would include any data not publicly available, for example district budgets etc., then we need to make our own ID variable even for these units of observations. Also, if we have a unit of observation for which one or more instances that has only a few observations of another level mapped to it, for example a school with few students or a village with a few households, then we have to create an anonymous IDs for all instances at that level. Not just that one school or village, but all schools or villages. Otherwise the ID of the school or the village can be used to understand who each of those students or farmers are, despite the student ID and the farmer ID is anonymous.
It is never incorrect to create an anonymous ID, so if there is any uncertainty whether a public ID can be used, then always go for the anonymous option.
Back to Parent
This article is part of the topic Data Management
Please add here related articles, including a brief description and link.