Difference between revisions of "ID Variable Properties"

Jump to: navigation, search
Line 9: Line 9:
The first and the second properties are the most commonly referred to property of an ID variable. An ID variable is uniquely identifying when no two observation share a value in the ID variable. Next paragraph shows that this is easy to test for a single data set, however, the first property does not only apply to a single data set, it applies to the full project. To test the first property for a full project one must first make sure that all observations are added to the [[Master Data Set|master data set]], and then test for the first property as described in the next paragraph.  
The first and the second properties are the most commonly referred to property of an ID variable. An ID variable is uniquely identifying when no two observation share a value in the ID variable. Next paragraph shows that this is easy to test for a single data set, however, the first property does not only apply to a single data set, it applies to the full project. To test the first property for a full project one must first make sure that all observations are added to the [[Master Data Set|master data set]], and then test for the first property as described in the next paragraph.  


There are several ways to test for this in Stata. For example <code>duplicates report idvar</code> where <code>idvar</code> is the ID variable. It is also possible to test the first property suing this command <code>isid idvar</code>. While <code>duplicates report</code> provides more informative output, <code>isid</code> is a quick and easy way to test for both the first and the second property.
There are several ways to test for this in Stata. For example <code>duplicates report idvar</code> where <code>idvar</code> is the ID variable. It is also possible to test the first property suing this command <code>isid idvar</code>. While <code>duplicates report</code> provides a more informative output, <code>isid</code> is a quick and easy way to test for both the first and the second property.


==Second property: Fully Identifying==
==Second property: Fully Identifying==

Revision as of 03:42, 7 February 2017

An ID variable that identifies an observation should have the properties listed below. Note that this relates to the ID variable that identifies observations across data sets in out project folder. Some commands in Stata, for example reclink requires a masterid() and an userid() and these ID variables created temporarily for that command does not have to have all of these properties.

Read First


First property: Uniquely Identifying

The first and the second properties are the most commonly referred to property of an ID variable. An ID variable is uniquely identifying when no two observation share a value in the ID variable. Next paragraph shows that this is easy to test for a single data set, however, the first property does not only apply to a single data set, it applies to the full project. To test the first property for a full project one must first make sure that all observations are added to the master data set, and then test for the first property as described in the next paragraph.

There are several ways to test for this in Stata. For example duplicates report idvar where idvar is the ID variable. It is also possible to test the first property suing this command isid idvar. While duplicates report provides a more informative output, isid is a quick and easy way to test for both the first and the second property.

Second property: Fully Identifying

Third property: Constant Across a Project

Fourth property: Constant Throughout the Duration of a Project

Fifth property: Anonymous IDs

The fifth property is less a requirement and more a good practice. Sometimes we have access to IDs that satisfy all the properties above, but we should be very careful before using them. Examples of such cases could be individual national IDs, public company IDs, a hospital's patient ID etc. Since records over those IDs are available to people outside our team, there is no way for us to guarantee that we can protect the privacy of the data we collect. In all of these cases we need to create our own ID that has no association with the ID variable created by someone else and is unique to our project and thereby be an anonymous ID that only identifies the observation to us. In the master data set we can include the other ID to enable us to merge data quickly, but then the information in the master data set becomes even more sensitive then usual.

There is an exception to this rule that can simplify the data work but should only be used with care. If a project has a high-level unit of observation for which the project team is absolutely certain that it will not collect sensitive data, and there is an official code for it, then we could perhaps use this code. It could for example be done for districts or region so that we can easier include publicly available data from those district or region. However, if there is any probability that we would include any data not publicly available, for example district budgets etc., then we need to make our own code. Also, if we have a unit of observation for which we have a single instance in which we have few observations of another level, for example a school with few students or a village with a few households, then we have to create an anonymous IDs for all instances at that level. Not just that one school or village, but all schools or villages.

It is never incorrect to create an anonymous ID, so if there is any uncertainty whether a public ID can be used, then always go for the anonymous option.

Back to Parent

This article is part of the topic Data Management


Additional Resources

  • list here other articles related to this topic, with a brief description and link