Unit of Observation

Jump to: navigation, search

While the specific term Unit of Observation is not always well know, it is a concept that all people who work with data has come across. Having an exact understanding of this concept and getting the habit of thinking about your data sets in terms if unit of observation and organize your data sets and your project folder accordingly is key to a efficient data work. Mistakes done in regards to this concepts are more common that what one would expect, and those mistakes will bias your analysis.


Read First

  • Never trust the the file name by itself as an indicator of what the unit of observations is. Always perform some tests to convince yourself of the unit of observation.
  • A data set always only have one unit of observation. It is always incorrect to include two different units of observation in a single data set

Definition

The most common context where the concept of unit of observation is used is to describe a data set. A non-technical way to explain unit of observation in this context is what each row in the data set represents. Just as a distance data does not make sense unless we know whether it is measured in miles or kilometer, we need to know the unit of our data set. We often have a good idea what the unit of observation is at the first glance of the data, but do not trust this, always test that your assumption is correct.

While thinking of unit of observation is correct in most cases can be described as a the row of a data set, there are cases when a unit of observation is more than that. For example, a regression might pull different types of observations from different data sets, or a regression might collapse observations to another unit of observation that did never exist in a data set. Although, until you have a deeper understanding of this concept, you will be fine thinking of this as what a row in a data set represents.

Note, a data set is always incorrectly constructed if one data set has more than one unit of observation. Then those data sets should be in separated into two data sets. Even if the two units of observation has the same variables, it is incorrect, bad practice, and a huge source of error if they were included in the same data set.

Methods to confirm the Unit of Observation in a data set

The first time you are using a data set you have not created yourself, you should always start by making sure that you have no doubt what the unit of observation is. You often get this information from the name of the data file, but you should always test that before believing it. The most obvious method to make sure you know what is the unit of observation is to ask the person that sent you the data set, but the rest of this section assumes that you for any reason cannot confirm the unit of observation that easily.

If you open up a data set for which you have a good reason to believe the unit of observations is, for example, household, then look for a household ID variable and test if it is uniquely and fully identifying the data set. If this is the case then you are done. However, if you do not find such variable you will have to find other information that uniquely and fully identifies the data set. For example, in this case, you would look for variables with information of household head name. Test if this variable uniquely identifies the all observations. Names are often not unique across a country, so you might have to add region name and village name to the test.

Usages other than in data sets

The examples below all have many similarities to how unit of observation is used in the context of a data set. They are included to give further explanation to the concept or highlight small differences in usage.

Regressions

The unit of observation in a regression is what the N (or number of observations) represents. That is very much related to how the concept is used in the data set, as the N is the number of rows from the data set included in the regression. To be able to interpret the regression correctly therefore depends on understanding the unit of observation. In most cases this is trivial, but we have had issues where regressions have been misinterpreted as a monitoring data that was believed to have the unit of observation "households", while it actually was "packages distributed to households". Since the vast majority of households only received one package each, it was easy to make this mistake than what it first might seem.

Note that some regressions collapse your data set so the unit of observation in the regression is different from the unit of observation in your data set. This is one example when unit of observation cannot described as a row in a data set.

Surveys

The concept of unit of observation can also be used to describe for example surveys. The unit of observation in a survey is the type of respondent. For example, household, company, school etc. In the cases of company and school the respondent is a person, for example the CEO or the principal, but they provide answers about the company or the school. If they would be asked questions about themselves, then the unit of observation would be CEOs and principals.

Back to Parent

This article is part of the topic Data Management


Additional Resources

  • list here other articles related to this topic, with a brief description and link