Difference between revisions of "Unit of Observation"

Jump to: navigation, search
Line 1: Line 1:
<onlyinclude>While the specific term ''Unit of Observation'' is not always well known, it is a concept that all people who work with data have come across. Having an exact understanding of this concept and getting into the habit of thinking about your data sets in terms of unit of observation, and organizing your data sets and your project folder accordingly is key to a efficient data work. Mistakes done in regards to this concepts are more common that what one would expect, and those mistakes will bias your analysis.</onlyinclude>
<onlyinclude>The ''unit of observation'' is the “who” or “what” about which survey data is collected and analysis is focused. Common examples include individual, household, or community. Clearly identifying your ''unit of observation'' in datasets and project folders will lead to a more efficient workflow and a more accurate analysis. 
 
</onlyinclude>
== Read First ==
== Read First ==
* Never trust the file name by itself as an indicator of what the unit of observations is. Always perform some tests to convince yourself of the unit of observation.
*Mistakes related to ''unit of observation'' introduce bias into analyses. Always double check the ''unit of observation'' before working with data.
* A data set always only has one unit of observation. It is incorrect to include two different units of observation in a single data set.


==Definition==
==Definition==


The ''unit of observation'' is the who or what about which data is collected in a survey or the who or what is being studied in an analysis. In a data set, this is represented by a row in the data set. ''Unit of observation'' refers to the category, type or classification that each who or what belongs to, not to the specific people or objects included. The term ''unit of analysis'' is synonymous to ''unit of observation'' in the context of analysis, but it is used slightly different in the context of data collection. For example, if data is collected on both students and schools, but the analysis only focuses on students, then both school and students are ''unit of observation'', but schools would rarely be referred to as ''unit of observation'' in this case.
The ''unit of observation'' is the “who” or “what” about which survey data is collected and analysis is focused. Common examples include individual, household, or community. Note that the ''unit of observation'' refers to the category, type, or classification of data -- not to specific parties. For example, while student and school are units of observation, “Ali Jones” or “Cedar Elementary School” are not. 
 
In many cases, there is little risk for confusion in terms of ''unit of observation'', but errors due to a clear understanding of this are more common than what one might first think. Just as distance data does not make sense unless we know whether it is measured in miles or kilometer, we need to know the unit of our data set. We often have a good idea what the ''unit of observation'' is already at the first glance of a data set, but do not trust this, and always test that your assumption is correct. Even if you are quite sure that you know the ''unit of observation'' of a data set that you are working with, always make sure that you are sure beyond any reasonable doubt before working with a data set.
 
A data set is always incorrectly constructed if one data set has more than one ''unit of observation''. Even if the two units of observation has the same variables, it is incorrect, bad practice, and a huge source of error if they were included in the same data set. All such data sets should be separated into two data sets.


===Methods to confirm the Unit of Observation in a data set===
==Confirming the Unit of Observation==


The first time you are using a data set you have not created yourself, you should always start by making sure that you have no doubt what the ''unit of observation'' is. You often get this information from the name of the data file, but you should always test that before believing it. The most obvious method to make sure you know what is the unit of observation is to ask the person that sent you the data set, but the rest of this section assumes that you for any reason cannot confirm the ''unit of observation'' that easily.
Just as distance data does not make sense until we know whether its unit is miles or kilometers, survey data and any resulting analyses do not make sense until we know the ''unit of observation''. In many cases, there is seemingly little risk for confusion in terms of ''unit of observation.'' We often have a good intuition for the ''unit of observation'' at the first glance of a dataset or a file name. However, always test that your assumption is correct: errors due to an unclear understanding of ''unit of observation'' are more common than one might imagine. When working with a dataset that you have not created yourself, start by clearly identifying the unit of observation. The most obvious way to do so is by asking the person from whom you received the dataset.


If you open up a data set for which you have a good reason to believe the ''unit of observations'' is, for example, household, then look for a household ID variable and test if it is [[ID Variable Properties|uniquely and fully identifying ]] the data set. If this is the case, then you are done. However, if you do not find such variable, you will have to find other information that uniquely and fully identifies the data set. For example, in this case, you would look for variables with information of household head name. Test if this variable uniquely identifies all observations. Names are often not unique across a country, so you might have to add region name and village name to the test.
Consider, however, that you have a dataset for which you do not know the unit of observation and you cannot reach the person from whom you received the dataset. You believe that the ''unit of observation'' is household. To confirm, open up the dataset, look for a household ID variable and test if it is [[ID Variable Properties|uniquely and fully identifying]] the dataset. If this is the case, then you are done. However, if you do not find such variable, search for other information that uniquely and fully identifies the dataset. In this case, for example, look for variables with information of household head name. Test if this variable uniquely identifies all observations. Names are often not unique across a country, so you might have to add region name and village name to the test. Once you have found the information that uniquely and fully identifies the dataset, make sure you create an appropriate [[ID variable Properties|ID Variable]] accordingly if it does not yet exist.  


==Usages other than in data sets==
Note that a dataset is always incorrectly constructed if it has more than one unit of observation. Even if the two units of observation have the same variables, it is incorrect, bad practice, and a huge source of error if they were included in the same dataset. All such datasets should be separated into two datasets.


The examples below all have many similarities to how ''unit of observation'' is used in the context of a data set. They are included to give further explanation to the concept or highlight small differences in usage.
==Applications==
The examples below all have many similarities to how ''unit of observation'' is used in the context of a dataset. They are included to give further explanation to the concept or highlight small differences in usage.


===Regressions===
===Regressions===
The unit of observation in a regression is what the N (or number of observations) represents. That is very much related to how the concept is used in the data set, as the N is the number of rows from the data set included in the regression. To be able to interpret the regression correctly therefore depends on understanding the ''unit of observation''. In most cases this is trivial, but we have had issues where regressions have been misinterpreted as a monitoring data that was believed to have the unit of observation "households", while it actually was "packages distributed to households". Since the vast majority of households only received one package each, it was easy to make this mistake than what it first might seem.
In a regression, N (or the number of observations) represents the unit of observation. A correct interpretation of the regression depends on a clear understanding of the unit of observation. In most cases this is trivial, but not always. Consider, for example, monitoring data that is believed to have the ''unit of observation'' "households," though its true ''unit of observation'' is "packages distributed to households." Since the vast majority of households only received one package each, it is easy yet problematic to make this mistake.


Note that some regressions collapse your data set, so the unit of observation in the regression is different from the unit of observation in your data set. This is one example when unit of observation cannot described as a row in a data set.
Note that some regressions collapse your dataset, so the ''unit of observation'' in the regression is different from the ''unit of observation'' in your dataset. This is one example when ''unit of observation'' cannot described as a row in a dataset.


===Surveys===
===Surveys===
The concept of unit of observation can also be used to describe for example surveys. The unit of observation in a survey is the type of respondent. For example, household, company, school etc. In the cases of company and school the respondent is a person, for example the CEO or the principal, but they provide answers about the company or the school. If they would be asked questions about themselves, then the ''unit of observation'' would be CEOs and principals.
The concept of ''unit of observation'' can also be used to describe for example surveys. The ''unit of observation'' in a survey is the type of respondent. For example, household, company, school etc. In the cases of company and school the respondent is a person, for example the CEO or the principal, but they provide answers about the company or the school. If they would be asked questions about themselves, then the ''unit of observation'' would be CEOs and principals.


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[Data Management]]
This article is part of the topic [[Data Management]]


== Additional Resources ==
== Additional Resources ==

Revision as of 17:17, 29 March 2019

The unit of observation is the “who” or “what” about which survey data is collected and analysis is focused. Common examples include individual, household, or community. Clearly identifying your unit of observation in datasets and project folders will lead to a more efficient workflow and a more accurate analysis. 

Read First

  • Mistakes related to unit of observation introduce bias into analyses. Always double check the unit of observation before working with data.

Definition

The unit of observation is the “who” or “what” about which survey data is collected and analysis is focused. Common examples include individual, household, or community. Note that the unit of observation refers to the category, type, or classification of data -- not to specific parties. For example, while student and school are units of observation, “Ali Jones” or “Cedar Elementary School” are not. 

Confirming the Unit of Observation

Just as distance data does not make sense until we know whether its unit is miles or kilometers, survey data and any resulting analyses do not make sense until we know the unit of observation. In many cases, there is seemingly little risk for confusion in terms of unit of observation. We often have a good intuition for the unit of observation at the first glance of a dataset or a file name. However, always test that your assumption is correct: errors due to an unclear understanding of unit of observation are more common than one might imagine. When working with a dataset that you have not created yourself, start by clearly identifying the unit of observation. The most obvious way to do so is by asking the person from whom you received the dataset.

Consider, however, that you have a dataset for which you do not know the unit of observation and you cannot reach the person from whom you received the dataset. You believe that the unit of observation is household. To confirm, open up the dataset, look for a household ID variable and test if it is uniquely and fully identifying the dataset. If this is the case, then you are done. However, if you do not find such variable, search for other information that uniquely and fully identifies the dataset. In this case, for example, look for variables with information of household head name. Test if this variable uniquely identifies all observations. Names are often not unique across a country, so you might have to add region name and village name to the test. Once you have found the information that uniquely and fully identifies the dataset, make sure you create an appropriate ID Variable accordingly if it does not yet exist.

Note that a dataset is always incorrectly constructed if it has more than one unit of observation. Even if the two units of observation have the same variables, it is incorrect, bad practice, and a huge source of error if they were included in the same dataset. All such datasets should be separated into two datasets.

Applications

The examples below all have many similarities to how unit of observation is used in the context of a dataset. They are included to give further explanation to the concept or highlight small differences in usage.

Regressions

In a regression, N (or the number of observations) represents the unit of observation. A correct interpretation of the regression depends on a clear understanding of the unit of observation. In most cases this is trivial, but not always. Consider, for example, monitoring data that is believed to have the unit of observation "households," though its true unit of observation is "packages distributed to households." Since the vast majority of households only received one package each, it is easy yet problematic to make this mistake.

Note that some regressions collapse your dataset, so the unit of observation in the regression is different from the unit of observation in your dataset. This is one example when unit of observation cannot described as a row in a dataset.

Surveys

The concept of unit of observation can also be used to describe for example surveys. The unit of observation in a survey is the type of respondent. For example, household, company, school etc. In the cases of company and school the respondent is a person, for example the CEO or the principal, but they provide answers about the company or the school. If they would be asked questions about themselves, then the unit of observation would be CEOs and principals.

Back to Parent

This article is part of the topic Data Management

Additional Resources

Please add here related articles, including a brief description and link.