Difference between revisions of "Unit of Observation"

Jump to: navigation, search
 
(19 intermediate revisions by 6 users not shown)
Line 1: Line 1:
While the specific term ''Unit of Observation'' is not always well know, it is a concept that all people who work with data has come across. Having an exact understanding of this concept and getting the habit of thinking about your data sets in terms if unit of observation and organize your data sets and your project folder accordingly is key to a efficient data work. Mistakes done in regards to this concepts are more common that what one would expect, and those mistakes will bias your analysis.
The unit of observation is the unit at or for which data is collected. Common examples include individual, household, community, or school. Clearly identifying the unit of observation is important for a logical [[Questionnaire Design | survey design]], organized [[Primary Data Collection | data collection]], a sound [[DataWork Folder | data folder]] set-up, and an unbiased [[Data Analysis | analysis]]. This page discusses unit of observation in the context of surveys and datasets and explains how to confirm the unit of observation for a given dataset.  
 


== Read First ==
== Read First ==
* Never trust the the file name by itself as an indicator of what the unit of observations is. Always perform some tests to convince yourself of the unit of observation.
* When working with a dataset that you have not created yourself, identifying the unit of observation is the first step to understanding the data.  
 
*Mistakes related to unit of observation introduce bias into analyses. Always double check the unit of observation before working with data.
==Definition==
The most common context where the concept of unit of observation is used is to describe a data set. A non-technical way to explain ''unit of observation'' in this context is what each row in the data set represents. Just as a distance data does not make sense unless we know whether it is measured in miles or kilometer, we need to know the unit of our data set. We often have a good idea what the unit of observation is at the first glance of the data, but do not trust this, always test that your assumption is correct.
 
While thinking of ''unit of observation'' is correct in most cases can be described as a the row of a data set, there are cases when a ''unit of observation'' is more than that. For example, a regression might pull different types of observations from different data sets, or a regression might collapse observations to another ''unit of observation'' that did never exist in a data set. Although, until you have a deeper understanding of this concept, you will be fine thinking of this as what a row in a data set represents.


Note, a data set is always incorrectly constructed if one data set has more than one unit of observation. Then those data sets should be in separated into two data sets. Even if the two units of observation has the same variables, it is incorrect, bad practice, and a huge source of error if they were included in the same data set.
==Unit of Observation in Surveys==


===Methods to confirm the Unit of Observation in a data set===
In the context of a survey, the unit of observation describes the unit at or for which survey data is collected. Many times, the unit of observation in a survey is the type of respondent. However, sometimes a respondent provides answers about a larger entity, which is the unit of observation. For example, if school principals are the survey respondents but they provide answers about their schools, the unit of observation is school. If mothers are the survey respondents but they provide answers about their households, the unit of observation is household. However, if school principals are the survey respondents and they provide answers about themselves, then the unit of observation is principal. Similarly, if mothers are the survey respondents and they provide answers about themselves, the unit of observation is mother. Identifying the unit of observation early in the study design is critical for designing a high-quality survey and effectively planning [[Primary Data Collection | primary data collection]].


The first time you are using a data set you have not created yourself, you should always start by making sure that you have no doubt what the ''unit of observation'' is. You often get this information from the name of the data file, but you should always test that before believing it. The most obvious method to make sure you know what is the unit of observation is to ask the person that sent you the data set, but the rest of this section assumes that you for any reason cannot confirm the ''unit of observation'' that easily.
==Unit of Observation in Datasets==


If you open up a data set for which you have a good reason to believe the ''unit of observations'' is, for example, household, then look for a household ID variable and test if it is uniquely and fully identifying the data set. If this is the case then you are done. However, if you do not find such variable you will have to find other information that uniquely and fully identifies the data set. For example, in this case, you would look for variables with information of household head name. Test if this variable uniquely identifies the all observations. Names are often not unique across a country, so you might have to add region name and village name to the test.
When working with a dataset that you have not created yourself, always start by identifying the unit of observation. In many cases, there is seemingly little risk for confusion in terms of unit of observation. We often have a good intuition for the unit of observation at the first glance of a dataset or a file name. However, always test that your assumption is correct: errors due to an unclear understanding of unit of observation are more common than one might imagine. Consider, for example, monitoring data whose unit of observation is “packages distributed to households.However, since most households in the dataset only received one package, one could easily confuse the unit of observation to be “household.” Clarifying and confirming the unit of confirmation before beginning to work with a dataset avoids biased [[Data Analysis | analysis]] and makes the way for a correct interpretation of regression and analysis results.  


==Usages other than in data sets==
Note that a dataset is always incorrectly constructed if it has more than one unit of observation. Even if the two units of observation have the same variables, it is incorrect, bad practice, and a huge source of error if they are included in the same dataset. All such datasets should be separated into two datasets.


The examples below all have many similarities to how ''unit of observation'' is used in the context of a data set. They are included to give further explanation to the concept or highlight small differences in usage.
===Confirming Unit of Observation===


===Regressions===
The most obvious way to confirm the unit of observation in a new dataset is by asking the person from whom you received the dataset. If you can’t do this for whatever reason, begin by inferring the unit of observation. Imagine you believe the unit of observation is household. Then, open up the dataset, look for a household ID variable and test if it is [[ID Variable Properties|uniquely and fully identifying]]. If it is, then you are done. If not, search for other information that uniquely and fully identifies the dataset. In this case, for example, look for variables with information of household head name. Test if this variable uniquely identifies all observations. Names are often not unique across a country, so you might have to add region name and village name to the test. Once you have found the information that uniquely and fully identifies the dataset, make sure you create an appropriate [[ID Variable Properties|ID variable]] accordingly if it does not yet exist.  
The unit of observation in a regression is what the N (or number of observations) represents. That is very much related to how the concept is used in the data set, as the N is the number of rows from the data set included in the regression. To be able to interpret the regression correctly therefore depends on understanding the ''unit of observation''. In most cases this is trivial, but we have had issues where regressions have been misinterpreted as a monitoring data that was believed to have the unit of observation "households", while it actually was "packages distributed to households". Since the vast majority of households only received one package each, it was easy to make this mistake than what it first might seem.
 
Note that some regressions collapse your data set so the unit of observation in the regression is different from the unit of observation in your data set. This is one example when unit of observation cannot described as a row in a data set.
 
===Surveys===
The concept of unit of observation can also be used to describe for example surveys. The unit of observation in a survey is the type of respondent. For example, household, company, school etc. In the cases of company and school the respondent is a person, for example the CEO or the principal, but they provide answers about the company or the school. If they would be asked questions about themselves, then the ''unit of observation'' would be CEOs and principals.


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[Data Management]]
This article is part of the topic [[Data Management]]


== Additional Resources ==
*In [https://www.bmj.com/content/bmj/348/bmj.g3840.full.pdf Unit of observation versus unit of analysis], Philip Sedgwick explains that “the unit of observation, sometimes referred to as the unit of measurement, is defined statistically as the “who” or “what” for which data are measured or collected. The unit of analysis is defined statistically as the “who” or “what” for which information is analysed and conclusions are made.”


== Additional Resources ==
--
* list here other articles related to this topic, with a brief description and link


[[Category: *category name* ]]
[[Category: Data Management ]]

Latest revision as of 17:50, 21 May 2019

The unit of observation is the unit at or for which data is collected. Common examples include individual, household, community, or school. Clearly identifying the unit of observation is important for a logical survey design, organized data collection, a sound data folder set-up, and an unbiased analysis. This page discusses unit of observation in the context of surveys and datasets and explains how to confirm the unit of observation for a given dataset.

Read First

  • When working with a dataset that you have not created yourself, identifying the unit of observation is the first step to understanding the data.
  • Mistakes related to unit of observation introduce bias into analyses. Always double check the unit of observation before working with data.

Unit of Observation in Surveys

In the context of a survey, the unit of observation describes the unit at or for which survey data is collected. Many times, the unit of observation in a survey is the type of respondent. However, sometimes a respondent provides answers about a larger entity, which is the unit of observation. For example, if school principals are the survey respondents but they provide answers about their schools, the unit of observation is school. If mothers are the survey respondents but they provide answers about their households, the unit of observation is household. However, if school principals are the survey respondents and they provide answers about themselves, then the unit of observation is principal. Similarly, if mothers are the survey respondents and they provide answers about themselves, the unit of observation is mother. Identifying the unit of observation early in the study design is critical for designing a high-quality survey and effectively planning primary data collection.

Unit of Observation in Datasets

When working with a dataset that you have not created yourself, always start by identifying the unit of observation. In many cases, there is seemingly little risk for confusion in terms of unit of observation. We often have a good intuition for the unit of observation at the first glance of a dataset or a file name. However, always test that your assumption is correct: errors due to an unclear understanding of unit of observation are more common than one might imagine. Consider, for example, monitoring data whose unit of observation is “packages distributed to households.” However, since most households in the dataset only received one package, one could easily confuse the unit of observation to be “household.” Clarifying and confirming the unit of confirmation before beginning to work with a dataset avoids biased analysis and makes the way for a correct interpretation of regression and analysis results.

Note that a dataset is always incorrectly constructed if it has more than one unit of observation. Even if the two units of observation have the same variables, it is incorrect, bad practice, and a huge source of error if they are included in the same dataset. All such datasets should be separated into two datasets.

Confirming Unit of Observation

The most obvious way to confirm the unit of observation in a new dataset is by asking the person from whom you received the dataset. If you can’t do this for whatever reason, begin by inferring the unit of observation. Imagine you believe the unit of observation is household. Then, open up the dataset, look for a household ID variable and test if it is uniquely and fully identifying. If it is, then you are done. If not, search for other information that uniquely and fully identifies the dataset. In this case, for example, look for variables with information of household head name. Test if this variable uniquely identifies all observations. Names are often not unique across a country, so you might have to add region name and village name to the test. Once you have found the information that uniquely and fully identifies the dataset, make sure you create an appropriate ID variable accordingly if it does not yet exist.

Back to Parent

This article is part of the topic Data Management

Additional Resources

  • In Unit of observation versus unit of analysis, Philip Sedgwick explains that “the unit of observation, sometimes referred to as the unit of measurement, is defined statistically as the “who” or “what” for which data are measured or collected. The unit of analysis is defined statistically as the “who” or “what” for which information is analysed and conclusions are made.”

--